2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

1897 Commits

Author SHA1 Message Date
Omar Sandoval
8e2bd3b7fa Btrfs: deal with existing encompassing extent map in btrfs_get_extent()
My QEMU VM was seeing inexplicable I/O errors that I tracked down to
errors coming from the qcow2 virtual drive in the host system. The qcow2
file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
Every once in awhile, pread() or pwrite() would return EEXIST, which
makes no sense. This turned out to be a bug in btrfs_get_extent().

Commit 8dff9c8534 ("Btrfs: deal with duplciates during extent_map
insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
two threads race on adding the same extent map to an inode's extent map
tree. However, if the added em is merged with an adjacent em in the
extent tree, then we'll end up with an existing extent that is not
identical to but instead encompasses the extent we tried to add. When we
call merge_extent_mapping() to find the nonoverlapping part of the new
em, the arithmetic overflows because there is no such thing. We then end
up trying to add a bogus em to the em_tree, which results in a EEXIST
that can bubble all the way up to userspace.

Fix it by extending the identical extent map special case.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:14 +01:00
Christoph Hellwig
cf8cddd38b btrfs: don't abuse REQ_OP_* flags for btrfs_map_block
btrfs_map_block supports different types of mappings, which to a large
extent resemble block layer operations.  But they don't always do, and
currently btrfs dangerously overlays it's own flag over the block layer
flags.  This is just asking for a conflict, so introduce a different
map flags enum inside of btrfs instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-29 14:10:38 +01:00
Linus Torvalds
46d7cbb2c4 Merge branch 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from Chris Mason:
 "Some fixes that Dave Sterba collected.  We held off on these last week
  because I was focused on the memory corruption testing"

* 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix WARNING in btrfs_select_ref_head()
  Btrfs: remove some no-op casts
  btrfs: pass correct args to btrfs_async_run_delayed_refs()
  btrfs: make file clone aware of fatal signals
  btrfs: qgroup: Prevent qgroup->reserved from going subzero
  Btrfs: kill BUG_ON in do_relocation
2016-11-04 20:08:16 -07:00
Christoph Hellwig
70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig
ef295ecf09 block: better op and flags encoding
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields.  This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits.  Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:48:16 -06:00
Wang Xiaoguang
dd4b857aab btrfs: pass correct args to btrfs_async_run_delayed_refs()
In btrfs_truncate_inode_items()->btrfs_async_run_delayed_refs(), we
swap the arg2 and arg3 wrongly, fix this.

This bug just impacts asynchronous delayed refs handle when we truncate inodes.
In delayed_ref_async_start(), there is such codes:

    trans = btrfs_join_transaction(async->root);
    if (trans->transid > async->transid)
        goto end;
    ret = btrfs_run_delayed_refs(trans, async->root, async->count);

From this codes, we can see that this just influence whether can we handle
delayed refs or the number of delayed refs to handle, this may impact
performance, but will not result in missing delayed refs, all delayed refs will
be handled in btrfs_commit_transaction().

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Goldwyn Rodrigues
0b34c261e2 btrfs: qgroup: Prevent qgroup->reserved from going subzero
While free'ing qgroup->reserved resources, we much check if
the page has not been invalidated by a truncate operation
by checking if the page is still dirty before reducing the
qgroup resources. Resources in such a case are free'd when
the entire extent is released by delayed_ref.

This fixes a double accounting while releasing resources
in case of truncating a file, reproduced by the following testcase.

SCRATCH_DEV=/dev/vdb
SCRATCH_MNT=/mnt
mkfs.btrfs -f $SCRATCH_DEV
mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT
cd $SCRATCH_MNT
btrfs quota enable $SCRATCH_MNT
btrfs subvolume create a
btrfs qgroup limit 500m a $SCRATCH_MNT
sync
for c in {1..15}; do
dd if=/dev/zero  bs=1M count=40 of=$SCRATCH_MNT/a/file;
done

sleep 10
sync
sleep 5

touch $SCRATCH_MNT/a/newfile

echo "Removing file"
rm $SCRATCH_MNT/a/file

Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:21 +02:00
Linus Torvalds
f29135b54b Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This is a big variety of fixes and cleanups.

  Liu Bo continues to fixup fuzzer related problems, and some of Josef's
  cleanups are prep for his bigger extent buffer changes (slated for
  v4.10)"

* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (39 commits)
  Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
  Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf
  Btrfs: don't BUG() during drop snapshot
  btrfs: fix btrfs_no_printk stub helper
  Btrfs: memset to avoid stale content in btree leaf
  btrfs: parent_start initialization cleanup
  btrfs: Remove already completed TODO comment
  btrfs: Do not reassign count in btrfs_run_delayed_refs
  btrfs: fix a possible umount deadlock
  Btrfs: fix memory leak in do_walk_down
  btrfs: btrfs_debug should consume fs_info when DEBUG is not defined
  btrfs: convert send's verbose_printk to btrfs_debug
  btrfs: convert pr_* to btrfs_* where possible
  btrfs: convert printk(KERN_* to use pr_* calls
  btrfs: unsplit printed strings
  btrfs: clean the old superblocks before freeing the device
  Btrfs: kill BUG_ON in run_delayed_tree_ref
  Btrfs: don't leak reloc root nodes on error
  btrfs: squash lines for simple wrapper functions
  Btrfs: improve check_node to avoid reading corrupted nodes
  ...
2016-10-11 11:23:06 -07:00
Linus Torvalds
101105b171 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 ">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Replace current_fs_time() with current_time()
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  vfs: Add current_time() api
  vfs: add note about i_op->rename changes to porting
  fs: rename "rename2" i_op to "rename"
  vfs: remove unused i_op->rename
  fs: make remaining filesystems use .rename2
  libfs: support RENAME_NOREPLACE in simple_rename()
  fs: support RENAME_NOREPLACE for local filesystems
  ncpfs: fix unused variable warning
2016-10-10 20:16:43 -07:00
Al Viro
3873691e5a Merge remote-tracking branch 'ovl/rename2' into for-linus 2016-10-10 23:02:51 -04:00
Linus Torvalds
97d2116708 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "xattr stuff from Andreas

  This completes the switch to xattr_handler ->get()/->set() from
  ->getxattr/->setxattr/->removexattr"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Remove {get,set,remove}xattr inode operations
  xattr: Stop calling {get,set,remove}xattr inode operations
  vfs: Check for the IOP_XATTR flag in listxattr
  xattr: Add __vfs_{get,set,remove}xattr helpers
  libfs: Use IOP_XATTR flag for empty directory handling
  vfs: Use IOP_XATTR flag for bad-inode handling
  vfs: Add IOP_XATTR inode operations flag
  vfs: Move xattr_resolve_name to the front of fs/xattr.c
  ecryptfs: Switch to generic xattr handlers
  sockfs: Get rid of getxattr iop
  sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
  kernfs: Switch to generic xattr handlers
  hfs: Switch to generic xattr handlers
  jffs2: Remove jffs2_{get,set,remove}xattr macros
  xattr: Remove unnecessary NULL attribute name check
2016-10-10 17:11:50 -07:00
Linus Torvalds
fed41f7d03 Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull splice fixups from Al Viro:
 "A couple of fixups for interaction of pipe-backed iov_iter with
  O_DIRECT reads + constification of a couple of primitives in uio.h
  missed by previous rounds.

  Kudos to davej - his fuzzing has caught those bugs"

* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  [btrfs] fix check_direct_IO() for non-iovec iterators
  constify iov_iter_count() and iter_is_iovec()
  fix ITER_PIPE interaction with direct_IO
2016-10-10 13:38:49 -07:00
Linus Torvalds
abb5a14fa2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted misc bits and pieces.

  There are several single-topic branches left after this (rename2
  series from Miklos, current_time series from Deepa Dinamani, xattr
  series from Andreas, uaccess stuff from from me) and I'd prefer to
  send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
  proc: switch auxv to use of __mem_open()
  hpfs: support FIEMAP
  cifs: get rid of unused arguments of CIFSSMBWrite()
  posix_acl: uapi header split
  posix_acl: xattr representation cleanups
  fs/aio.c: eliminate redundant loads in put_aio_ring_file
  fs/internal.h: add const to ns_dentry_operations declaration
  compat: remove compat_printk()
  fs/buffer.c: make __getblk_slow() static
  proc: unsigned file descriptors
  fs/file: more unsigned file descriptors
  fs: compat: remove redundant check of nr_segs
  cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
  cifs: don't use memcpy() to copy struct iov_iter
  get rid of separate multipage fault-in primitives
  fs: Avoid premature clearing of capabilities
  fs: Give dentry to inode_change_ok() instead of inode
  fuse: Propagate dentry down to inode_change_ok()
  ceph: Propagate dentry down to inode_change_ok()
  xfs: Propagate dentry down to inode_change_ok()
  ...
2016-10-10 13:04:49 -07:00
Al Viro
cd27e45504 [btrfs] fix check_direct_IO() for non-iovec iterators
looking for duplicate ->iov_base makes sense only for
iovec-backed iterators; for kvec-backed ones it's pointless,
for bvec-backed ones it's pointless and broken on 32bit (we
walk through an array of struct bio_vec accessing them as if
they were struct iovec; works by accident on 64bit, but on
32bit it'll blow up) and for pipe-backed ones it's pointless
and ends up oopsing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-10 13:58:16 -04:00
Al Viro
e55f1d1d13 Merge remote-tracking branch 'jk/vfs' into work.misc 2016-10-08 11:06:08 -04:00
Andreas Gruenbacher
fd50ecaddf vfs: Remove {get,set,remove}xattr inode operations
These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 21:48:36 -04:00
Deepa Dinamani
c2050a454c fs: Replace current_fs_time() with current_time()
current_fs_time() uses struct super_block* as an argument.
As per Linus's suggestion, this is changed to take struct
inode* as a parameter instead. This is because the function
is primarily meant for vfs inode timestamps.
Also the function was renamed as per Arnd's suggestion.

Change all calls to current_fs_time() to use the new
current_time() function instead. current_fs_time() will be
deleted.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:22 -04:00
Miklos Szeredi
2773bf00ae fs: rename "rename2" i_op to "rename"
Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-27 11:03:58 +02:00
Jeff Mahoney
ab8d0fc48d btrfs: convert pr_* to btrfs_* where possible
For many printks, we want to know which file system issued the message.

This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.

fs/btrfs/check-integrity.c is left alone for another day.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:04 +02:00
Jeff Mahoney
5d163e0e68 btrfs: unsplit printed strings
CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."

This patch unsplits user-visible strings.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Josef Bacik
afcdd129e0 Btrfs: add a flags field to btrfs_fs_info
We have a lot of random ints in btrfs_fs_info that can be put into flags.  This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Qu Wenruo
ba8b04c1d4 btrfs: extend btrfs_set_extent_delalloc and its friends to support in-band dedupe and subpage size patchset
Extend btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc()
parameters for both in-band dedupe and subpage sector size patchset.

This should reduce conflict of both patchset and the effort to rebase
them.

Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Jan Kara
31051c85b5 fs: Give dentry to inode_change_ok() instead of inode
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Miklos Szeredi
f031221001 btrfs: use filemap_check_errors()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Cc: Chris Mason <clm@fb.com>
2016-09-16 12:44:21 +02:00
Bart Van Assche
4382e33ad3 block, dm-crypt, btrfs: Introduce bio_flags()
Introduce the bio_flags() macro. Ensure that the second argument of
bio_set_op_attrs() only contains flags and no operation. This patch
does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Chris Mason <clm@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Josef Bacik <jbacik@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14 08:48:27 -06:00
Linus Torvalds
28687b935e Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We've queued up a few different fixes in here.  These range from
  enospc corners to fsync and quota fixes, and a few targeted at error
  handling for corrupt metadata/fuzzing"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix lockdep warning on deadlock against an inode's log mutex
  Btrfs: detect corruption when non-root leaf has zero item
  Btrfs: check btree node's nritems
  btrfs: don't create or leak aliased root while cleaning up orphans
  Btrfs: fix em leak in find_first_block_group
  btrfs: do not background blkdev_put()
  Btrfs: clarify do_chunk_alloc()'s return value
  btrfs: fix fsfreeze hang caused by delayed iputs deal
  btrfs: update btrfs_space_info's bytes_may_use timely
  btrfs: divide btrfs_update_reserved_bytes() into two functions
  btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
  btrfs: qgroup: Fix qgroup incorrectness caused by log replay
  btrfs: relocation: Fix leaking qgroups numbers on data extents
  btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
  btrfs: waiting on qgroup rescan should not always be interruptible
  btrfs: properly track when rescan worker is running
  btrfs: flush_space: treat return value of do_chunk_alloc properly
  Btrfs: add ASSERT for block group's memory leak
  btrfs: backref: Fix soft lockup in __merge_refs function
  Btrfs: fix memory leak of reloc_root
2016-08-26 20:22:01 -07:00
Wang Xiaoguang
18513091af btrfs: update btrfs_space_info's bytes_may_use timely
This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
	#!/bin/bash
	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
	dev=$(losetup --show -f fs.img)
	mkfs.btrfs -f -M $dev
	mkdir /tmp/mntpoint
	mount $dev /tmp/mntpoint
	cd /tmp/mntpoint
	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile

Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
|   bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
    |-> btrfs_reserve_extent()
        |-> btrfs_add_reserved_bytes()
        |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
        |   change bytes_may_use, and bytes_reserved += 64M. Now
        |   bytes_may_use + bytes_reserved == 128M, which is greater
        |   than btrfs_space_info's total_bytes, false enospc occurs.
        |   Note, the bytes_may_use decrease operation will be done in
        |   end of btrfs_fallocate(), which is too late.

Here is another simple case for buffered write:
                    CPU 1              |              CPU 2
                                       |
|-> cow_file_range()                   |-> __btrfs_buffered_write()
    |-> btrfs_reserve_extent()         |   |
    |                                  |   |
    |                                  |   |
    |    .....                         |   |-> btrfs_check_data_free_space()
    |                                  |
    |                                  |
    |-> extent_clear_unlock_delalloc() |

In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.

To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.

As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.

Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:26 -07:00
Linus Torvalds
9512c47ec2 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Some fixes for btrfs send/recv and fsync from Filipe and Robbie Ko.

  Bonus points to Filipe for already having xfstests in place for many
  of these"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: remove unused function btrfs_add_delayed_qgroup_reserve()
  Btrfs: improve performance on fsync against new inode after rename/unlink
  Btrfs: be more precise on errors when getting an inode from disk
  Btrfs: send, don't bug on inconsistent snapshots
  Btrfs: send, avoid incorrect leaf accesses when sending utimes operations
  Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations
  Btrfs: send, fix warning due to late freeing of orphan_dir_info structures
  Btrfs: incremental send, fix premature rmdir operations
  Btrfs: incremental send, fix invalid paths for rename operations
  Btrfs: send, add missing error check for calls to path_loop()
  Btrfs: send, fix failure to move directories with the same name around
  Btrfs: add missing check for writeback errors on fsync
2016-08-10 11:16:03 -07:00
Jens Axboe
1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Chris Mason
1083881654 Merge branch 'integration-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.8 2016-08-05 12:25:05 -07:00
Linus Torvalds
d58b0d980f Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "This is part two of my btrfs pull, which is some cleanups and a batch
  of fixes.

  Most of the code here is from Jeff Mahoney, making the pointers we
  pass around internally more consistent and less confusing overall.  I
  noticed a small problem right before I sent this out yesterday, so I
  fixed it up and re-tested overnight"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (40 commits)
  Btrfs: fix __MAX_CSUM_ITEMS
  btrfs: btrfs_abort_transaction, drop root parameter
  btrfs: add btrfs_trans_handle->fs_info pointer
  btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
  btrfs: convert nodesize macros to static inlines
  btrfs: introduce BTRFS_MAX_ITEM_SIZE
  btrfs: cleanup, remove prototype for btrfs_find_root_ref
  btrfs: copy_to_sk drop unused root parameter
  btrfs: simpilify btrfs_subvol_inherit_props
  btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
  btrfs: tests, require fs_info for root
  btrfs: tests, move initialization into tests/
  btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
  btrfs: prefix fsid to all trace events
  btrfs: plumb fs_info into btrfs_work
  btrfs: remove obsolete part of comment in statfs
  btrfs: hide test-only member under ifdef
  btrfs: Ratelimit "no csum found" info message
  btrfs: Add ratelimit to btrfs printing
  Btrfs: fix unexpected balance crash due to BUG_ON
  ...
2016-08-04 19:56:16 -04:00
Filipe Manana
44f714dae5 Btrfs: improve performance on fsync against new inode after rename/unlink
With commit 56f23fdbb6 ("Btrfs: fix file/data loss caused by fsync after
rename and new inode") we got simple fix for a functional issue when the
following sequence of actions is done:

  at transaction N
  create file A at directory D
  at transaction N + M (where M >= 1)
  move/rename existing file A from directory D to directory E
  create a new file named A at directory D
  fsync the new file
  power fail

The solution was to simply detect such scenario and fallback to a full
transaction commit when we detect it. However this turned out to had a
significant impact on throughput (and a bit on latency too) for benchmarks
using the dbench tool, which simulates real workloads from smbd (Samba)
servers. For example on a test vm (with a debug kernel):

Unpatched:
Throughput 19.1572 MB/sec  32 clients  32 procs  max_latency=1005.229 ms

Patched:
Throughput 23.7015 MB/sec  32 clients  32 procs  max_latency=809.206 ms

The patched results (this patch is applied) are similar to the results of
a kernel with the commit 56f23fdbb6 ("Btrfs: fix file/data loss caused
by fsync after rename and new inode") reverted.

This change avoids the fallback to a transaction commit and instead makes
sure all the names of the conflicting inode (the one that had a name in a
past transaction that matches the name of the new file in the same parent
directory) are logged so that at log replay time we don't lose neither the
new file nor the old file, and the old file gets the name it was renamed
to.

This also ends up avoiding a full transaction commit for a similar case
that involves an unlink instead of a rename of the old file:

  at transaction N
  create file A at directory D
  at transaction N + M (where M >= 1)
  remove file A
  create a new file named A at directory D
  fsync the new file
  power fail

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:32:14 +01:00
Filipe Manana
67710892ec Btrfs: be more precise on errors when getting an inode from disk
When we attempt to read an inode from disk, we end up always returning an
-ESTALE error to the caller regardless of the actual failure reason, which
can be an out of memory problem (when allocating a path), some error found
when reading from the fs/subvolume btree (like a genuine IO error) or the
inode does not exists. So lets start returning the real error code to the
callers so that they don't treat all -ESTALE errors as meaning that the
inode does not exists (such as during orphan cleanup). This will also be
needed for a subsequent patch in the same series dealing with a special
fsync case.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-01 07:32:03 +01:00
Linus Torvalds
ba929b6646 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This pull is dedicated to Josef's enospc rework, which we've been
  testing for a few releases now.  It fixes some early enospc problems
  and is dramatically faster.

  This also includes an updated fix for the delalloc accounting that
  happens after a fault in copy_from_user.  My patch in v4.7 was almost
  but not quite enough"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix delalloc accounting after copy_from_user faults
  Btrfs: avoid deadlocks during reservations in btrfs_truncate_block
  Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes
  Btrfs: fill relocation block rsv after allocation
  Btrfs: always use trans->block_rsv for orphans
  Btrfs: change how we calculate the global block rsv
  Btrfs: use root when checking need_async_flush
  Btrfs: don't bother kicking async if there's nothing to reclaim
  Btrfs: fix release reserved extents trace points
  Btrfs: add fsid to some tracepoints
  Btrfs: add tracepoints for flush events
  Btrfs: fix delalloc reservation amount tracepoint
  Btrfs: trace pinned extents
  Btrfs: introduce ticketed enospc infrastructure
  Btrfs: add tracepoint for adding block groups
  Btrfs: warn_on for unaccounted spaces
  Btrfs: change delayed reservation fallback behavior
  Btrfs: always reserve metadata for delalloc extents
  Btrfs: fix callers of btrfs_block_rsv_migrate
  Btrfs: add bytes_readonly to the spaceinfo at once
2016-07-31 21:27:32 -04:00
Linus Torvalds
d05d7f4079 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:

   - the big change is the cleanup from Mike Christie, cleaning up our
     uses of command types and modified flags.  This is what will throw
     some merge conflicts

   - regression fix for the above for btrfs, from Vincent

   - following up to the above, better packing of struct request from
     Christoph

   - a 2038 fix for blktrace from Arnd

   - a few trivial/spelling fixes from Bart Van Assche

   - a front merge check fix from Damien, which could cause issues on
     SMR drives

   - Atari partition fix from Gabriel

   - convert cfq to highres timers, since jiffies isn't granular enough
     for some devices these days.  From Jan and Jeff

   - CFQ priority boost fix idle classes, from me

   - cleanup series from Ming, improving our bio/bvec iteration

   - a direct issue fix for blk-mq from Omar

   - fix for plug merging not involving the IO scheduler, like we do for
     other types of merges.  From Tahsin

   - expose DAX type internally and through sysfs.  From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
  block: Fix front merge check
  block: do not merge requests without consulting with io scheduler
  block: Fix spelling in a source code comment
  block: expose QUEUE_FLAG_DAX in sysfs
  block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
  Btrfs: fix comparison in __btrfs_map_block()
  block: atari: Return early for unsupported sector size
  Doc: block: Fix a typo in queue-sysfs.txt
  cfq-iosched: Charge at least 1 jiffie instead of 1 ns
  cfq-iosched: Fix regression in bonnie++ rewrite performance
  cfq-iosched: Convert slice_resid from u64 to s64
  block: Convert fifo_time from ulong to u64
  blktrace: avoid using timespec
  block/blk-cgroup.c: Declare local symbols static
  block/bio-integrity.c: Add #include "blk.h"
  block/partition-generic.c: Remove a set-but-not-used variable
  block: bio: kill BIO_MAX_SIZE
  cfq-iosched: temporarily boost queue priority for idle classes
  block: drbd: avoid to use BIO_MAX_SIZE
  block: bio: remove BIO_MAX_SECTORS
  ...
2016-07-26 15:03:07 -07:00
Jeff Mahoney
66642832f0 btrfs: btrfs_abort_transaction, drop root parameter
__btrfs_abort_transaction doesn't use its root parameter except to
obtain an fs_info pointer.  We can obtain that from trans->root->fs_info
for now and from trans->fs_info in a later patch.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:26 +02:00
Jeff Mahoney
f5ee5c9ac5 btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
Now that we have a dummy fs_info associated with each test that
uses a root, we don't need the DUMMY_ROOT bit anymore.  This lets
us make choices without needing an actual root like in e.g.
btrfs_find_create_tree_block.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:19 +02:00
Jeff Mahoney
3cdde2240d btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
btrfs_test_opt and friends only use the root pointer to access
the fs_info.  Let's pass the fs_info directly in preparation to
eliminate similar patterns all over btrfs.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:53:16 +02:00
Wang Xiaoguang
dda3245eca btrfs: expand cow_file_range() to support in-band dedup and subpage-blocksize
Extract cow_file_range() new parameters for both in-band dedupe and
subpage sector size patchset.

This should make conflict of both patchset to minimal, and reduce the
effort needed to rebase them.

Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Ashish Samant
c8bb0c8bd2 btrfs: Cleanup compress_file_range()
Remove unnecessary checks in compress_file_range().

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
[ minor coding style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Liu Bo
6f034ece34 Btrfs: cleanup BUG_ON in merge_bio
One can use btrfs-corrupt-block to hit BUG_ON() in merge_bio(),
thus this aims to stop anyone to panic the whole system by using
 their btrfs.

Since the error in merge_bio can only come from __btrfs_map_block()
when chunk tree mapping has something insane and __btrfs_map_block()
has already had printed the reason, we can just return errors in
merge_bio.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Nikolay Borisov
fba4b69771 btrfs: Fix slab accounting flags
BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.

To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Nikolay Borisov
3d48d9810d btrfs: Handle uninitialised inode eviction
The code flow in btrfs_new_inode allows for btrfs_evict_inode to be
called with not fully initialised inode (e.g. ->root member not
being set). This can happen when btrfs_set_inode_index in
btrfs_new_inode fails, which in turn would call iput for the newly
allocated inode. This in turn leads to vfs calling into btrfs_evict_inode.
This leads to null pointer dereference. To handle this situation check whether
the passed inode has root set and just free it in case it doesn't.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Josef Bacik
25d609f86d Btrfs: fix callers of btrfs_block_rsv_migrate
So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
Not only this but it unconditionally changes the size of the block_rsv.  This
isn't a bug strictly speaking, but it makes truncate block rsv's look funny
because every time we migrate bytes over its size grows, even though we only
want it to be a specific size.  So collapse this into one function that takes an
update_size argument and make truncate and evict not update the size for
consistency sake.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07 18:45:53 +02:00
Linus Torvalds
da2f6aba4a Merge branch 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes part 2 from Chris Mason:
 "This has one patch from Omar to bring iterate_shared back to btrfs.

  We have a tree of work we queue up for directory items and it doesn't
  lend itself well to shared access.  While we're cleaning it up, Omar
  has changed things to use an exclusive lock when there are delayed
  items"

* 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes
2016-06-25 08:53:38 -07:00
Linus Torvalds
b971712afc Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I have a two part pull this time because one of the patches Dave
  Sterba collected needed to be against v4.7-rc2 or higher (we used
  rc4).  I try to make my for-linus-xx branch testable on top of the
  last major so we can hand fixes to people on the list more easily, so
  I've split this pull in two.

  This first part has some fixes and two performance improvements that
  we've been testing for some time.

  Josef's two performance fixes are most notable.  The transid tracking
  patch makes a big improvement on pretty much every workload"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: Force stripesize to the value of sectorsize
  btrfs: fix disk_i_size update bug when fallocate() fails
  Btrfs: fix error handling in map_private_extent_buffer
  Btrfs: fix error return code in btrfs_init_test_fs()
  Btrfs: don't do nocow check unless we have to
  btrfs: fix deadlock in delayed_ref_async_start
  Btrfs: track transid for delayed ref flushing
2016-06-25 08:42:31 -07:00
Omar Sandoval
02dbfc99b4 Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes
Commit fe742fd4f9 ("Revert "btrfs: switch to ->iterate_shared()"")
backed out the conversion to ->iterate_shared() for Btrfs because the
delayed inode handling in btrfs_real_readdir() is racy. However, we can
still do readdir in parallel if there are no delayed nodes.

This is a temporary fix which upgrades the shared inode lock to an
exclusive lock only when we have delayed items until we come up with a
more complete solution. While we're here, rename the
btrfs_{get,put}_delayed_items functions to make it very clear that
they're just for readdir.

Tested with xfstests and by doing a parallel kernel build:

	while make tinyconfig && make -j4 && git clean dqfx; do
		:
	done

along with a bunch of parallel finds in another shell:

	while true; do
		for ((i=0; i<4; i++)); do
			find . >/dev/null &
		done
		wait
	done

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-25 06:20:10 -07:00
Josef Bacik
31b9655f43 Btrfs: track transid for delayed ref flushing
Using the offwakecputime bpf script I noticed most of our time was spent waiting
on the delayed ref throttling.  This is what is supposed to happen, but
sometimes the transaction can commit and then we're waiting for throttling that
doesn't matter anymore.  So change this stuff to be a little smarter by tracking
the transid we were in when we initiated the throttling.  If the transaction we
get is different then we can just bail out.  This resulted in a 50% speedup in
my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
over the entire run (which is about 30 minutes).  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22 17:54:18 -07:00
Linus Torvalds
4c6459f945 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The most user visible change here is a fix for our recent superblock
  validation checks that were causing problems on non-4k pagesized
  systems"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize
  btrfs: remove build fixup for qgroup_account_snapshot
  btrfs: use new error message helper in qgroup_account_snapshot
  btrfs: avoid blocking open_ctree from cleaner_kthread
  Btrfs: don't BUG_ON() in btrfs_orphan_add
  btrfs: account for non-CoW'd blocks in btrfs_abort_transaction
  Btrfs: check if extent buffer is aligned to sectorsize
  btrfs: Use correct format specifier
2016-06-18 05:57:59 -10:00
Josef Bacik
3b6571c180 Btrfs: don't BUG_ON() in btrfs_orphan_add
This is just a screwup for developers, so change it to an ASSERT() so developers
notice when things go wrong and deal with the error appropriately if ASSERT()
isn't enabled.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17 18:32:40 +02:00
Mike Christie
6296b9604f block, drivers, fs: shrink bi_rw from long to int
We don't need bi_rw to be so large on 64 bit archs, so
reduce it to unsigned int.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
81a75f6781 btrfs: use bio fields for op and flags
The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is
no need to pass around the rq_flag_bits bits too. btrfs users should
should access the bio insead.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
b3d3fa5199 btrfs: update __btrfs_map_block for REQ_OP transition
We no longer pass in a bitmap of rq_flag_bits bits to __btrfs_map_block.
It will always be a REQ_OP, or the btrfs specific REQ_GET_READ_MIRRORS,
so this drops the bit tests.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
37226b2111 btrfs: use bio op accessors
This should be the easier cases to convert btrfs to
bio_set_op_attrs/bio_op.
They are mostly just cut and replace type of changes.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
8a4c1e42e0 direct-io: use bio set/get op accessors
This patch has the dio code use a REQ_OP for the op and rq_flag_bits
for bi_rw flags. To set/get the op it uses the bio_set_op_attrs/bio_op
accssors.

It also begins to convert btrfs's dio_submit_t because of the dio
submit_io callout use. The next patches will completely convert
this code and the reset of the btrfs code paths.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Linus Torvalds
b2d5ad8223 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The important part of this pull is Filipe's set of fixes for btrfs
  device replacement.  Filipe fixed a few issues seen on the list and a
  number he found on his own"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent
  Btrfs: fix race between device replace and read repair
  Btrfs: fix race between device replace and discard
  Btrfs: fix race between device replace and chunk allocation
  Btrfs: fix race setting block group back to RW mode during device replace
  Btrfs: fix unprotected assignment of the left cursor for device replace
  Btrfs: fix race setting block group readonly during device replace
  Btrfs: fix race between device replace and block group removal
  Btrfs: fix race between readahead and device replace/removal
2016-06-04 11:56:28 -07:00
Chris Mason
8dff9c8534 Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent
When dealing with inline extents, btrfs_get_extent will incorrectly try
to insert a duplicate extent_map.  The dup hits -EEXIST from
add_extent_map, but then we try to merge with the existing one and end
up trying to insert a zero length extent_map.

This actually works most of the time, except when there are extent maps
past the end of the inline extent.  rocksdb will trigger this sometimes
because it preallocates an extent and then truncates down.

Josef made a script to trigger with xfs_io:

	#!/bin/bash

	xfs_io -f -c "pwrite 0 1000" inline
	xfs_io -c "falloc -k 4k 1M" inline
	xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
	xfs_io -c "fadvise -d 0 1000" inline
	cat inline

You'll get EIOs trying to read inline after this because add_extent_map
is returning EEXIST

Signed-off-by: Chris Mason <clm@fb.com>
2016-06-03 12:32:34 -07:00
Linus Torvalds
559b6d90a0 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs cleanups and fixes from Chris Mason:
 "We have another round of fixes and a few cleanups.

  I have a fix for short returns from btrfs_copy_from_user, which
  finally nails down a very hard to find regression we added in v4.6.

  Dave is pushing around gfp parameters, mostly to cleanup internal apis
  and make it a little more consistent.

  The rest are smaller fixes, and one speelling fixup patch"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (22 commits)
  Btrfs: fix handling of faults from btrfs_copy_from_user
  btrfs: fix string and comment grammatical issues and typos
  btrfs: scrub: Set bbio to NULL before calling btrfs_map_block
  Btrfs: fix unexpected return value of fiemap
  Btrfs: free sys_array eb as soon as possible
  btrfs: sink gfp parameter to convert_extent_bit
  btrfs: make state preallocation more speculative in __set_extent_bit
  btrfs: untangle gotos a bit in convert_extent_bit
  btrfs: untangle gotos a bit in __clear_extent_bit
  btrfs: untangle gotos a bit in __set_extent_bit
  btrfs: sink gfp parameter to set_record_extent_bits
  btrfs: sink gfp parameter to set_extent_new
  btrfs: sink gfp parameter to set_extent_defrag
  btrfs: sink gfp parameter to set_extent_delalloc
  btrfs: sink gfp parameter to clear_extent_dirty
  btrfs: sink gfp parameter to clear_record_extent_bits
  btrfs: sink gfp parameter to clear_extent_bits
  btrfs: sink gfp parameter to set_extent_bits
  btrfs: make find_workspace warn if there are no workspaces
  btrfs: make find_workspace always succeed
  ...
2016-05-27 16:37:36 -07:00
David Sterba
42f31734eb Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
Nicholas D Steeves
0132761017 btrfs: fix string and comment grammatical issues and typos
Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-25 22:35:14 +02:00
Linus Torvalds
07be1337b9 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has our merge window series of cleanups and fixes.  These target
  a wide range of issues, but do include some important fixes for
  qgroups, O_DIRECT, and fsync handling.  Jeff Mahoney moved around a
  few definitions to make them easier for userland to consume.

  Also whiteout support is included now that issues with overlayfs have
  been cleared up.

  I have one more fix pending for page faults during btrfs_copy_from_user,
  but I wanted to get this bulk out the door first"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (90 commits)
  btrfs: fix memory leak during RAID 5/6 device replacement
  Btrfs: add semaphore to synchronize direct IO writes with fsync
  Btrfs: fix race between block group relocation and nocow writes
  Btrfs: fix race between fsync and direct IO writes for prealloc extents
  Btrfs: fix number of transaction units for renames with whiteout
  Btrfs: pin logs earlier when doing a rename exchange operation
  Btrfs: unpin logs if rename exchange operation fails
  Btrfs: fix inode leak on failure to setup whiteout inode in rename
  btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT
  Btrfs: pin log earlier when renaming
  Btrfs: unpin log if rename operation fails
  Btrfs: don't do unnecessary delalloc flushes when relocating
  Btrfs: don't wait for unrelated IO to finish before relocation
  Btrfs: fix empty symlink after creating symlink and fsync parent dir
  Btrfs: fix for incorrect directory entries after fsync log replay
  btrfs: build fixup for qgroup_account_snapshot
  btrfs: qgroup: Fix qgroup accounting when creating snapshot
  Btrfs: fix fspath error deallocation
  btrfs: make find_workspace warn if there are no workspaces
  btrfs: make find_workspace always succeed
  ...
2016-05-21 10:49:22 -07:00
Linus Torvalds
e34df3344d Merge branch 'work.lookups' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull parallel lookup fixups from Al Viro:
 "Fix for xfs parallel readdir (turns out the cxfs exposure was not
  enough to catch all problems), and a reversion of btrfs back to
  ->iterate() until the fs/btrfs/delayed-inode.c gets fixed"

* 'work.lookups' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xfs: concurrent readdir hangs on data buffer locks
  Revert "btrfs: switch to ->iterate_shared()"
2016-05-18 10:28:45 -07:00
Al Viro
fe742fd4f9 Revert "btrfs: switch to ->iterate_shared()"
This reverts commit 972b241f84.
Quoth Chris:
	didn't take the delayed inode stuff into account
	it got an rbtree of items and it pulls things out
	so in shared mode, its hugely racey
	sorry, lets revert and fix it for real inside of btrfs

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-18 13:19:17 -04:00
Linus Torvalds
ba5a2655c2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull remaining vfs xattr work from Al Viro:
 "The rest of work.xattr (non-cifs conversions)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  btrfs: Switch to generic xattr handlers
  ubifs: Switch to generic xattr handlers
  jfs: Switch to generic xattr handlers
  jfs: Clean up xattr name mapping
  gfs2: Switch to generic xattr handlers
  ceph: kill __ceph_removexattr()
  ceph: Switch to generic xattr handlers
  ceph: Get rid of d_find_alias in ceph_set_acl
2016-05-18 10:08:45 -07:00
Andreas Gruenbacher
e0d46f5c6e btrfs: Switch to generic xattr handlers
The btrfs_{set,remove}xattr inode operations check for a read-only root
(btrfs_root_readonly) before calling into generic_{set,remove}xattr.  If
this check is moved into __btrfs_setxattr, we can get rid of
btrfs_{set,remove}xattr.

This patch applies to mainline, I would like to keep it together with
the other xattr cleanups if possible, though.  Could you please review?

Thanks,
Andreas

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-17 19:17:09 -04:00
Linus Torvalds
c2e7b20705 Merge branch 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs cleanups from Al Viro:
 "More cleanups from Christoph"

* 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nfsd: use RWF_SYNC
  fs: add RWF_DSYNC aand RWF_SYNC
  ceph: use generic_write_sync
  fs: simplify the generic_write_sync prototype
  fs: add IOCB_SYNC and IOCB_DSYNC
  direct-io: remove the offset argument to dio_complete
  direct-io: eliminate the offset argument to ->direct_IO
  xfs: eliminate the pos variable in xfs_file_dio_aio_write
  filemap: remove the pos argument to generic_file_direct_write
  filemap: remove pos variables in generic_file_read_iter
2016-05-17 15:05:23 -07:00
Chris Mason
c315ef8d9d Merge branch 'for-chris-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.7
Signed-off-by: Chris Mason <clm@fb.com>
2016-05-17 14:43:19 -07:00
Filipe Manana
5f9a8a51d8 Btrfs: add semaphore to synchronize direct IO writes with fsync
Due to the optimization of lockless direct IO writes (the inode's i_mutex
is not held) introduced in commit 38851cc19a ("Btrfs: implement unlocked
dio write"), we started having races between such writes with concurrent
fsync operations that use the fast fsync path. These races were addressed
in the patches titled "Btrfs: fix race between fsync and lockless direct
IO writes" and "Btrfs: fix race between fsync and direct IO writes for
prealloc extents". The races happened because the direct IO path, like
every other write path, does create extent maps followed by the
corresponding ordered extents while the fast fsync path collected first
ordered extents and then it collected extent maps. This made it possible
to log file extent items (based on the collected extent maps) without
waiting for the corresponding ordered extents to complete (get their IO
done). The two fixes mentioned before added a solution that consists of
making the direct IO path create first the ordered extents and then the
extent maps, while the fsync path attempts to collect any new ordered
extents once it collects the extent maps. This was simple and did not
require adding any synchonization primitive to any data structure (struct
btrfs_inode for example) but it makes things more fragile for future
development endeavours and adds an exceptional approach compared to the
other write paths.

This change adds a read-write semaphore to the btrfs inode structure and
makes the direct IO path create the extent maps and the ordered extents
while holding read access on that semaphore, while the fast fsync path
collects extent maps and ordered extents while holding write access on
that semaphore. The logic for direct IO write path is encapsulated in a
new helper function that is used both for cow and nocow direct IO writes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13 01:59:36 +01:00
Filipe Manana
f78c436c39 Btrfs: fix race between block group relocation and nocow writes
Relocation of a block group waits for all existing tasks flushing
dellaloc, starting direct IO writes and any ordered extents before
starting the relocation process. However for direct IO writes that end
up doing nocow (inode either has the flag nodatacow set or the write is
against a prealloc extent) we have a short time window that allows for a
race that makes relocation proceed without waiting for the direct IO
write to complete first, resulting in data loss after the relocation
finishes. This is illustrated by the following diagram:

           CPU 1                                     CPU 2

 btrfs_relocate_block_group(bg X)

                                               direct IO write starts against
                                               an extent in block group X
                                               using nocow mode (inode has the
                                               nodatacow flag or the write is
                                               for a prealloc extent)

                                               btrfs_direct_IO()
                                                 btrfs_get_blocks_direct()
                                                   --> can_nocow_extent() returns 1

   btrfs_inc_block_group_ro(bg X)
     --> turns block group into RO mode

   btrfs_wait_ordered_roots()
     --> returns and does not know about
         the DIO write happening at CPU 2
         (the task there has not created
          yet an ordered extent)

   relocate_block_group(bg X)
     --> rc->stage == MOVE_DATA_EXTENTS

     find_next_extent()
       --> returns extent that the DIO
           write is going to write to

     relocate_data_extent()

       relocate_file_extent_cluster()

         --> reads the extent from disk into
             pages belonging to the relocation
             inode and dirties them

                                                   --> creates DIO ordered extent

                                                 btrfs_submit_direct()
                                                   --> submits bio against a location
                                                       on disk obtained from an extent
                                                       map before the relocation started

   btrfs_wait_ordered_range()
     --> writes all the pages read before
         to disk (belonging to the
         relocation inode)

   relocation finishes

                                                 bio completes and wrote new data
                                                 to the old location of the block
                                                 group

So fix this by tracking the number of nocow writers for a block group and
make sure relocation waits for that number to go down to 0 before starting
to move the extents.

The same race can also happen with buffered writes in nocow mode since the
patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
when relocating", because we are no longer flushing all delalloc which
served as a synchonization mechanism (due to page locking) and ensured
the ordered extents for nocow buffered writes were created before we
called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
mode existed before that patch (no pages are locked or used during direct
IO) and that fixed only races with direct IO writes that do cow.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13 01:59:34 +01:00
Filipe Manana
0b901916a0 Btrfs: fix race between fsync and direct IO writes for prealloc extents
When we do a direct IO write against a preallocated extent (fallocate)
that does not go beyond the i_size of the inode, we do the write operation
without holding the inode's i_mutex (an optimization that landed in
commit 38851cc19a ("Btrfs: implement unlocked dio write")). This allows
for a very tiny time window where a race can happen with a concurrent
fsync using the fast code path, as the direct IO write path creates first
a new extent map (no longer flagged as a prealloc extent) and then it
creates the ordered extent, while the fast fsync path first collects
ordered extents and then it collects extent maps. This allows for the
possibility of the fast fsync path to collect the new extent map without
collecting the new ordered extent, and therefore logging an extent item
based on the extent map without waiting for the ordered extent to be
created and complete. This can result in a situation where after a log
replay we end up with an extent not marked anymore as prealloc but it was
only partially written (or not written at all), exposing random, stale or
garbage data corresponding to the unwritten pages and without any
checksums in the csum tree covering the extent's range.

This is an extension of what was done in commit de0ee0edb2 ("Btrfs: fix
race between fsync and lockless direct IO writes").

So fix this by creating first the ordered extent and then the extent
map, so that this way if the fast fsync patch collects the new extent
map it also collects the corresponding ordered extent.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13 01:59:32 +01:00
Filipe Manana
5062af35c3 Btrfs: fix number of transaction units for renames with whiteout
When we do a rename with the whiteout flag, we need to create the whiteout
inode, which in the worst case requires 5 transaction units (1 inode item,
1 inode ref, 2 dir items and 1 xattr if selinux is enabled). So bump the
number of transaction units from 11 to 16 if the whiteout flag is set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:30 +01:00
Filipe Manana
376e5a57bf Btrfs: pin logs earlier when doing a rename exchange operation
The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(),
which had a race fixed by my previous patch titled "Btrfs: pin log earlier
when renaming", and so it suffers from the same problem.

We pin the logs of the affected roots after we insert the new inode
references, leaving a time window where concurrent tasks logging the
inodes can end up logging both the new and old references, resulting
in log trees that when replayed can turn the metadata into inconsistent
states. This behaviour was added to btrfs_rename() in 2009 without any
explanation about why not pinning the logs earlier, just leaving a
comment about the posibility for the race. As of today it's perfectly
safe and sane to pin the logs before we start doing any of the steps
involved in the rename operation.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:28 +01:00
Filipe Manana
86e8aa0e77 Btrfs: unpin logs if rename exchange operation fails
If rename exchange operations fail at some point after we pinned any of
the logs, we end up aborting the current transaction but never unpin the
logs, which leaves concurrent tasks that are trying to sync the logs (as
part of an fsync request from user space) blocked forever and preventing
the filesystem from being unmountable.

Fix this by safely unpinning the log.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:26 +01:00
Filipe Manana
c990161888 Btrfs: fix inode leak on failure to setup whiteout inode in rename
If we failed to fully setup the whiteout inode during a rename operation
with the whiteout flag, we ended up leaking the inode, not decrementing
its link count nor removing all its items from the fs/subvol tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:23 +01:00
Dan Fuhry
cdd1fedf82 btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT
Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new
behavior in the renameat2() syscall. This behavior is primarily used by
overlayfs. This patch adds support for these flags to btrfs, enabling it to
be used as a fully functional upper layer for overlayfs.

RENAME_EXCHANGE support was written by Davide Italiano originally
submitted on 2 April 2015.

Signed-off-by: Davide Italiano <dccitaliano@gmail.com>
Signed-off-by: Dan Fuhry <dfuhry@datto.com>
[ remove unlikely ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:21 +01:00
Filipe Manana
c4aba95454 Btrfs: pin log earlier when renaming
We were pinning the log right after the first step in the rename operation
(inserting inode ref for the new name in the destination directory)
instead of doing it before. This behaviour was introduced in 2009 for some
reason that was not mentioned neither on the changelog nor any comment,
with the drawback of a small time window where concurrent log writers can
end up logging the new inode reference for the inode we are renaming while
the rename operation is in progress (so that we can end up with a log
containing both the new and old references). As of today there's no reason
to not pin the log before that first step anymore, so just fix this.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:19 +01:00
Filipe Manana
3dc9e8f767 Btrfs: unpin log if rename operation fails
If rename operations fail at some point after we pinned the log, we end
up aborting the current transaction but never unpin the log, which leaves
concurrent tasks that are trying to sync the log (as part of an fsync
request from user space) blocked forever and preventing the filesystem
from being unmountable.

Fix this by safely unpinning the log.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:18 +01:00
Filipe Manana
9cfa3e34e2 Btrfs: don't do unnecessary delalloc flushes when relocating
Before we start the actual relocation process of a block group, we do
calls to flush delalloc of all inodes and then wait for ordered extents
to complete. However we do these flush calls just to make sure we don't
race with concurrent tasks that have actually already started to run
delalloc and have allocated an extent from the block group we want to
relocate, right before we set it to readonly mode, but have not yet
created the respective ordered extents. The flush calls make us wait
for such concurrent tasks because they end up calling
filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
__start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
btrfs_run_delalloc_work()) which ends up serializing us with those tasks
due to attempts to lock the same pages (and the delalloc flush procedure
calls the allocator and creates the ordered extents before unlocking the
pages).

These flushing calls not only make us waste time (cpu, IO) but also reduce
the chances of writing larger extents (applications might be writing to
contiguous ranges and we flush before they finish dirtying the whole
ranges).

So make sure we don't flush delalloc and just wait for concurrent tasks
that have already started flushing delalloc and have allocated an extent
from the block group we are about to relocate.

This change also ends up fixing a race with direct IO writes that makes
relocation not wait for direct IO ordered extents. This race is
illustrated by the following diagram:

        CPU 1                                       CPU 2

 btrfs_relocate_block_group(bg X)

                                           starts direct IO write,
                                           target inode currently has no
                                           ordered extents ongoing nor
                                           dirty pages (delalloc regions),
                                           therefore the root for our inode
                                           is not in the list
                                           fs_info->ordered_roots

                                           btrfs_direct_IO()
                                             __blockdev_direct_IO()
                                               btrfs_get_blocks_direct()
                                                 btrfs_lock_extent_direct()
                                                   locks range in the io tree
                                                 btrfs_new_extent_direct()
                                                   btrfs_reserve_extent()
                                                     --> extent allocated
                                                         from bg X

   btrfs_inc_block_group_ro(bg X)

   btrfs_start_delalloc_roots()
     __start_delalloc_inodes()
       --> does nothing, no dealloc ranges
           in the inode's io tree so the
           inode's root is not in the list
           fs_info->delalloc_roots

   btrfs_wait_ordered_roots()
     --> does not find the inode's root in the
         list fs_info->ordered_roots

     --> ends up not waiting for the direct IO
         write started by the task at CPU 2

   relocate_block_group(rc->stage ==
     MOVE_DATA_EXTENTS)

     prepare_to_relocate()
       btrfs_commit_transaction()

     iterates the extent tree, using its
     commit root and moves extents into new
     locations

                                                   btrfs_add_ordered_extent_dio()
                                                     --> now a ordered extent is
                                                         created and added to the
                                                         list root->ordered_extents
                                                         and the root added to the
                                                         list fs_info->ordered_roots
                                                     --> this is too late and the
                                                         task at CPU 1 already
                                                         started the relocation

     btrfs_commit_transaction()

                                                   btrfs_finish_ordered_io()
                                                     btrfs_alloc_reserved_file_extent()
                                                       --> adds delayed data reference
                                                           for the extent allocated
                                                           from bg X

   relocate_block_group(rc->stage ==
     UPDATE_DATA_PTRS)

     prepare_to_relocate()
       btrfs_commit_transaction()
         --> delayed refs are run, so an extent
             item for the allocated extent from
             bg X is added to extent tree
         --> commit roots are switched, so the
             next scan in the extent tree will
             see the extent item

     sees the extent in the extent tree

When this happens the relocation produces the following warning when it
finishes:

[ 7260.832836] ------------[ cut here ]------------
[ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
[ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
[ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
[ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
[ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
[ 7260.852998] Call Trace:
[ 7260.852998]  [<ffffffff812648b3>] dump_stack+0x67/0x90
[ 7260.852998]  [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
[ 7260.852998]  [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998]  [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
[ 7260.852998]  [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998]  [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
[ 7260.852998]  [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
[ 7260.852998]  [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
[ 7260.852998]  [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
[ 7260.852998]  [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
[ 7260.852998]  [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
[ 7260.852998]  [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
[ 7260.852998]  [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
[ 7260.852998]  [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
[ 7260.852998]  [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
[ 7260.852998]  [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
[ 7260.852998]  [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
[ 7260.852998]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---

This is because at the end of the first stage, in relocate_block_group(),
we commit the current transaction, which makes delayed refs run, the
commit roots are switched and so the second stage will find the extent
item that the ordered extent added to the delayed refs. But this extent
was not moved (ordered extent completed after first stage finished), so
at the end of the relocation our block group item still has a positive
used bytes counter, triggering a warning at the end of
btrfs_relocate_block_group(). Later on when trying to read the extent
contents from disk we hit a BUG_ON() due to the inability to map a block
with a logical address that belongs to the block group we relocated and
is no longer valid, resulting in the following trace:

[ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
[ 7344.887518] ------------[ cut here ]------------
[ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
[ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
[ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
[ 7344.888431] RIP: 0010:[<ffffffffa037c88c>]  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
[ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
[ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
[ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
[ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
[ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
[ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
[ 7344.888431] Stack:
[ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
[ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
[ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
[ 7344.888431] Call Trace:
[ 7344.888431]  [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
[ 7344.888431]  [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
[ 7344.888431]  [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
[ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
[ 7344.888431]  [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
[ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
[ 7344.888431]  [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
[ 7344.888431]  [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
[ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
[ 7344.888431]  [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
[ 7344.888431]  [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
[ 7344.888431]  [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
[ 7344.888431]  [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
[ 7344.888431]  [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
[ 7344.888431]  [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
[ 7344.888431]  [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
[ 7344.888431]  [<ffffffff8117773a>] __vfs_read+0x79/0x9d
[ 7344.888431]  [<ffffffff81178050>] vfs_read+0x8f/0xd2
[ 7344.888431]  [<ffffffff81178a38>] SyS_read+0x50/0x7e
[ 7344.888431]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
[ 7344.888431] RIP  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431]  RSP <ffff8802046878f0>
[ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-05-13 01:59:16 +01:00
Al Viro
972b241f84 btrfs: switch to ->iterate_shared()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09 11:42:19 -04:00
Christoph Hellwig
c8b8e32d70 direct-io: eliminate the offset argument to ->direct_IO
Including blkdev_direct_IO and dax_do_io.  It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
David Sterba
7cd8c7527c btrfs: sink gfp parameter to set_extent_delalloc
Callers pass GFP_NOFS and tests pass GFP_KERNEL, but using NOFS there
does not hurt. No need to pass the flags around.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
David Sterba
91166212e0 btrfs: sink gfp parameter to clear_extent_bits
Callers pass GFP_NOFS and GFP_KERNEL. No need to pass the flags around.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-29 11:01:47 +02:00
Luke Dashjr
4c63c2454e btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl
32-bit ioctl uses these rather than the regular FS_IOC_* versions. They can
be handled in btrfs using the same code. Without this, 32-bit {ch,ls}attr
fail.

Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org>
Cc: stable@vger.kernel.org
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:40:27 +02:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Filipe Manana
ade770294d Btrfs: fix deadlock between direct IO reads and buffered writes
While running a test with a mix of buffered IO and direct IO against
the same files I hit a deadlock reported by the following trace:

[11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
[11642.142452]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.146332] kworker/u32:3   D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
[11642.149771]     0 15282      2 0x00000000
[11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
[11642.154074]  ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
[11642.156722]  ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
[11642.159205]  0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
[11642.161403] Call Trace:
[11642.162129]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.163396]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.164871]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.167020]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.167931]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.182320]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.183762]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.185308]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.186782]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.188217]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.189626]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.190803]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.192158]  [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.193379]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.194831]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.197068]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.199188]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.200723]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.202465]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.203836]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.205624]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
[11642.207057]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
[11642.208529]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
[11642.210375]  [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
[11642.212132]  [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
[11642.213837]  [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
[11642.215457]  [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
[11642.217095]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
[11642.218324]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
[11642.219466]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.220801]  [<ffffffff8106a500>] kthread+0xd4/0xdc
[11642.222032]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.223190]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
[11642.224394]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.226295] 2 locks held by kworker/u32:3/15282:
[11642.227273]  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.229412]  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
[11642.232872]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.235776] kworker/u32:8   D ffff88020de5f848     0 15289      2 0x00000000
[11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
[11642.238670]  ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
[11642.240475]  ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
[11642.242154]  0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
[11642.243715] Call Trace:
[11642.244390]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.245432]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.246392]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.247479]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.248551]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.249968]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.251043]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.252202]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.253210]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.254307]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.256118]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.257131]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.258200]  [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.259168]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.260516]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.261841]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.263531]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.264747]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.266148]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.267264]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.268280]  [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
[11642.269407]  [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
[11642.270476]  [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
[11642.271547]  [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
[11642.272588]  [<ffffffff81194821>] wb_workfn+0x201/0x341
[11642.273523]  [<ffffffff81194821>] ? wb_workfn+0x201/0x341
[11642.274479]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
[11642.275497]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
[11642.276518]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.277520]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.278517]  [<ffffffff8106a500>] kthread+0xd4/0xdc
[11642.279371]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.280468]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
[11642.281607]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.282604] 3 locks held by kworker/u32:8/15289:
[11642.283423]  #0:  ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.285629]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.287538]  #2:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
[11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
[11642.290547]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.292864] fdm-stress      D ffff88022c107c20     0 26848  26591 0x00000000
[11642.294118]  ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
[11642.295602]  ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
[11642.297098]  ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
[11642.298433] Call Trace:
[11642.298896]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.299738]  [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
[11642.300833]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
[11642.301943]  [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
[11642.303270]  [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
[11642.304552]  [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
[11642.305782]  [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
[11642.306878]  [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
[11642.307729]  [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
[11642.308602]  [<ffffffff8116efbb>] SyS_write+0x50/0x7e
[11642.309410]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.310403] 3 locks held by fdm-stress/26848:
[11642.311108]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
[11642.312578]  #1:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
[11642.314170]  #2:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
[11642.317842]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.319959] fdm-stress      D ffff8801964ffa68     0 26849  26591 0x00000000
[11642.321312]  ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
[11642.322555]  ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
[11642.323715]  ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
[11642.325096] Call Trace:
[11642.325532]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.326303]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.327180]  [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
[11642.328114]  [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
[11642.329051]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.330053]  [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
[11642.330952]  [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
[11642.331869]  [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
[11642.332925]  [<ffffffff81074075>] ? wake_up_q+0x47/0x47
[11642.333736]  [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
[11642.334672]  [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
[11642.335858]  [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
[11642.336854]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
[11642.337820]  [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
[11642.339026]  [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
[11642.340214]  [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
[11642.341123]  [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
[11642.341934]  [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
[11642.342936]  [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
[11642.343772]  [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
[11642.344673]  [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
[11642.346024]  [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
[11642.346873]  [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
[11642.347720]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.350222] 4 locks held by fdm-stress/26849:
[11642.350898]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
[11642.352375]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
[11642.354072]  #2:  (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
[11642.355647]  #3:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
[11642.358508]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.368625] fdm-stress      D ffff88021f167688     0 26850  26591 0x00000000
[11642.369716]  ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
[11642.370950]  ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
[11642.372210]  0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
[11642.373430] Call Trace:
[11642.373853]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.374623]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.375948]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.376862]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.377637]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.378610]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.379457]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.380366]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.381353]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.382255]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.383162]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.383945]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.384875]  [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.385749]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.386721]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.387596]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.389030]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.389973]  [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
[11642.390939]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.392271]  [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
[11642.393305]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.394239]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.395045]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
[11642.395991]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
[11642.397144]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
[11642.398392]  [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
[11642.399363]  [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
[11642.400445]  [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
[11642.401309]  [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
[11642.402213]  [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
[11642.403139]  [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
[11642.404360]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.406187]  [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
[11642.407070]  [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
[11642.407990]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.409192]  [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
[11642.410146]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.411291]  [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
[11642.412263]  [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
[11642.413057]  [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
[11642.413897]  [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
[11642.414708]  [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
[11642.415573]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.416572] 1 lock held by fdm-stress/26850:
[11642.417345]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
[11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
[11642.419698]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.421807] fdm-stress      D ffff880196483d28     0 26851  26591 0x00000000
[11642.422878]  ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
[11642.424149]  ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
[11642.425374]  ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
[11642.426591] Call Trace:
[11642.427013]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.427856]  [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
[11642.428852]  [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
[11642.429743]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.430911]  [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.432102]  [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
[11642.433259]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.434431]  [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
[11642.436079]  [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
[11642.437009]  [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
[11642.437860]  [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
[11642.438723]  [<ffffffff81171435>] iterate_supers+0x75/0xc2
[11642.439597]  [<ffffffff81197d00>] sys_sync+0x52/0x80
[11642.440454]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.441533] 3 locks held by fdm-stress/26851:
[11642.442370]  #0:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
[11642.444043]  #1:  (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
[11642.446010]  #2:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]

This happened because under specific timings the path for direct IO reads
can deadlock with concurrent buffered writes. The diagram below shows how
this happens for an example file that has the following layout:

     [  extent A  ]  [  extent B  ]  [ ....
     0K              4K              8K

     CPU 1                                               CPU 2                             CPU 3

DIO read against range
 [0K, 8K[ starts

btrfs_direct_IO()
  --> calls btrfs_get_blocks_direct()
      which finds the extent map for the
      extent A and leaves the range
      [0K, 4K[ locked in the inode's
      io tree

                                                   buffered write against
                                                   range [4K, 8K[ starts

                                                   __btrfs_buffered_write()
                                                     --> dirties page at 4K

                                                                                     a user space
                                                                                     task calls sync
                                                                                     for e.g or
                                                                                     writepages() is
                                                                                     invoked by mm

                                                                                     writepages()
                                                                                       run_delalloc_range()
                                                                                         cow_file_range()
                                                                                           --> ordered extent X
                                                                                               for the buffered
                                                                                               write is created
                                                                                               and
                                                                                               writeback starts

  --> calls btrfs_get_blocks_direct()
      again, without submitting first
      a bio for reading extent A, and
      finds the extent map for extent B

  --> calls lock_extent_direct()

      --> locks range [4K, 8K[
      --> finds ordered extent X
          covering range [4K, 8K[
      --> unlocks range [4K, 8K[

                                                  buffered write against
                                                  range [0K, 8K[ starts

                                                  __btrfs_buffered_write()
                                                    prepare_pages()
                                                      --> locks pages with
                                                          offsets 0 and 4K
                                                    lock_and_cleanup_extent_if_need()
                                                      --> blocks attempting to
                                                          lock range [0K, 8K[ in
                                                          the inode's io tree,
                                                          because the range [0, 4K[
                                                          is already locked by the
                                                          direct IO task at CPU 1

      --> calls
          btrfs_start_ordered_extent(oe X)

          btrfs_start_ordered_extent(oe X)

            --> At this point writeback for ordered
                extent X has not finished yet

            filemap_fdatawrite_range()
              btrfs_writepages()
                extent_writepages()
                  extent_write_cache_pages()
                    --> finds page with offset 0
                        with the writeback tag
                        (and not dirty)
                    --> tries to lock it
                         --> deadlock, task at CPU 2
                             has the page locked and
                             is blocked on the io range
                             [0, 4K[ that was locked
                             earlier by this task

So fix this by falling back to a buffered read in the direct IO read path
when an ordered extent for a buffered write is found.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:37 -08:00
Chris Mason
c05c5ee5ea Btrfs patchsets for 4.6
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJW0GSnAAoJEMVl1fnXbVg75qAP/0xbZPJtvTgRMSRnARtFJ28w
 vCsxqY+AatNJDuEpg2My/vscZvAVXGcTWjnM8NkXMMKN+oags47QN4qD0cuNv2kI
 JWcz7Ppt3GY6lcQbTj/Ce6N8RPRCNGsU7vxev+sKZ+jjXn+vuc+wKXnyJgaL1qcN
 XhcP2MccrXTVVJXLbGMFoaJXWWfd2i9uJ2MplmjFP7HQi5zP+5t/dsVaAQbc1dqx
 2TqgTJkUEPQqK8geAKom5wdLTmpLSgMWvg1m4lkYpDO89Fi+hFAKeeuJZvNutxVa
 hA0QLrLyZmr4tbZhM1of35Kl7N1uwCzOd8u6xsxurB12bibz67RbQpK+fazlCjKa
 wZJvJV+N3gqgCusLHlXYX0YalQxpWRQiKkjzpMy3Pq4K4soLrw20tQOnnBFhLR1y
 ZwqmZUN33lhFNCIWqLS4BLqDG+Z7Sf2aGhFtspMDjSUJe9gLbIpvH9sW6CexJI2r
 FnxTaVZ08uY0ky1dvZcRDR6zDDbVUpoQKWmwdZpxoEO1eLKjD01VsMOw5zlAaxdc
 a5SxKMVt0Gq56oTPgp0MuLHJr20pxx03yr+yl69VM8R1dAG/y61Dq5DwiFNQ8+J6
 jrX+eVYGBgTNYw/UGb14UPwVjQFFEs/vouphy6MmOVvNz+YZI6thN1uScB0vw7BV
 p/oFts5Fo0ipJgaBzGu4
 =CRdD
 -----END PGP SIGNATURE-----

Merge tag 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6

Btrfs patchsets for 4.6
2016-03-01 08:13:56 -08:00
David Sterba
f004fae0cf Merge branch 'cleanups-4.6' into for-chris-4.6 2016-02-26 15:38:33 +01:00
David Sterba
675d276b32 Merge branch 'foreign/liubo/replace-lockup' into for-chris-4.6 2016-02-26 15:38:32 +01:00
David Sterba
e9ddd77a31 Merge branch 'foreign/josef/space-updates' into for-chris-4.6 2016-02-26 15:38:31 +01:00
David Sterba
e22b3d1fbe Merge branch 'dev/gfp-flags' into for-chris-4.6 2016-02-26 15:38:28 +01:00
David Sterba
5f1b5664d9 Merge branch 'chandan/prep-subpage-blocksize' into for-chris-4.6
# Conflicts:
#	fs/btrfs/file.c
2016-02-26 15:38:28 +01:00
Linus Torvalds
ce6b71432d Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "My for-linus-4.5 branch has a btrfs DIO error passing fix.

  I know how much you love DIO, so I'm going to suggest against reading
  it.  We'll follow up with a patch to drop the error arg from
  dio_end_io in the next merge window."

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix direct IO requests not reporting IO error to user space
2016-02-19 13:40:42 -08:00
Kinglong Mee
5598e9005a btrfs: drop null testing before destroy functions
Cleanup.

kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:46:03 +01:00
Deepa Dinamani
04b285f35e btrfs: Replace CURRENT_TIME by current_fs_time()
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_fs_time() instead.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:46:03 +01:00
Josef Bacik
dc95f7bfc5 Btrfs: fix truncate_space_check
truncate_space_check is using btrfs_csum_bytes_to_leaves() but forgetting to
multiply by nodesize so we get an actual byte count.  We need a tracepoint here
so that we have the matching reserve for the release that will come later.  Also
add a comment to make clear what the intent of truncate_space_check is.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:22:24 +01:00
Filipe Manana
1636d1d77e Btrfs: fix direct IO requests not reporting IO error to user space
If a bio for a direct IO request fails, we were not setting the error in
the parent bio (the main DIO bio), making us not return the error to
user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
return the number of bytes issued for IO and not the error a bio created
and submitted by btrfs_submit_direct() got from the block layer.
This essentially happens because when we call:

   dio_end_io(dio_bio, bio->bi_error);

It does not set dio_bio->bi_error to the value of the second argument.
So just add this missing assignment in endio callbacks, just as we do in
the error path at btrfs_submit_direct() when we fail to clone the dio bio
or allocate its private object. This follows the convention of what is
done with other similar APIs such as bio_endio() where the caller is
responsible for setting the bi_error field in the bio it passes as an
argument to bio_endio().

This was detected by the new generic test cases in xfstests: 271, 272,
276 and 278. Which essentially setup a dm error target, then load the
error table, do a direct IO write and unload the error table. They
expect the write to fail with -EIO, which was not getting reported
when testing against btrfs.

Cc: stable@vger.kernel.org  # 4.3+
Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-02-16 03:41:26 +00:00
Linus Torvalds
27c9d772e5 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has a few fixes from Filipe, along with a readdir fix from Dave
  that we've been testing for some time"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: properly set the termination value of ctx->pos in readdir
  Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
  Btrfs: remove no longer used function extent_read_full_page_nolock()
  Btrfs: fix page reading in extent_same ioctl leading to csum errors
  Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
2016-02-12 09:21:28 -08:00
David Sterba
bc4ef7592f btrfs: properly set the termination value of ctx->pos in readdir
The value of ctx->pos in the last readdir call is supposed to be set to
INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
larger value, then it's LLONG_MAX.

There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
the increment.

We can get to that situation like that:

* emit all regular readdir entries
* still in the same call to readdir, bump the last pos to INT_MAX
* next call to readdir will not emit any entries, but will reach the
  bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX

Normally this is not a problem, but if we call readdir again, we'll find
'pos' set to LLONG_MAX and the unconditional increment will overflow.

The report from Victor at
(http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
print shows that pattern:

 Overflow: e
 Overflow: 7fffffff
 Overflow: 7fffffffffffffff
 PAX: size overflow detected in function btrfs_real_readdir
   fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
   context: dir_context;
 CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
  ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
  ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
  ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
 Call Trace:
  [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
  [<ffffffff811cb706>] report_size_overflow+0x36/0x40
  [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
  [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
  [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
  [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
 Overflow: 1a
  [<ffffffff811db070>] ? iterate_dir+0x150/0x150
  [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83

The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
are not yet synced and are processed from the delayed list. Then the code
could go to the bump section again even though it might not emit any new
dir entries from the delayed list.

The fix avoids entering the "bump" section again once we've finished
emitting the entries, both for synced and delayed entries.

References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
Reported-by: Victor <services@swwu.com>
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.com>
Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-02-11 07:01:59 -08:00
David Sterba
49e350a491 btrfs: readdir: use GFP_KERNEL
Readdir is initiated from userspace and is not on the critical
writeback path, we don't need to use GFP_NOFS for allocations.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
Chandan Rajendra
27772b68f6 Btrfs: Clean pte corresponding to page straddling i_size
When extending a file by either "truncate up" or by writing beyond i_size, the
page which had i_size needs to be marked "read only" so that future writes to
the page via mmap interface causes btrfs_page_mkwrite() to be invoked. If not,
a write performed after extending the file via the mmap interface will find
the page to be writaeable and continue writing to the page without invoking
btrfs_page_mkwrite() i.e. we end up writing to a file without reserving disk
space.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
5a2834f808 Btrfs: Fix block size returned to user space
btrfs_getattr() returns PAGE_CACHE_SIZE as the block size. Since
generic_fillattr() already does the right thing (by obtaining block size
from inode->i_blkbits), just remove the statement from btrfs_getattr.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
0c29ba993e Btrfs: Limit inline extents to root->sectorsize
cow_file_range_inline() limits the size of an inline extent to
PAGE_CACHE_SIZE. This breaks in subpagesize-blocksize scenarios. Fix this by
comparing against root->sectorsize.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
5f4dc8fc83 Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length
In subpagesize-blocksize scenario, map_length can be less than the length of a
bio vector. Such a condition may cause btrfs_submit_direct_hook() to submit a
zero length bio. Fix this by comparing map_length against block size rather
than with bv_len.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
dbfdb6d1b3 Btrfs: Search for all ordered extents that could span across a page
In subpagesize-blocksize scenario it is not sufficient to search using the
first byte of the page to make sure that there are no ordered extents
present across the page. Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
d0b7da88f6 Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units
In subpagesize-blocksize scenario, if i_size occurs in a block which is not
the last block in the page, then the space to be reserved should be calculated
appropriately.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
9703fefe0b Btrfs: fallocate: Work with sectorsized blocks
While at it, this commit changes btrfs_truncate_page() to truncate sectorsized
blocks instead of pages. Hence the function has been renamed to
btrfs_truncate_block().

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:24:29 +01:00
Chandan Rajendra
2dabb32484 Btrfs: Direct I/O read: Work on sectorsized blocks
The direct I/O read's endio and corresponding repair functions work on
page sized blocks. This commit adds the ability for direct I/O read to work on
subpagesized blocks.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-01 19:23:47 +01:00
Linus Torvalds
d3f71ae711 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Dave had a small collection of fixes to the new free space tree code,
  one of which was keeping our sysfs files more up to date with feature
  bits as different things get enabled (lzo, raid5/6, etc).

  I should have kept the sysfs stuff for rc3, since we always manage to
  trip over something.  This time it was GFP_KERNEL from somewhere that
  is NOFS only.  Instead of rebasing it out I've put a revert in, and
  we'll fix it properly for rc3.

  Otherwise, Filipe fixed a btrfs DIO race and Qu Wenruo fixed up a
  use-after-free in our tracepoints that Dave Jones reported"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Revert "btrfs: synchronize incompat feature bits with sysfs files"
  btrfs: don't use GFP_HIGHMEM for free-space-tree bitmap kzalloc
  btrfs: sysfs: check initialization state before updating features
  Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"
  btrfs: async-thread: Fix a use-after-free error for trace
  Btrfs: fix race between fsync and lockless direct IO writes
  btrfs: add free space tree to the cow-only list
  btrfs: add free space tree to lockdep classes
  btrfs: tweak free space tree bitmap allocation
  btrfs: tests: switch to GFP_KERNEL
  btrfs: synchronize incompat feature bits with sysfs files
  btrfs: sysfs: introduce helper for syncing bits with sysfs files
  btrfs: sysfs: add free-space-tree bit attribute
  btrfs: sysfs: fix typo in compat_ro attribute definition
2016-01-29 15:46:49 -08:00
Filipe Manana
de0ee0edb2 Btrfs: fix race between fsync and lockless direct IO writes
An fsync, using the fast path, can race with a concurrent lockless direct
IO write and end up logging a file extent item that points to an extent
that wasn't written to yet. This is because the fast fsync path collects
ordered extents into a local list and then collects all the new extent
maps to log file extent items based on them, while the direct IO write
path creates the new extent map before it creates the corresponding
ordered extent (and submitting the respective bio(s)).

So fix this by making the direct IO write path create ordered extents
before the extent maps and make the fast fsync path collect any new
ordered extents after it collects the extent maps.
Note that making the fsync handler call inode_dio_wait() (after acquiring
the inode's i_mutex) would not work and lead to a deadlock when doing
AIO, as through AIO we end up in a path where the fsync handler is called
(through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range())
before the inode's dio counter is decremented (inode_dio_wait() waits
for this counter to have a value of zero).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-25 16:50:26 -08:00
Al Viro
5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Linus Torvalds
2101ae4289 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "These are mostly fixes that we've been testing, but also we grabbed
  and tested a few small cleanups that had been on the list for a while.

  Zhao Lei's patchset also fixes some early ENOSPC buglets"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (21 commits)
  btrfs: raid56: Use raid_write_end_io for scrub
  btrfs: Remove unnecessary ClearPageUptodate for raid56
  btrfs: use rbio->nr_pages to reduce calculation
  btrfs: Use unified stripe_page's index calculation
  btrfs: Fix calculation of rbio->dbitmap's size calculation
  btrfs: Fix no_space in write and rm loop
  btrfs: merge functions for wait snapshot creation
  btrfs: delete unused argument in btrfs_copy_from_user
  btrfs: Use direct way to determine raid56 write/recover mode
  btrfs: Small cleanup for get index_srcdev loop
  btrfs: Enhance chunk validation check
  btrfs: Enhance super validation check
  Btrfs: fix deadlock running delayed iputs at transaction commit time
  Btrfs: fix typo in log message when starting a balance
  btrfs: remove duplicate const specifier
  btrfs: initialize the seq counter in struct btrfs_device
  Btrfs: clean up an error code in btrfs_init_space_info()
  btrfs: fix iterator with update error in backref.c
  Btrfs: fix output of compression message in btrfs_parse_options()
  Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots
  ...
2016-01-22 11:49:21 -08:00
Zhao Lei
0bc19f9031 btrfs: merge functions for wait snapshot creation
wait_for_snapshot_creation() is in same group with oher two:
 btrfs_start_write_no_snapshoting()
 btrfs_end_write_no_snapshoting()

Rename wait_for_snapshot_creation() and move it into same place
with other two.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20 07:22:13 -08:00
Filipe Manana
c2d6cb1636 Btrfs: fix deadlock running delayed iputs at transaction commit time
While running a stress test I ran into a deadlock when running the delayed
iputs at transaction time, which produced the following report and trace:

[  886.399989] =============================================
[  886.400871] [ INFO: possible recursive locking detected ]
[  886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
[  886.402384] ---------------------------------------------
[  886.403182] fio/8277 is trying to acquire lock:
[  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[  886.403568]
[  886.403568] but task is already holding lock:
[  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[  886.403568]
[  886.403568] other info that might help us debug this:
[  886.403568]  Possible unsafe locking scenario:
[  886.403568]
[  886.403568]        CPU0
[  886.403568]        ----
[  886.403568]   lock(&fs_info->delayed_iput_sem);
[  886.403568]   lock(&fs_info->delayed_iput_sem);
[  886.403568]
[  886.403568]  *** DEADLOCK ***
[  886.403568]
[  886.403568]  May be due to missing lock nesting notation
[  886.403568]
[  886.403568] 3 locks held by fio/8277:
[  886.403568]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
[  886.403568]  #1:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[  886.403568]  #2:  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[  886.403568]
[  886.403568] stack backtrace:
[  886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[  886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[  886.403568]  0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
[  886.403568]  ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
[  886.403568]  0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
[  886.403568] Call Trace:
[  886.403568]  [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
[  886.403568]  [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
[  886.403568]  [<ffffffff810c22db>] ? __module_address+0xdf/0x108
[  886.403568]  [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
[  886.403568]  [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
[  886.403568]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[  886.489542]  [<ffffffff8148556b>] down_read+0x3e/0x4d
[  886.489542]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[  886.489542]  [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[  886.489542]  [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
[  886.489542]  [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
[  886.489542]  [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
[  886.489542]  [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
[  886.489542]  [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
[  886.489542]  [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
[  886.489542]  [<ffffffff81188e31>] evict+0xa7/0x15c
[  886.489542]  [<ffffffff81189878>] iput+0x1d3/0x266
[  886.489542]  [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
[  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[  886.489542]  [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
[  886.489542]  [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
[  886.489542]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[  886.489542]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[  886.489542]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[  886.489542]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[  886.489542]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[  886.489542]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[  886.489542]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
[  886.489542]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
[  886.489542]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
[  886.489542]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
[ 1081.854348]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.863227] fio        D ffff880213f9bb28     0  8244   8240 0x00000000
[ 1081.868719]  ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
[ 1081.872499]  ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
[ 1081.876834]  ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
[ 1081.880782] Call Trace:
[ 1081.881793]  [<ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.883340]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.895525]  [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
[ 1081.897419]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1081.899251]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1081.901063]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1081.902365]  [<ffffffff814855bd>] down_write+0x43/0x57
[ 1081.903846]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.906078]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.908846]  [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
[ 1081.910409]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 1081.912482]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 1081.914597]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 1081.919037]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 1081.920754]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 1081.922496]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 1081.923922]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 1081.925275]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 1081.926584]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
[ 1081.927968]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.985293] INFO: lockdep is turned off.
[ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
[ 1081.987434]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.990147] fio        D ffff880218febbb8     0  8249   8240 0x00000000
[ 1081.991626]  ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
[ 1081.993258]  ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
[ 1081.994850]  ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
[ 1081.996485] Call Trace:
[ 1081.997037]  [<ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.998017]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.999241]  [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
[ 1082.000306]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1082.001533]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1082.002776]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1082.003995]  [<ffffffff814855bd>] down_write+0x43/0x57
[ 1082.005000]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.007403]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.008988]  [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
[ 1082.010193]  [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
[ 1082.011280]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.012265]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.013021]  [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
[ 1082.013738]  [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
[ 1082.014778]  [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
[ 1082.015778]  [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
[ 1082.016806]  [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
[ 1082.017789]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
[ 1082.018706]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f

This happens because we can recursively acquire the semaphore
fs_info->delayed_iput_sem when attempting to allocate space to satisfy
a file write request as shown in the first trace above - when committing
a transaction we acquire (down_read) the semaphore before running the
delayed iputs, and when running a delayed iput() we can end up calling
an inode's eviction handler, which in turn commits another transaction
and attempts to acquire (down_read) again the semaphore to run more
delayed iput operations.
This results in a deadlock because if a task acquires multiple times a
semaphore it should invoke down_read_nested() with a different lockdep
class for each level of recursion.

Fix this by simplifying the implementation and use a mutex instead that
is acquired by the cleaner kthread before it runs the delayed iputs
instead of always acquiring a semaphore before delayed references are
run from anywhere.

Fixes: d7c151717a (btrfs: Fix NO_SPACE bug caused by delayed-iput)
Cc: stable@vger.kernel.org   # 4.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19 18:21:41 -08:00
Linus Torvalds
c1a198d923 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has our usual assortment of fixes and cleanups, but the biggest
  change included is Omar Sandoval's free space tree.  It's not the
  default yet, mounting -o space_cache=v2 enables it and sets a readonly
  compat bit.  The tree can actually be deleted and regenerated if there
  are any problems, but it has held up really well in testing so far.

  For very large filesystems (30T+) our existing free space caching code
  can end up taking a huge amount of time during commits.  The new tree
  based code is faster and less work overall to update as the commit
  progresses.

  Omar worked on this during the summer and we'll hammer on it in
  production here at FB over the next few months"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (73 commits)
  Btrfs: fix fitrim discarding device area reserved for boot loader's use
  Btrfs: Check metadata redundancy on balance
  btrfs: statfs: report zero available if metadata are exhausted
  btrfs: preallocate path for snapshot creation at ioctl time
  btrfs: allocate root item at snapshot ioctl time
  btrfs: do an allocation earlier during snapshot creation
  btrfs: use smaller type for btrfs_path locks
  btrfs: use smaller type for btrfs_path lowest_level
  btrfs: use smaller type for btrfs_path reada
  btrfs: cleanup, use enum values for btrfs_path reada
  btrfs: constify static arrays
  btrfs: constify remaining structs with function pointers
  btrfs tests: replace whole ops structure for free space tests
  btrfs: use list_for_each_entry* in backref.c
  btrfs: use list_for_each_entry_safe in free-space-cache.c
  btrfs: use list_for_each_entry* in check-integrity.c
  Btrfs: use linux/sizes.h to represent constants
  btrfs: cleanup, remove stray return statements
  btrfs: zero out delayed node upon allocation
  btrfs: pass proper enum type to start_transaction()
  ...
2016-01-18 12:44:40 -08:00
Vladimir Davydov
5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Linus Torvalds
ddf1d6238d Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "Andreas' xattr cleanup series.

  It's a followup to his xattr work that went in last cycle; -0.5KLoC"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xattr handlers: Simplify list operation
  ocfs2: Replace list xattr handler operations
  nfs: Move call to security_inode_listsecurity into nfs_listxattr
  xfs: Change how listxattr generates synthetic attributes
  tmpfs: listxattr should include POSIX ACL xattrs
  tmpfs: Use xattr handler infrastructure
  btrfs: Use xattr handler infrastructure
  vfs: Distinguish between full xattr names and proper prefixes
  posix acls: Remove duplicate xattr name definitions
  gfs2: Remove gfs2_xattr_acl_chmod
  vfs: Remove vfs_xattr_cmp
2016-01-11 13:32:10 -08:00
Chris Mason
988f1f576d Merge branch 'for-chris-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-11 08:39:28 -08:00
Chris Mason
b28cf57246 Merge branch 'misc-cleanups-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-11 06:08:37 -08:00
Chris Mason
a3058101c1 Merge branch 'misc-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2016-01-11 05:59:32 -08:00
David Sterba
e4058b54d1 btrfs: cleanup, use enum values for btrfs_path reada
Replace the integers by enums for better readability. The value 2 does
not have any meaning since a717531942
"Btrfs: do less aggressive btree readahead" (2009-01-22).

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:01:15 +01:00
David Sterba
4d4ab6d6bc btrfs: constify static arrays
There are a few statically initialized arrays that can be made const.
The remaining (like file_system_type, sysfs attributes or prop handlers)
do not allow that due to type mismatch when passed to the APIs or
because the structures are modified through other members.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:01:15 +01:00
David Sterba
20e5506baf btrfs: constify remaining structs with function pointers
* struct extent_io_ops
* struct btrfs_free_space_op

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:01:14 +01:00
Byongho Lee
ee22184b53 Btrfs: use linux/sizes.h to represent constants
We use many constants to represent size and offset value.  And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'.  However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.

So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:38:02 +01:00
David Sterba
7928d672ff btrfs: cleanup, remove stray return statements
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:30:52 +01:00
Byongho Lee
e40da0e58a btrfs: remove unused inode argument from uncompress_inline()
The inode argument is never used from the beginning, so remove it.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:29:02 +01:00
David Sterba
100d57025c btrfs: don't use slab cache for struct btrfs_delalloc_work
Although we prefer to use separate caches for various structs, it seems
better not to do that for struct btrfs_delalloc_work. Objects of this
type are allocated rarely, when transaction commit calls
btrfs_start_delalloc_roots, requesting delayed iputs.

The objects are temporary (with some IO involved) but still allocated
and freed within __start_delalloc_inodes. Memory allocation failure is
handled.

The slab cache is empty most of the time (observed on several systems),
so if we need to allocate a new slab object, the first one has to
allocate a full page. In a potential case of low memory conditions this
might fail with higher probability compared to using the generic slab
caches.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
David Sterba
8089fe62c6 btrfs: put delayed item hook into inode
Inodes for delayed iput allocate a trivial helper structure, let's place
the list hook directly into the inode and save a kmalloc (killing a
__GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.

The inode can be put into the delayed_iputs list more than once and we
have to keep the count. This means we can't use the list_splice to
process a bunch of inodes because we'd lost track of the count if the
inode is put into the delayed iputs again while it's processed.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
Josef Bacik
be7bd73084 Btrfs: igrab inode in writepage
We hit this panic on a few of our boxes this week where we have an
ordered_extent with an NULL inode.  We do an igrab() of the inode in writepages,
but weren't doing it in writepage which can be called directly from the VM on
dirty pages.  If the inode has been unlinked then we could have I_FREEING set
which means igrab() would return NULL and we get this panic.  Fix this by trying
to igrab in btrfs_writepage, and if it returns NULL then just redirty the page
and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
Filipe Manana
271dba4521 Btrfs: fix transaction handle leak on failure to create hard link
If we failed to create a hard link we were not always releasing the
the transaction handle we got before, resulting in a memory leak and
preventing any other tasks from being able to commit the current
transaction.
Fix this by always releasing our transaction handle.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-01-06 22:52:38 +00:00
Filipe Manana
9269d12b2d Btrfs: fix number of transaction units required to create symlink
We weren't accounting for the insertion of an inline extent item for the
symlink inode nor that we need to update the parent inode item (through
the call to btrfs_add_nondir()). So fix this by including two more
transaction units.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-31 18:18:40 +00:00
Filipe Manana
d50866d00f Btrfs: don't leave dangling dentry if symlink creation failed
When we are creating a symlink we might fail with an error after we
created its inode and added the corresponding directory indexes to its
parent inode. In this case we end up never removing the directory indexes
because the inode eviction handler, called for our symlink inode on the
final iput(), only removes items associated with the symlink inode and
not with the parent inode.

Example:

  $ mkfs.btrfs -f /dev/sdi
  $ mount /dev/sdi /mnt
  $ touch /mnt/foo
  $ ln -s /mnt/foo /mnt/bar
  ln: failed to create symbolic link ‘bar’: Cannot allocate memory
  $ umount /mnt
  $ btrfsck /dev/sdi
  Checking filesystem on /dev/sdi
  UUID: d5acb5ba-31bd-42da-b456-89dca2e716e1
  checking extents
  checking free space cache
  checking fs roots
  root 5 inode 258 errors 2001, no inode item, link count wrong
	unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref
  found 131073 bytes used err is 1
  total csum bytes: 0
  total tree bytes: 131072
  total fs tree bytes: 32768
  total extent tree bytes: 16384
  btree space waste bytes: 124305
  file data blocks allocated: 262144
   referenced 262144
  btrfs-progs v4.2.3

So fix this by adding the directory index entries as the very last
step of symlink creation.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-31 18:10:56 +00:00
Al Viro
fceef393a5 switch ->get_link() to delayed_call, kill ->put_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-30 13:01:03 -05:00
Chris Mason
a53fe25769 Merge branch 'for-chris-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.5 2015-12-23 13:28:35 -08:00
Chris Mason
bb9d687618 Merge branch 'dev/simplify-set-bit' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-23 13:17:42 -08:00
Chris Mason
afa427cf9d Merge branch 'cleanup/misc-simplify' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2015-12-23 13:10:26 -08:00
Filipe Manana
f28a492878 Btrfs: fix leaking of ordered extents after direct IO write error
When doing a direct IO write, __blockdev_direct_IO() can call the
btrfs_get_blocks_direct() callback one or more times before it calls the
btrfs_submit_direct() callback. However it can fail after calling the
first callback and before calling the second callback, which is a problem
because the first one creates ordered extents and the second one is the
one that submits bios that cover the ordered extents created by the first
one. That means the ordered extents will never complete nor have any of
the flags BTRFS_ORDERED_IO_DONE / BTRFS_ORDERED_IOERR set, resulting in
subsequent operations (such as other direct IO writes, buffered writes or
hole punching) that lock the same IO range and lookup for ordered extents
in the range to hang forever waiting for those ordered extents because
they can not complete ever, since no bio was submitted.

Fix this by tracking a range of created ordered extents that don't have
yet corresponding bios submitted and completing the ordered extents in
the range if __blockdev_direct_IO() fails with an error.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17 10:59:51 +00:00
Filipe Manana
b850ae1427 Btrfs: fix deadlock between direct IO write and defrag/readpages
If readpages() (triggered by defrag or buffered reads) is called while a
direct IO write is in progress, we have a small time window where we can
deadlock, resulting in traces like the following being generated:

[84723.212993] INFO: task fio:2849 blocked for more than 120 seconds.
[84723.214310]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[84723.215640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84723.217313] fio        D ffff88023ec75218     0  2849   2835 0x00000000
[84723.218778]  ffff880122dfb6e8 0000000000000092 0000000000000000 ffff88023ec75200
[84723.220458]  ffff88000e05d2c0 ffff880122dfc000 ffff88023ec75200 7fffffffffffffff
[84723.230597]  0000000000000002 ffffffff8147891a ffff880122dfb700 ffffffff8147856a
[84723.232085] Call Trace:
[84723.232625]  [<ffffffff8147891a>] ? bit_wait+0x3c/0x3c
[84723.233529]  [<ffffffff8147856a>] schedule+0x7d/0x95
[84723.234398]  [<ffffffff8147baa3>] schedule_timeout+0x43/0x10b
[84723.235384]  [<ffffffff810f82eb>] ? time_hardirqs_on+0x15/0x28
[84723.236426]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
[84723.237502]  [<ffffffff810af8a3>] ? read_seqcount_begin.constprop.20+0x57/0x6d
[84723.238807]  [<ffffffff8108a09b>] ? trace_hardirqs_on_caller+0x16/0x1ab
[84723.242012]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
[84723.243064]  [<ffffffff810af2ad>] ? timekeeping_get_ns+0xe/0x33
[84723.244116]  [<ffffffff810afa2e>] ? ktime_get+0x41/0x52
[84723.245029]  [<ffffffff81477cff>] io_schedule_timeout+0xb7/0x12b
[84723.245942]  [<ffffffff81477cff>] ? io_schedule_timeout+0xb7/0x12b
[84723.246596]  [<ffffffff81478953>] bit_wait_io+0x39/0x45
[84723.247503]  [<ffffffff81478b93>] __wait_on_bit_lock+0x49/0x8d
[84723.248540]  [<ffffffff8111684f>] __lock_page+0x66/0x68
[84723.249558]  [<ffffffff81081c9b>] ? autoremove_wake_function+0x3a/0x3a
[84723.250844]  [<ffffffff81124a04>] lock_page+0x2c/0x2f
[84723.251871]  [<ffffffff81124afc>] invalidate_inode_pages2_range+0xf5/0x2aa
[84723.253274]  [<ffffffff81117c34>] ? filemap_fdatawait_range+0x12d/0x146
[84723.254757]  [<ffffffff81118191>] ? filemap_fdatawrite_range+0x13/0x15
[84723.256378]  [<ffffffffa05139a2>] btrfs_get_blocks_direct+0x1b0/0x664 [btrfs]
[84723.258556]  [<ffffffff8119e3f9>] ? submit_page_section+0x7b/0x111
[84723.260064]  [<ffffffff8119eb90>] do_blockdev_direct_IO+0x658/0xbdb
[84723.261479]  [<ffffffffa05137f2>] ? btrfs_page_exists_in_range+0x1a9/0x1a9 [btrfs]
[84723.262961]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.264449]  [<ffffffff8119f144>] __blockdev_direct_IO+0x31/0x33
[84723.265614]  [<ffffffff8119f144>] ? __blockdev_direct_IO+0x31/0x33
[84723.266769]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.268264]  [<ffffffffa050935d>] btrfs_direct_IO+0x1b9/0x259 [btrfs]
[84723.270954]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.272465]  [<ffffffff8111878c>] generic_file_direct_write+0xb3/0x128
[84723.273734]  [<ffffffffa051955c>] btrfs_file_write_iter+0x228/0x404 [btrfs]
[84723.275101]  [<ffffffff8116ca6f>] __vfs_write+0x7c/0xa5
[84723.276200]  [<ffffffff8116cfab>] vfs_write+0xa0/0xe4
[84723.277298]  [<ffffffff8116d79d>] SyS_write+0x50/0x7e
[84723.278327]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[84723.279595] INFO: lockdep is turned off.
[84723.379035] INFO: task btrfs:2923 blocked for more than 120 seconds.
[84723.380323]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
[84723.381608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84723.383003] btrfs           D ffff88023ed75218     0  2923   2859 0x00000000
[84723.384277]  ffff88001311f860 0000000000000082 ffff88001311f840 ffff88023ed75200
[84723.385748]  ffff88012c6751c0 ffff880013120000 ffff88012042fe68 ffff88012042fe30
[84723.387152]  ffff880221571c88 0000000000000001 ffff88001311f878 ffffffff8147856a
[84723.388620] Call Trace:
[84723.389105]  [<ffffffff8147856a>] schedule+0x7d/0x95
[84723.391882]  [<ffffffffa051da32>] btrfs_start_ordered_extent+0x161/0x1fa [btrfs]
[84723.393718]  [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
[84723.395659]  [<ffffffffa0522c5b>] __do_contiguous_readpages.constprop.21+0x81/0xdc [btrfs]
[84723.397383]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
[84723.398852]  [<ffffffffa0522da3>] __extent_readpages.constprop.20+0xed/0x100 [btrfs]
[84723.400561]  [<ffffffff81123f6c>] ? __lru_cache_add+0x5d/0x72
[84723.401787]  [<ffffffffa0523896>] extent_readpages+0x111/0x1a7 [btrfs]
[84723.403121]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
[84723.404583]  [<ffffffffa05088fa>] btrfs_readpages+0x1f/0x21 [btrfs]
[84723.406007]  [<ffffffff811226df>] __do_page_cache_readahead+0x168/0x1f4
[84723.407502]  [<ffffffff81122988>] ondemand_readahead+0x21d/0x22e
[84723.408937]  [<ffffffff81122988>] ? ondemand_readahead+0x21d/0x22e
[84723.410487]  [<ffffffff81122af1>] page_cache_sync_readahead+0x3d/0x3f
[84723.411710]  [<ffffffffa0535388>] btrfs_defrag_file+0x419/0xaaf [btrfs]
[84723.413007]  [<ffffffffa0531db0>] ? kzalloc+0xf/0x11 [btrfs]
[84723.414085]  [<ffffffffa0535b43>] btrfs_ioctl_defrag+0x125/0x14e [btrfs]
[84723.415307]  [<ffffffffa0536753>] btrfs_ioctl+0x746/0x24c6 [btrfs]
[84723.416532]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
[84723.417731]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[84723.418699]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[84723.421532]  [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
[84723.422629]  [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
[84723.423712]  [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
[84723.424801]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[84723.425968]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[84723.427063]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[84723.428138]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f

Consider the following logical and physical file layout:

logical:    ... [ prealloc extent A ] [ prealloc extent B ] [ extent C ] ...
                4K                    8K                    16K

physical:   ... 12853248              12857344              1103101952   ...
                                      (= 12853248 + 4K)

Extents A and B are physically adjacent. The following diagram shows a
sequence of events that lead to the deadlock when we attempt to do a
direct IO write against the file range [4K, 16K[ and a defrag is triggered
simultaneously.

           CPU 1                                               CPU 2

 btrfs_direct_IO()

   btrfs_get_blocks_direct()
     creates ordered extent A, covering
     the 4k prealloc extent A (range [4K, 8K[)

                                                    btrfs_defrag_file()
                                                      page_cache_sync_readahead([0K, 1M[)
                                                        btrfs_readpages()
                                                          extent_readpages()

                                                            locks all pages in the file
                                                            range [0K, 128K[ through calls
                                                            to add_to_page_cache_lru()

                                                            __do_contiguous_readpages()

                                                               finds ordered extent A

                                                               waits for it to complete

   btrfs_get_blocks_direct() called again

     lock_extent_direct(range [8K, 16K[)

       finds a page in range [8K, 16K[ through
       btrfs_page_exists_in_range()

       invalidate_inode_pages2_range([8K, 16K[)

         --> tries to lock pages that are already
             locked by the task at CPU 2

         --> our task, running __blockdev_direct_IO(),
             hangs waiting to lock the pages and the
             submit bio callback, btrfs_submit_direct(),
             ends up never being called, resulting in the
             ordered extent A never completing (because a
             corresponding bio is never submitted) and
             CPU 2 will wait for it forever while holding
             the pages locked
              ---> deadlock!

Fix this by removing the page invalidation approach when attempting to
lock the range for IO from the callback btrfs_get_blocks_direct() and
falling back buffered IO. This was a rare case anyway and well behaved
applications do not mix concurrent direct IO writes with buffered reads
anyway, being a concurrent defrag the only normal case that could lead
to the deadlock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17 10:59:50 +00:00
Filipe Manana
14543774bd Btrfs: fix error path when failing to submit bio for direct IO write
Commit 61de718fce ("Btrfs: fix memory corruption on failure to submit
bio for direct IO") fixed problems with the error handling code after we
fail to submit a bio for direct IO. However there were 2 problems that it
did not address when the failure is due to memory allocation failures for
direct IO writes:

1) We considered that there could be only one ordered extent for the whole
   IO range, which is not always true, as we can have multiple;

2) It did not set the bit BTRFS_ORDERED_IO_DONE in the ordered extent,
   which can make other tasks running btrfs_wait_logged_extents() hang
   forever, since they wait for that bit to be set. The general assumption
   is that regardless of an error, the BTRFS_ORDERED_IO_DONE is always set
   and it precedes setting the bit BTRFS_ORDERED_COMPLETE.

Fix these issues by moving part of the btrfs_endio_direct_write() handler
into a new helper function and having that new helper function called when
we fail to allocate memory to submit the bio (and its private object) for
a direct IO write.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-12-17 10:59:49 +00:00
Al Viro
6b2553918d replace ->follow_link() with new method that could stay in RCU mode
new method: ->get_link(); replacement of ->follow_link().  The differences
are:
	* inode and dentry are passed separately
	* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
	* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.

It's a flagday change - the old method is gone, all in-tree instances
converted.  Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode.  That'll change
in the next commits.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-08 22:41:54 -05:00
Al Viro
21fc61c73c don't put symlink bodies in pagecache into highmem
kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.

new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases.  page_follow_link_light()
instrumented to yell about anything missed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-08 22:41:36 -05:00
Andreas Gruenbacher
9172abbcd3 btrfs: Use xattr handler infrastructure
Use the VFS xattr handler infrastructure and get rid of similar code in
the filesystem.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-06 21:34:14 -05:00
Andreas Gruenbacher
97d7929922 posix acls: Remove duplicate xattr name definitions
Remove POSIX_ACL_XATTR_{ACCESS,DEFAULT} and GFS2_POSIX_ACL_{ACCESS,DEFAULT}
and replace them with the definitions in <include/uapi/linux/xattr.h>.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-06 21:25:17 -05:00
David Sterba
3042460136 btrfs: remove wait from struct btrfs_delalloc_work
The value is 0 and never changes, we can propagate the value, remove
wait and the implied dead code.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 15:02:21 +01:00
David Sterba
651d494a67 btrfs: sink parameter wait to btrfs_alloc_delalloc_work
There's only one caller and single value, we can propagate it down to
the callee and remove the unused parameter.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 15:02:21 +01:00
David Sterba
ff13db41f1 btrfs: drop unused parameter from lock_extent_bits
We've always passed 0. Stack usage will slightly decrease.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 14:30:40 +01:00
Linus Torvalds
80e0c505b2 Merge branch 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has Mark Fasheh's patches to fix quota accounting during subvol
  deletion, which we've been working on for a while now.  The patch is
  pretty small but it's a key fix.

  Otherwise it's a random assortment"

* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix balance range usage filters in 4.4-rc
  btrfs: qgroup: account shared subtree during snapshot delete
  Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
  btrfs: qgroup: fix quota disable during rescan
  Btrfs: fix race between cleaner kthread and space cache writeout
  Btrfs: fix scrub preventing unused block groups from being deleted
  Btrfs: fix race between scrub and block group deletion
  btrfs: fix rcu warning during device replace
  btrfs: Continue replace when set_block_ro failed
  btrfs: fix clashing number of the enhanced balance usage filter
  Btrfs: fix the number of transaction units needed to remove a block group
  Btrfs: use global reserve when deleting unused block group after ENOSPC
  Btrfs: tests: checking for NULL instead of IS_ERR()
  btrfs: fix signed overflows in btrfs_sync_file
2015-11-27 15:45:45 -08:00
Filipe Manana
8eab77ff16 Btrfs: use global reserve when deleting unused block group after ENOSPC
It's possible to reach a state where the cleaner kthread isn't able to
start a transaction to delete an unused block group due to lack of enough
free metadata space and due to lack of unallocated device space to allocate
a new metadata block group as well. If this happens try to use space from
the global block group reserve just like we do for unlink operations, so
that we don't reach a permanent state where starting a transaction for
filesystem operations (file creation, renames, etc) keeps failing with
-ENOSPC. Such an unfortunate state was observed on a machine where over
a dozen unused data block groups existed and the cleaner kthread was
failing to delete them due to ENOSPC error when attempting to start a
transaction, and even running balance with a -dusage=0 filter failed with
ENOSPC as well. Also unmounting and mounting again the filesystem didn't
help. Allowing the cleaner kthread to use the global block reserve to
delete the unused data block groups fixed the problem.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:19:50 -08:00
Linus Torvalds
e75cdf9898 Merge branch 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes and cleanups from Chris Mason:
 "Some of this got cherry-picked from a github repo this week, but I
  verified the patches.

  We have three small scrub cleanups and a collection of fixes"

* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: Use fs_info directly in btrfs_delete_unused_bgs
  btrfs: Fix lost-data-profile caused by balance bg
  btrfs: Fix lost-data-profile caused by auto removing bg
  btrfs: Remove len argument from scrub_find_csum
  btrfs: Reduce unnecessary arguments in scrub_recheck_block
  btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for scrub_recheck_block_checksum
  btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksum
  btrfs: scrub: setup all fields for sblock_to_check
  btrfs: scrub: set error stats when tree block spanning stripes
  Btrfs: fix race when listing an inode's xattrs
  Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow
  Btrfs: fix race leading to incorrect item deletion when dropping extents
  Btrfs: fix sleeping inside atomic context in qgroup rescan worker
  Btrfs: fix race waiting for qgroup rescan worker
  btrfs: qgroup: exit the rescan worker during umount
  Btrfs: fix extent accounting for partial direct IO writes
2015-11-13 16:30:29 -08:00
Yaowei Bai
7cac0a8599 fs/btrfs/inode.c: remove unnecessary new_valid_dev() check
new_valid_dev() always returns 1, so the !new_valid_dev() check is not
needed.  Remove it.

Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-09 15:11:24 -08:00
Filipe Manana
1d512cb77b Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow
If we are using the NO_HOLES feature, we have a tiny time window when
running delalloc for a nodatacow inode where we can race with a concurrent
link or xattr add operation leading to a BUG_ON.

This happens because at run_delalloc_nocow() we end up casting a leaf item
of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
file extent item (struct btrfs_file_extent_item) and then analyse its
extent type field, which won't match any of the expected extent types
(values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
explicit BUG_ON(1).

The following sequence diagram shows how the race happens when running a
no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
neighbour leafs:

             Leaf X (has N items)                    Leaf Y

 [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
              slot N - 2         slot N - 1              slot 0

 (Note the implicit hole for inode 257 regarding the [0, 8K[ range)

       CPU 1                                         CPU 2

 run_dealloc_nocow()
   btrfs_lookup_file_extent()
     --> searches for a key with value
         (257 EXTENT_DATA 4096) in the
         fs/subvol tree
     --> returns us a path with
         path->nodes[0] == leaf X and
         path->slots[0] == N

   because path->slots[0] is >=
   btrfs_header_nritems(leaf X), it
   calls btrfs_next_leaf()

   btrfs_next_leaf()
     --> releases the path

                                              hard link added to our inode,
                                              with key (257 INODE_REF 500)
                                              added to the end of leaf X,
                                              so leaf X now has N + 1 keys

     --> searches for the key
         (257 INODE_REF 256), because
         it was the last key in leaf X
         before it released the path,
         with path->keep_locks set to 1

     --> ends up at leaf X again and
         it verifies that the key
         (257 INODE_REF 256) is no longer
         the last key in the leaf, so it
         returns with path->nodes[0] ==
         leaf X and path->slots[0] == N,
         pointing to the new item with
         key (257 INODE_REF 500)

   the loop iteration of run_dealloc_nocow()
   does not break out the loop and continues
   because the key referenced in the path
   at path->nodes[0] and path->slots[0] is
   for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
   and its offset (500) is less then our delalloc
   range's end (8192)

   the item pointed by the path, an inode reference item,
   is (incorrectly) interpreted as a file extent item and
   we get an invalid extent type, leading to the BUG_ON(1):

   if (extent_type == BTRFS_FILE_EXTENT_REG ||
      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
       (...)
   } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
       (...)
   } else {
       BUG_ON(1)
   }

The same can happen if a xattr is added concurrently and ends up having
a key with an offset smaller then the delalloc's range end.

So fix this by skipping keys with a type smaller than
BTRFS_EXTENT_DATA_KEY.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-09 11:29:14 +00:00
Filipe Manana
9c9464cc92 Btrfs: fix extent accounting for partial direct IO writes
When doing a write using direct IO we can end up not doing the whole write
operation using the direct IO path, in that case we fallback to a buffered
write to do the remaining IO. This happens for example if the range we are
writing to contains a compressed extent.
When we do a partial write and fallback to buffered IO, due to the
existence of a compressed extent for example, we end up not adjusting the
outstanding extents counter of our inode which ends up getting decremented
twice, once by the DIO ordered extent for the partial write and once again
by btrfs_direct_IO(), resulting in an arithmetic underflow at
extent-tree.c:drop_outstanding_extent(). For example if we have:

  extents        [ prealloc extent ] [ compressed extent ]
  offsets        A        B          C       D           E

and at the moment our inode's outstanding extents counter is 0, if we do a
direct IO write against the range [B, D[ (which has a length smaller than
128Mb), we end up bumping our inode's outstanding extents counter to 1, we
create a DIO ordered extent for the range [B, C[ and then fallback to a
buffered write for the range [C, D[. The direct IO handler
(inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by
1, leaving it with a value of 0, through a call to
btrfs_delalloc_release_space() and then shortly after the DIO ordered
extent finishes and calls btrfs_delalloc_release_metadata() which ends
up to attempt to decrement the inode's outstanding extents counter by 1,
resulting in an assertion failure at drop_outstanding_extent() because
the operation would result in an arithmetic underflow (0 - 1). This
produces the following trace:

  [125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5526
  [125471.338844] ------------[ cut here ]------------
  [125471.340745] kernel BUG at fs/btrfs/ctree.h:4173!
  [125471.340745] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  [125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs]
  [125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
  [125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
  [125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
  [125471.340745] task: ffff8804244fcf80 ti: ffff88040a118000 task.ti: ffff88040a118000
  [125471.340745] RIP: 0010:[<ffffffffa0550da1>]  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
  [125471.340745] RSP: 0018:ffff88040a11bc78  EFLAGS: 00010296
  [125471.340745] RAX: 0000000000000075 RBX: 0000000000005000 RCX: 0000000000000000
  [125471.340745] RDX: ffffffff81098f93 RSI: ffffffff8147c619 RDI: 00000000ffffffff
  [125471.340745] RBP: ffff88040a11bc78 R08: 0000000000000001 R09: 0000000000000000
  [125471.340745] R10: ffff88040a11bc08 R11: ffffffff81651000 R12: ffff8803efb4a000
  [125471.340745] R13: ffff8803efb4a000 R14: 0000000000000000 R15: ffff8802f8e33c88
  [125471.340745] FS:  0000000000000000(0000) GS:ffff88043dd40000(0000) knlGS:0000000000000000
  [125471.340745] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  [125471.340745] CR2: 00007fae7ca86095 CR3: 0000000001a0b000 CR4: 00000000000006e0
  [125471.340745] Stack:
  [125471.340745]  ffff88040a11bc88 ffffffffa04ca0cd ffff88040a11bcc8 ffffffffa04ceeb1
  [125471.340745]  ffff8802f8e33940 ffff8802c93eadb0 ffff8802f8e0bf50 ffff8803efb4a000
  [125471.340745]  0000000000000000 ffff8802f8e33c88 ffff88040a11bd38 ffffffffa04eccfa
  [125471.340745] Call Trace:
  [125471.340745]  [<ffffffffa04ca0cd>] drop_outstanding_extent+0x3d/0x6d [btrfs]
  [125471.340745]  [<ffffffffa04ceeb1>] btrfs_delalloc_release_metadata+0x51/0xdd [btrfs]
  [125471.340745]  [<ffffffffa04eccfa>] btrfs_finish_ordered_io+0x420/0x4eb [btrfs]
  [125471.340745]  [<ffffffffa04ecdda>] finish_ordered_fn+0x15/0x17 [btrfs]
  [125471.340745]  [<ffffffffa050e6e8>] normal_work_helper+0x14c/0x32a [btrfs]
  [125471.340745]  [<ffffffffa050e9c8>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
  [125471.340745]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
  [125471.340745]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
  [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
  [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
  [125471.340745]  [<ffffffff8106904d>] kthread+0xef/0xf7
  [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
  [125471.340745]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
  [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
  [125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 0b 0f 1f 44 00 00 55 31 c9 ba 18 00 00 00 48 89 e5 41 56 41
  [125471.340745] RIP  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
  [125471.340745]  RSP <ffff88040a11bc78>
  [125471.539620] ---[ end trace 144259f7838b4aa4 ]---

So fix this by ensuring we adjust the outstanding extents counter when we
do the fallback just like we do for the case where the whole write can be
done through the direct IO path.

We were also adjusting the outstanding extents counter by a constant value
of 1, which is incorrect because we were ignorning that we account extents
in BTRFS_MAX_EXTENT_SIZE units, o fix that as well.

The following test case for fstests reproduces this issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_xfs_io_command "falloc"

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount "-o compress"

  # Create a compressed extent covering the range [700K, 800K[.
  $XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Create prealloc extent covering the range [600K, 700K[.
  $XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo

  # Write 80K of data to the range [640K, 720K[ using direct IO. This
  # range covers both the prealloc extent and the compressed extent.
  # Because there's a compressed extent in the range we are writing to,
  # the DIO write code path ends up only writing the first 60k of data,
  # which goes to the prealloc extent, and then falls back to buffered IO
  # for writing the remaining 20K of data - because that remaining data
  # maps to a file range containing a compressed extent.
  # When falling back to buffered IO, we used to trigger an assertion when
  # releasing reserved space due to bad accounting of the inode's
  # outstanding extents counter, which was set to 1 but we ended up
  # decrementing it by 1 twice, once through the ordered extent for the
  # 60K of data we wrote using direct IO, and once through the main direct
  # IO handler (inode.cbtrfs_direct_IO()) because the direct IO write
  # wrote less than 80K of data (60K).
  $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Now similar test as above but for very large write operations. This
  # triggers special cases for an inode's outstanding extents accounting,
  # as internally btrfs logically splits extents into 128Mb units.
  $XFS_IO_PROG -f -s \
      -c "pwrite -S 0xaa -b 128M 258M 128M" \
      -c "falloc 0 258M" \
      $SCRATCH_MNT/bar | _filter_xfs_io
  $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \
      | _filter_xfs_io

  # Now verify the file contents are correct and that they are the same
  # even after unmounting and mounting the fs again (or evicting the page
  # cache).
  #
  # For file foo, all bytes in the range [0, 640K[ must have a value of
  # 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb
  # and all bytes in the range [720K, 800K[ must have a value of 0xaa.
  #
  # For file bar, all bytes in the range [0, 3M[ must havea value of 0x00,
  # all bytes in the range [3M, 259M[ must have a value of 0xbb and all
  # bytes in the range [259M, 386M[ must have a value of 0xaa.
  #
  echo "File digests before remounting the file system:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch
  md5sum $SCRATCH_MNT/bar | _filter_scratch
  _scratch_remount
  echo "File digests after remounting the file system:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch
  md5sum $SCRATCH_MNT/bar | _filter_scratch

  status=0
  exit

Fixes: e1cbbfa5f5 ("Btrfs: fix outstanding_extents accounting in DIO")
Fixes: 3e05bde8c3 ("Btrfs: only adjust outstanding_extents when we do a short write")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-05 10:32:19 +00:00
Qu Wenruo
5846a3c268 btrfs: qgroup: Fix a race in delayed_ref which leads to abort trans
Between btrfs_allocerved_file_extent() and
btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs
are run and delayed ref head maybe freed before
btrfs_add_delayed_qgroup_reserve().

This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT,
and cause transaction to be aborted.

This patch will record qgroup reserve space info into delayed_ref_head
at btrfs_add_delayed_ref(), to eliminate the race window.

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:44:39 -07:00
Filipe Manana
b06c4bf5c8 Btrfs: fix regression running delayed references when using qgroups
In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:

  commit 0ed4792af0 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")

Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:

  static int run_delayed_tree_ref(...)
  {
     (...)
     BUG_ON(node->ref_mod != 1);
     (...)
  }

This happens on a scenario like the following:

  1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.

  2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
     It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.

  3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
     It's not merged with the reference at the tail of the list of refs
     for bytenr X because the reference at the tail, Ref2 is incompatible
     due to Ref2->no_quota != Ref3->no_quota.

  4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
     It's not merged with the reference at the tail of the list of refs
     for bytenr X because the reference at the tail, Ref3 is incompatible
     due to Ref3->no_quota != Ref4->no_quota.

  5) We run delayed references, trigger merging of delayed references,
     through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().

  6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
     all other conditions are satisfied too. So Ref1 gets a ref_mod
     value of 2.

  7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
     all other conditions are satisfied too. So Ref2 gets a ref_mod
     value of 2.

  8) Ref1 and Ref2 aren't merged, because they have different values
     for their no_quota field.

  9) Delayed reference Ref1 is picked for running (select_delayed_ref()
     always prefers references with an action == BTRFS_ADD_DELAYED_REF).
     So run_delayed_tree_ref() is called for Ref1 which triggers the
     BUG_ON because Ref1->red_mod != 1 (equals 2).

So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").

The use of no_quota was also buggy in at least two places:

1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
   no_quota to 0 instead of 1 when the following condition was true:
   is_fstree(ref_root) || !fs_info->quota_enabled

2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
   reset a node's no_quota when the condition "!is_fstree(root_objectid)
   || !root->fs_info->quota_enabled" was true but we did it only in
   an unused local stack variable, that is, we never reset the no_quota
   value in the node itself.

This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.

Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).

Also, this fixes deadlock issue when using the clone ioctl with qgroups
enabled, as reported by Elias Probst in the mailing list. The deadlock
happens because after calling btrfs_insert_empty_item we have our path
holding a write lock on a leaf of the fs/subvol tree and then before
releasing the path we called check_ref() which did backref walking, when
qgroups are enabled, and tried to read lock the same leaf. The trace for
this case is the following:

  INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
  (...)
  Call Trace:
    [<ffffffff86999201>] schedule+0x74/0x83
    [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
    [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
    [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
    [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
    [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
    [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
    [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
    [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
    [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
    [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
    [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
    [<ffffffff863e852e>] check_ref+0x64/0xc4
    [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
    [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
    [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
    [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
  (...)

The problem goes away by eleminating check_ref(), which no longer is
needed as its purpose was to get a value for the no_quota field of
a delayed reference (this patch removes the no_quota field as mentioned
earlier).

Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr>
Reported-by: Elias Probst <mail@eliasprobst.eu>
Reported-by: Peter Becker <floyd.net@gmail.com>
Reported-by: Malte Schröder <malte@tnxip.de>
Reported-by: Derek Dongray <derek@valedon.co.uk>
Reported-by: Erkki Seppala <flux-btrfs@inside.org>
Cc: stable@vger.kernel.org  # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-10-25 19:53:26 +00:00
Chris Mason
a9e6d15356 Merge branch 'allocator-fixes' into for-linus-4.4
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 19:00:38 -07:00
Josef Bacik
0b670dc44c Btrfs: fix prealloc under heavy fragmentation conditions
If we are heavily fragmented we will continually try to prealloc the largest
extent size we can every time we call btrfs_reserve_extent.  This can be very
expensive when we are heavily fragmented, burning lots of CPU cycles and loops
through the allocator.  So instead notice when we get a smaller chunk from the
allocator than what we specified and use this as the new maximum size we try to
allocate.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:51:44 -07:00
Qu Wenruo
56fa9d0762 btrfs: qgroup: Check if qgroup reserved space leaked
Add check at btrfs_destroy_inode() time to detect qgroup reserved space
leak.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:10 -07:00
Qu Wenruo
51773bec7e btrfs: qgroup: Avoid calling btrfs_free_reserved_data_space in clear_bit_hook
In clear_bit_hook, qgroup reserved data is already handled quite well,
either released by finish_ordered_io or invalidatepage.

So calling btrfs_qgroup_free_data() here is completely meaningless, and
since btrfs_qgroup_free_data() will lock io_tree, so it can't be called
with io_tree lock hold.

This patch will add a new function
btrfs_free_reserved_data_space_noquota() for clear_bit_hook() to cease
the lockdep warning.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:09 -07:00
Qu Wenruo
b9d0b38928 btrfs: Add handler for invalidate page
For btrfs_invalidatepage() and its variant evict_inode_truncate_page(),
there will be pages don't reach disk.
In that case, their reserved space won't be release nor freed by
finish_ordered_io() nor delayed_ref handler.

So we must free their qgroup reserved space, or we will leaking reserved
space again.

So this will patch will call btrfs_qgroup_free_data() for
invalidatepage() and its variant evict_inode_truncate_page().

And due to the nature of new btrfs_qgroup_reserve/free_data() reserved
space will only be reserved or freed once, so for pages which are
already flushed to disk, their reserved space will be released and freed
by delayed_ref handler.

Double free won't be a problem.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:07 -07:00
Qu Wenruo
94ed938aba btrfs: qgroup: Add handler for NOCOW and inline
For NOCOW and inline case, there will be no delayed_ref created for
them, so we should free their reserved data space at proper
time(finish_ordered_io for NOCOW and cow_file_inline for inline).

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:07 -07:00
Qu Wenruo
7cf5b97650 btrfs: qgroup: Cleanup old inaccurate facilities
Cleanup the old facilities which use old btrfs_qgroup_reserve() function
call, replace them with the newer version, and remove the "__" prefix in
them.

Also, make btrfs_qgroup_reserve/free() functions private, as they are
now only used inside qgroup codes.

Now, the whole btrfs qgroup is swithed to use the new reserve facilities.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:06 -07:00
Qu Wenruo
df480633b8 btrfs: extent-tree: Switch to new delalloc space reserve and release
Use new __btrfs_delalloc_reserve_space() and
__btrfs_delalloc_release_space() to reserve and release space for
delalloc.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:41:05 -07:00
Qu Wenruo
297d750b9f btrfs: delayed_ref: release and free qgroup reserved at proper timing
Qgroup reserved space needs to be released from inode dirty map and get
freed at different timing:

1) Release when the metadata is written into tree
After corresponding metadata is written into tree, any newer write will
be COWed(don't include NOCOW case yet).
So we must release its range from inode dirty range map, or we will
forget to reserve needed range, causing accounting exceeding the limit.

2) Free reserved bytes when delayed ref is run
When delayed refs are run, qgroup accounting will follow soon and turn
the reserved bytes into rfer/excl numbers.
As run_delayed_refs and qgroup accounting are all done at
commit_transaction() time, we are safe to free reserved space in
run_delayed_ref time().

With these timing to release/free reserved space, we should be able to
resolve the long existing qgroup reserve space leak problem.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:37:47 -07:00
Chris Mason
a408365c62 Merge branch 'integration-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.4 2015-10-21 18:23:59 -07:00
Chris Mason
a0d58e48db Merge branch 'cleanups/for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 2015-10-21 18:21:40 -07:00
Byongho Lee
568b1c9cca btrfs: remove unnecessary list_del
We can safely iterate whole list items, without using list_del macro.
So remove the list_del call.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Chandan Rajendra
0d51e28a11 Btrfs: btrfs_submit_bio_hook: Use btrfs_wq_endio_type values instead of integer constants
btrfs_submit_bio_hook() uses integer constants instead of values from "enum
btrfs_wq_endio_type". Fix this.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:47 +02:00
Filipe Manana
0305cd5f7f Btrfs: fix truncation of compressed and inlined extents
When truncating a file to a smaller size which consists of an inline
extent that is compressed, we did not discard (or made unusable) the
data between the new file size and the old file size, wasting metadata
space and allowing for the truncated data to be leaked and the data
corruption/loss mentioned below.
We were also not correctly decrementing the number of bytes used by the
inode, we were setting it to zero, giving a wrong report for callers of
the stat(2) syscall. The fsck tool also reported an error about a mismatch
between the nbytes of the file versus the real space used by the file.

Now because we weren't discarding the truncated region of the file, it
was possible for a caller of the clone ioctl to actually read the data
that was truncated, allowing for a security breach without requiring root
access to the system, using only standard filesystem operations. The
scenario is the following:

   1) User A creates a file which consists of an inline and compressed
      extent with a size of 2000 bytes - the file is not accessible to
      any other users (no read, write or execution permission for anyone
      else);

   2) The user truncates the file to a size of 1000 bytes;

   3) User A makes the file world readable;

   4) User B creates a file consisting of an inline extent of 2000 bytes;

   5) User B issues a clone operation from user A's file into its own
      file (using a length argument of 0, clone the whole range);

   6) User B now gets to see the 1000 bytes that user A truncated from
      its file before it made its file world readbale. User B also lost
      the bytes in the range [1000, 2000[ bytes from its own file, but
      that might be ok if his/her intention was reading stale data from
      user A that was never supposed to be public.

Note that this contrasts with the case where we truncate a file from 2000
bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In
this case reading any byte from the range [1000, 2000[ will return a value
of 0x00, instead of the original data.

This problem exists since the clone ioctl was added and happens both with
and without my recent data loss and file corruption fixes for the clone
ioctl (patch "Btrfs: fix file corruption and data loss after cloning
inline extents").

So fix this by truncating the compressed inline extents as we do for the
non-compressed case, which involves decompressing, if the data isn't already
in the page cache, compressing the truncated version of the extent, writing
the compressed content into the inline extent and then truncate it.

The following test case for fstests reproduces the problem. In order for
the test to pass both this fix and my previous fix for the clone ioctl
that forbids cloning a smaller inline extent into a larger one,
which is titled "Btrfs: fix file corruption and data loss after cloning
inline extents", are needed. Without that other fix the test fails in a
different way that does not leak the truncated data, instead part of
destination file gets replaced with zeroes (because the destination file
has a larger inline extent than the source).

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount "-o compress"

  # Create our test files. File foo is going to be the source of a clone operation
  # and consists of a single inline extent with an uncompressed size of 512 bytes,
  # while file bar consists of a single inline extent with an uncompressed size of
  # 256 bytes. For our test's purpose, it's important that file bar has an inline
  # extent with a size smaller than foo's inline extent.
  $XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128"   \
          -c "pwrite -S 0x2a 128 384" \
          $SCRATCH_MNT/foo | _filter_xfs_io
  $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io

  # Now durably persist all metadata and data. We do this to make sure that we get
  # on disk an inline extent with a size of 512 bytes for file foo.
  sync

  # Now truncate our file foo to a smaller size. Because it consists of a
  # compressed and inline extent, btrfs did not shrink the inline extent to the
  # new size (if the extent was not compressed, btrfs would shrink it to 128
  # bytes), it only updates the inode's i_size to 128 bytes.
  $XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo

  # Now clone foo's inline extent into bar.
  # This clone operation should fail with errno EOPNOTSUPP because the source
  # file consists only of an inline extent and the file's size is smaller than
  # the inline extent of the destination (128 bytes < 256 bytes). However the
  # clone ioctl was not prepared to deal with a file that has a size smaller
  # than the size of its inline extent (something that happens only for compressed
  # inline extents), resulting in copying the full inline extent from the source
  # file into the destination file.
  #
  # Note that btrfs' clone operation for inline extents consists of removing the
  # inline extent from the destination inode and copy the inline extent from the
  # source inode into the destination inode, meaning that if the destination
  # inode's inline extent is larger (N bytes) than the source inode's inline
  # extent (M bytes), some bytes (N - M bytes) will be lost from the destination
  # file. Btrfs could copy the source inline extent's data into the destination's
  # inline extent so that we would not lose any data, but that's currently not
  # done due to the complexity that would be needed to deal with such cases
  # (specially when one or both extents are compressed), returning EOPNOTSUPP, as
  # it's normally not a very common case to clone very small files (only case
  # where we get inline extents) and copying inline extents does not save any
  # space (unlike for normal, non-inlined extents).
  $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Now because the above clone operation used to succeed, and due to foo's inline
  # extent not being shinked by the truncate operation, our file bar got the whole
  # inline extent copied from foo, making us lose the last 128 bytes from bar
  # which got replaced by the bytes in range [128, 256[ from foo before foo was
  # truncated - in other words, data loss from bar and being able to read old and
  # stale data from foo that should not be possible to read anymore through normal
  # filesystem operations. Contrast with the case where we truncate a file from a
  # size N to a smaller size M, truncate it back to size N and then read the range
  # [M, N[, we should always get the value 0x00 for all the bytes in that range.

  # We expected the clone operation to fail with errno EOPNOTSUPP and therefore
  # not modify our file's bar data/metadata. So its content should be 256 bytes
  # long with all bytes having the value 0xbb.
  #
  # Without the btrfs bug fix, the clone operation succeeded and resulted in
  # leaking truncated data from foo, the bytes that belonged to its range
  # [128, 256[, and losing data from bar in that same range. So reading the
  # file gave us the following content:
  #
  # 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
  # *
  # 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
  # *
  # 0000400
  echo "File bar's content after the clone operation:"
  od -t x1 $SCRATCH_MNT/bar

  # Also because the foo's inline extent was not shrunk by the truncate
  # operation, btrfs' fsck, which is run by the fstests framework everytime a
  # test completes, failed reporting the following error:
  #
  #  root 5 inode 257 errors 400, nbytes wrong

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-16 21:02:53 +01:00
Chris Mason
6db4a7335d Merge branch 'fix/waitqueue-barriers' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 2015-10-12 16:24:40 -07:00
David Sterba
ee86395458 btrfs: comment the rest of implicit barriers before waitqueue_active
There are atomic operations that imply the barrier for waitqueue_active
mixed in an if-condition.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:42:00 +02:00
David Sterba
9464732266 btrfs: switch message printers to ratelimited variants
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 13:04:06 +02:00
Linus Torvalds
03e8f64486 Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is an assorted set I've been queuing up:

  Jeff Mahoney tracked down a tricky one where we ended up starting IO
  on the wrong mapping for special files in btrfs_evict_inode.  A few
  people reported this one on the list.

  Filipe found (and provided a test for) a difficult bug in reading
  compressed extents, and Josef fixed up some quota record keeping with
  snapshot deletion.  Chandan killed off an accounting bug during DIO
  that lead to WARN_ONs as we freed inodes"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: keep dropped roots in cache until transaction commit
  Btrfs: Direct I/O: Fix space accounting
  btrfs: skip waiting on ordered range for special files
  Btrfs: fix read corruption of compressed and shared extents
  Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock
  Btrfs: don't initialize a space info as full to prevent ENOSPC
2015-09-25 12:08:41 -07:00
chandan
50745b0a7f Btrfs: Direct I/O: Fix space accounting
The following call trace is seen when generic/095 test is executed,

WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0()
Modules linked in:
CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
 ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0
 0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38
 ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0
Call Trace:
 [<ffffffff81984058>] dump_stack+0x45/0x57
 [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0
 [<ffffffff81050465>] warn_slowpath_null+0x15/0x20
 [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0
 [<ffffffff8117ce07>] destroy_inode+0x37/0x60
 [<ffffffff8117cf39>] evict+0x109/0x170
 [<ffffffff8117cfd5>] dispose_list+0x35/0x50
 [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100
 [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0
 [<ffffffff81165951>] kill_anon_super+0x11/0x20
 [<ffffffff81302093>] btrfs_kill_super+0x13/0x110
 [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70
 [<ffffffff811660cf>] deactivate_super+0x5f/0x70
 [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90
 [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10
 [<ffffffff81069c06>] task_work_run+0x96/0xb0
 [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50
 [<ffffffff8198cbc2>] int_signal+0x12/0x17

This means that the inode had non-zero "outstanding extents" during
eviction. This occurs because, during direct I/O a task which successfully
used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
not clear the bit after finishing the DIO write. A future DIO write could
actually fail and the unused reserve space won't be freed because of the
previously set BTRFS_INODE_DIO_READY bit.

Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
following issue,
|-----------------------------------+-------------------------------------|
| Task A                            | Task B                              |
|-----------------------------------+-------------------------------------|
| Start direct i/o write on inode X.|                                     |
| reserve space                     |                                     |
| Allocate ordered extent           |                                     |
| release reserved space            |                                     |
| Set BTRFS_INODE_DIO_READY bit.    |                                     |
|                                   | splice()                            |
|                                   | Transfer data from pipe buffer to   |
|                                   | destination file.                   |
|                                   | - kmap(pipe buffer page)            |
|                                   | - Start direct i/o write on         |
|                                   |   inode X.                          |
|                                   |   - reserve space                   |
|                                   |   - dio_refill_pages()              |
|                                   |     - sdio->blocks_available == 0   |
|                                   |     - Since a kernel address is     |
|                                   |       being passed instead of a     |
|                                   |       user space address,           |
|                                   |       iov_iter_get_pages() returns  |
|                                   |       -EFAULT.                      |
|                                   |   - Since BTRFS_INODE_DIO_READY is  |
|                                   |     set, we don't release reserved  |
|                                   |     space.                          |
|                                   |   - Clear BTRFS_INODE_DIO_READY bit.|
| -EIOCBQUEUED is returned.         |                                     |
|-----------------------------------+-------------------------------------|

Hence this commit introduces "struct btrfs_dio_data" to track the usage of
reserved data space. The remaining unused "reserve space" can now be freed
reliably.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-09-21 13:47:55 -07:00
Jeff Mahoney
a30e577c96 btrfs: skip waiting on ordered range for special files
In btrfs_evict_inode, we properly truncate the page cache for evicted
inodes but then we call btrfs_wait_ordered_range for every inode as well.
It's the right thing to do for regular files but results in incorrect
behavior for device inodes for block devices.

filemap_fdatawrite_range gets called with inode->i_mapping which gets
resolved to the block device inode before getting passed to
wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi.  What happens
next depends on whether there's an open file handle associated with the
inode.  If there is, we write to the block device, which is unexpected
behavior.  If there isn't, we through normally and inode->i_data is used.
We can also end up racing against open/close which can result in crashes
when i_mapping points to a block device inode that has been closed.

Since there can't be any page cache associated with special file inodes,
it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
the problem.

Cc: <stable@vger.kernel.org>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911
Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
2015-09-15 02:21:08 +01:00
Linus Torvalds
e91eb6204f Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs cleanups and fixes from Chris Mason:
 "These are small cleanups, and also some fixes for our async worker
  thread initialization.

  I was having some trouble testing these, but it ended up being a
  combination of changing around my test servers and a shiny new
  schedule while atomic from the new start/finish_plug in
  writeback_sb_inodes().

  That one only hits on btrfs raid5/6 or MD raid10, and if I wasn't
  changing a bunch of things in my test setup at once it would have been
  really clear.  Fix for writeback_sb_inodes() on the way as well"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
  btrfs: async_thread: Fix workqueue 'max_active' value when initializing
  btrfs: Add raid56 support for updating  num_tolerated_disk_barrier_failures in btrfs_balance
  btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures
  btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk
  btrfs: Update out-of-date "skip parity stripe" comment
2015-09-11 12:38:25 -07:00
Linus Torvalds
22365979ab Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has Jeff Mahoney's long standing trim patch that fixes corners
  where trims were missing.  Omar has some raid5/6 fixes, especially for
  using scrub and device replace when devices are missing.

  Zhao Lie continues cleaning and fixing things, this series fixes some
  really hard to hit corners in xfstests.  I had to pull it last merge
  window due to some deadlocks, but those are now resolved.

  I added support for Tejun's new blkio controllers.  It seems to work
  well for single devices, we'll expand to multi-device as well"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits)
  btrfs: fix compile when block cgroups are not enabled
  Btrfs: fix file read corruption after extent cloning and fsync
  Btrfs: check if previous transaction aborted to avoid fs corruption
  btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
  btrfs: Prevent from early transaction abort
  btrfs: Remove unused arguments in tree-log.c
  btrfs: Remove useless condition in start_log_trans()
  Btrfs: add support for blkio controllers
  Btrfs: remove unused mutex from struct 'btrfs_fs_info'
  Btrfs: fix parity scrub of RAID 5/6 with missing device
  Btrfs: fix device replace of a missing RAID 5/6 device
  Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation
  Btrfs: count devices correctly in readahead during RAID 5/6 replace
  Btrfs: remove misleading handling of missing device scrub
  btrfs: fix clone / extent-same deadlocks
  Btrfs: fix defrag to merge tail file extent
  Btrfs: fix warning in backref walking
  btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()
  btrfs: Remove root argument in extent_data_ref_count()
  btrfs: Fix wrong comment of btrfs_alloc_tree_block()
  ...
2015-09-05 15:14:43 -07:00
Linus Torvalds
1081230b74 Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "This first core part of the block IO changes contains:

   - Cleanup of the bio IO error signaling from Christoph.  We used to
     rely on the uptodate bit and passing around of an error, now we
     store the error in the bio itself.

   - Improvement of the above from myself, by shrinking the bio size
     down again to fit in two cachelines on x86-64.

   - Revert of the max_hw_sectors cap removal from a revision again,
     from Jeff Moyer.  This caused performance regressions in various
     tests.  Reinstate the limit, bump it to a more reasonable size
     instead.

   - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
     Most devices have huge trim limits, which can cause nasty latencies
     when deleting files.  Enable the admin to configure the size down.
     We will look into having a more sane default instead of UINT_MAX
     sectors.

   - Improvement of the SGP gaps logic from Keith Busch.

   - Enable the block core to handle arbitrarily sized bios, which
     enables a nice simplification of bio_add_page() (which is an IO hot
     path).  From Kent.

   - Improvements to the partition io stats accounting, making it
     faster.  From Ming Lei.

   - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
     file in blk-mq, as well as a fix for a blk-mq timeout race
     condition.

   - Ming Lin has been carrying Kents above mentioned patches forward
     for a while, and testing them.  Ming also did a few fixes around
     that.

   - Sasha Levin found and fixed a use-after-free problem introduced by
     the bio->bi_error changes from Christoph.

   - Small blk cgroup cleanup from Viresh Kumar"

* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
  blk: Fix bio_io_vec index when checking bvec gaps
  block: Replace SG_GAPS with new queue limits mask
  block: bump BLK_DEF_MAX_SECTORS to 2560
  Revert "block: remove artifical max_hw_sectors cap"
  blk-mq: fix race between timeout and freeing request
  blk-mq: fix buffer overflow when reading sysfs file of 'pending'
  Documentation: update notes in biovecs about arbitrarily sized bios
  block: remove bio_get_nr_vecs()
  fs: use helper bio_add_page() instead of open coding on bi_io_vec
  block: kill merge_bvec_fn() completely
  md/raid5: get rid of bio_fits_rdev()
  md/raid5: split bio for chunk_aligned_read
  block: remove split code in blkdev_issue_{discard,write_same}
  btrfs: remove bio splitting and merge_bvec_fn() calls
  bcache: remove driver private bio splitting code
  block: simplify bio_add_page()
  block: make generic_make_request handle arbitrarily sized bios
  blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
  block: don't access bio->bi_error after bio_put()
  block: shrink struct bio down to 2 cache lines again
  ...
2015-09-02 13:10:25 -07:00
Tsutomu Itoh
527afb4493 Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
We need not check path before btrfs_free_path() is called because
path is checked in btrfs_free_path().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:46:41 -07:00
Kent Overstreet
b54ffb73ca block: remove bio_get_nr_vecs()
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:32:04 -06:00
Chris Mason
da2f0f74cf Btrfs: add support for blkio controllers
This attaches accounting information to bios as we submit them so the
new blkio controllers can throttle on btrfs filesystems.

Not much is required, we're just associating bios with blkcgs during clone,
calling wbc_init_bio()/wbc_account_io() during writepages submission,
and attaching the bios to the current context during direct IO.

Finally if we are splitting bios during btrfs_map_bio, this attaches
accounting information to the split.

The end result is able to throttle nicely on single disk filesystems.  A
little more work is required for multi-device filesystems.

Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:35:06 -07:00
Filipe Manana
bde6c24202 Btrfs: fix stale dir entries after unlink, inode eviction and fsync
If we remove a hard link from an inode, the inode gets evicted, then
we fsync the inode and then power fail/crash, when the log tree is
replayed, the parent directory inode still has entries pointing to
the name that no longer exists, while our inode no longer has the
BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected),
leaving the filesystem in an inconsistent state. The stale directory
entries can not be deleted (an attempt to delete them causes -ESTALE
errors), which makes it impossible to delete the parent directory.

This happens because we track the id of the transaction where the last
unlink operation for the inode happened (last_unlink_trans) in an
in-memory only field of the inode, that is, a value that is never
persisted in the inode item stored on the fs/subvol btree. So if an
inode is evicted and loaded again, the value for last_unlink_trans is
set to 0, which prevents the fsync from logging the parent directory
at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans
to the id of the transaction that last modified the inode when we
load the inode. This is a pessimistic approach but it always ensures
correctness with the trade off of ocassional full transaction commits
when an fsync is done against the inode in the same transaction where
it was evicted and reloaded when our inode is a directory and often
logging its parent unnecessarily when our inode is not a directory.

The following test case for fstests triggers the problem:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with 2 hard links.
  mkdir $SCRATCH_MNT/testdir
  touch $SCRATCH_MNT/testdir/foo
  ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar

  # Make sure everything done so far is durably persisted.
  sync

  # Now remove one of the links, trigger inode eviction and then fsync
  # our inode.
  unlink $SCRATCH_MNT/testdir/bar
  echo 2 > /proc/sys/vm/drop_caches
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo

  # Silently drop all writes on our scratch device to simulate a power failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again and mount the fs to trigger log/journal replay.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now verify our directory entries.
  echo "Entries in testdir:"
  ls -1 $SCRATCH_MNT/testdir

  # If we remove our inode, its parent should become empty and therefore we should
  # be able to remove the parent.
  rm -f $SCRATCH_MNT/testdir/*
  rmdir $SCRATCH_MNT/testdir

  _unmount_flakey

  # The fstests framework will call fsck against our filesystem which will verify
  # that all metadata is in a consistent state.

  status=0
  exit

The test failed on btrfs with:

  generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad)
    --- tests/generic/098.out	2015-07-23 18:01:12.616175932 +0100
    +++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad	2015-07-23 18:04:58.924138308 +0100
    @@ -1,3 +1,6 @@
     QA output created by 098
     Entries in testdir:
    +bar
     foo
    +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle
    +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
    ...
    (Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad'  to see the entire diff)
  _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full)

  $ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full
  (...)
  checking fs roots
  root 5 inode 258 errors 2001, no inode item, link count wrong
     unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref
     unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref
  Checking filesystem on /dev/sdc
  (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 06:16:58 -07:00
Christoph Hellwig
4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Filipe Manana
c1aa45759e Btrfs: fix shrinking truncate when the no_holes feature is enabled
If the no_holes feature is enabled, we attempt to shrink a file to a size
that ends up in the middle of a hole and we don't have any file extent
items in the fs/subvol tree that go beyond the new file size (or any
ordered extents that will insert such file extent items), we end up not
updating the inode's disk_i_size, we only update the inode's i_size.

This means that after unmounting and mounting the filesystem, or after
the inode is evicted and reloaded, its i_size ends up being incorrect
(an inode's i_size is set to the disk_i_size field when an inode is
loaded). This happens when btrfs_truncate_inode_items() doesn't find
any file extent items to drop - in this case it never makes a call to
btrfs_ordered_update_i_size() in order to update the inode's disk_i_size.

Example reproducer:

  $ mkfs.btrfs -O no-holes -f /dev/sdd
  $ mount /dev/sdd /mnt

  # Create our test file with some data and durably persist it.
  $ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo
  $ sync

  # Append some data to the file, increasing its size, and leave a hole
  # between the old size and the start offset if the following write. So
  # our file gets a hole in the range [128Kb, 256Kb[.
  $ xfs_io -c "truncate 160K" /mnt/foo

  # We expect to see our file with a size of 160Kb, with the first 128Kb
  # of data all having the value 0xaa and the remaining 32Kb of data all
  # having the value 0x00.
  $ od -t x1 /mnt/foo
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0500000

  # Now cleanly unmount and mount again the filesystem.
  $ umount /mnt
  $ mount /dev/sdd /mnt

  # We expect to get the same result as before, a file with a size of
  # 160Kb, with the first 128Kb of data all having the value 0xaa and the
  # remaining 32Kb of data all having the value 0x00.
  $ od -t x1 /mnt/foo
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0400000

In the example above the file size/data do not match what they were before
the remount.

Fix this by always calling btrfs_ordered_update_i_size() with a size
matching the size the file was truncated to if btrfs_truncate_inode_items()
is not called for a log tree and no file extent items were dropped. This
ensures the same behaviour as when the no_holes feature is not enabled.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-07-11 22:33:14 +01:00
Liu Bo
ddba1bfc23 Btrfs: fix warning of bytes_may_use
While running generic/019, dmesg got several warnings from
btrfs_free_reserved_data_space().

Test generic/019 produces some disk failures so sumbit dio will get errors,
in which case, btrfs_direct_IO() goes to the error handling and free
bytes_may_use, but the problem is that bytes_may_use has been free'd
during get_block().

This adds a runtime flag to show if we've gone through get_block(), if so,
don't do the cleanup work.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:21 -07:00
Liu Bo
ad9ee2053f Btrfs: fix hang when failing to submit bio of directIO
The hang is uncoverd by generic/019.

btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits
an error, thus those added ordered extents will never get processed, which
block processes that waiting for them via btrfs_start_ordered_extent().

This fixes the above, and meanwhile finish_ordered_fn will do the space
accounting work.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:20 -07:00
Filipe Manana
9c6429d96d Btrfs: fix a comment in inode.c:evict_inode_truncate_pages()
The comment was not correct about the part where it says the endio
callback of the bio might have not yet been called - update it
to mention that by that time the endio callback execution might
still be in progress only.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:19 -07:00
Filipe Manana
61de718fce Btrfs: fix memory corruption on failure to submit bio for direct IO
If we fail to submit a bio for a direct IO request, we were grabbing the
corresponding ordered extent and decrementing its reference count twice,
once for our lookup reference and once for the ordered tree reference.
This was a problem because it caused the ordered extent to be freed
without removing it from the ordered tree and any lists it might be
attached to, leaving dangling pointers to the ordered extent around.
Example trace with CONFIG_DEBUG_PAGEALLOC=y:

[161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
[161779.859983] IP: [<ffffffff8124ca68>] rb_prev+0x22/0x3b
[161779.860636] PGD 34d818067 PUD 0
[161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
(...)
[161779.860636] Call Trace:
[161779.860636]  [<ffffffffa06b36a6>] __tree_search+0xd9/0xf9 [btrfs]
[161779.860636]  [<ffffffffa06b3708>] tree_search+0x42/0x63 [btrfs]
[161779.860636]  [<ffffffffa06b4868>] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
[161779.860636]  [<ffffffffa06b4873>] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
[161779.860636]  [<ffffffffa06aab8e>] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
[161779.860636]  [<ffffffff8119727f>] do_blockdev_direct_IO+0x5ff/0xb43
[161779.860636]  [<ffffffffa06aaa73>] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
[161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
[161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [<ffffffffa06a10ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
[161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
[161779.860636]  [<ffffffffa06affaa>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
[161779.860636]  [<ffffffffa06b004c>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
(...)

We were also not freeing the btrfs_dio_private we allocated previously,
which kmemleak reported with the following trace in its sysfs file:

unreferenced object 0xffff8803f553bf80 (size 96):
  comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
  hex dump (first 32 bytes):
    88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00  .l..............
    00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81161ffe>] create_object+0x172/0x29a
    [<ffffffff8145870f>] kmemleak_alloc+0x25/0x41
    [<ffffffff81154e64>] kmemleak_alloc_recursive.constprop.40+0x16/0x18
    [<ffffffff811579ed>] kmem_cache_alloc_trace+0xfb/0x148
    [<ffffffffa03d8cff>] btrfs_submit_direct+0x65/0x16a [btrfs]
    [<ffffffff811968dc>] dio_bio_submit+0x62/0x8f
    [<ffffffff811975fe>] do_blockdev_direct_IO+0x97e/0xb43
    [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
    [<ffffffffa03d70ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
    [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
    [<ffffffffa03e604d>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
    [<ffffffff8116586a>] __vfs_write+0x7c/0xa5
    [<ffffffff81165da9>] vfs_write+0xa0/0xe4
    [<ffffffff81166675>] SyS_pwrite64+0x64/0x82
    [<ffffffff81464fd7>] system_call_fastpath+0x12/0x6f
    [<ffffffffffffffff>] 0xffffffffffffffff

For read requests we weren't doing any cleanup either (none of the work
done by btrfs_endio_direct_read()), so a failure submitting a bio for a
read request would leave a range in the inode's io_tree locked forever,
blocking any future operations (both reads and writes) against that range.

So fix this by making sure we do the same cleanup that we do for the case
where the bio submission succeeds.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01 17:17:18 -07:00
Filipe Manana
6ca0709756 Btrfs: fix hang during inode eviction due to concurrent readahead
Zygo Blaxell and other users have reported occasional hangs while an
inode is being evicted, leading to traces like the following:

[ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
[ 5281.973836]       Not tainted 4.0.0-rc5-btrfs-next-9+ #2
[ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5281.976364] rm              D ffff8800724cfc38     0 20488   7747 0x00000000
[ 5281.977506]  ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
[ 5281.978461]  ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
[ 5281.979541]  ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
[ 5281.981396] Call Trace:
[ 5281.982066]  [<ffffffff8143107e>] schedule+0x74/0x83
[ 5281.983341]  [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
[ 5281.985127]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[ 5281.986715]  [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
[ 5281.988680]  [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
[ 5281.990200]  [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
[ 5281.991781]  [<ffffffff8116964d>] evict+0xa0/0x148
[ 5281.992735]  [<ffffffff8116a43d>] iput+0x18f/0x1e5
[ 5281.993796]  [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
[ 5281.994806]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
[ 5281.996120]  [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
[ 5281.997562]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 5281.998815]  [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
[ 5281.999920]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[ 5282.001299] 1 lock held by rm/20488:
[ 5282.002066]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b

This happens when we have readahead, which calls readpages(), happening
right before the inode eviction handler is invoked. So the reason is
essentially:

1) readpages() is called while a reference on the inode is held, so
   eviction can not be triggered before readpages() returns. It also
   locks one or more ranges in the inode's io_tree (which is done at
   extent_io.c:__do_contiguous_readpages());

2) readpages() submits several read bios, all with an end io callback
   that runs extent_io.c:end_bio_extent_readpage() and that is executed
   by other task when a bio finishes, corresponding to a work queue
   (fs_info->end_io_workers) worker kthread. This callback unlocks
   the ranges in the inode's io_tree that were previously locked in
   step 1;

3) readpages() returns, the reference on the inode is dropped;

4) One or more of the read bios previously submitted are still not
   complete (their end io callback was not yet invoked or has not
   yet finished execution);

5) Inode eviction is triggered (through an unlink call for example).
   The inode reference count was not incremented before submitting
   the read bios, therefore this is possible;

6) The eviction handler starts executing and enters the loop that
   iterates over all extent states in the inode's io_tree;

7) The loop picks one extent state record and uses its ->start and
   ->end fields, after releasing the inode's io_tree spinlock, to
   call lock_extent_bits() and clear_extent_bit(). The call to lock
   the range [state->start, state->end] blocks because the whole
   range or a part of it was locked by the previous call to
   readpages() and the corresponding end io callback, which unlocks
   the range was not yet executed;

8) The end io callback for the read bio is executed and unlocks the
   range [state->start, state->end] (or a superset of that range).
   And at clear_extent_bit() the extent_state record state is used
   as a second argument to split_state(), which sets state->start to
   a larger value;

9) The task executing the eviction handler is woken up by the task
   executing the bio's end io callback (through clear_state_bit) and
   the eviction handler locks the range
   [old value for state->start, state->end]. Shortly after, when
   calling clear_extent_bit(), it unlocks the range
   [new value for state->start, state->end], so it ends up unlocking
   only part of the range that it locked, leaving an extent state
   record in the io_tree that represents the unlocked subrange;

10) The eviction handler loop, in its next iteration, gets the
    extent_state record for the subrange that it did not unlock in the
    previous step and then tries to lock it, resulting in an hang.

So fix this by not using the ->start and ->end fields of an existing
extent_state record. This is a simple solution, and an alternative
could be to bump the inode's reference count before submitting each
read bio and having it dropped in the bio's end io callback. But that
would be a more invasive/complex change and would not protect against
other possible places that are not holding a reference on the inode
as well. Something to consider in the future.

Many thanks to Zygo Blaxell for reporting, in the mailing list, the
issue, a set of scripts to trigger it and testing this fix.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:09 -07:00
Linus Torvalds
64887b6882 Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "A few more btrfs fixes.

  These range from corners Filipe found in the new free space cache
  writeback to a grab bag of fixes from the list"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent
  Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.
  btrfs: unlock i_mutex after attempting to delete subvolume during send
  btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache
  btrfs: fix race on ENOMEM in alloc_extent_buffer
  btrfs: handle ENOMEM in btrfs_alloc_tree_block
  Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
  Btrfs: don't check for delalloc_bytes in cache_save_setup
  Btrfs: fix deadlock when starting writeback of bg caches
  Btrfs: fix race between start dirty bg cache writeout and bg deletion
2015-05-01 07:46:21 -07:00
Linus Torvalds
9ec3a646fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fourth vfs update from Al Viro:
 "d_inode() annotations from David Howells (sat in for-next since before
  the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RCU pathwalk breakage when running into a symlink overmounting something
  fix I_DIO_WAKEUP definition
  direct-io: only inc/dec inode->i_dio_count for file systems
  fs/9p: fix readdir()
  VFS: assorted d_backing_inode() annotations
  VFS: fs/inode.c helpers: d_inode() annotations
  VFS: fs/cachefiles: d_backing_inode() annotations
  VFS: fs library helpers: d_inode() annotations
  VFS: assorted weird filesystems: d_inode() annotations
  VFS: normal filesystems (and lustre): d_inode() annotations
  VFS: security/: d_inode() annotations
  VFS: security/: d_backing_inode() annotations
  VFS: net/: d_inode() annotations
  VFS: net/unix: d_backing_inode() annotations
  VFS: kernel/: d_inode() annotations
  VFS: audit: d_backing_inode() annotations
  VFS: Fix up some ->d_inode accesses in the chelsio driver
  VFS: Cachefiles should perform fs modifications on the top layer only
  VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26 17:22:07 -07:00
Yang Dongsheng
6e17d30bfa Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.
We need to fill inode when we found a node for it in delayed_nodes_tree.
But we did not fill the ->last_trans currently, it will cause the test
of xfstest/generic/311 fail. Scenario of the 311 is shown as below:

Problem:
	(1). test_fd = open(fname, O_RDWR|O_DIRECT)
	(2). pwrite(test_fd, buf, 4096, 0)
	(3). close(test_fd)
	(4). drop_all_caches()	<-------- "echo 3 > /proc/sys/vm/drop_caches"
	(5). test_fd = open(fname, O_RDWR|O_DIRECT)
	(6). fsync(test_fd);
				<-------- we did not get the correct log entry for the file
Reason:
	When we re-open this file in (5), we would find a node
in delayed_nodes_tree and fill the inode we are lookup with the
information. But the ->last_trans is not filled, then the fsync()
will check the ->last_trans and found it's 0 then say this inode
is already in our tree which is commited, not recording the extents
for it.

Fix:
	This patch fill the ->last_trans properly and set the
runtime_flags if needed in this situation. Then we can get the
log entries we expected after (6) and generic/311 passed.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:27:03 -07:00
Jens Axboe
fe0f07d08e direct-io: only inc/dec inode->i_dio_count for file systems
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
 |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
 | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
 | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
 | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
 | 99.99th=[  165]

After:

clat percentiles (usec):
 |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
 | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
 | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
 | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
 | 99.99th=[  438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24 15:45:28 -04:00
Linus Torvalds
ba0e4ae88f Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "I've been running these through a longer set of load tests because my
  commits change the free space cache writeout.  It fixes commit stalls
  on large filesystems (~20T space used and up) that we have been
  triggering here.  We were seeing new writers blocked for 10 seconds or
  more during commits, which is far from good.

  Josef and I fixed up ENOSPC aborts when deleting huge files (3T or
  more), that are triggered because our metadata reservations were not
  properly accounting for crcs and were not replenishing during the
  truncate.

  Also in this series, a number of qgroup fixes from Fujitsu and Dave
  Sterba collected most of the pending cleanups from the list"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (93 commits)
  btrfs: quota: Update quota tree after qgroup relationship change.
  btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations.
  btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota.
  btrfs: Update btrfs qgroup status item when rescan is done.
  btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value.
  btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created
  btrfs: Check qgroup level in kernel qgroup assign.
  btrfs: qgroup: allow to remove qgroup which has parent but no child.
  btrfs: qgroup: return EINVAL if level of parent is not higher than child's.
  btrfs: qgroup: do a reservation in a higher level.
  Btrfs: qgroup, Account data space in more proper timings.
  Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.
  Btrfs: qgroup: free reserved in exceeding quota.
  Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup().
  btrfs: qgroup: fix limit args override whole limit struct
  btrfs: qgroup: update limit info in function btrfs_run_qgroups().
  btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item().
  btrfs: qgroup: update qgroup in memory at the same time when we update it in btree.
  btrfs: qgroup: inherit limit info from srcgroup in creating snapshot.
  btrfs: Support busy loop of write and delete
  ...
2015-04-24 07:40:02 -07:00
David Howells
2b0143b5c9 VFS: normal filesystems (and lustre): d_inode() annotations
that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:57 -04:00
Dongsheng Yang
e2d1f92399 btrfs: qgroup: do a reservation in a higher level.
There are two problems in qgroup:

a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
is not available to user.

b). When user is writing a inline data, qgroup will not reserve it,
it means this is a window we can exceed the limit of a qgroup.

The main idea of this patch is reserving the data size of write_bytes
rather than the reserve_bytes. It means qgroup will not care about
the data size btrfs will reserve for user, but only care about the
data size user is going to write. Then reserve it when user want to
write and release it in transaction committed.

In this way, qgroup can be released from the complex procedure in
btrfs and only do the reserve when user want to write and account
when the data is written in commit_transaction().

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:50 -07:00
Dongsheng Yang
31193213f1 Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.
Currently, for pre_alloc or delay_alloc, the bytes will be accounted
in space_info by the three guys.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
But on the other hand, in qgroup, there are only two counters to account the
bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts
bytes in space_info->bytes_may_use and qg->excl accounts bytes in
space_info->used. So the bytes in space_info->reserved is not accounted
in qgroup. If so, there is a window we can exceed the quota limit when
bytes is in space_info->reserved.

Example:
	# btrfs quota enable /mnt
	# btrfs qgroup limit -e 10M /mnt
	# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
	# sync
	# btrfs qgroup show -pcre /mnt
qgroupid rfer     excl     max_rfer max_excl parent  child
-------- ----     ----     -------- -------- ------  -----
0/5      20987904 20987904 0        10485760 ---     ---

qg->excl is 20987904 larger than max_excl 10485760.

This patch introduce a new counter named may_use to qgroup, then
there are three counters in qgroup to account bytes in space_info
as below.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
qgroup->may_use           --- qgroup->reserved     --- qgroup->excl

With this patch applied:
	# btrfs quota enable /mnt
	# btrfs qgroup limit -e 10M /mnt
	# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
fallocate: /mnt/data9: fallocate failed: Disk quota exceeded
fallocate: /mnt/data10: fallocate failed: Disk quota exceeded
fallocate: /mnt/data11: fallocate failed: Disk quota exceeded
fallocate: /mnt/data12: fallocate failed: Disk quota exceeded
fallocate: /mnt/data13: fallocate failed: Disk quota exceeded
fallocate: /mnt/data14: fallocate failed: Disk quota exceeded
fallocate: /mnt/data15: fallocate failed: Disk quota exceeded
fallocate: /mnt/data16: fallocate failed: Disk quota exceeded
fallocate: /mnt/data17: fallocate failed: Disk quota exceeded
fallocate: /mnt/data18: fallocate failed: Disk quota exceeded
fallocate: /mnt/data19: fallocate failed: Disk quota exceeded
	# sync
	# btrfs qgroup show -pcre /mnt
qgroupid rfer    excl    max_rfer max_excl parent  child
-------- ----    ----    -------- -------- ------  -----
0/5      9453568 9453568 0        10485760 ---     ---

Reported-by: Cyril SCETBON <cyril.scetbon@free.fr>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:52:47 -07:00
Zhao Lei
d7c151717a btrfs: Fix NO_SPACE bug caused by delayed-iput
Steps to reproduce:
  while true; do
    dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
    rm /btrfs_dir/file
    sync
  done

  And we'll see dd failed because btrfs return NO_SPACE.

Reason:
  Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
  in end to free fs space for next write, but sometimes it hadn't
  done work on time, because btrfs-cleaner thread get delayed-iputs
  from list before, but do iput() after next write.

  This is log:
  [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin

  [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
  [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
  [ 2569.087554] comm=sync func=btrfs_commit_transaction() end

  [ 2569.191081] comm=dd begin
  [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28

  [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
  [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
  ...
  [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
  [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end

Fix:
  Make btrfs_commit_transaction() wait current running btrfs-cleaner's
  delayed-iputs() done in end.

Test:
  Use script similar to above(more complex),
  before patch:
    7 failed in 100 * 20 loop.
  after patch:
    0 failed in 100 * 20 loop.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:27:41 -07:00
Omar Sandoval
22c6186ece direct_IO: remove rw from a_ops->direct_IO()
Now that no one is using rw, remove it completely.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:45 -04:00
Omar Sandoval
6f67376318 direct_IO: use iov_iter_rw() instead of rw everywhere
The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:45 -04:00
Omar Sandoval
17f8c842d2 Remove rw from {,__,do_}blockdev_direct_IO()
Most filesystems call through to these at some point, so we'll start
here.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:44 -04:00
Al Viro
c0fec3a98b Merge branch 'iocb' into for-next 2015-04-11 22:24:41 -04:00
Josef Bacik
3bce876fd5 Btrfs: don't steal from the global reserve if we don't have the space
btrfs_evict_inode() needs to be more careful about stealing from the
global_rsv.  We dont' want to end up aborting commit with ENOSPC just
because the evict_inode code was too greedy.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:59 -07:00
Chris Mason
28f75a0e6c Btrfs: refill block reserves during truncate
When truncate starts, it allocates some space in the block reserves so
that we'll have enough to update metadata along the way.

For very large files, we can easily go through all of that space as we
loop through the extents.  This changes truncate to refill the space
reservation as it progresses through the file.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:34 -07:00
Josef Bacik
1262133b8d Btrfs: account for crcs in delayed ref processing
As we delete large extents, we end up doing huge amounts of COW in order
to delete the corresponding crcs.  This adds accounting so that we keep
track of that space and flushing of delayed refs so that we don't build
up too much delayed crc work.

This helps limit the delayed work that must be done at commit time and
tries to avoid ENOSPC aborts because the crcs eat all the global
reserves.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:04:47 -07:00
Chris Mason
28ed1345a5 btrfs: actively run the delayed refs while deleting large files
When we are deleting large files with large extents, we are building up
a huge set of delayed refs for processing.  Truncate isn't checking
often enough to see if we need to back off and process those, or let
a commit proceed.

The end result is long stalls after the rm, and very long commit times.
During the commits, other processes back up waiting to start new
transactions and we get into trouble.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:00:14 -07:00
Guenter Roeck
4a3d1caf8a fs: btrfs: Add missing include file
Building alpha:allmodconfig fails with

fs/btrfs/inode.c: In function 'check_direct_IO':
fs/btrfs/inode.c:8050:2: error: implicit declaration of function 'iov_iter_alignment'

due to a missing include file.

Fixes: 3737c63e1fb0 ("fs: move struct kiocb to fs.h")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-01 12:32:07 -07:00
Christoph Hellwig
e2e40f2c1e fs: move struct kiocb to fs.h
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-25 20:28:11 -04:00
Chris Mason
9deed229fa Merge branch 'cleanups-for-4.1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1 2015-03-25 10:43:16 -07:00
Linus Torvalds
521d474631 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Most of these are fixing extent reservation accounting, or corners
  with tree writeback during commit.

  Josef's set does add a test, which isn't strictly a fix, but it'll
  keep us from making this same mistake again"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix outstanding_extents accounting in DIO
  Btrfs: add sanity test for outstanding_extents accounting
  Btrfs: just free dummy extent buffers
  Btrfs: account merges/splits properly
  Btrfs: prepare block group cache before writing
  Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
  Btrfs: account for the correct number of extents for delalloc reservations
  Btrfs: fix merge delalloc logic
  Btrfs: fix comp_oper to get right order
  Btrfs: catch transaction abortion after waiting for it
  btrfs: fix sizeof format specifier in btrfs_check_super_valid()
2015-03-21 10:53:37 -07:00
Josef Bacik
e1cbbfa5f5 Btrfs: fix outstanding_extents accounting in DIO
We are keeping track of how many extents we need to reserve properly based on
the amount we want to write, but we were still incrementing outstanding_extents
if we wrote less than what we requested.  This isn't quite right since we will
be limited to our max extent size.  So instead lets do something horrible!  Keep
track of how many outstanding_extents we reserved, and decrement each time we
allocate an extent.  If we use our entire reserve make sure to jack up
outstanding_extents on the inode so the accounting works out properly.  Thanks,

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 16:36:35 -04:00
Josef Bacik
6a3891c551 Btrfs: add sanity test for outstanding_extents accounting
I introduced a regression wrt outstanding_extents accounting.  These are tricky
areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
at any time.  So add sanity tests to cover the various conditions that are
tricky in order to make sure we don't introduce regressions in the future.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 16:36:31 -04:00
Josef Bacik
ba11721355 Btrfs: account merges/splits properly
My fix

Btrfs: fix merge delalloc logic

only fixed half of the problems, it didn't fix the case where we have two large
extents on either side and then join them together with a new small extent.  We
need to instead keep track of how many extents we have accounted for with each
side of the new extent, and then see how many extents we need for the new large
extent.  If they match then we know we need to keep our reservation, otherwise
we need to drop our reservation.  This shows up with a case like this

[BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K]

Previously the logic would have said that the number extents required for the
new size (3) is larger than the number of extents required for the largest side
(2) therefore we need to keep our reservation.  But this isn't the case, since
both sides require a reservation of 2 which leads to 4 for the whole range
currently reserved, but we only need 3, so we need to drop one of the
reservations.  The same problem existed for splits, we'd think we only need 3
extents when creating the hole but in reality we need 4.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 16:28:21 -04:00
Josef Bacik
8461a3de77 Btrfs: fix merge delalloc logic
My patch to properly count outstanding extents wrt MAX_EXTENT_SIZE introduced a
regression when re-dirtying already dirty areas.  We have logic in split to make
sure we are taking the largest space into account but didn't have it for merge,
so it was sometimes making us think we were turning a tiny extent into a huge
extent, when in reality we already had a huge extent and needed to use the other
side in our logic.  This fixes the regression that was reported by a user on
list.  Thanks,

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:46:59 -07:00
Linus Torvalds
84399bb075 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Outside of misc fixes, Filipe has a few fsync corners and we're
  pulling in one more of Josef's fixes from production use here"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
  Btrfs: fix data loss in the fast fsync path
  Btrfs: remove extra run_delayed_refs in update_cowonly_root
  Btrfs: incremental send, don't rename a directory too soon
  btrfs: fix lost return value due to variable shadowing
  Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
  Btrfs: fix off-by-one logic error in btrfs_realloc_node
  Btrfs: add missing inode update when punching hole
  Btrfs: abort the transaction if we fail to update the free space cache inode
  Btrfs: fix fsync race leading to ordered extent memory leaks
2015-03-06 13:52:54 -08:00
David Sterba
31e818fe73 btrfs: cleanup, use kmalloc_array/kcalloc array helpers
Convert kmalloc(nr * size, ..) to kmalloc_array that does additional
overflow checks, the zeroing variant is kcalloc.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:58 +01:00
David Sterba
1932b7be97 btrfs: fix lost return value due to variable shadowing
A block-local variable stores error code but btrfs_get_blocks_direct may
not return it in the end as there's a ret defined in the function scope.

CC: <stable@vger.kernel.org>	# 3.6+
Fixes: d187663ef2 ("Btrfs: lock extents as we map them in DIO")
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
Linus Torvalds
2b9fb532d4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This pull is mostly cleanups and fixes:

   - The raid5/6 cleanups from Zhao Lei fixup some long standing warts
     in the code and add improvements on top of the scrubbing support
     from 3.19.

   - Josef has round one of our ENOSPC fixes coming from large btrfs
     clusters here at FB.

   - Dave Sterba continues a long series of cleanups (thanks Dave), and
     Filipe continues hammering on corner cases in fsync and others

  This all was held up a little trying to track down a use-after-free in
  btrfs raid5/6.  It's not clear yet if this is just made easier to
  trigger with this pull or if its a new bug from the raid5/6 cleanups.
  Dave Sterba is the only one to trigger it so far, but he has a
  consistent way to reproduce, so we'll get it nailed shortly"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits)
  Btrfs: don't remove extents and xattrs when logging new names
  Btrfs: fix fsync data loss after adding hard link to inode
  Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
  Btrfs: account for large extents with enospc
  Btrfs: don't set and clear delalloc for O_DIRECT writes
  Btrfs: only adjust outstanding_extents when we do a short write
  btrfs: Fix out-of-space bug
  Btrfs: scrub, fix sleep in atomic context
  Btrfs: fix scheduler warning when syncing log
  Btrfs: Remove unnecessary placeholder in btrfs_err_code
  btrfs: cleanup init for list in free-space-cache
  btrfs: delete chunk allocation attemp when setting block group ro
  btrfs: clear bio reference after submit_one_bio()
  Btrfs: fix scrub race leading to use-after-free
  Btrfs: add missing cleanup on sysfs init failure
  Btrfs: fix race between transaction commit and empty block group removal
  btrfs: add more checks to btrfs_read_sys_array
  btrfs: cleanup, rename a few variables in btrfs_read_sys_array
  btrfs: add checks for sys_chunk_array sizes
  btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
  ...
2015-02-19 14:36:00 -08:00
Josef Bacik
dcab6a3b2a Btrfs: account for large extents with enospc
On our gluster boxes we stream large tar balls of backups onto our fses.  With
160gb of ram this means we get really large contiguous ranges of dirty data, but
the way our ENOSPC stuff works is that as long as it's contiguous we only hold
metadata reservation for one extent.  The problem is we limit our extents to
128mb, so we'll end up with at least 800 extents so our enospc accounting is
quite a bit lower than what we need.  To keep track of this make sure we
increase outstanding_extents for every multiple of the max extent size so we can
be sure to have enough reserved metadata space.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:22:48 -08:00
Josef Bacik
3266789f9d Btrfs: don't set and clear delalloc for O_DIRECT writes
We do this to get the space accounting, but this is just needless churn on the
io_tree, so just drop setting/clearing delalloc and just drop the reserved data
space when we have a successfull allocation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:19:14 -08:00
Josef Bacik
3e05bde8c3 Btrfs: only adjust outstanding_extents when we do a short write
We have this weird dance where we always inc outstanding_extents when we do a
O_DIRECT write, even if we allocate the entire range.  To get around this we
also drop the metadata space if we successfully write.  This is an unnecessary
dance, we only need to jack up outstanding_extents if we don't satisfy the
entire range request in get_blocks_direct, otherwise we are good using our
original reservation.  So drop the unconditional inc and the drop of the
metadata space that we have for the unconditional inc.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:19:14 -08:00
chandan r
9cc97d6462 Btrfs: Add code to support file creation time
This patch adds a new member to the 'struct btrfs_inode' structure to hold
the file creation time.

Signed-off-by: chandan <chandanrmail@gmail.com>
[refreshed, removed btrfs_inode_otime]
Signed-off-by: David Sterba <dsterba@suse.cz>

Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 18:39:16 -08:00
David Sterba
a937b9791e btrfs: kill btrfs_inode_*time helpers
They just opencode taking address of the timespec member.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 18:39:07 -08:00
Zhao Lei
ffe2d2034b Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simply
So we can check raid56 with:
 (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
instead of long:
 (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6))

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
David Sterba
9ee49a047d btrfs: switch extent_state state to unsigned
Currently there's a 4B hole in the structure between refs and state and there
are only 16 bits used so we can make it unsigned. This will get a better
packing and may save some stack space for local variables.

The size of extent_state gets reduced by 8B and there are usually a lot
of slab objects.

struct extent_state {
	u64                        start;                /*     0     8 */
	u64                        end;                  /*     8     8 */
	struct rb_node             rb_node;              /*    16    24 */
	wait_queue_head_t          wq;                   /*    40    24 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	atomic_t                   refs;                 /*    64     4 */

	/* XXX 4 bytes hole, try to pack */

	long unsigned int          state;                /*    72     8 */
	u64                        private;              /*    80     8 */

	/* size: 88, cachelines: 2, members: 7 */
	/* sum members: 84, holes: 1, sum holes: 4 */
	/* last cacheline: 24 bytes */
};

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
David Sterba
f0954c6637 btrfs: update message levels after checksum errors
The errors are worth noting and might get missed with INFO level.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
David Sterba
68b663d13c btrfs: update message levels for errors
Several messages that point to some internal problem, level INFO is
wrong here.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
Christoph Hellwig
b83ae6d421 fs: remove mapping->backing_dev_info
Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:05 -07:00
David Sterba
1d4c08e0a6 btrfs: expand btrfs_find_item if found_key is NULL
If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper
for simple btrfs_search_slot.

After we've removed all such callers, passing a NULL key is not valid
anymore.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:48 +01:00
Linus Torvalds
03c751a5e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "None of these are huge, but my commit does fix a regression from 3.18
  that could cause lost files during log replay.

  This also adds Dave Sterba to the list of Btrfs maintainers.  It
  doesn't mean we're doing things differently, but Dave has really been
  helping with the maintainer workload for years"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: don't delay inode ref updates during log replay
  Btrfs: correctly get tree level in tree_backref_for_extent
  Btrfs: call inode_dec_link_count() on mkdir error path
  Btrfs: abort transaction if we don't find the block group
  Btrfs, scrub: uninitialized variable in scrub_extent_for_parity()
  Btrfs: add more maintainers
2015-01-09 17:46:07 -08:00
Wang Shilong
c7cfb8a540 Btrfs: call inode_dec_link_count() on mkdir error path
In btrfs_mkdir(), if it fails to create dir, we should
clean up existed items, setting inode's link properly
to make sure it could be cleaned up properly.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-02 14:47:55 -05:00
Linus Torvalds
bdeb03cada Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "From a feature point of view, most of the code here comes from Miao
  Xie and others at Fujitsu to implement scrubbing and replacing devices
  on raid56.  This has been in development for a while, and it's a big
  improvement.

  Filipe and Josef have a great assortment of fixes, many of which solve
  problems corruptions either after a crash or in error conditions.  I
  still have a round two from Filipe for next week that solves
  corruptions with discard and block group removal"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
  Btrfs: make get_caching_control unconditionally return the ctl
  Btrfs: fix unprotected deletion from pending_chunks list
  Btrfs: fix fs mapping extent map leak
  Btrfs: fix memory leak after block remove + trimming
  Btrfs: make btrfs_abort_transaction consider existence of new block groups
  Btrfs: fix race between writing free space cache and trimming
  Btrfs: fix race between fs trimming and block group remove/allocation
  Btrfs, replace: enable dev-replace for raid56
  Btrfs: fix freeing used extents after removing empty block group
  Btrfs: fix crash caused by block group removal
  Btrfs: fix invalid block group rbtree access after bg is removed
  Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
  Btrfs, replace: write raid56 parity into the replace target device
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, raid56: support parity scrub on raid56
  Btrfs, raid56: use a variant to record the operation type
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  ...
2014-12-12 11:15:23 -08:00
Filipe Manana
9ea24bbe17 Btrfs: fix snapshot inconsistency after a file write followed by truncate
If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.

For example, if we perform the following file operations:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ xfs_io -f \
          -c "pwrite -S 0xaa -b 32K 0 32K" \
          -c "fsync" \
          -c "pwrite -S 0xbb -b 32770 16K 32770" \
          -c "truncate 90123" \
          /mnt/foobar

and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:

        item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
        item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                inode ref index 282 namelen 10 name: foobar
        item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                extent data disk byte 1104855040 nr 32768
                extent data offset 0 nr 32768 ram 32768
                extent compression 0
        item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                extent data disk byte 0 nr 0
                extent data offset 0 nr 40960 ram 40960
                extent compression 0

There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.

Btrfs' fsck tool complains about such cases with a message like the following:

    "root 331 inode 257 errors 100, file extent discount"

>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:

1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured

But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 07:41:23 -08:00
Filipe Manana
b38ef71cb1 Btrfs: ensure ordered extent errors aren't missed on fsync
When doing a fsync with a fast path we have a time window where we can miss
the fact that writeback of some file data failed, and therefore we endup
returning success (0) from fsync when we should return an error.
The steps that lead to this are the following:

1) We start all ordered extents by calling filemap_fdatawrite_range();

2) We do some other work like locking the inode's i_mutex, start a transaction,
   start a log transaction, etc;

3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
   ordered extents from inode's ordered tree into a list;

4) But by the time we do ordered extent collection, some ordered extents we started
   at step 1) might have already completed with an error, and therefore we didn't
   found them in the ordered tree and had no idea they finished with an error. This
   makes our fsync return success (0) to userspace, but has no bad effects on the log
   like for example insertion of file extent items into the log that point to unwritten
   extents, because the invalid extent maps were removed before the ordered extent
   completed (in inode.c:btrfs_finish_ordered_io).

So after collecting the ordered extents just check if the inode's i_mapping has any
error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
writeback fails for a page of an ordered extent, we call mapping_set_error (done in
extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
that sets one of those error flags in the inode's i_mapping flags.

This change also has the side effect of fixing the issue where for fast fsyncs we
never checked/cleared the error flags from the inode's i_mapping flags, which means
that a full fsync performed after a fast fsync could get such errors that belonged
to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
calls filemap_fdatawait_range(), and the later checks for and clears those flags,
while for fast fsyncs we never call filemap_fdatawait_range() or anything else
that checks for and clears the error flags from the inode's i_mapping.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21 11:59:57 -08:00
Filipe Manana
e6eb43142a Btrfs: report error after failure inlining extent in compressed write path
If cow_file_range_inline() failed, when called from compress_file_range(),
we were tagging the locked page for writeback, end its writeback and unlock it,
but not marking it with an error nor setting AS_EIO in inode's mapping flags.

This made it impossible for a caller of filemap_fdatawrite_range (writepages)
or filemap_fdatawait_range() to know that an error happened. And the return
value of compress_file_range() is useless because it's returned to a workqueue
task and not to the task calling filemap_fdatawrite_range (writepages).

This change applies on top of the previous patchset starting at the patch
titled:

    "[1/5] Btrfs: set page and mapping error on compressed write failure"

Which changed extent_clear_unlock_delalloc() to use SetPageError and
mapping_set_error().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:28 -08:00
Filipe Manana
728404dacf Btrfs: add helper btrfs_fdatawrite_range
To avoid duplicating this double filemap_fdatawrite_range() call for
inodes with async extents (compressed writes) so often.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:28 -08:00
Filipe Manana
075bdbdbe9 Btrfs: correctly flush compressed data before/after direct IO
For compressed writes, after doing the first filemap_fdatawrite_range() we
don't get the pages tagged for writeback immediately. Instead we create
a workqueue task, which is run by other kthread, and keep the pages locked.
That other kthread compresses data, creates the respective ordered extent/s,
tags the pages for writeback and unlocks them. Therefore we need a second
call to filemap_fdatawrite_range() if we have compressed writes, as this
second call will wait for the pages to become unlocked, then see they became
tagged for writeback and finally wait for the writeback to finish.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:27 -08:00
Filipe Manana
c44f649e28 Btrfs: make inode.c:compress_file_range() return void
Its return value is useless, its single caller ignores it and can't do
anything with it anyway, since it's a workqueue task and not the task
calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
Failure is communicated to such functions via start and end of writeback
with the respective pages tagged with an error and AS_EIO flag set in the
inode's imapping.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:27 -08:00
Shilong Wang
4bcbb33255 Btrfs: fix incorrect compression ratio detection
Steps to reproduce:
 # mkfs.btrfs -f /dev/sdb
 # mount -t btrfs /dev/sdb /mnt -o compress=lzo
 # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1

after previous steps, inode will be detected as bad compression ratio,
and NOCOMPRESS flag will be set for that inode.

Reason is that compress have a max limit pages every time(128K), if a
132k write in, it will be splitted into two write(128k+4k), this bug
is a leftover for commit 68bb462d42a(Btrfs: don't compress for a small write)

Fix this problem by checking every time before compression, if it is a
small write(<=blocksize), we bail out and fall into nocompression directly.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:27 -08:00
Filipe Manana
dec8f17563 Btrfs: make inode.c:submit_compressed_extents() return void
Its return value is completely ignored by its single caller and it's
useless anyway, since errors are indicated through SetPageError and
the bit AS_EIO set in the flags of the inode's mapping. The caller
can't do anything with the value, as it's invoked from a workqueue
task and not by the task calling filemap_fdatawrite_range (which calls
the writepages address space callback, which in turn calls the inode's
fill_delalloc callback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:26 -08:00
Filipe Manana
3d7a820f71 Btrfs: process all async extents on compressed write failure
If we had an error when processing one of the async extents from our list,
we were not processing the remaining async extents, meaning we would leak
those async_extent structs, never release the pages with the compressed
data and never unlock and clear the dirty flag from the inode's pages (those
that correspond to the uncompressed content).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:26 -08:00
Filipe Manana
40ae837b43 Btrfs: don't leak pages and memory on compressed write error
In inode.c:submit_compressed_extents(), if we fail before calling
btrfs_submit_compressed_write(), or when that function fails, we
were freeing the async_extent structure without releasing its pages
and freeing the pages array.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:26 -08:00
Filipe Manana
fce2a4e6b2 Btrfs: fix hang on compressed write error
In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write()
we start writeback for all pages, clear their dirty flag, unlock them, etc, but if
btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM),
we never end the writeback on the pages, so any filemap_fdatawait_range() call will
hang forever. We were also not calling the writepage end io hook, which means the
corresponding ordered extent will never complete and all its waiters will block
forever, such as a full fsync (via btrfs_wait_ordered_range()).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:25 -08:00
Filipe Manana
704de49d2b Btrfs: set page and mapping error on compressed write failure
If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
but we don't tag the pages, nor the inode's mapping, with an error. This makes it
impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
for e.g.) know that there was an error.

Note that the return value of submit_compressed_extents() is useless, as that function
is executed by a workqueue task and not directly by the fill_delalloc callback. This
means the writepage/s callbacks of the inode's address space operations don't get that
return value.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20 17:14:25 -08:00
Al Viro
41d28bca2d switch d_materialise_unique() users to d_splice_alias()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:20 -05:00
Chris Mason
d37973082b Revert "Btrfs: race free update of commit root for ro snapshots"
This reverts commit 9c3b306e1c.

Switching only one commit root during a transaction is wrong because it
leads the fs into an inconsistent state. All commit roots should be
switched at once, at transaction commit time, otherwise backref walking
can often miss important references that were only accessible through
the old commit root.  Plus, the root item for the snapshot's root wasn't
getting updated and preventing the next transaction commit to do it.

This made several users get into random corruption issues after creation
of readonly snapshots.

A regression test for xfstests will follow soon.

Cc: stable@vger.kernel.org # 3.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-17 02:40:59 -07:00
Chris Mason
0ec31a61f0 Merge branch 'remove-unlikely' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2014-10-04 09:57:44 -07:00
Chris Mason
bbf65cf0b5 Merge branch 'cleanup/misc-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
Signed-off-by: Chris Mason <clm@fb.com>

Conflicts:
	fs/btrfs/extent_io.c
2014-10-04 09:56:45 -07:00
Qu Wenruo
32be3a1ac6 btrfs: Fix the wrong condition judgment about subset extent map
Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert
best fitted extent map
is using wrong condition to judgement whether the range is a subset of a
existing extent map.

This may cause bug in btrfs no-holes mode.

This patch will correct the judgment and fix the bug.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
David Sterba
bfebd8b544 btrfs: use enum for wq endio metadata type
The enum exists but is not consistently used.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:28 +02:00
David Sterba
ee39b432b4 btrfs: remove unlikely from data-dependent branches and slow paths
There are the branch hints that obviously depend on the data being
processed, the CPU predictor will do better job according to the actual
load. It also does not make sense to use the hints in slow paths that do
a lot of other operations like locking, waiting or IO.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:15:21 +02:00
David Sterba
5d99a998f3 btrfs: remove unlikely from NULL checks
Unlikely is implicit for NULL checks of pointers.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:06:19 +02:00
Josef Bacik
1d52c78afb Btrfs: try not to ENOSPC on log replay
When doing log replay we may have to update inodes, which traditionally goes
through our delayed inode stuff.  This will try to move space over from the
trans handle, but we don't reserve space in our trans handle on replay since we
don't know how much we will need, so instead we try to flush.  But because we
have a trans handle open we won't flush anything, so if we are out of reserve
space we will simply return ENOSPC.  Since we know that if an operation made it
into the log then we definitely had space before the box bought the farm then we
don't need to worry about doing this space reservation.  Use the
fs_info->log_root_recovering flag to skip the delayed inode stuff and update the
item directly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22 17:13:36 -07:00
Qu Wenruo
e6c4efd87a btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map
The following commit enhanced the merge_extent_mapping() to reduce
fragment in extent map tree, but it can't handle case which existing
lies before map_start:
51f39 btrfs: Use right extent length when inserting overlap extent map.

[BUG]
When existing extent map's start is before map_start,
the em->len will be minus, which will corrupt the extent map and fail to
insert the new extent map.
This will happen when someone get a large extent map, but when it is
going to insert it into extent map tree, some one has already commit
some write and split the huge extent into small parts.

[REPRODUCER]
It is very easy to tiger using filebench with randomrw personality.
It is about 100% to reproduce when using 8G preallocated file in 60s
randonrw test.

[FIX]
This patch can now handle any existing extent position.
Since it does not directly use existing->start, now it will find the
previous and next extent around map_start.
So the old existing->start < map_start bug will never happen again.

[ENHANCE]
This patch will insert the best fitted extent map into extent map tree,
other than the oldest [map_start, map_start + sectorsize) or the
relatively newer but not perfect [map_start, existing->start).

The patch will first search existing extent that does not intersects with
the desired map range [map_start, map_start + len).
The existing extent will be either before or behind map_start, and based
on the existing extent, we can find out the previous and next extent
around map_start.

So the best fitted extent would be [prev->end, next->start).
For prev or next is not found, em->start would be prev->end and em->end
wold be next->start.

With this patch, the fragment in extent map tree should be reduced much
more than the 51f39 commit and reduce an unneeded extent map tree search.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-18 07:14:46 -07:00
Miao Xie
f612496bca Btrfs: cleanup the read failure record after write or when the inode is freeing
After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
  mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
  the corrupted data is corrected, so the failure record can be removed. And if
  some errors happen on the mirrors, we also needn't worry about it because the
  failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:02 -07:00
Miao Xie
8b110e393c Btrfs: implement repair function when direct read fails
This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:01 -07:00
Miao Xie
c1dc08967f Btrfs: do file data check by sub-bio's self
Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:53 -07:00
Miao Xie
dc380aea5f Btrfs: cleanup similar code of the buffered data data check and dio read data check
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:52 -07:00
Miao Xie
23ea8e5a07 Btrfs: load checksum data once when submitting a direct read io
The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:50 -07:00
Wang Shilong
354877befa Btrfs: fix off-by-one in cow_file_range_inline()
Btrfs could still inline file data if its size is same as
page size, so don't skip max value here.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:39 -07:00
Wang Shilong
7816030eb4 Btrfs: fall into nocompression codes quickly if possible
If flag NOCOMPRESS is set which means bad compression ratio,
we could avoid call cow_file_range_async() for this case earlier.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:38 -07:00
Wang Shilong
f79707b092 Btrfs: fix wrong skipping compression for an inode
If a file's compression ratios is bad, we will set NOCOMPRESS
flag for it, and it will skip compression for that inode next time.

However, if we remount fs to COMPRESS_FORCE, it still should try
if we could compress pages for that inode, this patch fix wrong
check for this problem.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:36 -07:00
Filipe Manana
555e128640 Btrfs: set error return value in btrfs_get_blocks_direct
We were returning with 0 (success) because we weren't extracting the
error code from em (PTR_ERR(em)). Fix it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:32 -07:00
Wang Shilong
47059d930f Btrfs: make defragment work with nodatacow option
Btrfs defragment will utilize COW feature, which means this
did not work for nodatacow option, this problem was detected
by xfstests generic/018 with nodatacow mount option.

Fix this problem by forcing cow for a extent with state
@EXTETN_DEFRAG setting.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:26 -07:00
David Sterba
962a298f35 btrfs: kill the key type accessor helpers
btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:12 -07:00
Linus Torvalds
7ed641be75 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe is doing a careful pass through fsync problems, and these are
  the fixes so far.  I'll have one more for rc6 that we're still
  testing.

  My big commit is fixing up some inode hash races that Al Viro found
  (thanks Al)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: use insert_inode_locked4 for inode creation
  Btrfs: fix fsync data loss after a ranged fsync
  Btrfs: kfree()ing ERR_PTRs
  Btrfs: fix crash while doing a ranged fsync
  Btrfs: fix corruption after write/fsync failure + fsync + log recovery
  Btrfs: fix autodefrag with compression
2014-09-12 11:53:30 -07:00
Chris Mason
b0d5d10f41 Btrfs: use insert_inode_locked4 for inode creation
Btrfs was inserting inodes into the hash table before we had fully
set the inode up on disk.  This leaves us open to rare races that allow
two different inodes in memory for the same [root, inode] pair.

This patch fixes things by using insert_inode_locked4 to insert an I_NEW
inode and unlock_new_inode when we're ready for the rest of the kernel
to use the inode.

It also makes sure to init the operations pointers on the inode before
going into the error handling paths.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-08 13:56:45 -07:00
Filipe Manana
dac5705cad Btrfs: fix crash while doing a ranged fsync
While doing a ranged fsync, that is, one whose range doesn't cover the
whole possible file range (0 to LLONG_MAX), we can crash under certain
circumstances with a trace like the following:

[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>]  [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919]  [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919]  [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919]  [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919]  [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919]  [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919]  [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919]  [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919]  [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919]  [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919]  [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919]  [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)

The necessary conditions that lead to such crash are:

* an incremental fsync (when the inode doesn't have the
  BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
  a file extent item ending at offset X;

* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
  to a file truncate operation that reduces the file to a size smaller
  than X;

* a ranged fsync call happens (via an msync for example), with a range that
  doesn't cover the whole file and the end of this range, lets call it Y, is
  smaller than X;

* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
  calls btrfs_truncate_inode_items() to remove all items from the log
  tree that are associated with our file;

* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
  file extent item it removed is the one ending at offset X, where X > 0 and
  X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
  parameter set to X;

* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
  size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
  ordered operation with a range covering the offset X, calling a BUG_ON() if
  such ordered operation exists. This assumption is made because the disk_i_size
  is only increased after the corresponding file extent item is added to the
  btree (btrfs_finish_ordered_io);

* But because our fsync covers only a limited range, such an ordered extent might
  exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
  extent to finish when calling btrfs_wait_ordered_range();

And then by the time btrfs_ordered_update_i_size() is called, via:

   btrfs_sync_file() ->
       btrfs_log_dentry_safe() ->
           btrfs_log_inode_parent() ->
               btrfs_log_inode() ->
                   btrfs_truncate_inode_items() ->
                       btrfs_ordered_update_i_size()

We hit the BUG_ON(), which could never happen if the fsync range covered the whole
possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
finish before calling btrfs_truncate_inode_items().

So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size.

Issue found while running xfstests/generic/127 (happens very rarely for me), more
specifically via the fsx calls that use memory mapped IO (and issue msync calls).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-02 16:46:05 -07:00
Filipe Manana
d9f85963e3 Btrfs: fix corruption after write/fsync failure + fsync + log recovery
While writing to a file, in inode.c:cow_file_range() (and same applies to
submit_compressed_extents()), after reserving an extent for the file data,
we create a new extent map for the written range and insert it into the
extent map cache. After that, we create an ordered operation, but if it
fails (due to a transient/temporary-ENOMEM), we return without dropping
that extent map, which points to a reserved extent that is freed when we
return. A subsequent incremental fsync (when the btrfs inode doesn't have
the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
logs a file extent item based on that extent map, which points to a disk
extent that doesn't contain valid data - it was freed by us earlier, at this
point it might contain any random/garbage data.

Therefore, if we reach an error condition when cowing a file range after
we added the new extent map to the cache, drop it from the cache before
returning.

Some sequence of steps that lead to this:

    $ mkfs.btrfs -f /dev/sdd
    $ mount -o commit=9999 /dev/sdd /mnt
    $ cd /mnt

    $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
    $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
    $ sync

    $ od -t x1 foo
    0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
    *
    0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
    *
    0020000

    $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo

    # Now this write + fsync fail with -ENOMEM, which was returned by
    # btrfs_add_ordered_extent() in inode.c:cow_file_range().
    $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
    $ xfs_io -c "fsync" foo
    fsync: Cannot allocate memory

    # Now do a new write + fsync, which will succeed. Our previous
    # -ENOMEM was a transient/temporary error.
    $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
    $ xfs_io -c "fsync" foo

    # Our file content (in page cache) is now:
    $ od -t x1 foo
    0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
    *
    0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    *
    0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    *
    0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
    *
    0050000

    # Now reboot the machine, and mount the fs, so that fsync log replay
    # takes place.

    # The file content is now weird, in particular the first 8Kb, which
    # do not match our data before nor after the sync command above.
    $ od -t x1 foo
    0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
    *
    0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
    *
    0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    *
    0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
    *
    0050000

    # In fact these first 4Kb are a duplicate of the last 4kb block.
    # The last write got an extent map/file extent item that points to
    # the same disk extent that we got in the write+fsync that failed
    # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
    # verify that:

    $ btrfs-debug-tree /dev/sdd
    (...)
	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
		extent data disk byte 12582912 nr 8192
		extent data offset 0 nr 8192 ram 8192
	item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
		extent data disk byte 0 nr 0
		extent data offset 0 nr 8192 ram 8192
	item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
		extent data disk byte 12582912 nr 4096
		extent data offset 0 nr 4096 ram 4096

    $ umount /dev/sdd
    $ btrfsck /dev/sdd
    Checking filesystem on /dev/sdd
    UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
    checking extents
    extent item 12582912 has multiple extent items
    ref mismatch on [12582912 4096] extent item 1, found 2
    Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
    backpointer mismatch on [12582912 4096]
    Errors found in extent allocation tree or chunk allocation
    checking free space cache
    checking fs roots
    root 5 inode 257 errors 1000, some csum missing
    found 131074 bytes used err is 1
    total csum bytes: 4
    total tree bytes: 131072
    total fs tree bytes: 32768
    total extent tree bytes: 16384
    btree space waste bytes: 123404
    file data blocks allocated: 274432
     referenced 274432
    Btrfs v3.14.1-96-gcc7fd5a-dirty

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-02 16:46:05 -07:00
Linus Torvalds
1fb00cbca0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The biggest of these comes from Liu Bo, who tracked down a hang we've
  been hitting since moving to kernel workqueues (it's a btrfs bug, not
  in the generic code).  His patch needs backporting to 3.16 and 3.15
  stable, which I'll send once this is in.

  Otherwise these are assorted fixes.  Most were integrated last week
  during KS, but I wanted to give everyone the chance to test the
  result, so I waited for rc2 to come out before sending"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
  Btrfs: fix task hang under heavy compressed write
  Btrfs: fix filemap_flush call in btrfs_file_release
  Btrfs: fix crash on endio of reading corrupted block
  btrfs: fix leak in qgroup_subtree_accounting() error path
  btrfs: Use right extent length when inserting overlap extent map.
  Btrfs: clone, don't create invalid hole extent map
  Btrfs: don't monopolize a core when evicting inode
  Btrfs: fix hole detection during file fsync
  Btrfs: ensure tmpfile inode is always persisted with link count of 0
  Btrfs: race free update of commit root for ro snapshots
  Btrfs: fix regression of btrfs device replace
  Btrfs: don't consider the missing device when allocating new chunks
  Btrfs: Fix wrong device size when we are resizing the device
  Btrfs: don't write any data into a readonly device when scrub
  Btrfs: Fix the problem that the replace destroys the seed filesystem
  btrfs: Return right extent when fiemap gives unaligned offset and len.
  Btrfs: fix wrong extent mapping for DirectIO
  Btrfs: fix wrong write range for filemap_fdatawrite_range()
  Btrfs: fix wrong missing device counter decrease
  Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
  ...
2014-08-27 09:14:17 -07:00
Liu Bo
9e0af23764 Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.

Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.

Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --

normal_work_helper(arg)
    work = container_of(arg, struct btrfs_work, normal_work);

    work->func() <---- (we name it work X)
    for ordered_work in wq->ordered_list
            ordered_work->ordered_func()
            ordered_work->ordered_free()

The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will

file a readahead request
    btrfs_readpages()
         for page that is not in page cache
                __do_readpage()
                     submit_extent_page()
                           btrfs_submit_bio_hook()
                                 btrfs_bio_wq_end_io()
                                 submit_bio()
                                 end_workqueue_bio() <--(ret by the 1st endio)
                                      queue a work(named work Y) for the 2nd
                                      also the real endio()

So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.

A bit more explanation,

A,B,C -- struct btrfs_work
arg   -- struct work_struct

kthread:
worker_thread()
    pick up a work_struct from @worklist
    process_one_work(arg)
	worker->current_work = arg;  <-- arg is A->normal_work
	worker->current_func(arg)
		normal_work_helper(arg)
		     A = container_of(arg, struct btrfs_work, normal_work);

		     A->func()
		     A->ordered_func()
		     A->ordered_free()  <-- A gets freed

		     B->ordered_func()
			  submit_compressed_extents()
			      find_free_extent()
				  load_free_space_inode()
				      ...   <-- (the above readhead stack)
				      end_workqueue_bio()
					   btrfs_queue_work(work C)
		     B->ordered_free()

As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.

Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).

When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.

So the situation is that our kthread is waiting forever on work C.

Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.

With this patch, I no long hit the above hang.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-24 07:17:02 -07:00
Qu Wenruo
51f395ad40 btrfs: Use right extent length when inserting overlap extent map.
When current btrfs finds that a new extent map is going to be insereted
but failed with -EEXIST, it will try again to insert the extent map
but with the length of sectorsize.
This is OK if we don't enable 'no-holes' feature since all extent space
is continuous, we will not go into the not found->insert routine.

But if we enable 'no-holes' feature, it will make things out of control.
e.g. in 4K sectorsize, we pass the following args to btrfs_get_extent():
btrfs_get_extent() args: start:  27874 len 4100
28672		  27874		28672	27874+4100	32768
                    |-----------------------|
|---------hole--------------------|---------data----------|

1) not found and insert
Since no extent map containing the range, btrfs_get_extent() will go
into the not_found and insert routine, which will try to insert the
extent map (27874, 27847 + 4100).

2) first overlap
But it overlaps with (28672, 32768) extent, so -EEXIST will be returned
by add_extent_mapping().

3) retry but still overlap
After catching the -EEXIST, then btrfs_get_extent() will try insert it
again but with 4K length, which still overlaps, so -EEXIST will be
returned.

This makes the following patch fail to punch hole.
d77815461f btrfs: Avoid trucating page or punching hole in a already existed hole.

This patch will use the right length, which is the (exsisting->start -
em->start) to insert, making the above patch works in 'no-holes' mode.
Also, some small code style problems in above patch is fixed too.

Reported-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe David Manana <fdmanana@suse.com>
Tested-by: Filipe David Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:27 -07:00
Filipe Manana
7064dd5c36 Btrfs: don't monopolize a core when evicting inode
If an inode has a very large number of extent maps, we can spend
a lot of time freeing them, which triggers a soft lockup warning.
Therefore reschedule if we need to when freeing the extent maps
while evicting the inode.

I could trigger this all the time by running xfstests/generic/299 on
a file system with the no-holes feature enabled. That test creates
an inode with 11386677 extent maps.

    $ mkfs.btrfs -f -O no-holes $TEST_DEV
    $ MKFS_OPTIONS="-O no-holes" ./check generic/299
    generic/299 382s ...
    Message from syslogd@debian-vm3 at Aug  7 10:44:29 ...
     kernel:[85304.208017] BUG: soft lockup - CPU#0 stuck for 22s! [umount:25330]
     384s
    Ran: generic/299
    Passed all 1 tests

    $ dmesg
    (...)
    [86304.300017] BUG: soft lockup - CPU#0 stuck for 23s! [umount:25330]
    (...)
    [86304.300036] Call Trace:
    [86304.300036]  [<ffffffff81698ba9>] __slab_free+0x54/0x295
    [86304.300036]  [<ffffffffa02ee9cc>] ? free_extent_map+0x5c/0xb0 [btrfs]
    [86304.300036]  [<ffffffff811a6cd2>] kmem_cache_free+0x282/0x2a0
    [86304.300036]  [<ffffffffa02ee9cc>] free_extent_map+0x5c/0xb0 [btrfs]
    [86304.300036]  [<ffffffffa02e3775>] btrfs_evict_inode+0xd5/0x660 [btrfs]
    [86304.300036]  [<ffffffff811e7c8d>] ? __inode_wait_for_writeback+0x6d/0xc0
    [86304.300036]  [<ffffffff816a389b>] ? _raw_spin_unlock+0x2b/0x40
    [86304.300036]  [<ffffffff811d8cbb>] evict+0xab/0x180
    [86304.300036]  [<ffffffff811d8dce>] dispose_list+0x3e/0x60
    [86304.300036]  [<ffffffff811d9b04>] evict_inodes+0xf4/0x110
    [86304.300036]  [<ffffffff811bd953>] generic_shutdown_super+0x53/0x110
    [86304.300036]  [<ffffffff811bdaa6>] kill_anon_super+0x16/0x30
    [86304.300036]  [<ffffffffa02a78ba>] btrfs_kill_super+0x1a/0xa0 [btrfs]
    [86304.300036]  [<ffffffff811bd3a9>] deactivate_locked_super+0x59/0x80
    [86304.300036]  [<ffffffff811be44e>] deactivate_super+0x4e/0x70
    [86304.300036]  [<ffffffff811dec14>] mntput_no_expire+0x174/0x1f0
    [86304.300036]  [<ffffffff811deab7>] ? mntput_no_expire+0x17/0x1f0
    [86304.300036]  [<ffffffff811e0517>] SyS_umount+0x97/0x100
    (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:25 -07:00
Filipe Manana
5762b5c958 Btrfs: ensure tmpfile inode is always persisted with link count of 0
If we open a file with O_TMPFILE, don't do any further operation on
it (so that the inode item isn't updated) and then force a transaction
commit, we get a persisted inode item with a link count of 1, and not 0
as it should be.

Steps to reproduce it (requires a modern xfs_io with -T support):

    $ mkfs.btrfs -f /dev/sdd
    $ mount -o /dev/sdd /mnt
    $ xfs_io -T /mnt &
    $ sync

Then btrfs-debug-tree shows the inode item with a link count of 1:

    $ btrfs-debug-tree /dev/sdd
    (...)
    fs tree key (FS_TREE ROOT_ITEM 0)
    leaf 29556736 items 4 free space 15851 generation 6 owner 5
    fs uuid f164d01b-1b92-481d-a4e4-435fb0f843d0
    chunk uuid 0e3d0e56-bcca-4a1c-aa5f-cec2c6f4f7a6
    	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
		inode generation 3 transid 6 size 0 block group 0 mode 40755 links 1
    	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
    		inode ref index 0 namelen 2 name: ..
    	item 2 key (257 INODE_ITEM 0) itemoff 15951 itemsize 160
    		inode generation 6 transid 6 size 0 block group 0 mode 100600 links 1
    	item 3 key (ORPHAN ORPHAN_ITEM 257) itemoff 15951 itemsize 0
		orphan item
    checksum tree key (CSUM_TREE ROOT_ITEM 0)
    (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:23 -07:00
Filipe Manana
9c3b306e1c Btrfs: race free update of commit root for ro snapshots
This is a better solution for the problem addressed in the following
commit:

    Btrfs: update commit root on snapshot creation after orphan cleanup
    (3821f34888)

The previous solution wasn't the best because of 2 reasons:

    1) It added another full transaction commit, which is more expensive
       than just swapping the commit root with the root;

    2) If a reboot happened after the first transaction commit (the one
       that creates the snapshot) and before the second transaction commit,
       then we would end up with the same problem if a send using that
       snapshot was requested before the first transaction commit after
       the reboot.

This change addresses those 2 issues. The second issue is addressed by
switching the commit root in the dentry lookup VFS callback, which is
also called by the snapshot/subvol creation ioctl and performs orphan
cleanup if needed. Like the vfs, the ioctl locks the parent inode too,
preventing race issues between a dentry lookup and snapshot creation.

Cc: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:21 -07:00
Wang Shilong
e2eca69dc6 Btrfs: fix wrong extent mapping for DirectIO
btrfs_next_leaf() will use current leaf's last key to search
and then return a bigger one. So it may still return a file extent
item that is smaller than expected value and we will
get an overflow here for @em->len.

This is easy to reproduce for Btrfs Direct writting, it did not
cause any problem, because writting will re-insert right mapping later.

However, by hacking code to make DIO support compression, wrong extent
mapping is kept and it encounter merging failure(EEXIST) quickly.

Fix this problem by looping to find next file extent item that is bigger
than @start or we could not find anything more.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:13 -07:00
Wang Shilong
9a025a0860 Btrfs: fix wrong write range for filemap_fdatawrite_range()
filemap_fdatawrite_range() expect the third arg to be @end
not @len, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:12 -07:00
Miao Xie
7a5c3c9be1 Btrfs: fix put dio bio twice when we submit dio bio fail
The caller of btrfs_submit_direct_hook() will put the original dio bio
when btrfs_submit_direct_hook() return a error number, so we needn't
put the original bio in btrfs_submit_direct_hook().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:24 -07:00
Linus Torvalds
e64df3ebe8 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "These are all fixes I'd like to get out to a broader audience.

  The biggest of the bunch is Mark's quota fix, which is also in the
  SUSE kernel, and makes our subvolume quotas dramatically more
  accurate.

  I've been running xfstests with these against your current git
  overnight, but I'm queueing up longer tests as well"

* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: disable strict file flushes for renames and truncates
  Btrfs: fix csum tree corruption, duplicate and outdated checksums
  Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
  Btrfs: fix compressed write corruption on enospc
  btrfs: correctly handle return from ulist_add
  btrfs: qgroup: account shared subtrees during snapshot delete
  Btrfs: read lock extent buffer while walking backrefs
  Btrfs: __btrfs_mod_ref should always use no_quota
  btrfs: adjust statfs calculations according to raid profiles
2014-08-16 09:06:55 -06:00
Chris Mason
8d875f95da btrfs: disable strict file flushes for renames and truncates
Truncates and renames are often used to replace old versions of a file
with new versions.  Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.

Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.

This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held.  It's rare, but these things can
deadlock.

This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.

Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:42 -07:00
Liu Bo
ce62003f69 Btrfs: fix compressed write corruption on enospc
When failing to allocate space for the whole compressed extent, we'll
fallback to uncompressed IO, but we've forgotten to redirty the pages
which belong to this compressed extent, and these 'clean' pages will
simply skip 'submit' part and go to endio directly, at last we got data
corruption as we write nothing.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:18 -07:00
Miklos Szeredi
80ace85c91 btrfs: add RENAME_NOREPLACE
RENAME_NOREPLACE is trivial to implement for most filesystems: switch over
to ->rename2() and check for the supported flags.  The rest is done by the
VFS.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Linus Torvalds
e13d100beb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This fixes some lockups in btrfs reported with rc1.  It probably has
  some performance impact because it is backing off our spinning locks
  more often and switching to a blocking lock.  I'll be able to nail
  that down next week, but for now I want to get the lockups taken care
  of.

  Otherwise some more stack reduction and assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix wrong error handle when the device is missing or is not writeable
  Btrfs: fix deadlock when mounting a degraded fs
  Btrfs: use bio_endio_nodec instead of open code
  Btrfs: fix NULL pointer crash when running balance and scrub concurrently
  btrfs: Skip scrubbing removed chunks to avoid -ENOENT.
  Btrfs: fix broken free space cache after the system crashed
  Btrfs: make free space cache write out functions more readable
  Btrfs: remove unused wait queue in struct extent_buffer
  Btrfs: fix deadlocks with trylock on tree nodes
2014-06-21 14:21:43 -10:00
Miao Xie
e570fd27f2 Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
  BTRFS error (device xxx): block group xxxx has wrong amount of free space
  BTRFS error (device xxx): failed to load free space cache for block group xxx

It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.

There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
  space cache
- account the size of the allocated space that is used to store the file
  data, if the size is not zero, don't write out the free space cache.

The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Linus Torvalds
16b9057804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "This the bunch that sat in -next + lock_parent() fix.  This is the
  minimal set; there's more pending stuff.

  In particular, I really hope to get acct.c fixes merged this cycle -
  we need that to deal sanely with delayed-mntput stuff.  In the next
  pile, hopefully - that series is fairly short and localized
  (kernel/acct.c, fs/super.c and fs/namespace.c).  In this pile: more
  iov_iter work.  Most of prereqs for ->splice_write with sane locking
  order are there and Kent's dio rewrite would also fit nicely on top of
  this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
  lock_parent: don't step on stale ->d_parent of all-but-freed one
  kill generic_file_splice_write()
  ceph: switch to iter_file_splice_write()
  shmem: switch to iter_file_splice_write()
  nfs: switch to iter_splice_write_file()
  fs/splice.c: remove unneeded exports
  ocfs2: switch to iter_file_splice_write()
  ->splice_write() via ->write_iter()
  bio_vec-backed iov_iter
  optimize copy_page_{to,from}_iter()
  bury generic_file_aio_{read,write}
  lustre: get rid of messing with iovecs
  ceph: switch to ->write_iter()
  ceph_sync_direct_write: stop poking into iov_iter guts
  ceph_sync_read: stop poking into iov_iter guts
  new helper: copy_page_from_iter()
  fuse: switch to ->write_iter()
  btrfs: switch to ->write_iter()
  ocfs2: switch to ->write_iter()
  xfs: switch to ->write_iter()
  ...
2014-06-12 10:30:18 -07:00
Linus Torvalds
859862ddd2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "The biggest change here is Josef's rework of the btrfs quota
  accounting, which improves the in-memory tracking of delayed extent
  operations.

  I had been working on Btrfs stack usage for a while, mostly because it
  had become impossible to do long stress runs with slab, lockdep and
  pagealloc debugging turned on without blowing the stack.  Even though
  you upgraded us to a nice king sized stack, I kept most of the
  patches.

  We also have some very hard to find corruption fixes, an awesome sysfs
  use after free, and the usual assortment of optimizations, cleanups
  and other fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (80 commits)
  Btrfs: convert smp_mb__{before,after}_clear_bit
  Btrfs: fix scrub_print_warning to handle skinny metadata extents
  Btrfs: make fsync work after cloning into a file
  Btrfs: use right type to get real comparison
  Btrfs: don't check nodes for extent items
  Btrfs: don't release invalid page in btrfs_page_exists_in_range()
  Btrfs: make sure we retry if page is a retriable exception
  Btrfs: make sure we retry if we couldn't get the page
  btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56
  trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/
  Btrfs: fix leaf corruption after __btrfs_drop_extents
  Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item
  Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled
  btrfs: free delayed node outside of root->inode_lock
  btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX
  Btrfs: fix transaction leak during fsync call
  btrfs: Avoid trucating page or punching hole in a already existed hole.
  Btrfs: update commit root on snapshot creation after orphan cleanup
  Btrfs: ioctl, don't re-lock extent range when not necessary
  Btrfs: avoid visiting all extent items when cloning a range
  ...
2014-06-11 09:22:21 -07:00
Filipe Manana
7ffbb598a0 Btrfs: make fsync work after cloning into a file
When cloning into a file, we were correctly replacing the extent
items in the target range and removing the extent maps. However
we weren't replacing the extent maps with new ones that point to
the new extents - as a consequence, an incremental fsync (when the
inode doesn't have the full sync flag) was a NOOP, since it relies
on the existence of extent maps in the modified list of the inode's
extent map tree, which was empty. Therefore add new extent maps to
reflect the target clone range.

A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:16 -07:00
Filipe Manana
6fdef6d43c Btrfs: don't release invalid page in btrfs_page_exists_in_range()
In inode.c:btrfs_page_exists_in_range(), if the page we got from
the radix tree is an exception entry, which can't be retried, we
exit the loop with a non-NULL page and then call page_cache_release
against it, which is not ok since it's not a valid page. This could
also make us return true when we shouldn't.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:14 -07:00
Filipe Manana
809f901625 Btrfs: make sure we retry if page is a retriable exception
In inode.c:btrfs_page_exists_in_range(), if the page we get from the
radix tree is an exception which should make us retry, set page to
NULL in order to really retry, because otherwise we don't get another
loop iteration executed (page != NULL makes the while loop exit).
This also was making us call page_cache_release after exiting the loop,
which isn't correct because page doesn't point to a valid page, and
possibly return true from the function when we shouldn't.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:13 -07:00
Filipe Manana
91405151eb Btrfs: make sure we retry if we couldn't get the page
In inode.c:btrfs_page_exists_in_range(), if we can't get the page
we need to retry. However we weren't retrying because we weren't
setting page to NULL, which makes the while loop exit immediately
and will make us call page_cache_release after exiting the loop
which is incorrect because our page get didn't succeed. This could
also make us return true when we shouldn't.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:12 -07:00
Chris Mason
a79b7d4b3e Btrfs: async delayed refs
Delayed extent operations are triggered during transaction commits.
The goal is to queue up a healthly batch of changes to the extent
allocation tree and run through them in bulk.

This farms them off to async helper threads.  The goal is to have the
bulk of the delayed operations being done in the background, but this is
also important to limit our stack footprint.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:58 -07:00
Chris Mason
40f765805f Btrfs: split up __extent_writepage to lower stack usage
__extent_writepage has two unrelated parts.  First it does the delayed
allocation dance and second it does the mapping and IO for the page
we're actually writing.

This splits it up into those two parts so the stack from one doesn't
impact the stack from the other.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:58 -07:00
Alex Gartrell
fc4adbff82 btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking
In these instances, we are trying to determine if a page has been accessed
since we began the operation for the sake of retry.  This is easily
accomplished by doing a gang lookup in the page mapping radix tree, and it
saves us the dependency on the flag (so that we might eventually delete
it).

btrfs_page_exists_in_range borrows heavily from find_get_page, replacing
the radix tree look up with a gang lookup of 1, so that we can find the
next highest page >= index and see if it falls into our lock range.

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Alex Gartrell <agartrell@fb.com>
2014-06-09 17:20:57 -07:00
David Sterba
351fd35321 btrfs: remove stale newlines from log messages
I've noticed an extra line after "use no compression", but search
revealed much more in messages of more critical levels and rare errors.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:53 -07:00
Miao Xie
995946dd29 Btrfs: use helpers for last_trans_log_full_commit instead of opencode
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:45 -07:00
Miao Xie
27cdeb7096 Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:40 -07:00
Daeseok Youn
944a4515b2 btrfs: remove redundant null check in btrfs_dentry_release()
It doesn't need to check NULL for kfree()

Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:31 -07:00
Filipe Manana
ef3b9af50b Btrfs: implement inode_operations callback tmpfile
This implements the tmpfile callback of struct inode_operations, introduced
in the linux kernel 3.11, and implemented already by some filesystems. This
callback is invoked by the VFS when the flag O_TMPFILE is passed to the open
system call.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-06-09 17:20:30 -07:00
Zach Brown
166ae5a418 btrfs: fix inline compressed read err corruption
uncompress_inline() is dropping the error from btrfs_decompress() after
testing it and zeroing the page that was supposed to hold decompressed
data.  This can silently turn compressed inline data in to zeros if
decompression fails due to corrupt compressed data or memory allocation
failure.

I verified this by manually forcing the error from btrfs_decompress()
for a silly named copy of od:

	if (!strcmp(current->comm, "failod"))
		ret = -ENOMEM;

  # od -x /mnt/btrfs/dir/80 | head -1
  0000000 3031 3038 310a 2d30 6f70 6e69 0a74 3031
  # echo 3 > /proc/sys/vm/drop_caches
  # cp $(which od) /tmp/failod
  # /tmp/failod -x /mnt/btrfs/dir/80 | head -1
  0000000 0000 0000 0000 0000 0000 0000 0000 0000

The fix is to pass the error to its caller.  Which still has a BUG_ON().
So we fix that too.

There seems to be no reason for the zeroing of the page on the error
from btrfs_decompress() but not from the allocation error a few lines
above.  So the page zeroing is removed.

Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:23 -07:00
Al Viro
28060d5d9b btrfs: switch check_direct_IO() to iov_iter
... and don't open-code iov_iter_alignment() there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:53 -04:00
Al Viro
31b140398c switch {__,}blockdev_direct_IO() to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:46 -04:00
Al Viro
a6cbcd4a4a get rid of pointless iov_length() in ->direct_IO()
all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:45 -04:00
Al Viro
d8d3d94b80 pass iov_iter to ->direct_IO()
unmodified, for now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:44 -04:00
Peter Zijlstra
4e857c58ef arch: Mass conversion of smp_mb__*()
Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 14:20:48 +02:00
Linus Torvalds
3123bca719 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull second set of btrfs updates from Chris Mason:
 "The most important changes here are from Josef, fixing a btrfs
  regression in 3.14 that can cause corruptions in the extent allocation
  tree when snapshots are in use.

  Josef also fixed some deadlocks in send/recv and other assorted races
  when balance is running"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
  Btrfs: fix compile warnings on on avr32 platform
  btrfs: allow mounting btrfs subvolumes with different ro/rw options
  btrfs: export global block reserve size as space_info
  btrfs: fix crash in remount(thread_pool=) case
  Btrfs: abort the transaction when we don't find our extent ref
  Btrfs: fix EINVAL checks in btrfs_clone
  Btrfs: fix unlock in __start_delalloc_inodes()
  Btrfs: scrub raid56 stripes in the right way
  Btrfs: don't compress for a small write
  Btrfs: more efficient io tree navigation on wait_extent_bit
  Btrfs: send, build path string only once in send_hole
  btrfs: filter invalid arg for btrfs resize
  Btrfs: send, fix data corruption due to incorrect hole detection
  Btrfs: kmalloc() doesn't return an ERR_PTR
  Btrfs: fix snapshot vs nocow writting
  btrfs: Change the expanding write sequence to fix snapshot related bug.
  btrfs: make device scan less noisy
  btrfs: fix lockdep warning with reclaim lock inversion
  Btrfs: hold the commit_root_sem when getting the commit root during send
  Btrfs: remove transaction from send
  ...
2014-04-11 14:16:53 -07:00
Wang Shilong
a1ecaabbf9 Btrfs: fix unlock in __start_delalloc_inodes()
This patch fix a regression caused by the following patch:
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock

break while loop will make us call @spin_unlock() without
calling @spin_lock() before, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:50 -07:00
Wang Shilong
68bb462d42 Btrfs: don't compress for a small write
To compress a small file range(<=blocksize) that is not
an inline extent can not save disk space at all. skip it can
save us some cpu time.

This patch can also fix wrong setting nocompression flag for
inode, say a case when @total_in is 4096, and then we get
@total_compressed 52,because we do aligment to page cache size
firstly, and then we get into conclusion @total_in=@total_compressed
thus we will clear this inode's compression flag.

An exception comes from inserting inline extent failure but we
still have @total_compressed < @total_in,so we will still reset
inode's flag, this is ok, because we don't have good compression
effect.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:48 -07:00
Wang Shilong
e9894fd3e3 Btrfs: fix snapshot vs nocow writting
While running fsstress and snapshots concurrently, we will hit something
like followings:

Thread 1			Thread 2

|->fallocate
  |->write pages
    |->join transaction
       |->add ordered extent
    |->end transaction
				|->flushing data
				  |->creating pending snapshots
|->write data into src root's
   fallocated space

After above work flows finished, we will get a state that source and
snapshot root share same space, but source root have written data into
fallocated space, this will make fsck fail to verify checksums for
snapshot root's preallocating file extent data.Nocow writting also
has this same problem.

Fix this problem by syncing snapshots with nocow writting:

 1.for nocow writting,if there are pending snapshots, we will
 fall into COW way.

 2.if there are pending nocow writes, snapshots for this root
 will be blocked until nocow writting finish.

Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:43 -07:00
Linus Torvalds
53c566625f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs changes from Chris Mason:
 "This is a pretty long stream of bug fixes and performance fixes.

  Qu Wenruo has replaced the btrfs async threads with regular kernel
  workqueues.  We'll keep an eye out for performance differences, but
  it's nice to be using more generic code for this.

  We still have some corruption fixes and other patches coming in for
  the merge window, but this batch is tested and ready to go"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits)
  Btrfs: fix a crash of clone with inline extents's split
  btrfs: fix uninit variable warning
  Btrfs: take into account total references when doing backref lookup
  Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
  Btrfs: fix incremental send's decision to delay a dir move/rename
  Btrfs: remove unnecessary inode generation lookup in send
  Btrfs: fix race when updating existing ref head
  btrfs: Add trace for btrfs_workqueue alloc/destroy
  Btrfs: less fs tree lock contention when using autodefrag
  Btrfs: return EPERM when deleting a default subvolume
  Btrfs: add missing kfree in btrfs_destroy_workqueue
  Btrfs: cache extent states in defrag code path
  Btrfs: fix deadlock with nested trans handles
  Btrfs: fix possible empty list access when flushing the delalloc inodes
  Btrfs: split the global ordered extents mutex
  Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
  Btrfs: reclaim delalloc metadata more aggressively
  Btrfs: remove unnecessary lock in may_commit_transaction()
  Btrfs: remove the unnecessary flush when preparing the pages
  Btrfs: just do dirty page flush for the inode with compression before direct IO
  ...
2014-04-04 15:31:36 -07:00
Johannes Weiner
91b0abe36a mm + fs: store shadow entries in page cache
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page.  As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently.  At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate.  Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:01 -07:00
Miao Xie
573bfb72f7 Btrfs: fix possible empty list access when flushing the delalloc inodes
We didn't have a lock to protect the access to the delalloc inodes list, that is
we might access a empty delalloc inodes list if someone start flushing delalloc
inodes because the delalloc inodes were moved into a other list temporarily.
Fix it by wrapping the access with a lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:29 -04:00
Miao Xie
6c255e67ce Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
We needn't flush all delalloc inodes when we doesn't get s_umount lock,
or we would make the tasks wait for a long time.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:27 -04:00
Miao Xie
41bd9ca459 Btrfs: just do dirty page flush for the inode with compression before direct IO
As the comment in the btrfs_direct_IO says, only the compressed pages need be
flush again to make sure they are on the disk, but the common pages needn't,
so we add a if statement to check if the inode has compressed pages or not,
if no, skip the flush.

And in order to prevent the write ranges from intersecting, we need wait for
the running ordered extents. But the current code waits for them twice, one
is done before the direct IO starts (in btrfs_wait_ordered_range()), the other
is before we get the blocks, it is unnecessary. because we can do the direct
IO without holding i_mutex, it means that the intersected ordered extents may
happen during the direct IO, the first wait can not avoid this problem. So we
use filemap_fdatawrite_range() instead of btrfs_wait_ordered_range() to remove
the first wait.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:24 -04:00
Qu Wenruo
d458b0540e btrfs: Cleanup the "_struct" suffix in btrfs_workequeue
Since the "_struct" suffix is mainly used for distinguish the differnt
btrfs_work between the original and the newly created one,
there is no need using the suffix since all btrfs_workers are changed
into btrfs_workqueue.

Also this patch fixed some codes whose code style is changed due to the
too long "_struct" suffix.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:16 -04:00
Qu Wenruo
dc6e320998 btrfs: Replace fs_info->fixup_workers workqueue with btrfs_workqueue.
Replace the fs_info->fixup_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:12 -04:00
Qu Wenruo
fccb5d86d8 btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue.
Replace the fs_info->endio_* workqueues with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:08 -04:00
Qu Wenruo
a44903abe9 btrfs: Replace fs_info->flush_workers with btrfs_workqueue.
Replace the fs_info->submit_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:07 -04:00
Qu Wenruo
afe3d24267 btrfs: Replace fs_info->delalloc_workers with btrfs_workqueue
Much like the fs_info->workers, replace the fs_info->delalloc_workers
use the same btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:06 -04:00
Miao Xie
7b2b70851f Btrfs: fix preallocate vs double nocow write
We can not release the reserved metadata space for the first write if we
find the write position is pre-allocated. Because the kernel might write
the data on the disk before we do the second write but after the can-nocow
check, if we release the space for the first write, we might fail to update
the metadata because of no space.

Fix this problem by end nocow write if there is dirty data in the range whose
space is pre-allocated.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:00 -04:00
Liu Bo
7813b3db0a Btrfs: avoid warning bomb of btrfs_invalidate_inodes
So after transaction is aborted, we need to cleanup inode resources by
calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes
roots' refs to be zero in old times and sets a WARN_ON(), however, this
is not always true within cleaning up transaction, so we get to detect
transaction abortion and not warn at all.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:38 -04:00
Wang Shilong
bcbba5e659 Btrfs: skip readonly root for snapshot-aware defragment
Btrfs send is assuming readonly root won't change, let's skip readonly root.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:34 -04:00
Josef Bacik
29bce2f399 Btrfs: unlock extent and pages on error in cow_file_range
When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I
made it so we just return an error instead of unlocking all of our various
stuff.  This is a mistake and causes us to hang when we run into this.  This
patch fixes this problem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:53 -04:00
Josef Bacik
c581afc8db Btrfs: balance delayed inode updates
While trying to reproduce a delayed ref problem I noticed the box kept falling
over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's.
Turns out this is because we only throttle delayed inode updates in
btrfs_dirty_inode, which doesn't actually get called that often, especially when
all you are doing is creating a bunch of files.  So balance delayed inode
updates everytime we create a new inode.  With this patch we no longer use up
all of our ram with delayed inode updates.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:15:52 -04:00
Linus Torvalds
3962dfbe22 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We have a small collection of fixes in my for-linus branch.

  The big thing that stands out is a revert of a new ioctl.  Users
  haven't shipped yet in btrfs-progs, and Dave Sterba found a better way
  to export the information"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: use right clone root offset for compressed extents
  btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105
  Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol
  Btrfs: fix max_inline mount option
  Btrfs: fix a lockdep warning when cleaning up aborted transaction
  Revert "btrfs: add ioctl to export size of global metadata reservation"
2014-02-16 11:05:27 -08:00
Josef Bacik
3a0dfa6a12 Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol
A user was running into errors from an NFS export of a subvolume that had a
default subvol set.  When we mount a default subvol we will use d_obtain_alias()
to find an existing dentry for the subvolume in the case that the root subvol
has already been mounted, or a dummy one is allocated in the case that the root
subvol has not already been mounted.  This allows us to connect the dentry later
on if we wander into the path.  However if we don't ever wander into the path we
will keep DCACHE_DISCONNECTED set for a long time, which angers NFS.  It doesn't
appear to cause any problems but it is annoying nonetheless, so simply unset
DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to
use d_materialise_unique() instead which will make everything play nicely
together and reconnect stuff if we wander into the defaul subvol path from a
different way.  With this patch I'm no longer getting the NFS errors when
exporting a volume that has been mounted with a default subvol set.  Thanks,

cc: bfields@fieldses.org
cc: ebiederm@xmission.com
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-02-14 13:44:32 -08:00
Linus Torvalds
878a876b2e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe is fixing compile and boot problems with our crc32c rework, and
  Josef has disabled snapshot aware defrag for now.

  As the number of snapshots increases, we're hitting OOM.  For the
  short term we're disabling things until a bigger fix is ready"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: use late_initcall instead of module_init
  Btrfs: use btrfs_crc32c everywhere instead of libcrc32c
  Btrfs: disable snapshot aware defrag for now
2014-02-04 12:26:56 -08:00
Josef Bacik
8101c8dbf6 Btrfs: disable snapshot aware defrag for now
It's just broken and it's taking a lot of effort to fix it, so for now just
disable it so people can defrag in peace.  Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-02-03 09:01:27 -08:00
Linus Torvalds
e7651b819e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This is a pretty big pull, and most of these changes have been
  floating in btrfs-next for a long time.  Filipe's properties work is a
  cool building block for inheriting attributes like compression down on
  a per inode basis.

  Jeff Mahoney kicked in code to export filesystem info into sysfs.

  Otherwise, lots of performance improvements, cleanups and bug fixes.

  Looks like there are still a few other small pending incrementals, but
  I wanted to get the bulk of this in first"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits)
  Btrfs: fix spin_unlock in check_ref_cleanup
  Btrfs: setup inode location during btrfs_init_inode_locked
  Btrfs: don't use ram_bytes for uncompressed inline items
  Btrfs: fix btrfs_search_slot_for_read backwards iteration
  Btrfs: do not export ulist functions
  Btrfs: rework ulist with list+rb_tree
  Btrfs: fix memory leaks on walking backrefs failure
  Btrfs: fix send file hole detection leading to data corruption
  Btrfs: add a reschedule point in btrfs_find_all_roots()
  Btrfs: make send's file extent item search more efficient
  Btrfs: fix to catch all errors when resolving indirect ref
  Btrfs: fix protection between walking backrefs and root deletion
  btrfs: fix warning while merging two adjacent extents
  Btrfs: fix infinite path build loops in incremental send
  btrfs: undo sysfs when open_ctree() fails
  Btrfs: fix snprintf usage by send's gen_unique_name
  btrfs: fix defrag 32-bit integer overflow
  btrfs: sysfs: list the NO_HOLES feature
  btrfs: sysfs: don't show reserved incompat feature
  btrfs: call permission checks earlier in ioctls and return EPERM
  ...
2014-01-30 20:08:20 -08:00
Linus Torvalds
f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Chris Mason
90d3e592e9 Btrfs: setup inode location during btrfs_init_inode_locked
We have a race during inode init because the BTRFS_I(inode)->location is setup
after the inode hash table lock is dropped.  btrfs_find_actor uses the location
field, so our search might not find an existing inode in the hash table if we
race with the inode init code.

This commit changes things to setup the location field sooner.  Also the find actor now
uses only the location objectid to match inodes.  For inode hashing, we just
need a unique and stable test, it doesn't have to reflect the inode numbers we
show to userland.

Signed-off-by: Chris Mason <clm@fb.com>
CC: stable@vger.kernel.org
2014-01-29 07:06:30 -08:00
Chris Mason
514ac8ad87 Btrfs: don't use ram_bytes for uncompressed inline items
If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
the new size.  The fixe uses the size directly from the item header when
reading uncompressed inlines, and also fixes truncate to update the
size as it goes.

Reported-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
CC: stable@vger.kernel.org
2014-01-29 07:06:29 -08:00
Gui Hecheng
3c9665df0c btrfs: fix warning while merging two adjacent extents
When we have two adjacent extents in relink_extent_backref,
we try to merge them. When we use btrfs_search_slot to locate the
slot for the current extent, we shouldn't set "ins_len = 1",
because we will merge it into the previous extent rather than
insert a new item. Otherwise, we may happen to create a new leaf
in btrfs_search_slot and path->slot[0] will be 0. Then we try to
fetch the previous item using "path->slots[0]--", and it will cause
a warning as follows:

	[  145.713385] WARNING: CPU: 3 PID: 1796 at fs/btrfs/extent_io.c:5043 map_private_extent_buffer+0xd4/0xe0
	[  145.713387] btrfs bad mapping eb start 5337088 len 4096, wanted 167772306 8
	...
	[  145.713462]  [<ffffffffa034b1f4>] map_private_extent_buffer+0xd4/0xe0
	[  145.713476]  [<ffffffffa030097a>] ? btrfs_free_path+0x2a/0x40
	[  145.713485]  [<ffffffffa0340864>] btrfs_get_token_64+0x64/0xf0
	[  145.713498]  [<ffffffffa033472c>] relink_extent_backref+0x41c/0x820
	[  145.713508]  [<ffffffffa0334d69>] btrfs_finish_ordered_io+0x239/0xa80

I encounter this warning when running defrag having mkfs.btrfs
with option -M. At the same time there are read/writes & snapshots
running at background.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:22 -08:00
Wang Shilong
2c21b4d733 Btrfs: fix transaction abortion when remounting btrfs from RW to RO
Steps to reproduce:
 # mkfs.btrfs -f /dev/sda8
 # mount /dev/sda8 /mnt -o flushoncommit
 # dd if=/dev/zero of=/mnt/data bs=4k count=102400 &
 # mount /dev/sda8 /mnt -o remount, ro

When remounting RW to RO, the logic is to firstly set flag
to RO and then commit transaction, however with option
flushoncommit enabled,we will do RO check within committing
transaction, so we get a transaction abortion here.

Actually,here check is wrong, we should check if FS_STATE_ERROR
is set, fix it.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Suggested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:36 -08:00
Filipe David Borba Manana
63541927c8 Btrfs: add support for inode properties
This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."

Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).

This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.

The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.

Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.

Basically the tests correspond to:

Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.

Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.

Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.

Test 4 - same as test 3 but with 10 properties per file.

Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.

* Without properties (test 1)

                    file creation time        ls -lha time
10 000 files              3.49                   0.76
100 000 files            47.19                   8.37
1 000 000 files         518.51                 107.06

* With 1 property (compression property set to lzo - test 2)

                    file creation time        ls -lha time
10 000 files              3.63                    0.93
100 000 files            48.56                    9.74
1 000 000 files         537.72                  125.11

* With 4 properties (test 3)

                    file creation time        ls -lha time
10 000 files              3.94                    1.20
100 000 files            52.14                   11.48
1 000 000 files         572.70                  142.13

* With 10 properties (test 4)

                    file creation time        ls -lha time
10 000 files              4.61                    1.35
100 000 files            58.86                   13.83
1 000 000 files         656.01                  177.61

The increased latencies with properties are essencialy because of:

*) When creating an inode, we now synchronously write 1 more item
   (an xattr item) for each property inherited from the parent dir
   (or subvolume). This could be done in an asynchronous way such
   as we do for dir intex items (delayed-inode.c), which could help
   reduce the file creation latency;

*) With properties, we now have larger fs trees. For this particular
   test each xattr item uses 75 bytes of leaf space in the fs tree.
   This could be less by using a new item for xattr items, instead of
   the current btrfs_dir_item, since we could cut the 'location' and
   'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
   total of 26 bytes per xattr item) from the btrfs_dir_item type.

Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.

Test script:

$ cat test.pl
  #!/usr/bin/perl -w

  use strict;
  use Time::HiRes qw(time);
  use constant NUM_FILES => 10_000;
  use constant FILE_SIZES => (64 * 1024);
  use constant DEV => '/dev/sdb4';
  use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
  use constant TEST_DIR => (MNT_POINT . '/testdir');

  system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";

  # following line for testing without properties
  #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";

  # following 2 lines for testing with properties
  system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
  system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";

  system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
  my ($t1, $t2);

  $t1 = time();
  for (my $i = 1; $i <= NUM_FILES; $i++) {
      my $p = TEST_DIR . '/file_' . $i;
      open(my $f, '>', $p) or die "Error opening file!";
      $f->autoflush(1);
      for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
          print $f ('A' x 4096) or die "Error writing to file!";
      }
      close($f);
  }
  $t2 = time();
  print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
  system("umount", DEV) == 0 or die "umount failed!";
  system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";

  $t1 = time();
  system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
  $t2 = time();
  print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
  system("umount", DEV) == 0 or die "umount failed!";

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:24 -08:00
Filipe David Borba Manana
1acae57b16 Btrfs: faster file extent item replace operations
When writing to a file we drop existing file extent items that cover the
write range and then add a new file extent item that represents that write
range.

Before this change we were doing a tree lookup to remove the file extent
items, and then after we did another tree lookup to insert the new file
extent item.
Most of the time all the file extent items we need to drop are located
within a single leaf - this is the leaf where our new file extent item ends
up at. Therefore, in this common case just combine these 2 operations into
a single one.

By avoiding the second btree navigation for insertion of the new file extent
item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
COW operations, CPU time on btree node/leaf key binary searches, etc.

Besides for file writes, this is an operation that happens for file fsync's
as well. However log btrees are much less likely to big as big as regular
fs btrees, therefore the impact of this change is smaller.

The following benchmark was performed against an SSD drive and a
HDD drive, both for random and sequential writes:

  sysbench --test=fileio --file-num=4096 --file-total-size=8G \
     --file-test-mode=[rndwr|seqwr] --num-threads=512 \
     --file-block-size=8192 \ --max-requests=1000000 \
     --file-fsync-freq=0 --file-io-mode=sync [prepare|run]

All results below are averages of 10 runs of the respective test.

** SSD sequential writes

Before this change: 225.88 Mb/sec
After this change:  277.26 Mb/sec

** SSD random writes

Before this change: 49.91 Mb/sec
After this change:  56.39 Mb/sec

** HDD sequential writes

Before this change: 68.53 Mb/sec
After this change:  69.87 Mb/sec

** HDD random writes

Before this change: 13.04 Mb/sec
After this change:  14.39 Mb/sec

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:23 -08:00
Miao Xie
e77751aad1 Btrfs: fix the wrong nocow range check
The following warning message was outputed when running the 274th case
of xfstests with nodatacow option:
 BUG: Bad page state in process kswapd0  pfn:1c66f
 page:ffffea0000636848 count:0 mapcount:0 mapping:(null) index:0x78000
 page flags: 0x1000000000100a(error|uptodate|private_2)

It is because the check of nocow range was wrong, we should compare the
start and end position of the extent with the write position to verify
if the write position was in the extent, but the current code just used
the start postion to do the check, so we got the wrong extent and told
the caller that it was a nocow write. And then when we write back the
dirty pages, we found we should cow the extent, but at that time, there
was no space in the fs, we had to the error flag for the page. When
someone reclaimed that page, the above warning outputed. Fix it.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:13 -08:00
Filipe David Borba Manana
eb653de159 Btrfs: reduce btree node locking duration on item update
If we do a btree search with the goal of updating an existing item
without changing its size (ins_len == 0 and cow == 1), then we never
need to hold locks on upper level nodes (even when slot == 0) after we
COW their child nodes/leaves, as we won't have node splits or merges
in this scenario (that is, no key additions, removals or shifts on any
nodes or leaves).

Therefore release the locks immediately after COWing the child nodes/leaves
while navigating the btree, even if their parent slot is 0, instead of
returning a path to the caller with those nodes locked, which would get
released only when the caller releases or frees the path (or if it calls
btrfs_unlock_up_safe).

This is a common scenario, for example when updating inode items in fs
trees and block group items in the extent tree.

The following benchmarks were performed on a quad core machine with 32Gb
of ram, using a leaf/node size of 4Kb (to generate deeper fs trees more
quickly).

  sysbench --test=fileio --file-num=131072 --file-total-size=8G \
    --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
    --max-requests=100000 --file-io-mode=sync [prepare|run]

Before this change:  49.85Mb/s (average of 5 runs)
After this change:   50.38Mb/s (average of 5 runs)

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:11 -08:00
Miao Xie
67de11769b Btrfs: introduce the delayed inode ref deletion for the single link inode
The inode reference item is close to inode item, so we insert it simultaneously
with the inode item insertion when we create a file/directory.. In fact, we also
can handle the inode reference deletion by the same way. So we made this patch to
introduce the delayed inode reference deletion for the single link inode(At most
case, the file doesn't has hard link, so we don't take the hard link into account).

This function is based on the delayed inode mechanism. After applying this patch,
we can reduce the time of the file/directory deletion by ~10%.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:09 -08:00
Frank Holton
efe120a067 Btrfs: convert printk to btrfs_ and fix BTRFS prefix
Convert all applicable cases of printk and pr_* to the btrfs_* macros.

Fix all uses of the BTRFS prefix.

Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:05 -08:00
Wang Shilong
180589efde Btrfs: fix a warning when iput a file
See the warning below:

[ 1209.102076]  [<ffffffffa04721b9>] remove_extent_mapping+0x69/0x70 [btrfs]
[ 1209.102084]  [<ffffffffa0466b06>] btrfs_evict_inode+0x96/0x4d0 [btrfs]
[ 1209.102089]  [<ffffffff81073010>] ? wake_atomic_t_function+0x40/0x40
[ 1209.102092]  [<ffffffff8118ab2e>] evict+0x9e/0x190
[ 1209.102094]  [<ffffffff8118b313>] iput+0xf3/0x180
[ 1209.102101]  [<ffffffffa0461fd1>] btrfs_run_delayed_iputs+0xb1/0xd0 [btrfs]
[ 1209.102107]  [<ffffffffa045d358>] __btrfs_end_transaction+0x268/0x350 [btrfs]

clear extent bit here to avoid triggering WARN_ON() in remove_extent_mapping()

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:02 -08:00
Wang Shilong
663df05330 Btrfs: remove dead comments for read_csums()
Chris introduced hleper function  read_csums() and this function
has been removed, but we forgot to remove its corresponding comments.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:59 -08:00
Tsutomu Itoh
5662344b3c Btrfs: fix error check of btrfs_lookup_dentry()
Clean up btrfs_lookup_dentry() to never return NULL, but PTR_ERR(-ENOENT)
instead. This keeps the return value convention consistent.

Callers who use btrfs_lookup_dentry() require a trivial update.

create_snapshot() in particular looks like it can also lose a BUG_ON(!inode)
which is not really needed - there seems less harm in returning ENOENT to
userspace at that point in the stack than there is to crash the machine.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:56 -08:00
Filipe David Borba Manana
131e404a2a Btrfs: fix very slow inode eviction and fs unmount
The inode eviction can be very slow, because during eviction we
tell the VFS to truncate all of the inode's pages. This results
in calls to btrfs_invalidatepage() which in turn does calls to
lock_extent_bits() and clear_extent_bit(). These calls result in
too many merges and splits of extent_state structures, which
consume a lot of time and cpu when the inode has many pages. In
some scenarios I have experienced umount times higher than 15
minutes, even when there's no pending IO (after a btrfs fs sync).

A quick way to reproduce this issue:

$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ cd /mnt/btrfs
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
    --file-test-mode=seqwr --num-threads=128 \
    --file-block-size=16384 --max-time=60 --max-requests=0 run
$ time btrfs fi sync .
FSSync '.'

real	0m25.457s
user	0m0.000s
sys	0m0.092s
$ cd ..
$ time umount /mnt/btrfs

real	1m38.234s
user	0m0.000s
sys	1m25.760s

The same test on ext4 runs much faster:

$ mkfs.ext4 /dev/sdb3
$ mount /dev/sdb3 /mnt/ext4
$ cd /mnt/ext4
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
    --file-test-mode=seqwr --num-threads=128 \
    --file-block-size=16384 --max-time=60 --max-requests=0 run
$ sync
$ cd ..
$ time umount /mnt/ext4

real	0m3.626s
user	0m0.004s
sys	0m3.012s

After this patch, the unmount (inode evictions) is much faster:

$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ cd /mnt/btrfs
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
    --file-test-mode=seqwr --num-threads=128 \
    --file-block-size=16384 --max-time=60 --max-requests=0 run
$ time btrfs fi sync .
FSSync '.'

real	0m26.774s
user	0m0.000s
sys	0m0.084s
$ cd ..
$ time umount /mnt/btrfs

real	0m1.811s
user	0m0.000s
sys	0m1.564s

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:44 -08:00
Kelley Nielsen
75ac2dd907 btrfs: expand btrfs_find_item() to include find_root_ref functionality
This patch is the second step in bootstrapping the btrfs_find_item
interface. The btrfs_find_root_ref() is similar to the former
__inode_info(); it accepts four of its parameters, and duplicates the
first half of its functionality.

Replace the one former call to btrfs_find_root_ref() with a call to
btrfs_find_item(), along with the defined key type that was used
internally by btrfs_find_root ref, and a null found key. In
btrfs_find_item(), add a test for the null key at the place where
the functionality of btrfs_find_root_ref() ends; btrfs_find_item()
then returns if the test passes. Finally, remove btrfs_find_root_ref().

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:36 -08:00
Valentina Giusti
99e22f783b btrfs: remove unused variable from btrfs_new_inode
Variable owner in btrfs_new_inode is unused since commit
d82a6f1d7e
(Btrfs: kill BTRFS_I(inode)->block_group)

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:31 -08:00
Josef Bacik
16e7549f04 Btrfs: incompatible format change to remove hole extents
Btrfs has always had these filler extent data items for holes in inodes.  This
has made somethings very easy, like logging hole punches and sending hole
punches.  However for large holey files these extent data items are pure
overhead.  So add an incompatible feature to no longer add hole extents to
reduce the amount of metadata used by these sort of files.  This has a few
changes for logging and send obviously since they will need to detect holes and
log/send the holes if there are any.  I've tested this thoroughly with xfstests
and it doesn't cause any issues with and without the incompat format set.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:21 -08:00
Linus Torvalds
bf3d846b78 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted stuff; the biggest pile here is Christoph's ACL series.  Plus
  assorted cleanups and fixes all over the place...

  There will be another pile later this week"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
  __dentry_path() fixes
  vfs: Remove second variable named error in __dentry_path
  vfs: Is mounted should be testing mnt_ns for NULL or error.
  Fix race when checking i_size on direct i/o read
  hfsplus: remove can_set_xattr
  nfsd: use get_acl and ->set_acl
  fs: remove generic_acl
  nfs: use generic posix ACL infrastructure for v3 Posix ACLs
  gfs2: use generic posix ACL infrastructure
  jfs: use generic posix ACL infrastructure
  xfs: use generic posix ACL infrastructure
  reiserfs: use generic posix ACL infrastructure
  ocfs2: use generic posix ACL infrastructure
  jffs2: use generic posix ACL infrastructure
  hfsplus: use generic posix ACL infrastructure
  f2fs: use generic posix ACL infrastructure
  ext2/3/4: use generic posix ACL infrastructure
  btrfs: use generic posix ACL infrastructure
  fs: make posix_acl_create more useful
  fs: make posix_acl_chmod more useful
  ...
2014-01-28 08:38:04 -08:00
Christoph Hellwig
996a710d46 btrfs: use generic posix ACL infrastructure
Also don't bother to set up a .get_acl method for symlinks as we do not
support access control (ACLs or even mode bits) for symlinks in Linux.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:18 -05:00
Christoph Hellwig
dff6efc326 fs: fix iversion handling
Currently notify_change directly updates i_version for size updates,
which not only is counter to how all other fields are updated through
struct iattr, but also breaks XFS, which need inode updates to happen
under its own lock, and synchronized to the structure that gets written
to the log.

Remove the update in the common code, and it to btrfs and ext4,
XFS already does a proper updaste internally and currently gets a
double update with the existing code.

IMHO this is 3.13 and -stable material and should go in through the XFS
tree.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-05 16:36:21 -06:00
Kent Overstreet
4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Kent Overstreet
2c30c71bd6 block: Convert various code to bio_for_each_segment()
With immutable biovecs we don't want code accessing bi_io_vec directly -
the uses this patch changes weren't incorrect since they all own the
bio, but it makes the code harder to audit for no good reason - also,
this will help with multipage bvecs later.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-23 22:33:46 -08:00
Steven Rostedt
4cd8587ce8 btrfs: Use trace condition for get_extent tracepoint
Doing an if statement to test some condition to know if we should
trigger a tracepoint is pointless when tracing is disabled. This just
adds overhead and wastes a branch prediction. This is why the
TRACE_EVENT_CONDITION() was created. It places the check inside the jump
label so that the branch does not happen unless tracing is enabled.

That is, instead of doing:

	if (em)
		trace_btrfs_get_extent(root, em);

Which is basically this:

	if (em)
		if (static_key(trace_btrfs_get_extent)) {

Using a TRACE_EVENT_CONDITION() we can just do:

	trace_btrfs_get_extent(root, em);

And the condition trace event will do:

	if (static_key(trace_btrfs_get_extent)) {
		if (em) {
			...

The static key is a non conditional jump (or nop) that is faster than
having to check if em is NULL or not.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:47 -05:00
Josef Bacik
4724b106b9 Btrfs: don't BUG_ON() if we get an error walking backrefs
We can just return false for this so we stop doing the snapshot aware defrag
stuff.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:41:16 -05:00
Miao Xie
91aef86f3b Btrfs: rename btrfs_start_all_delalloc_inodes
rename the function -- btrfs_start_all_delalloc_inodes(), and make its
name be compatible to btrfs_wait_ordered_roots(), since they are always
used at the same place.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:58 -05:00
Dulshani Gunawardhana
678712545b btrfs: Fix checkpatch.pl warning of spacing issues
Fix spacing issues detected via checkpatch.pl in accordance with the
kernel style guidelines.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:31 -05:00
Dulshani Gunawardhana
fae7f21cec btrfs: Use WARN_ON()'s return value in place of WARN_ON(1)
Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
code that outputs a more descriptive warnings. Also fix the styling
warning of redundant braces that came up as a result of this fix.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:53 -05:00
Liu Bo
6f519564d7 Btrfs: do not run snapshot-aware defragment on error
If something wrong happens in write endio, running snapshot-aware defragment
can end up with undefined results, maybe a crash, so we should avoid it.

In order to share similar code, this also adds a helper to free the struct for
snapshot-aware defrag.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:00 -05:00
Josef Bacik
9f23e289ed Btrfs: make sure the delalloc workers actually flush compressed writes
When using delalloc workers in a non-waiting way (like for enospc handling) we
can end up not actually waiting for the dirty pages to be started if we have
compression.  We need to add an extra filemap flush to make sure any async
extents that have started are actually moved along before returning.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:08:22 -05:00
Josef Bacik
d788a34929 Btrfs: don't abort transaction in run_delalloc_nocow
This is just the write path, the only reason we start a transaction is so we can
check cross references, we don't make any actual changes, so there is no reason
to abort the transaction if we fail.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:07:58 -05:00
Josef Bacik
02ecd2c278 Btrfs: do not bug_on if we try to cow a free space cache inode
We can just return an error and we'll bail out properly.  We still want to catch
this case to make sure we don't have a bug somewhere, so just warn if this pops
up.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:07:49 -05:00
Josef Bacik
0ef8b72607 Btrfs: return an error from btrfs_wait_ordered_range
I noticed that if the free space cache has an error writing out it's data it
won't actually error out, it will just carry on.  This is because it doesn't
check the return value of btrfs_wait_ordered_range, which didn't actually return
anything.  So fix this in order to keep us from making free space cache look
valid when it really isnt.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:07:35 -05:00
Zach Brown
8b558c5f09 btrfs: remove fs/btrfs/compat.h
fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink()
and inc_nlink().  This doesn't belong in mainline.

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:03:19 -05:00
Josef Bacik
25a50341b6 Btrfs: handle a missing extent for the first file extent
While trying to kill our hole extents I noticed I was seeing problems where we
seek into a file and then start writing and then try to fiemap that file later.
This is because we search for offset 0, don't find anything and so back up one
slot, which puts us at the inode ref or something like that, which means we goto
not_found and create an extent map for our entire search area.  This isn't quite
what we want, we want to move forward one slot and see if there is an extent
there so we can limit our hole extent.  This patch fixes this problem, I will
add a testcase for this as well.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:58:05 -05:00
Josef Bacik
aaedb55bc0 Btrfs: add tests for btrfs_get_extent
I'm going to be removing hole extents in the near future so I wanted to make a
sanity test for btrfs_get_extent to make sure I don't break anything in the
meantime.  This patch just puts btrfs_get_extent through its paces by giving it
a completely unreasonable mapping to look at and make sure it is giving us back
maps that make sense.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:57:30 -05:00
Josef Bacik
857cc2fc29 Btrfs: free reserved space on error in a few places
While trying to track down a reserved space leak I noticed a few places where we
won't properly clean up reserved space if we have an error, this patch fixes
those up.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:56:41 -05:00
Filipe David Borba Manana
778ba82b17 Btrfs: improve inode hash function/inode lookup
Currently the hash value used for adding an inode to the VFS's inode
hash table consists of the plain inode number, which is a 64 bits
integer. This results in hash table buckets (hlist_head lists) with
too many elements for at least 2 important scenarios:

1) When we have many subvolumes. Each subvolume has its own btree
   where its files and directories are added to, and each has its
   own objectid (inode number) namespace. This means that if we have
   N subvolumes, and all have inode number X associated to a file or
   directory, the corresponding inodes all map to the same hash table
   entry, resulting in a bucket (hlist_head list) with N elements;

2) On 32 bits machines. Th VFS hash values are unsigned longs, which
   are 32 bits wide on 32 bits machines, and the inode (objectid)
   numbers are 64 bits unsigned integers. We simply cast the inode
   numbers to hash values, which means that for all inodes with the
   same 32 bits lower half, the same hash bucket is used for all of
   them. For example, all inodes with a number (objectid) between
   0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in
   the same hash table bucket.

This change ensures the inode's hash value depends both on the
objectid (inode number) and its subvolume's (btree root) objectid.
For 32 bits machines, this change gives better entropy by making
the hash value depend on both the upper and lower 32 bits of the
64 bits hash previously computed.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:55:19 -05:00
Miao Xie
fa7c14947a Btrfs: improve jitter performance of the sequential buffered write
The performance was slowed down sometimes when we ran sysbench to measure
the performance of the sequential buffered write by 2 or more threads.

It was because the write order of the test threads might be confused
by the task scheduler, and the coming write would be beyond the end of
the file, in this case, we need insert dummy file extents and create
a hole for the area we skip. But in order to avoid the ongoing ordered
extents which are in the area, we need wait for them. Unfortunately,
the current code doesn't check if there are ordered extents in the area
or not, try to find and flush the dirty pages directly, but in fact,
there is no dirty page in that area, this step of the current code is
unnecessary, and just wastes time. Sometimes, it would increase
the contention of some locks, and makes the performance slow down suddenly.

So we remove the ordered extent flush function before the check, and flush
the dirty pages and wait for the ordered extents only when we find them.

According to my test, we got 1-2 times of the performance regression when
we ran the test by 10 times before applying this patch. After applying
this patch, the regression went away.

Test Environment:
 CPU:		1CPU * 4Cores
 Memory:	6GB
 Partition:	20GB

Test Command:
 # sysbench --test=fileio --file-total-size=16G --file-test-mode=seqwr \
 > --num-threads=512 --file-block-size=16384 --max-time=60 --max-requests=0 run

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:54:38 -05:00
Josef Bacik
b6d08f0630 Btrfs: do not release metadata for space cache inodes
I've been testing our error paths and I was tripping the BUG_ON() in
drop_outstanding_extent because our outstanding_extents is 0 for space cache
inodes.  This is because we don't reserve metadata space for these inodes since
we depend on the global block reserve for our space.  To fix this we need to
make sure the DO_ACCOUNTING stuff doesn't actually call release_metadata for
space cache inodes.  With this patch I'm no longer panicing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:53:36 -05:00
Filipe David Borba Manana
703c88e035 Btrfs: fix tracking of orphan inode count
In inode.c:btrfs_orphan_add() if we failed to insert the orphan
item, we would return without decrementing the orphan count that
we just incremented before attempting the insertion, leaving the
orphan inode count wrong.

In inode.c:btrfs_orphan_del(), we were decrementing the inode
orphan count if the bit BTRFS_INODE_ORPHAN_META_RESERVED was set,
which is logically wrong because it should be decremented if the
bit BTRFS_INODE_HAS_ORPHAN_ITEM was set - after all we increment
the count when we set the bit BTRFS_INODE_HAS_ORPHAN_ITEM elsewhere.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:53:01 -05:00
Ross Kirk
dd3cc16b87 btrfs: drop unused parameter from btrfs_item_nr
Remove unused eb parameter from btrfs_item_nr

Signed-off-by: Ross Kirk <ross.kirk@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:50:48 -05:00
Filipe David Borba Manana
f06becc411 Btrfs: don't store NULL byte in symlink extents
It is not necessary to store the NULL byte in a symlink inline file
extent. There's currently no code that requires the NULL byte to be
present in the extent. This change also doesn't break file format
compatibility nor the send/receive feature.

The VFS also doesn't need the NULL byte to be present in the extent,
as it reads up to inode->i_size bytes (which already excluded the NULL
byte) and sets the NULL byte for us (in fs/namei.c:page_getlink()).

So with this change we save 1 byte per symlink file extent (which is
always inlined in the btree leaf) without losing backward and forward
compatibility.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:49:51 -05:00
Stefan Behrens
69e9c6c6dc Btrfs: eliminate the exceptional root_tree refs=0
The fact that btrfs_root_refs() returned 0 for the tree_root caused
bugs in the past, therefore it is set to 1 with this patch and
(hopefully) all affected code is adapted to this change.

I verified this change by temporarily adding WARN_ON() checks
everywhere where btrfs_root_refs() is used, checking whether the
logic of the code is changed by btrfs_root_refs() returning 1
instead of 0 for root->root_key.objectid == BTRFS_ROOT_TREE_OBJECTID.
With these added checks, I ran the xfstests './check -g auto'.

The two roots chunk_root and log_root_tree that are only referenced
by the superblock and the log_roots below the log_root_tree still
have btrfs_root_refs() == 0, only the tree_root is changed.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:49:26 -05:00
Linus Torvalds
bdeeab62a6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "Sage hit a deadlock with ceph on btrfs, and Josef tracked it down to a
  regression in our initial rc1 pull.  When doing nocow writes we were
  sometimes starting a transaction with locks held"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: release path before starting transaction in can_nocow_extent
2013-10-18 16:46:21 -07:00
Josef Bacik
1bda19eb73 Btrfs: release path before starting transaction in can_nocow_extent
We can't be holding tree locks while we try to start a transaction, we will
deadlock.  Thanks,

Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-10-18 12:43:40 -04:00
Linus Torvalds
d64dab903f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We've got more bug fixes in my for-linus branch:

  One of these fixes another corner of the compression oops from last
  time.  Miao nailed down some problems with concurrent snapshot
  deletion and drive balancing.

  I kept out one of his patches for more testing, but these are all
  stable"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix oops caused by the space balance and dead roots
  Btrfs: insert orphan roots into fs radix tree
  Btrfs: limit delalloc pages outside of find_delalloc_range
  Btrfs: use right root when checking for hash collision
2013-10-12 12:54:24 -07:00
Josef Bacik
4871c1588f Btrfs: use right root when checking for hash collision
btrfs_rename was using the root of the old dir instead of the root of the new
dir when checking for a hash collision, so if you tried to move a file into a
subvol it would freak out because it would see the file you are trying to move
in its current root.  This fixes the bug where this would fail

btrfs subvol create test1
btrfs subvol create test2
mv test1 test2.

Thanks to Chris Murphy for catching this,

Cc: stable@vger.kernel.org
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-10-10 21:27:45 -04:00
Linus Torvalds
0fbf2cc983 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are mostly bug fixes and a two small performance fixes.  The
  most important of the bunch are Josef's fix for a snapshotting
  regression and Mark's update to fix compile problems on arm"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
  Btrfs: create the uuid tree on remount rw
  btrfs: change extent-same to copy entire argument struct
  Btrfs: dir_inode_operations should use btrfs_update_time also
  btrfs: Add btrfs: prefix to kernel log output
  btrfs: refuse to remount read-write after abort
  Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0
  Btrfs: don't leak transaction in btrfs_sync_file()
  Btrfs: add the missing mutex unlock in write_all_supers()
  Btrfs: iput inode on allocation failure
  Btrfs: remove space_info->reservation_progress
  Btrfs: kill delay_iput arg to the wait_ordered functions
  Btrfs: fix worst case calculator for space usage
  Revert "Btrfs: rework the overcommit logic to be based on the total size"
  Btrfs: improve replacing nocow extents
  Btrfs: drop dir i_size when adding new names on replay
  Btrfs: replay dir_index items before other items
  Btrfs: check roots last log commit when checking if an inode has been logged
  Btrfs: actually log directory we are fsync()'ing
  Btrfs: actually limit the size of delalloc range
  Btrfs: allocate the free space by the existed max extent size when ENOSPC
  ...
2013-09-22 14:58:49 -07:00
Guangyu Sun
93fd63c2f0 Btrfs: dir_inode_operations should use btrfs_update_time also
Commit 2bc5565286 (Btrfs: don't update atime on
RO subvolumes) ensures that the access time of an inode is not updated when
the inode lives in a read-only subvolume.
However, if a directory on a read-only subvolume is accessed, the atime is
updated. This results in a write operation to a read-only subvolume. I
believe that access times should never be updated on read-only subvolumes.

To reproduce:

 # mkfs.btrfs -f /dev/dm-3
 (...)
 # mount /dev/dm-3 /mnt
 # btrfs subvol create /mnt/sub
 	Create subvolume '/mnt/sub'
 # mkdir /mnt/sub/dir
 # echo "abc" > /mnt/sub/dir/file
 # btrfs subvol snapshot -r /mnt/sub /mnt/rosnap
 	Create a readonly snapshot of '/mnt/sub' in '/mnt/rosnap'
 # stat /mnt/rosnap/dir
 	File: `/mnt/rosnap/dir'
 	Size: 8         Blocks: 0          IO Block: 4096   directory
 Device: 16h/22d    Inode: 257         Links: 1
 Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
 	Access: 2013-09-11 07:21:49.389157126 -0400
 	Modify: 2013-09-11 07:22:02.330156079 -0400
 	Change: 2013-09-11 07:22:02.330156079 -0400
 # ls /mnt/rosnap/dir
 	file
 # stat /mnt/rosnap/dir
 	File: `/mnt/rosnap/dir'
 	Size: 8         Blocks: 0          IO Block: 4096   directory
 Device: 16h/22d    Inode: 257         Links: 1
 Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
 	Access: 2013-09-11 07:22:56.797151670 -0400
 	Modify: 2013-09-11 07:22:02.330156079 -0400
 	Change: 2013-09-11 07:22:02.330156079 -0400

Reported-by: Koen De Wit <koen.de.wit@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:30 -04:00
Josef Bacik
f4ab9ea706 Btrfs: iput inode on allocation failure
We don't do the iput when we fail to allocate our delayed delalloc work in
__start_delalloc_inodes, fix this.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:28 -04:00
Filipe David Borba Manana
cef2193729 Btrfs: more efficient inode tree replace operation
Instead of removing the current inode from the red black tree
and then add the new one, just use the red black tree replace
operation, which is more efficient.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 10:58:55 -04:00
Linus Torvalds
ac4de9543a Merge branch 'akpm' (patches from Andrew Morton)
Merge more patches from Andrew Morton:
 "The rest of MM.  Plus one misc cleanup"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
  mm/Kconfig: add MMU dependency for MIGRATION.
  kernel: replace strict_strto*() with kstrto*()
  mm, thp: count thp_fault_fallback anytime thp fault fails
  thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
  thp: do_huge_pmd_anonymous_page() cleanup
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  mm: cleanup add_to_page_cache_locked()
  thp: account anon transparent huge pages into NR_ANON_PAGES
  truncate: drop 'oldsize' truncate_pagecache() parameter
  mm: make lru_add_drain_all() selective
  memcg: document cgroup dirty/writeback memory statistics
  memcg: add per cgroup writeback pages accounting
  memcg: check for proper lock held in mem_cgroup_update_page_stat
  memcg: remove MEMCG_NR_FILE_MAPPED
  memcg: reduce function dereference
  memcg: avoid overflow caused by PAGE_ALIGN
  memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
  memcg: correct RESOURCE_MAX to ULLONG_MAX
  mm: memcg: do not trap chargers with full callstack on OOM
  mm: memcg: rework and document OOM waiting and wakeup
  ...
2013-09-12 15:44:27 -07:00
Kirill A. Shutemov
7caef26767 truncate: drop 'oldsize' truncate_pagecache() parameter
truncate_pagecache() doesn't care about old size since commit
cedabed49b ("vfs: Fix vmtruncate() regression").  Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Linus Torvalds
b7c09ad401 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This is against 3.11-rc7, but was pulled and tested against your tree
  as of yesterday.  We do have two small incrementals queued up, but I
  wanted to get this bunch out the door before I hop on an airplane.

  This is a fairly large batch of fixes, performance improvements, and
  cleanups from the usual Btrfs suspects.

  We've included Stefan Behren's work to index subvolume UUIDs, which is
  targeted at speeding up send/receive with many subvolumes or snapshots
  in place.  It closes a long standing performance issue that was built
  in to the disk format.

  Mark Fasheh's offline dedup work is also here.  In this case offline
  means the FS is mounted and active, but the dedup work is not done
  inline during file IO.  This is a building block where utilities are
  able to ask the FS to dedup a series of extents.  The kernel takes
  care of verifying the data involved really is the same.  Today this
  involves reading both extents, but we'll continue to evolve the
  patches"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
  Btrfs: optimize key searches in btrfs_search_slot
  Btrfs: don't use an async starter for most of our workers
  Btrfs: only update disk_i_size as we remove extents
  Btrfs: fix deadlock in uuid scan kthread
  Btrfs: stop refusing the relocation of chunk 0
  Btrfs: fix memory leak of uuid_root in free_fs_info
  btrfs: reuse kbasename helper
  btrfs: return btrfs error code for dev excl ops err
  Btrfs: allow partial ordered extent completion
  Btrfs: convert all bug_ons in free-space-cache.c
  Btrfs: add support for asserts
  Btrfs: adjust the fs_devices->missing count on unmount
  Btrf: cleanup: don't check for root_refs == 0 twice
  Btrfs: fix for patch "cleanup: don't check the same thing twice"
  Btrfs: get rid of one BUG() in write_all_supers()
  Btrfs: allocate prelim_ref with a slab allocater
  Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC
  Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
  Btrfs: fix race between removing a dev and writing sbs
  Btrfs: remove ourselves from the cluster list under lock
  ...
2013-09-12 09:58:51 -07:00
Linus Torvalds
27703bb4a6 PTR_RET() is a weird name, and led to some confusing usage. We ended
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
 
 This has been sitting in linux-next for a whole cycle.
 
 Thanks,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJSJo+1AAoJENkgDmzRrbjxIC4QALJK95o8AUXuwUkl+2fmFkUt
 hh2/PJ1vDYgk4Xt0J6hyoK7XMa0H1RkbBrROuDdsBnorMFpEsGcgdkUZte9ufoAS
 97Bg+7N0KPbTB/S8vOwtW1vbERTJIVPN2uf6h1Wqm9Xc2puCh3HbMMr1AWMGu0WQ
 NqY5+Zz8zecy1UOrMhEP6H1CjeQcL1w1DO6YM5ydeqlKNzAz+JMfDXriLPDwiE7+
 XFPDF/O3Vtd2ckA7L70Lio7hfHwxV5U4WwFVfiwls98XB4jcZqDKIoh1r8z4SRgR
 +0Rae2DN3BaOabGMr//5XdrzQVpwJTh5m2w8BAOHJvCJ9HR7Sq29UIN4u+TowZBy
 L2xYo4dvFxkympwu5zEd3c7vHYWKIaqmSq5PIjr4gF/uIo2OeOTrpPIK782ZEYb7
 e+qUgOEM05V9AmQZCrSZeP9u474Sj8ow3sCtWxfdRtwNfoEIcUXsNNJd/zDHlVtW
 cEtXqc2xXIpcuUJQWlSaGp8fmRQjVZPzrLKYLM2m39ZcOOJbf5rzQAYS7hHPosIa
 SK+YVux/+Zzi+Xo/vXq1OlM/SruCr5S7JOgCxLowoQ88vupgXME6uPyC8EO+QQ50
 GsrHes5ZNLbk0uVsfcexIyojkUnyvDmmnDpv+1zdC6RgZLJQn8OXp5yNhHhnhrFT
 BiHX6YFWtDDqRlVv8Q0F
 =LeaW
 -----END PGP SIGNATURE-----

Merge tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull PTR_RET() removal patches from Rusty Russell:
 "PTR_RET() is a weird name, and led to some confusing usage.  We ended
  up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.

  This has been sitting in linux-next for a whole cycle"

[ There are still some PTR_RET users scattered about, with some of them
  possibly being new, but most of them existing in Rusty's tree too.  We
  have that

      #define PTR_RET(p) PTR_ERR_OR_ZERO(p)

  thing in <linux/err.h>, so they continue to work for now  - Linus ]

* tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO
  Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
  drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO
  sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO
  dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO
  drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO
  mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().
  staging/zcache: don't use PTR_RET().
  remoteproc: don't use PTR_RET().
  pinctrl: don't use PTR_RET().
  acpi: Replace weird use of PTR_RET.
  s390: Replace weird use of PTR_RET.
  PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
  PTR_RET is now PTR_ERR_OR_ZERO
2013-09-04 17:31:11 -07:00
Josef Bacik
7f4f6e0a3f Btrfs: only update disk_i_size as we remove extents
This fixes a problem where if we fail a truncate we will leave the i_size set
where we wanted to truncate to instead of where we were able to truncate to.
Fix this by making btrfs_truncate_inode_items do the disk_i_size update as it
removes extents, that way it will always be consistent with where its extents
are.  Then if the truncate fails at all we can update the in-ram i_size with
what we have on disk and delete the orphan item.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:40 -04:00
Josef Bacik
77cef2ec54 Btrfs: allow partial ordered extent completion
We currently have this problem where you can truncate pages that have not yet
been written for an ordered extent.  We do this because the truncate will be
coming behind to clean us up anyway so what's the harm right?  Well if truncate
fails for whatever reason we leave an orphan item around for the file to be
cleaned up later.  But if the user goes and truncates up the file and tries to
read from the area that had been discarded previously they will get a csum error
because we never actually wrote that data out.

This patch fixes this by allowing us to either discard the ordered extent
completely, by which I mean we just free up the space we had allocated and not
add the file extent, or adjust the length of the file extent we write.  We do
this by setting the length we truncated down to in the ordered extent, and then
we set the file extent length and ram bytes to this length.  The total disk
space stays unchanged since we may be compressed and we can't just chop off the
disk space, but at least this way the file extent only points to the valid data.
Then when the file extent is free'd the extent and csums will be freed normally.

This patch is needed for the next series which will give us more graceful
recovery of failed truncates.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:34 -04:00
Josef Bacik
e8e7cff667 Btrfs: do not clear our orphan item runtime flag on eexist
We were unconditionally clearing our runtime flag on the inode on error when
trying to insert an orphan item.  This is wrong in the case of -EEXIST since we
obviously have an orphan item.  This was causing us to not do the correct
cleanup of our orphan items which caused issues on cleanup.  This happens
because currently when truncate fails we just leave the orphan item on there so
it can be cleaned up, so if we go to remove the file later we will hit this
issue.  What we do for truncate isn't right either, but we shouldn't screw this
sort of thing up on error either, so fix this and then I'll fix truncate in a
different patch.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:22 -04:00
Geert Uytterhoeven
c1c9ff7c94 Btrfs: Remove superfluous casts from u64 to unsigned long long
u64 is "unsigned long long" on all architectures now, so there's no need to
cast it when formatting it using the "ll" length modifier.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:08 -04:00
Josef Bacik
00361589d2 Btrfs: avoid starting a transaction in the write path
I noticed while looking at a deadlock that we are always starting a transaction
in cow_file_range().  This isn't really needed since we only need a transaction
if we are doing an inline extent, or if the allocator needs to allocate a chunk.
So push down all the transaction start stuff to be closer to where we actually
need a transaction in all of these cases.  This will hopefully reduce our write
latency when we are committing often.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:05:05 -04:00
Josef Bacik
4ef31a45a0 Btrfs: fix the error handling wrt orphan items
There are several places where we BUG_ON() if we fail to remove the orphan items
and such, which is not ok, so remove those and either abort or just carry on.
This also fixes a problem where if we couldn't start a transaction we wouldn't
actually remove the orphan item reserve for the inode.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:05:03 -04:00
Liu Bo
116e0024c4 Btrfs: allow compressed extents to be merged during defragment
The rule originally comes from nocow writing, but snapshot-aware
defrag is a different case, the extent has been writen and we're
not going to change the extent but add a reference on the data.

So we're able to allow such compressed extents to be merged into
one bigger extent if they're pointing to the same data.

Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:52 -04:00
Josef Bacik
151a41bc46 Btrfs: fix what bits we clear when erroring out from delalloc
First of all we no longer set EXTENT_DIRTY when we dirty an extent so this patch
removes the clearing of EXTENT_DIRTY since it confuses me.  This patch also adds
clearing EXTENT_DEFRAG and also doing EXTENT_DO_ACCOUNTING when we have errors.
This is because if we are clearing delalloc without adding an ordered extent
then we need to make sure the enospc handling stuff is accounted for.  Also if
this range was DEFRAG we need to make sure that bit is cleared so we dont leak
it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:39 -04:00
Josef Bacik
c2790a2e2b Btrfs: cleanup arguments to extent_clear_unlock_delalloc
This patch removes the io_tree argument for extent_clear_unlock_delalloc since
we always use &BTRFS_I(inode)->io_tree, and it separates out the extent tree
operations from the page operations.  This way we just pass in the extent bits
we want to clear and then pass in the operations we want done to the pages.
This is because I'm going to fix what extent bits we clear in some cases and
rather than add a bunch of new flags we'll just use the actual extent bits we
want to clear.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:38 -04:00
Miao Xie
facc8a2247 Btrfs: don't cache the csum value into the extent state tree
Before applying this patch, we cached the csum value into the extent state
tree when reading some data from the disk, this operation increased the lock
contention of the state tree.

Now, we just store the csum value into the bio structure or other unshared
structure, so we can reduce the lock contention.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:33 -04:00
Josef Bacik
50f1319cb5 Btrfs: reset ret in record_one_backref
I was getting warnings when running find ./ -type f -exec btrfs fi defrag -f {}
\; from record_one_backref because ret was set.  Turns out it was because it was
set to 1 because the search slot didn't come out exact and we never reset it.
So reset it to 0 right after the search so we don't leak this and get
uneccessary warnings.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:23 -04:00
Zach Brown
db62efbbf8 btrfs: don't loop on large offsets in readdir
When btrfs readdir() hits the last entry it sets the readdir offset to a
huge value to stop buggy apps from breaking when the same name is
returned by readdir() with concurrent rename()s.

But unconditionally setting the offset to INT_MAX causes readdir() to
loop returning any entries with offsets past INT_MAX.  It only takes a
few hours of constant file creation and removal to create entries past
INT_MAX.

So let's set the huge offset to LLONG_MAX if the last entry has already
overflowed 32bit loff_t.   Without large offsets behaviour is identical.
With large offsets 64bit apps will work and 32bit apps will be no more
broken than they currently are if they see large offsets.

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:34:56 -04:00
Liu Bo
e68afa49ae Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents
For partial extents, snapshot-aware defrag does not work as expected,
since
a) we use the wrong logical offset to search for parents, which should be
   disk_bytenr + extent_offset, not just disk_bytenr,
b) 'offset' returned by the backref walking just refers to key.offset, not
   the 'offset' stored in btrfs_extent_data_ref which is
   (key.offset - extent_offset).

The reproducer:
$ mkfs.btrfs sda
$ mount sda /mnt
$ btrfs sub create /mnt/sub
$ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done
$ btrfs sub snap /mnt/sub /mnt/snap1
$ btrfs sub snap /mnt/sub /mnt/snap2
$ sync; btrfs filesystem defrag /mnt/sub/foo;
$ umount /mnt
$ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared.

This addresses the above two problems.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:29:17 -04:00
Jie Liu
7cddc19392 btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified
Create a small file and fallocate it to a big size with
FALLOC_FL_KEEP_SIZE option, then truncate it back to the
small size again, the disk free space is not changed back
in this case. i.e,

total 4
-rw-r--r-- 1 root root 512 Jun 28 11:35 test

Filesystem      Size  Used Avail Use% Mounted on
....
/dev/sdb1       8.0G   56K  7.2G   1% /mnt

-rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test

Filesystem      Size  Used Avail Use% Mounted on
....
/dev/sdb1       8.0G  5.1G  2.2G  70% /mnt

Filesystem      Size  Used Avail Use% Mounted on
....
/dev/sdb1       8.0G  5.1G  2.2G  70% /mnt

With this fix, the truncated up space is back as:
Filesystem      Size  Used Avail Use% Mounted on
....
/dev/sdb1       8.0G   56K  7.2G   1% /mnt

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:28:56 -04:00
Rusty Russell
8c6ffba0ed PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
Sweep of the simple cases.

Cc: netdev@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-15 11:25:01 +09:30
Linus Torvalds
e3a0dd98e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "These are the usual mixture of bugs, cleanups and performance fixes.
  Miao has some really nice tuning of our crc code as well as our
  transaction commits.

  Josef is peeling off more and more problems related to early enospc,
  and has a number of important bug fixes in here too"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (81 commits)
  Btrfs: wait ordered range before doing direct io
  Btrfs: only do the tree_mod_log_free_eb if this is our last ref
  Btrfs: hold the tree mod lock in __tree_mod_log_rewind
  Btrfs: make backref walking code handle skinny metadata
  Btrfs: fix crash regarding to ulist_add_merge
  Btrfs: fix several potential problems in copy_nocow_pages_for_inode
  Btrfs: cleanup the code of copy_nocow_pages_for_inode()
  Btrfs: fix oops when recovering the file data by scrub function
  Btrfs: make the chunk allocator completely tree lockless
  Btrfs: cleanup orphaned root orphan item
  Btrfs: fix wrong mirror number tuning
  Btrfs: cleanup redundant code in btrfs_submit_direct()
  Btrfs: remove btrfs_sector_sum structure
  Btrfs: check if we can nocow if we don't have data space
  Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc
  Btrfs: use a percpu to keep track of possibly pinned bytes
  Btrfs: check for actual acls rather than just xattrs when caching no acl
  Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate
  Btrfs: optimize reada_for_balance
  Btrfs: optimize read_block_for_search
  ...
2013-07-09 12:33:09 -07:00
Linus Torvalds
9e239bb939 Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
 block size is smaller than the page size (i.e., file systems 1k blocks
 on x86, or more interestingly file systems with 4k blocks on Power or
 ia64 systems.)
 
 In the cleanup category, the ext4's punch hole implementation was
 significantly improved by Lukas Czerner, and now supports bigalloc
 file systems.  In addition, Jan Kara significantly cleaned up the
 write submission code path.  We also improved error checking and added
 a few sanity checks.
 
 In the optimizations category, two major optimizations deserve
 mention.  The first is that ext4_writepages() is now used for
 nodelalloc and ext3 compatibility mode.  This allows writes to be
 submitted much more efficiently as a single bio request, instead of
 being sent as individual 4k writes into the block layer (which then
 relied on the elevator code to coalesce the requests in the block
 queue).  Secondly, the extent cache shrink mechanism, which was
 introduce in 3.9, no longer has a scalability bottleneck caused by the
 i_es_lru spinlock.  Other optimizations include some changes to reduce
 CPU usage and to avoid issuing empty commits unnecessarily.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG
 0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV
 yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv
 NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg
 rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT
 ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2
 AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/
 zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k
 U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD
 vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX
 +TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7
 XrddRLGlk4Hf+2o7/D7B
 =SwaI
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Lots of bug fixes, cleanups and optimizations.  In the bug fixes
  category, of note is a fix for on-line resizing file systems where the
  block size is smaller than the page size (i.e., file systems 1k blocks
  on x86, or more interestingly file systems with 4k blocks on Power or
  ia64 systems.)

  In the cleanup category, the ext4's punch hole implementation was
  significantly improved by Lukas Czerner, and now supports bigalloc
  file systems.  In addition, Jan Kara significantly cleaned up the
  write submission code path.  We also improved error checking and added
  a few sanity checks.

  In the optimizations category, two major optimizations deserve
  mention.  The first is that ext4_writepages() is now used for
  nodelalloc and ext3 compatibility mode.  This allows writes to be
  submitted much more efficiently as a single bio request, instead of
  being sent as individual 4k writes into the block layer (which then
  relied on the elevator code to coalesce the requests in the block
  queue).  Secondly, the extent cache shrink mechanism, which was
  introduce in 3.9, no longer has a scalability bottleneck caused by the
  i_es_lru spinlock.  Other optimizations include some changes to reduce
  CPU usage and to avoid issuing empty commits unnecessarily."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
  ext4: optimize starting extent in ext4_ext_rm_leaf()
  jbd2: invalidate handle if jbd2_journal_restart() fails
  ext4: translate flag bits to strings in tracepoints
  ext4: fix up error handling for mpage_map_and_submit_extent()
  jbd2: fix theoretical race in jbd2__journal_restart
  ext4: only zero partial blocks in ext4_zero_partial_blocks()
  ext4: check error return from ext4_write_inline_data_end()
  ext4: delete unnecessary C statements
  ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
  jbd2: move superblock checksum calculation to jbd2_write_superblock()
  ext4: pass inode pointer instead of file pointer to punch hole
  ext4: improve free space calculation for inline_data
  ext4: reduce object size when !CONFIG_PRINTK
  ext4: improve extent cache shrink mechanism to avoid to burn CPU time
  ext4: implement error handling of ext4_mb_new_preallocation()
  ext4: fix corruption when online resizing a fs with 1K block size
  ext4: delete unused variables
  ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
  jbd2: remove debug dependency on debug_fs and update Kconfig help text
  jbd2: use a single printk for jbd_debug()
  ...
2013-07-02 09:39:34 -07:00
Josef Bacik
0e267c44c3 Btrfs: wait ordered range before doing direct io
My recent truncate patch uncovered this bug, but I can reproduce it without the
truncate patch.  If you mount with -o compress-force, do a direct write to some
area, do a buffered write to some other area, and then do a direct read you will
get the wrong data for where you did the buffered write.  This is because the
generic direct io helpers only call filemap_write_and_wait once, and for
compression we need it twice.  So to be safe add the btrfs_wait_ordered_range to
the start of the direct io function to make sure any compressed writes have
truly been written.  This patch makes xfstests 130 pass when you mount with -o
compress-force=lzo.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:49 -04:00
Miao Xie
e6da5d2ec9 Btrfs: cleanup redundant code in btrfs_submit_direct()
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:48 -04:00
Josef Bacik
7ee9e4405f Btrfs: check if we can nocow if we don't have data space
We always just try and reserve data space when we write, but if we are out of
space but have prealloc'ed extents we should still successfully write.  This
patch will try and see if we can write to prealloc'ed space and if we can go
ahead and allow the write to continue.  With this patch we now pass xfstests
generic/274.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:45 -04:00
Josef Bacik
f23b5a5995 Btrfs: check for actual acls rather than just xattrs when caching no acl
We have an optimization that will go ahead and cache no acls on an inode if
there are no xattrs on the inode.  This saves us a lookup later to check the
acls for writes or any other access.  The problem is I use selinux so I always
have an xattr on inodes, so make this test a little smarter and check for the
actual acl hash on the key and if it isn't there then we still get to cache no
acl which makes everybody who uses selinux a little happier.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:40 -04:00
Josef Bacik
a71754fc68 Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate
This has plagued us forever and I'm so over working around it.  When we truncate
down to a non-page aligned offset we will call btrfs_truncate_page to zero out
the end of the page and write it back to disk, this will keep us from exposing
stale data if we truncate back up from that point.  The problem with this is it
requires data space to do this, and people don't really expect to get ENOSPC
from truncate() for these sort of things.  This also tends to bite the orphan
cleanup stuff too which keeps people from mounting.  To get around this we can
just move this into btrfs_cont_expand() to make sure if we are truncating up
from a non-page size aligned i_size we will zero out the rest of this page so
that we don't expose stale data.  This will give ENOSPC if you try to truncate()
up or if you try to write past the end of isize, which is much more reasonable.
This fixes xfstests generic/083 failing to mount because of the orphan cleanup
failing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:33 -04:00
Josef Bacik
fdf8e2ea3c Btrfs: unlock extent range on enospc in compressed submit
A user reported a deadlock where the async submit thread was blocked on the
lock_extent() lock, and then everybody behind him was locked on the page lock
for the page he was holding.  Looking at the code I noticed we do not unlock the
extent range when we get ENOSPC and goto retry.  This is bad because we
immediately try to lock that range again to do the cow, which will cause a
deadlock.  Fix this by unlocking the range.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:31 -04:00
Al Viro
9cdda8d31f [readdir] convert btrfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:00 +04:00
Josef Bacik
01cd33674e Btrfs: put our inode if orphan cleanup fails
When we cross into a different subvol when doing a lookup we will run the orhpan
cleanup.  If this fails however we do not drop the ref to the inode we were
looking up before we return an error, which leads to busy inodes on umount.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:16 -04:00
Josef Bacik
c69b26b011 Btrfs: add some missing iput()'s in btrfs_orphan_cleanup
There are some error cases that we don't do an iput() on our inode, fix this.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:15 -04:00
Josef Bacik
d52be818e6 Btrfs: simplify unlink reservations
Dave pointed out a problem where if you filled up a file system as much as
possible you couldn't remove any files.  The whole unlink reservation thing is
convoluted because it tries to guess if it's going to add space to unlink
something or not, and has all these odd uncommented cases where it simply does
not try.  So to fix this I've added a way to conditionally steal from the global
reserve if we can't make our normal reservation.  If we have more than half the
space in the global reserve free we will go ahead and steal from the global
reserve.  With this patch Dave's reproducer now works and I can rm all the files
on the file system.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:06 -04:00
Miao Xie
199c2a9c3d Btrfs: introduce per-subvolume ordered extent list
The reason we introduce per-subvolume ordered extent list is the same
as the per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:41 -04:00
Miao Xie
eb73c1b7ce Btrfs: introduce per-subvolume delalloc inode list
When we create a snapshot, we need flush all delalloc inodes in the
fs, just flushing the inodes in the source tree is OK. So we introduce
per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:40 -04:00
Stefan Behrens
3c64a1aba7 Btrfs: cleanup: don't check the same thing twice
btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in six places.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:30 -04:00
Linus Torvalds
a2648ebb7e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is an assortment of crash fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: stop all workers before cleaning up roots
  Btrfs: fix use-after-free bug during umount
  Btrfs: init relocate extent_io_tree with a mapping
  btrfs: Drop inode if inode root is NULL
  Btrfs: don't delete fs_roots until after we cleanup the transaction
2013-06-13 22:34:14 -07:00
Naohiro Aota
6379ef9fb2 btrfs: Drop inode if inode root is NULL
There is a path where btrfs_drop_inode() is called with its inode's root
is NULL: In btrfs_new_inode(), when btrfs_set_inode_index() fails,
iput() is called. We should handle this case before taking look at the
root->root_item.

Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:07:53 -04:00
Lukas Czerner
d47992f86b mm: change invalidatepage prototype to accept length
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.

Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).

This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.

We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.

Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
2013-05-21 23:17:23 -04:00
Linus Torvalds
130901ba33 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Miao Xie has been very busy, fixing races and enospc problems and many
  other small but important pieces.

  Alexandre Oliva discovered some problems with how our error handling
  was interacting with the block layer and for now has disabled our
  partial handling of sub-page writes.  The real sub-page work is in a
  series of patches from IBM that we still need to integrate and test.
  The code Alexandre has turned off was really incomplete.

  Josef has more error handling fixes and an important fix for the new
  skinny extent format.

  This also has my fix for the tracepoint crash from late in 3.9.  It's
  the first stage in a larger clean up to get rid of btrfs_bio and make
  a proper bioset for all the items we need to tack into the bio.  For
  now the bioset only holds our mirror_num and stripe_index, but for the
  next merge window I'll shuffle more in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
  Btrfs: use a btrfs bioset instead of abusing bio internals
  Btrfs: make sure roots are assigned before freeing their nodes
  Btrfs: explicitly use global_block_rsv for quota_tree
  btrfs: do away with non-whole_page extent I/O
  Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
  Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
  Btrfs: pause the space balance when remounting to R/O
  Btrfs: fix unprotected root node of the subvolume's inode rb-tree
  Btrfs: fix accessing a freed tree root
  Btrfs: return errno if possible when we fail to allocate memory
  Btrfs: update the global reserve if it is empty
  Btrfs: don't steal the reserved space from the global reserve if their space type is different
  Btrfs: optimize the error handle of use_block_rsv()
  Btrfs: don't use global block reservation for inode cache truncation
  Btrfs: don't abort the current transaction if there is no enough space for inode cache
  Correct allowed raid levels on balance.
  Btrfs: fix possible memory leak in replace_path()
  Btrfs: fix possible memory leak in the find_parent_nodes()
  Btrfs: don't allow device replace on RAID5/RAID6
  Btrfs: handle running extent ops with skinny metadata
  ...
2013-05-18 11:35:28 -07:00
Chris Mason
c5cb6a0573 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next 2013-05-17 21:53:17 -04:00
Chris Mason
9be3395bcd Btrfs: use a btrfs bioset instead of abusing bio internals
Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.

As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs.  They are also used
to count IO failures on a per device basis.

Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.

This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index.  The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>
2013-05-17 21:52:52 -04:00
Miao Xie
e1409cef85 Btrfs: fix unprotected root node of the subvolume's inode rb-tree
The root node of the rb-tree may be changed, so we should get it under
the lock. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:30 -04:00
Miao Xie
89042e5ad2 Btrfs: fix accessing a freed tree root
inode_tree_del() will move the tree root into the dead root list, and
then the tree will be destroyed by the cleaner. So if we remove the
delayed node which is cached in the inode after inode_tree_del(),
we may access a freed tree root. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:29 -04:00
Liu Bo
b9aa55bed1 Btrfs: return errno if possible when we fail to allocate memory
We need to set return value explicitly, otherwise we'll lose the error
value.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:27 -04:00
Linus Torvalds
983a5f84a4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "These are mostly fixes.  The biggest exceptions are Josef's skinny
  extents and Jan Schmidt's code to rebuild our quota indexes if they
  get out of sync (or you enable quotas on an existing filesystem).

  The skinny extents are off by default because they are a new variation
  on the extent allocation tree format.  btrfstune -x enables them, and
  the new format makes the extent allocation tree about 30% smaller.

  I rebased this a few days ago to rework Dave Sterba's crc checks on
  the super block, but almost all of these go back to rc6, since I
  though 3.9 was due any minute.

  The biggest missing fix is the tracepoint bug that was hit late in
  3.9.  I ran into problems with that in overnight testing and I'm still
  tracking it down.  I'll definitely have that fixed for rc2."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (101 commits)
  Btrfs: allow superblock mismatch from older mkfs
  btrfs: enhance superblock checks
  btrfs: fix misleading variable name for flags
  btrfs: use unsigned long type for extent state bits
  Btrfs: improve the loop of scrub_stripe
  btrfs: read entire device info under lock
  btrfs: remove unused gfp mask parameter from release_extent_buffer callchain
  btrfs: handle errors returned from get_tree_block_key
  btrfs: make static code static & remove dead code
  Btrfs: deal with errors in write_dev_supers
  Btrfs: remove almost all of the BUG()'s from tree-log.c
  Btrfs: deal with free space cache errors while replaying log
  Btrfs: automatic rescan after "quota enable" command
  Btrfs: rescan for qgroups
  Btrfs: split btrfs_qgroup_account_ref into four functions
  Btrfs: allocate new chunks if the space is not enough for global rsv
  Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
  btrfs: move leak debug code to functions
  Btrfs: return free space in cow error path
  Btrfs: set UUID in root_item for created trees
  ...
2013-05-09 13:07:40 -07:00
Kent Overstreet
a27bb332c0 aio: don't include aio.h in sched.h
Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 20:16:25 -07:00
David Sterba
410748882a btrfs: use unsigned long type for extent state bits
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:27 -04:00
Eric Sandeen
48a3b6366f btrfs: make static code static & remove dead code
Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.

removed functions:

btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()

btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.

ulist.c functions are left, another patch will take care of those.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:23 -04:00
Liu Bo
ace68bac61 Btrfs: return free space in cow error path
Replace some BUG_ONs with proper handling and take allocated space back to
free space cache for later use.

We don't have to worry about extent maps since they'd be freed in releasepage
path.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:15 -04:00
Josef Bacik
eb384b55ae Btrfs: fix extent logging with O_DIRECT into prealloc
This is the same as the fix from commit

Btrfs: fix bad extent logging

but for O_DIRECT.  I missed this when I fixed the problem originally, we were
still using the em for the orig_start and orig_block_len, which would be the
merged extent.  We need to use the actual extent from the on disk file extent
item, which we have to lookup to make sure it's ok to nocow anyway so just pass
in some pointers to hold this info.  Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:07 -04:00
Tsutomu Itoh
afe5fea72b Btrfs: cleanup of function where fixup_low_keys() is called
If argument 'trans' is unnecessary in the function where
fixup_low_keys() is called, 'trans' is deleted.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:52 -04:00
Zach Brown
d4e3991b99 btrfs: abort unlink trans in missed error case
__btrfs_unlink_inode() aborts its transaction when it sees errors after
it removes the directory item.  But it missed the case where
btrfs_del_dir_entries_in_log() returns an error.  If this happens then
the unlink appears to fail but the items have been removed without
updating the directory size.  The directory then has leaked bytes in
i_size and can never be removed.

Adding the missing transaction abort at least makes this failure
consistent with the other failure cases.

I noticed this while reading the code after someone on irc reported
having a directory with i_size but no entries.  I tested it by forcing
btrfs_del_dir_entries_in_log() to return -ENOMEM.

Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:36 -04:00
Josef Bacik
09a2a8f96e Btrfs: fix bad extent logging
A user sent me a btrfs-image of a file system that was panicing on mount during
the log recovery.  I had originally thought these problems were from a bug in
the free space cache code, but that was just a symptom of the problem.  The
problem is if your application does something like this

[prealloc][prealloc][prealloc]

the internal extent maps will merge those all together into one extent map, even
though on disk they are 3 separate extents.  So if you go to write into one of
these ranges the extent map will be right since we use the physical extent when
doing the write, but when we log the extents they will use the wrong sizes for
the remainder prealloc space.  If this doesn't happen to trip up the free space
cache (which it won't in a lot of cases) then you will get bogus entries in your
extent tree which will screw stuff up later.  The data and such will still work,
but everything else is broken.  This patch fixes this by not allowing extents
that are on the modified list to be merged.  This has the side effect that we
are no longer adding everything to the modified list all the time, which means
we now have to call btrfs_drop_extents every time we log an extent into the
tree.  So this allows me to drop all this speciality code I was using to get
around calling btrfs_drop_extents.  With this patch the testcase I've created no
longer creates a bogus file system after replaying the log.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:34 -04:00
Josef Bacik
cc95bef635 Btrfs: log ram bytes properly
When logging changed extents I was logging ram_bytes as the current length,
which isn't correct, it's supposed to be the ram bytes of the original extent.
This is for compression where even if we split the extent we need to know the
ram bytes so when we uncompress the extent we know how big it will be.  This was
still working out right with compression for some reason but I think we were
getting lucky.  It was definitely off for prealloc which is why I noticed it,
btrfsck was complaining about it.  With this patch btrfsck no longer complains
after a log replay.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:33 -04:00
David Sterba
4884b476d7 btrfs: make orphan cleanup less verbose
The messages

  btrfs: unlinked 123 orphans
  btrfs: truncated 456 orphans

are not useful to regular users and raise questions whether there are
problems with the filesystem.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:24 -04:00
Simon Kirby
c2cf52eb71 Btrfs: Include the device in most error printk()s
With more than one btrfs volume mounted, it can be very difficult to find
out which volume is hitting an error. btrfs_error() will print this, but
it is currently rigged as more of a fatal error handler, while many of
the printk()s are currently for debugging and yet-unhandled cases.

This patch just changes the functions where the device information is
already available. Some cases remain where the root or fs_info is not
passed to the function emitting the error.

This may introduce some confusion with volumes backed by multiple devices
emitting errors referring to the primary device in the set instead of the
one on which the error occurred.

Use btrfs_printk(fs_info, format, ...) rather than writing the device
string every time, and introduce macro wrappers ala XFS for brevity.
Since the function already cannot be used for continuations, print a
newline as part of the btrfs_printk() message rather than at each caller.

Signed-off-by: Simon Kirby <sim@hostway.ca>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:23 -04:00
Josef Bacik
3173a18f70 Btrfs: add a incompatible format change for smaller metadata extent refs
We currently store the first key of the tree block inside the reference for the
tree block in the extent tree.  This takes up quite a bit of space.  Make a new
key type for metadata which holds the level as the offset and completely removes
storing the btrfs_tree_block_info inside the extent ref.  This reduces the size
from 51 bytes to 33 bytes per extent reference for each tree block.  In practice
this results in a 30-35% decrease in the size of our extent tree, which means we
COW less and can keep more of the extent tree in memory which makes our heavy
metadata operations go much faster.  This is not an automatic format change, you
must enable it at mkfs time or with btrfstune.  This patch deals with having
metadata stored as either the old format or the new format so it is easy to
convert.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:18 -04:00
Liu Bo
b0496686ba Btrfs: cleanup unused arguments of btrfs_csum_data
Argument 'root' is no more used in btrfs_csum_data().

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:14 -04:00
Linus Torvalds
3615db41c4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We've had a busy two weeks of bug fixing.  The biggest patches in here
  are some long standing early-enospc problems (Josef) and a very old
  race where compression and mmap combine forces to lose writes (me).
  I'm fairly sure the mmap bug goes all the way back to the introduction
  of the compression code, which is proof that fsx doesn't trigger every
  possible mmap corner after all.

  I'm sure you'll notice one of these is from this morning, it's a small
  and isolated use-after-free fix in our scrub error reporting.  I
  double checked it here."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: don't drop path when printing out tree errors in scrub
  Btrfs: fix wrong return value of btrfs_lookup_csum()
  Btrfs: fix wrong reservation of csums
  Btrfs: fix double free in the btrfs_qgroup_account_ref()
  Btrfs: limit the global reserve to 512mb
  Btrfs: hold the ordered operations mutex when waiting on ordered extents
  Btrfs: fix space accounting for unlink and rename
  Btrfs: fix space leak when we fail to reserve metadata space
  Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes
  Btrfs: fix race between mmap writes and compression
  Btrfs: fix memory leak in btrfs_create_tree()
  Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
  Btrfs: fix missing qgroup reservation before fallocating
  Btrfs: handle a bogus chunk tree nicely
  Btrfs: update to use fs_state bit
2013-03-29 11:13:25 -07:00
Miao Xie
39847c4d3d Btrfs: fix wrong reservation of csums
We reserve the space for csums only when we write data into a file, in
the other cases, such as tree log, log replay, we don't do reservation,
so we can use the reservation of the transaction handle just for the former.
And for the latter, we should use the tree's own reservation. But the
function - btrfs_csum_file_blocks() didn't differentiate between these
two types of the cases, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-03-28 09:51:30 -04:00
Josef Bacik
6e137ed3f3 Btrfs: fix space accounting for unlink and rename
We are way over-reserving for unlink and rename.  Rename is just some random
huge number and unlink accounts for tree log operations that don't actually
happen during unlink, not to mention the tree log doesn't take from the trans
block rsv anyway so it's completely useless.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-03-28 09:51:27 -04:00
Chris Mason
4adaa61102 Btrfs: fix race between mmap writes and compression
Btrfs uses page_mkwrite to ensure stable pages during
crc calculations and mmap workloads.  We call clear_page_dirty_for_io
before we do any crcs, and this forces any application with the file
mapped to wait for the crc to finish before it is allowed to change
the file.

With compression on, the clear_page_dirty_for_io step is happening after
we've compressed the pages.  This means the applications might be
changing the pages while we are compressing them, and some of those
modifications might not hit the disk.

This commit adds the clear_page_dirty_for_io before compression starts
and makes sure to redirty the page if we have to fallback to
uncompressed IO as well.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Alexandre Oliva <oliva@gnu.org>
cc: stable@vger.kernel.org
2013-03-26 13:19:14 -04:00
Linus Torvalds
08637024ab Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Eric's rcu barrier patch fixes a long standing problem with our
  unmount code hanging on to devices in workqueue helpers.  Liu Bo
  nailed down a difficult assertion for in-memory extent mappings."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix warning of free_extent_map
  Btrfs: fix warning when creating snapshots
  Btrfs: return as soon as possible when edquot happens
  Btrfs: return EIO if we have extent tree corruption
  btrfs: use rcu_barrier() to wait for bdev puts at unmount
  Btrfs: remove btrfs_try_spin_lock
  Btrfs: get better concurrency for snapshot-aware defrag work
2013-03-17 11:04:14 -07:00
Liu Bo
a09a0a705d Btrfs: get better concurrency for snapshot-aware defrag work
Using spinning case instead of blocking will result in better concurrency
overall.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-03-14 14:50:19 -04:00
Linus Torvalds
0aefda3e81 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are scattered fixes and one performance improvement.  The
  biggest functional change is in how we throttle metadata changes.  The
  new code bumps our average file creation rate up by ~13% in fs_mark,
  and lowers CPU usage.

  Stefan bisected out a regression in our allocation code that made
  balance loop on extents larger than 256MB."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: improve the delayed inode throttling
  Btrfs: fix a mismerge in btrfs_balance()
  Btrfs: enforce min_bytes parameter during extent allocation
  Btrfs: allow running defrag in parallel to administrative tasks
  Btrfs: avoid deadlock on transaction waiting list
  Btrfs: do not BUG_ON on aborted situation
  Btrfs: do not BUG_ON in prepare_to_reloc
  Btrfs: free all recorded tree blocks on error
  Btrfs: build up error handling for merge_reloc_roots
  Btrfs: check for NULL pointer in updating reloc roots
  Btrfs: fix unclosed transaction handler when the async transaction commitment fails
  Btrfs: fix wrong handle at error path of create_snapshot() when the commit fails
  Btrfs: use set_nlink if our i_nlink is 0
2013-03-08 17:33:20 -08:00
Chris Mason
154ea28930 Btrfs: enforce min_bytes parameter during extent allocation
Commit 24542bf7ea changed preallocation of
extents to cap the max size we try to allocate.  It's a valid change,
but the extent reservation code is also used by balance, and that
can't tolerate a smaller extent being allocated.

__btrfs_prealloc_file_range already has a min_size parameter, which is
used by relocation to request a specific extent size.  This commit
adds an extra check to enforce that minimum extent size.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
2013-03-05 11:30:16 -05:00
Linus Torvalds
b695188dd3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "The biggest feature in the pull is the new (and still experimental)
  raid56 code that David Woodhouse started long ago.  I'm still working
  on the parity logging setup that will avoid inconsistent parity after
  a crash, so this is only for testing right now.  But, I'd really like
  to get it out to a broader audience to hammer out any performance
  issues or other problems.

  scrub does not yet correct errors on raid5/6 either.

  Josef has another pass at fsync performance.  The big change here is
  to combine waiting for metadata with waiting for data, which is a big
  latency win.  It is also step one toward using atomics from the
  hardware during a commit.

  Mark Fasheh has a new way to use btrfs send/receive to send only the
  metadata changes.  SUSE is using this to make snapper more efficient
  at finding changes between snapshosts.

  Snapshot-aware defrag is also included.

  Otherwise we have a large number of fixes and cleanups.  Eric Sandeen
  wins the award for removing the most lines, and I'm hoping we steal
  this idea from XFS over and over again."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
  btrfs: fixup/remove module.h usage as required
  Btrfs: delete inline extents when we find them during logging
  btrfs: try harder to allocate raid56 stripe cache
  Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic
  Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails
  Btrfs: remove reduplicate check about root in the function btrfs_clean_quota_tree
  Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails
  Btrfs: fix missing deleted items in btrfs_clean_quota_tree
  btrfs: use only inline_pages from extent buffer
  Btrfs: fix wrong reserved space when deleting a snapshot/subvolume
  Btrfs: fix wrong reserved space in qgroup during snap/subv creation
  Btrfs: remove unnecessary dget_parent/dput when creating the pending snapshot
  btrfs: remove a printk from scan_one_device
  Btrfs: fix NULL pointer after aborting a transaction
  Btrfs: fix memory leak of log roots
  Btrfs: copy everything if we've created an inline extent
  btrfs: cleanup for open-coded alignment
  Btrfs: do not change inode flags in rename
  Btrfs: use reserved space for creating a snapshot
  clear chunk_alloc flag on retryable failure
  ...
2013-03-02 16:41:54 -08:00
Josef Bacik
bdc20e67e8 Btrfs: copy everything if we've created an inline extent
I noticed while looking into a tree logging bug that we aren't logging inline
extents properly.  Since this requires copying and it shouldn't happen too often
just force us to copy everything for the inode into the tree log when we have an
inline extent.  With this patch we have valid data after a crash when we write
an inline extent.  Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-28 13:33:20 -05:00
Linus Torvalds
d895cb1af1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro:
 "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
  locking violations, etc.

  The most visible changes here are death of FS_REVAL_DOT (replaced with
  "has ->d_weak_revalidate()") and a new helper getting from struct file
  to inode.  Some bits of preparation to xattr method interface changes.

  Misc patches by various people sent this cycle *and* ocfs2 fixes from
  several cycles ago that should've been upstream right then.

  PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  saner proc_get_inode() calling conventions
  proc: avoid extra pde_put() in proc_fill_super()
  fs: change return values from -EACCES to -EPERM
  fs/exec.c: make bprm_mm_init() static
  ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
  ocfs2: fix possible use-after-free with AIO
  ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
  get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
  target: writev() on single-element vector is pointless
  export kernel_write(), convert open-coded instances
  fs: encode_fh: return FILEID_INVALID if invalid fid_type
  kill f_vfsmnt
  vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
  nfsd: handle vfs_getattr errors in acl protocol
  switch vfs_getattr() to struct path
  default SET_PERSONALITY() in linux/elf.h
  ceph: prepopulate inodes only when request is aborted
  d_hash_and_lookup(): export, switch open-coded instances
  9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
  9p: split dropping the acls from v9fs_set_create_acl()
  ...
2013-02-26 20:16:07 -08:00
Qu Wenruo
fda2832feb btrfs: cleanup for open-coded alignment
Though most of the btrfs codes are using ALIGN macro for page alignment,
there are still some codes using open-coded alignment like the
following:
------
        u64 mask = ((u64)root->stripesize - 1);
        u64 ret = (val + mask) & ~mask;
------
Or even hidden one:
------
        num_bytes = (end - start + blocksize) & ~(blocksize - 1);
------

Sometimes these open-coded alignment is not so easy to understand for
newbie like me.

This commit changes the open-coded alignment to the ALIGN macro for a
better readability.

Also there is a previous patch from David Sterba with similar changes,
but the patch is for 3.2 kernel and seems not merged.
http://www.spinics.net/lists/linux-btrfs/msg12747.html

Cc: David Sterba <dave@jikos.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-26 11:04:13 -05:00
Liu Bo
8c4ce81e91 Btrfs: do not change inode flags in rename
Before we forced to change a file's NOCOW and COMPRESS flag due to
the parent directory's, but this ends up a bad idea, because it
confuses end users a lot about file's NOCOW status, eg. if someone
change a file to NOCOW via 'chattr' and then rename it in the current
directory which is without NOCOW attribute, the file will lose the
NOCOW flag silently.

This diables 'change flags in rename', so from now on we'll only
inherit flags from the parent directory on creation stage while in
other places we can use 'chattr' to set NOCOW or COMPRESS flags.

Reported-by: Marios Titas <redneb8888@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-26 11:01:19 -05:00
Josef Bacik
f2bdf9a8f7 Btrfs: make sure NODATACOW also gets NODATASUM set
A user reported hitting the BUG_ON() in btrfs_finished_ordered_io() where we had
csums on a NOCOW extent.  This can happen if we have NODATACOW set but not
NODATASUM set, which can happen in two cases, either we mount with -o nodatacow
and then write into preallocated space, or chattr +C a directory and move a file
into that directory.  Liu has fixed the move case in a different place, but this
fixes the mount -o nodatacow case.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-26 10:57:48 -05:00
Al Viro
496ad9aa8e new helper: file_inode(file)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:31 -05:00
Miao Xie
172a50497f Btrfs: fix wrong outstanding_extents when doing DIO write
When running the 083th case of xfstests on the filesystem with
"compress-force=lzo", the following WARNINGs were triggered.
  WARNING: at fs/btrfs/inode.c:7908
  WARNING: at fs/btrfs/inode.c:7909
  WARNING: at fs/btrfs/inode.c:7911
  WARNING: at fs/btrfs/extent-tree.c:4510
  WARNING: at fs/btrfs/extent-tree.c:4511

This problem was introduced by the patch "Btrfs: fix deadlock due
to unsubmitted". In this patch, there are two bugs which caused
the above problem.

The 1st one is a off-by-one bug, if the DIO write return 0, it is
also a short write, we need release the reserved space for it. But
we didn't do it in that patch. Fix it by change "ret > 0" to
"ret >= 0".

The 2nd one is ->outstanding_extents was increased twice when
a short write happened. As we know, ->outstanding_extents is
a counter to keep track of the number of extent items we may
use duo to delalloc, when we reserve the free space for a
delalloc write, we assume that the write will introduce just
one extent item, so we increase ->outstanding_extents by 1 at
that time. And then we will increase it every time we split the
write, it is done at the beginning of btrfs_get_blocks_direct().
So when a short write happens, we needn't increase
->outstanding_extents again. But this patch done.

In order to fix the 2nd problem, I re-write the logic for
->outstanding_extents operation. We don't increase it at the
beginning of btrfs_get_blocks_direct(), instead, we just
increase it when the split actually happens.

Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-21 08:11:43 -05:00
Liu Bo
38c227d87c Btrfs: snapshot-aware defrag
This comes from one of btrfs's project ideas,
As we defragment files, we break any sharing from other snapshots.
The balancing code will preserve the sharing, and defrag needs to grow this
as well.

Now we're able to fill the blank with this patch, in which we make full use of
backref walking stuff.

Here is the basic idea,
o  set the writeback ranges started by defragment with flag EXTENT_DEFRAG
o  at endio, after we finish updating fs tree, we use backref walking to find
   all parents of the ranges and re-link them with the new COWed file layout by
   adding corresponding backrefs.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-20 18:06:09 -05:00
Zach Brown
24542bf7ea btrfs: limit fallocate extent reservation to 256MB
Very large fallocate requests are cpu bound and result in extents with a
repeating pattern of ever decreasing size:

$ time fallocate -l 1T file
real	0m13.039s

( an excerpt of the extents from btrfs-debug-tree: )
  prealloc data disk byte 1536292564992 nr 397312
  prealloc data disk byte 1536292962304 nr 196608
  prealloc data disk byte 1536293158912 nr 98304
  prealloc data disk byte 1536293257216 nr 49152
  prealloc data disk byte 1536293306368 nr 24576
  prealloc data disk byte 1536293330944 nr 12288
  prealloc data disk byte 1536293343232 nr 8192
  prealloc data disk byte 1536293351424 nr 4096
  prealloc data disk byte 1536293355520 nr 4096
  prealloc data disk byte 1536293359616 nr 4096

The excessive cpu use comes from __btrfs_prealloc_file_range() trying to
allocate the entire remaining size after each extent is allocated.
btrfs_reserve_extent() repeatedly cuts this requested size in half until
it gets down to the size that the allocators can return.  We limit the
problem for now by capping each reservation at 256 meg.

The small extents come from a masking bug when decreasing the requested
reservation size.  The high 32bits are cleared and the remaining low
bits might happen to reserve a small size.   Fix this by using
round_down() which properly casts the mask.

After these fixes huge fallocate requests are fast and result in nice
large extents:

$ time fallocate -l 1T file
real	0m0.082s

  prealloc data disk byte 1112425889792 nr 268435456
  prealloc data disk byte 1112694325248 nr 268435456
  prealloc data disk byte 1112962760704 nr 268435456

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-20 14:06:25 -05:00
Chris Mason
e942f883bc Merge branch 'raid56-experimental' into for-linus-3.9
Signed-off-by: Chris Mason <chris.mason@fusionio.com>

Conflicts:
	fs/btrfs/ctree.h
	fs/btrfs/extent-tree.c
	fs/btrfs/inode.c
	fs/btrfs/volumes.c
2013-02-20 14:06:05 -05:00
Chris Mason
b2c6b3e061 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-linus-3.9
Signed-off-by: Chris Mason <chris.mason@fusionio.com>

Conflicts:
	fs/btrfs/disk-io.c
2013-02-20 14:05:45 -05:00
Liu Bo
fa6ac8765c Btrfs: fix cleaner thread not working with inode cache option
Right now inode cache inode is treated as the same as space cache
inode, ie. keep inode in memory till putting super.

But this leads to an awkward situation.

If we're going to delete a snapshot/subvolume, btrfs will not
actually delete it and return free space, but will add it to dead
roots list until the last inode on this snap/subvol being destroyed.
Then we'll fetch deleted roots and cleanup them via cleaner thread.

So here is the problem, if we enable inode cache option, each
snap/subvol has a cached inode which is used to store inode allcation
information.  And this cache inode will be kept in memory, as the above
said.  So with inode cache, snap/subvol can only be added into
dead roots list during freeing roots stage in umount, so that we can
ONLY get space back after another remount(we cleanup dead roots on mount).

But the real thing is we'll no more use the snap/subvol if we mark it
deleted, so we can safely iput its cache inode when we delete snap/subvol.

Another thing is that we need to change the rules of droping inode, we
don't keep snap/subvol's cache inode in memory till end so that we can
add snap/subvol into dead roots list in time.

Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 13:00:06 -05:00
Miao Xie
38851cc19a Btrfs: implement unlocked dio write
This idea is from ext4. By this patch, we can make the dio write parallel,
and improve the performance. But because we can not update isize without
i_mutex, the unlocked dio write just can be done in front of the EOF.

We needn't worry about the race between dio write and truncate, because the
truncate need wait untill all the dio write end.

And we also needn't worry about the race between dio write and punch hole,
because we have extent lock to protect our operation.

I ran fio to test the performance of this feature.

== Hardware ==
CPU: Intel(R) Core(TM)2 Duo CPU     E7500  @ 2.93GHz
Mem: 2GB
SSD: Intel X25-M 120GB (Test Partition: 60GB)

== config file ==
[global]
ioengine=psync
direct=1
bs=4k
size=32G
runtime=60
directory=/mnt/btrfs/
filename=testfile
group_reporting
thread

[file1]
numjobs=1 # 2 4
rw=randwrite

== result (KBps) ==
write	1	2	4
lock	24936	24738	24726
nolock	24962	30866	32101

== result (iops) ==
write	1	2	4
lock	6234	6184	6181
nolock	6240	7716	8025

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:48 -05:00
Miao Xie
2e60a51e62 Btrfs: serialize unlocked dio reads with truncate
Currently, we can do unlocked dio reads, but the following race
is possible:

dio_read_task			truncate_task
				->btrfs_setattr()
->btrfs_direct_IO
    ->__blockdev_direct_IO
      ->btrfs_get_block
				  ->btrfs_truncate()
				 #alloc truncated blocks
				 #to other inode
      ->submit_io()
     #INFORMATION LEAK

In order to avoid this problem, we must serialize unlocked dio reads with
truncate. There are two approaches:
- use extent lock to protect the extent that we truncate
- use inode_dio_wait() to make sure the truncating task will wait for
  the read DIO.

If we use the 1st one, we will meet the endless truncation problem due to
the nonlocked read DIO after we implement the nonlocked write DIO. It is
because we still need invoke inode_dio_wait() avoid the race between write
DIO and truncation. By that time, we have to introduce

  btrfs_inode_{block, resume}_nolock_dio()

again. That is we have to implement this patch again, so I choose the 2nd
way to fix the problem.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:47 -05:00
Miao Xie
0934856d46 Btrfs: fix deadlock due to unsubmitted
The deadlock problem happened when running fsstress(a test program in LTP).

Steps to reproduce:
 # mkfs.btrfs -b 100M <partition>
 # mount <partition> <mnt>
 # <Path>/fsstress -p 3 -n 10000000 -d <mnt>

The reason is:
btrfs_direct_IO()
 |->do_direct_IO()
     |->get_page()
     |->get_blocks()
     |	 |->btrfs_delalloc_resereve_space()
     |	 |->btrfs_add_ordered_extent() -------	Add a new ordered extent
     |->dio_send_cur_page(page0) --------------	We didn't submit bio here
     |->get_page()
     |->get_blocks()
	 |->btrfs_delalloc_resereve_space()
	     |->flush_space()
		 |->btrfs_start_ordered_extent()
		     |->wait_event() ----------	Wait the completion of
						the ordered extent that is
						mentioned above

But because we didn't submit the bio that is mentioned above, the ordered
extent can not complete, we would wait for its completion forever.

There are two methods which can fix this deadlock problem:
1. submit the bio before we invoke get_blocks()
2. reserve the space before we do dio

Though the 1st is the simplest way, we need modify the code of VFS, and it
is likely to break contiguous requests, and introduce performance regression
for the other filesystems.

So we have to choose the 2nd way.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:45 -05:00
Josef Bacik
4a7d0f6854 Btrfs: cleanup orphan reservation if truncate fails
I noticed we were getting lots of warnings with xfstest 83 because we have
reservations outstanding.  This is because we moved the orphan add outside
of the truncate, but we don't actually cleanup our reservation if something
fails.  This fixes the problem and I no longer see warnings.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:44 -05:00
Josef Bacik
5d80366e9b Btrfs: steal from global reserve if we are cleaning up orphans
Sometimes xfstest 83 will fail to remount the scratch device because we've
gotten ourselves so full that we cannot cleanup the orphan items.  In this
case check to see if we're doing the orphan cleanup and if we are allow us
to steal our reservation from the global block rsv.  With this patch I've
not been able to reproduce the failed mount problem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:42 -05:00
Josef Bacik
3e04e7f10b Btrfs: handle errors in compression submission path
I noticed we would deadlock if we aborted a transaction while doing
compressed io.  This is because we don't unlock our pages if something goes
horribly wrong.  To fix this we need to make sure that we call
extent_clear_unlock_delalloc in order to unlock all the pages.  If we have
to cow in the async submission thread we need to make sure to unlock our
locked_page as the cow error path will not unlock the locked page as it
depends on the caller to unlock that page.  With this patch we no longer
deadlock on the page lock when we have an aborted transaction.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:33 -05:00
Josef Bacik
925396ecf2 Btrfs: account for orphan inodes properly during cleanup
Dave sent me a panic where we were doing the orphan cleanup and panic'ed
trying to release our reservation from the orphan block rsv.  The reason for
this is because our orphan block rsv had been free'd out from underneath us
because the transaction commit found that there were no orphan inodes
according to its count and decided to free it.  This is incorrect so make
sure we inc the orphan inodes count so the accounting is all done properly.
This would also cause the warning in the orphan commit code normally if you
had any orphans to cleanup as they would only decrement the orphan count so
you'd get a negative orphan count which could cause problems during runtime.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:31 -05:00
Josef Bacik
0bec9ef525 Btrfs: unreserve space if our ordered extent fails to work
When a transaction aborts or there's an EIO on an ordered extent or any
error really we will not free up the space we reserved for this ordered
extent.  This results in warnings from the block group cache cleanup in the
case of a transaction abort, or leaking space in the case of EIO on an
ordered extent.  Fix this up by free'ing the reserved space if we have an
error at all trying to complete an ordered extent.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:29 -05:00
Miao Xie
df0af1a57f Btrfs: use the inode own lock to protect its delalloc_bytes
We need not use a global lock to protect the delalloc_bytes of the
inode, just use its own lock. In this way, we can reduce the lock
contention and ->delalloc_lock will just protect delalloc inode
list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:06 -05:00
Miao Xie
963d678b0f Btrfs: use percpu counter for fs_info->delalloc_bytes
fs_info->delalloc_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant for it to reduce the lock
contention.

This patch also fixed the problem that we access the variant
without the lock protection.At worst, we would not flush the
delalloc inodes, and just return ENOSPC error when we still have
some free space in the fs.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:05 -05:00
Filipe Brandenburger
55e301fd57 Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h
The header file will then be installed under /usr/include/linux so that
userspace applications can refer to Btrfs ioctls by name and use the same
structs used internally in the kernel.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:28 -05:00
Josef Bacik
fe5fafbebd Revert "Btrfs: fix permissions of empty files not affected by umask"
This reverts commit 2794ed013b.

Wasn't supposed to get used in btrfs_mknod, it was supposed to be in
btrfs_create, which was done in commit
9185aa587b.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:26 -05:00
Miao Xie
63607cc86a Btrfs: traverse and flush the delalloc inodes once
btrfs_start_delalloc_inodes() needn't traverse and flush the delalloc inodes
repeatedly. It is because we can regard the data that the users write after
we start delalloc inodes flush as the one which is after the delalloc inodes
flush is done, and we can flush it next time.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:23 -05:00
Liu Bo
51fab69347 Btrfs: use token to avoid times mapping extent buffer
The API in tree log code has done sort of changes, and it proves that
we can benifit from using token, so do the same thing here.

function_graph tracer's timer shows that it costs nearly half time
of before(39.788us -> 22.391us).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:15 -05:00
Josef Bacik
2ab28f322f Btrfs: wait on ordered extents at the last possible moment
Since we don't actually copy the extent information from the source tree in
the fast case we don't need to wait for ordered io to be completed in order
to fsync, we just need to wait for the io to be completed.  So when we're
logging our file just attach all of the ordered extents to the log, and then
when the log syncs just wait for IO_DONE on the ordered extents and then
write the super.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:04 -05:00
Miao Xie
4eee4fa4f8 Btrfs: use wrapper page_offset
Use wrapper page_offset to get byte-offset into filesystem object for page.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:36:43 -05:00
Miao Xie
0e8c36a9fd Btrfs: fix lots of orphan inodes when the space is not enough
We're running into having 50-100 orphans left over with xfstests 83
because of ENOSPC when trying to start the transaction for the inode update.
But in fact, it makes no sense in updating the inode for the new size while
we're deleting the stupid thing. This patch fixes this problem.

Reported-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:36:39 -05:00
Chris Mason
0e4e026366 Merge branch 'for-linus' into raid56-experimental
Conflicts:
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-05 10:04:03 -05:00
David Woodhouse
53b381b3ab Btrfs: RAID5 and RAID6
This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.

Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio.  This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.

But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle.  This will be addressed in a
later commit.

Scrub is unable to repair crc errors on raid5/6 chunks.

Discard does not work on raid5/6 (yet)

The stripe size is fixed at 64KiB per disk.  This will be tunable
in a later commit.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:23 -05:00
David Woodhouse
64a167011b Btrfs: add rw argument to merge_bio_hook()
We'll want to merge writes so they can fill a full RAID[56] stripe, but
not necessarily reads.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 11:49:47 -05:00
Linus Torvalds
d7df025eb4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "It turns out that we had two crc bugs when running fsx-linux in a
  loop.  Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
  all down.  Miao also has a new OOM fix in this v2 pull as well.

  Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
  and resuming a running balance across drives.

  Josef's orphan truncate patch fixes an obscure corruption we'd see
  during xfstests.

  Arne's patches address problems with subvolume quotas.  If the user
  destroys quota groups incorrectly the FS will refuse to mount.

  The rest are smaller fixes and plugs for memory leaks."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
  Btrfs: fix repeated delalloc work allocation
  Btrfs: fix wrong max device number for single profile
  Btrfs: fix missed transaction->aborted check
  Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
  Btrfs: put csums on the right ordered extent
  Btrfs: use right range to find checksum for compressed extents
  Btrfs: fix panic when recovering tree log
  Btrfs: do not allow logged extents to be merged or removed
  Btrfs: fix a regression in balance usage filter
  Btrfs: prevent qgroup destroy when there are still relations
  Btrfs: ignore orphan qgroup relations
  Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
  Btrfs: fix unlock order in btrfs_ioctl_rm_dev
  Btrfs: fix unlock order in btrfs_ioctl_resize
  Btrfs: fix "mutually exclusive op is running" error code
  Btrfs: bring back balance pause/resume logic
  btrfs: update timestamps on truncate()
  btrfs: fix btrfs_cont_expand() freeing IS_ERR em
  Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
  Btrfs: fix off-by-one in lseek
  ...
2013-01-25 10:55:21 -08:00
Miao Xie
1eafa6c737 Btrfs: fix repeated delalloc work allocation
btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the
first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/
btrfs_queue_worker for this inode, and then it locks the list, checks the
head of the list again. But because we don't delete the first inode that it
deals with before, it will fetch the same inode. As a result, this function
allocates a huge amount of btrfs_delalloc_work structures, and OOM happens.

Fix this problem by splice this delalloc list.

Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:27 -05:00
Eric Sandeen
3972f2603d btrfs: update timestamps on truncate()
truncate() vs. ftruncate() differ in the VFS; truncate()
doesn't set (ATTR_CTIME | ATTR_MTIME), and it's up to the
fs to do the timestamp updates if the size changes.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
2013-01-14 13:53:37 -05:00
Zach Brown
f276795627 btrfs: fix btrfs_cont_expand() freeing IS_ERR em
btrfs_cont_expand() tries to free an IS_ERR em as it gets an error from
btrfs_get_extent() and breaks out of its loop.

An instance of -EEXIST was reported in the wild:

  https://bugzilla.redhat.com/show_bug.cgi?id=874407

I have no idea if that -EEXIST is surprising, or not.  Regardless, this
error handling should be cleaned up to handle other reasonable errors
(ENOMEM, EIO; whatever).

This seemed to be the only buggy freeing of the relatively rare IS_ERR
em so I opted to fix the caller rather than teach free_extent_map() to
use IS_ERR_OR_NULL().

Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
2013-01-14 13:53:23 -05:00
Liu Bo
f9e4fb5393 Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
xfstests case 285 complains.

It it because btrfs did not try to find unwritten delalloc
bytes(only dirty pages, not yet writeback) behind prealloc
extents, it ends up finding nothing while we're with SEEK_DATA.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:53:22 -05:00
Josef Bacik
f3fe820c20 Btrfs: add orphan before truncating pagecache
Running xfstests 83 in a loop would sometimes fail the fsck.  This happens
because if we invalidate a page that already has an ordered extent setup for
it we will complete the ordered extent ourselves, assuming that the truncate
will clean everything up.  The problem with this is there is plenty of time
for the truncate to fail after we've done this work.  So to fix this we need
to add the orphan item first to make sure the cleanup gets done properly,
and then we can truncate the pagecache and all that stuff and be safe.  This
fixes the btrfsck failures I was seeing while running 83 in a loop.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:52 -05:00
Jeff Layton
39e3c9553f vfs: remove DCACHE_NEED_LOOKUP
The code that relied on that flag was ripped out of btrfs quite some
time ago, and never added back. Josef indicated that he was going to
take a different approach to the problem in btrfs, and that we
could just eliminate this flag.

Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 13:57:36 -05:00
Liu Bo
213490b301 Btrfs: fix a bug of per-file nocow
Users report a bug, the reproducer is:
$ mkfs.btrfs /dev/loop0
$ mount /dev/loop0 /mnt/btrfs/
$ mkdir /mnt/btrfs/dir
$ chattr +C /mnt/btrfs/dir/
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10;
$ lsattr /mnt/btrfs/dir/foo
---------------C- /mnt/btrfs/dir/foo
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 1 extent found    ---> an extent
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 3 extents found   ---> with nocow, btrfs breaks the extent into three parts

The new created file should not only inherit the NODATACOW flag, but also
honor NODATASUM flag, because we must do COW on a file extent with checksum.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-17 14:48:21 -05:00
Chris Mason
9c52057c69 Btrfs: fix hash overflow handling
The handling for directory crc hash overflows was fairly obscure,
split_leaf returns EOVERFLOW when we try to extend the item and that is
supposed to bubble up to userland.  For a while it did so, but along the
way we added better handling of errors and forced the FS readonly if we
hit IO errors during the directory insertion.

Along the way, we started testing only for EEXIST and the EOVERFLOW case
was dropped.  The end result is that we may force the FS readonly if we
catch a directory hash bucket overflow.

This fixes a few problem spots.  First I add tests for EOVERFLOW in the
places where we can safely just return the error up the chain.

btrfs_rename is harder though, because it tries to insert the new
directory item only after it has already unlinked anything the rename
was going to overwrite.  Rather than adding very complex logic, I added
a helper to test for the hash overflow case early while it is still safe
to bail out.

Snapshot and subvolume creation had a similar problem, so they are using
the new helper now too.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Pascal Junod <pascal@junod.info>
2012-12-17 14:48:21 -05:00
Filipe Brandenburger
9185aa587b Btrfs: fix permissions of empty files not affected by umask
When a new file is created with btrfs_create(), the inode will initially be
created with permissions 0666 and later on in btrfs_init_acl() it will be
adapted to mask out the umask bits. The problem is that this change won't make
it into the btrfs_inode unless there's another change to the inode (e.g. writing
content changing the size or touching the file changing the mtime.)

This fix adds a call to btrfs_update_inode() to btrfs_create() to make sure that
the change will not get lost if the in-memory inode is flushed before other
changes are made to the file.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:28 -05:00
Josef Bacik
6c760c0724 Btrfs: do not call file_update_time in aio_write
This starts a transaction and dirties the inode everytime we call it, which
is super expensive if you have a write heavy workload.  We will be updating
the inode when the IO completes and we reserve the space for the inode
update when we reserve space for the write, so there is no chance of loss of
information or enospc issues.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:27 -05:00
Josef Bacik
70c8a91ce2 Btrfs: log changed inodes based on the extent map tree
We don't really need to copy extents from the source tree since we have all
of the information already available to us in the extent_map tree.  So
instead just write the extents straight to the log tree and don't bother to
copy the extent items from the source tree.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:24 -05:00
Josef Bacik
b11e234d21 Btrfs: do not mark ems as prealloc if we are writing to them
We are going to use EM's to log extents in the future, so we need to not
mark them as prealloc if they aren't actually prealloc extents.  Instead
mark them with FILLING so we know to ammend mod_start/mod_len and that way
we don't confuse the extent logging code.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:23 -05:00
Josef Bacik
b493968096 Btrfs: keep track of the extents original block length
If we've written to a prealloc extent we need to know the original block len
for the extent.  We can't figure this out currently since ->block_len is
just set to the extent length.  So introduce ->orig_block_len so that we
know how many bytes were in the original extent for proper extent logging
that future patches will need.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:23 -05:00
Josef Bacik
b812ce2879 Btrfs: inline csums if we're fsyncing
The tree logging stuff needs the csums to be on the ordered extents in order
to log them properly, so mark that we're sync and inline the csum creation
so we don't have to wait on the csumming to be done when logging extents
that are still in flight.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:22 -05:00
Josef Bacik
e997615149 Btrfs: only log the inode item if we can get away with it
Currently we copy all the file information into the log, inode item, the
refs, xattrs etc.  Except most of this doesn't change from fsync to fsync,
just the inode item changes.  So set a flag if an xattr changes or a link is
added, and otherwise only log the inode item.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:21 -05:00
Miao Xie
ac6a2b36f9 Btrfs: fix wrong return value of btrfs_truncate_page()
ret variant may be set to 0 if we read page successfully, but it might be
released before we lock it again. On this case, if we fail to allocate a
new page, we will return 0, it is wrong, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:20 -05:00
Miao Xie
543eabd5e1 Btrfs: don't auto defrag a file when doing directIO
If we runt the direct IO, we should not run auto defrag, because it may
introduce buffered IO vs direcIO problem, and make direct IO slow down.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:18 -05:00
Filipe Brandenburger
43baa579b3 Btrfs: refactor error handling to drop inode in btrfs_create()
Refactor it by checking whether the inode has been created and needs to be
dropped (drop_inode_on_err) and also if the err variable is set. That way the
variable doesn't need to be set on each and every error handling block.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:17 -05:00
Filipe Brandenburger
2794ed013b Btrfs: fix permissions of empty files not affected by umask
When a new file is created with btrfs_create(), the inode will initially be
created with permissions 0666 and later on in btrfs_init_acl() it will be
adapted to mask out the umask bits. The problem is that this change won't make
it into the btrfs_inode unless there's another change to the inode (e.g. writing
content changing the size or touching the file changing the mtime.)

This fix adds a call to btrfs_update_inode() to btrfs_create() to make sure that
the change will not get lost if the in-memory inode is flushed before other
changes are made to the file.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:16 -05:00
Tsutomu Itoh
05dadc09f5 Btrfs: add fiemap's flag check
When the flag not supported is specified, it is necessary to return the error
to the caller.
So, we add the validity check of the fiemap's flag.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:16 -05:00
Stefan Behrens
618919236b Btrfs: handle errors from btrfs_map_bio() everywhere
With the addition of the device replace procedure, it is possible
for btrfs_map_bio(READ) to report an error. This happens when the
specific mirror is requested which is located on the target disk,
and the copy operation has not yet copied this block. Hence the
block cannot be read and this error state is indicated by
returning EIO.
Some background information follows now. A new mirror is added
while the device replace procedure is running.
btrfs_get_num_copies() returns one more, and
btrfs_map_bio(GET_READ_MIRROR) adds one more mirror if a disk
location is involved that was already handled by the device
replace copy operation. The assigned mirror num is the highest
mirror number, e.g. the value 3 in case of RAID1.
If btrfs_map_bio() is invoked with mirror_num == 0 (i.e., select
any mirror), the copy on the target drive is never selected
because that disk shall be able to perform the write requests as
quickly as possible. The parallel execution of read requests would
only slow down the disk copy procedure. Second case is that
btrfs_map_bio() is called with mirror_num > 0. This is done from
the repair code only. In this case, the highest mirror num is
assigned to the target disk, since it is used last. And when this
mirror is not available because the copy procedure has not yet
handled this area, an error is returned. Everywhere in the code
the handling of such errors is added now.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:40 -05:00
Stefan Behrens
3ec706c831 Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree
This is required for the device replace procedure in a later step.
Two calling functions also had to be changed to have the fs_info
pointer: repair_io_failure() and scrub_setup_recheck_block().

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:34 -05:00
Liu Bo
b53d3f5db2 Btrfs: cleanup for btrfs_btree_balance_dirty
- 'nr' is no more used.
- btrfs_btree_balance_dirty() and __btrfs_btree_balance_dirty() can share
  a bunch of code.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:28 -05:00
Julia Lawall
6c1500f22a fs/btrfs: drop if around WARN_ON
Just use WARN_ON rather than an if containing only WARN_ON(1).

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression e;
@@
- if (e) WARN_ON(1);
+ WARN_ON(e);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:24 -05:00
Julia Lawall
31b1a2bd75 fs/btrfs: use WARN
Use WARN rather than printk followed by WARN_ON(1), for conciseness.

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression list es;
@@

-printk(
+WARN(1,
  es);
-WARN_ON(1);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:23 -05:00
Miao Xie
b7d5b0a819 Btrfs: fix joining the same transaction handler more than 2 times
If we flush inodes with pending delalloc in a transaction, we may join
the same transaction handler more than 2 times.

The reason is:
  Task						use_count of trans handle
  commit_transaction				1
    |-> btrfs_start_delalloc_inodes		1
	  |-> run_delalloc_nocow		1
		|-> join_transaction		2
		|-> cow_file_range		2
			|-> join_transaction	3

In fact, cow_file_range needn't join the transaction again because the caller
have joined the transaction, so we fix this problem by this way.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:20 -05:00
Miao Xie
8ccf6f19b6 Btrfs: make delalloc inodes be flushed by multi-task
This patch introduce a new worker pool named "flush_workers", and if we
want to force all the inode with pending delalloc to the disks, we can
queue those inodes into the work queue of the worker pool, in this way,
those inodes will be flushed by multi-task.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:37 -05:00
Miao Xie
08e007d2e5 Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.

We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:31 -05:00
Linus Torvalds
f48d42773b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has our series of fixes for the next rc.  The biggest batch is
  from Jan Schmidt, fixing up some problems in our subvolume quota code
  and fixing btrfs send/receive to work with the new extended inode
  refs."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: do not bug when we fail to commit the transaction
  Btrfs: fix memory leak when cloning root's node
  Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
  Btrfs: Send: preserve ownership (uid and gid) also for symlinks.
  Btrfs: fix deadlock caused by the nested chunk allocation
  btrfs: Return EINVAL when length to trim is less than FSB
  Btrfs: fix memory leak in btrfs_quota_enable()
  Btrfs: send correct rdev and mode in btrfs-send
  Btrfs: extended inode refs support for send mechanism
  Btrfs: Fix wrong error handling code
  Fix a sign bug causing invalid memory access in the ino_paths ioctl.
  Btrfs: comment for loop in tree_mod_log_insert_move
  Btrfs: fix extent buffer reference for tree mod log roots
  Btrfs: determine level of old roots
  Btrfs: tree mod log's old roots could still be part of the tree
  Btrfs: fix a tree mod logging issue for root replacement operations
  Btrfs: don't put removals from push_node_left into tree mod log twice
2012-10-26 09:34:04 -07:00
Josef Bacik
be6aef6049 Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
On a really full file system I was getting ENOSPC back from
btrfs_update_inode when trying to update the parent inode when creating a
snapshot.  Just use the fallback method so we can update the inode and not
have to worry about having a delayed ref.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-25 15:50:18 -04:00
Linus Torvalds
72055425e5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "This is a large pull, with the bulk of the updates coming from:

   - Hole punching

   - send/receive fixes

   - fsync performance

   - Disk format extension allowing more hardlinks inside a single
     directory (btrfs-progs patch required to enable the compat bit for
     this one)

  I'm cooking more unrelated RAID code, but I wanted to make sure this
  original batch makes it in.  The largest updates here are relatively
  old and have been in testing for some time."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (121 commits)
  btrfs: init ref_index to zero in add_inode_ref
  Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer
  Btrfs: fix page leakage
  Btrfs: do not warn_on when we cannot alloc a page for an extent buffer
  Btrfs: don't bug on enomem in readpage
  Btrfs: cleanup pages properly when ENOMEM in compression
  Btrfs: make filesystem read-only when submitting barrier fails
  Btrfs: detect corrupted filesystem after write I/O errors
  Btrfs: make compress and nodatacow mount options mutually exclusive
  btrfs: fix message printing
  Btrfs: don't bother committing delayed inode updates when fsyncing
  btrfs: move inline function code to header file
  Btrfs: remove unnecessary IS_ERR in bio_readpage_error()
  btrfs: remove unused function btrfs_insert_some_items()
  Btrfs: don't commit instead of overcommitting
  Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called
  Btrfs: be smarter about dropping things from the tree log
  Btrfs: don't lookup csums for prealloc extents
  Btrfs: cache extent state when writing out dirty metadata pages
  Btrfs: do not hold the file extent leaf locked when adding extent item
  ...
2012-10-10 10:49:20 +09:00
Tsutomu Itoh
f0bd95ea72 Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called
We should confirm the value of extent_map before calling
trace_btrfs_get_extent() because the value of extent_map has the
possibility of NULL.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-09 09:15:42 -04:00
Josef Bacik
ce19533256 Btrfs: do not hold the file extent leaf locked when adding extent item
For some reason we unlock everything except the leaf we are on, set the path
blocking and then add the extent item for the extent we just finished
writing.  I can't for the life of me figure out why we would want to do
this, and the history doesn't really indicate that there was a real reason
for it, so just remove it.  This will reduce our tree lock contention on
heavy writes.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-09 09:15:40 -04:00
Miao Xie
a698d0755a Btrfs: add a type field for the transaction handle
This patch add a type field into the transaction handle structure,
in this way, we needn't implement various end-transaction functions
and can make the code more simple and readable.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-09 09:15:38 -04:00
Mark Fasheh
f186373fef btrfs: extended inode refs
This patch adds basic support for extended inode refs. This includes support
for link and unlink of the refs, which basically gets us support for rename
as well.

Inode creation does not need changing - extended refs are only added after
the ref array is full.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-10-09 09:14:45 -04:00
David Sterba
b3ae244e71 btrfs: return EPERM upon rmdir on a subvolume
A subvolume cannot be deleted via rmdir, but the error code ENOTEMPTY
is confusing. Return EPERM instead, as this is not permitted.

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-10-04 09:39:56 -04:00
Linus Torvalds
aab174f0df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:

 - big one - consolidation of descriptor-related logics; almost all of
   that is moved to fs/file.c

   (BTW, I'm seriously tempted to rename the result to fd.c.  As it is,
   we have a situation when file_table.c is about handling of struct
   file and file.c is about handling of descriptor tables; the reasons
   are historical - file_table.c used to be about a static array of
   struct file we used to have way back).

   A lot of stray ends got cleaned up and converted to saner primitives,
   disgusting mess in android/binder.c is still disgusting, but at least
   doesn't poke so much in descriptor table guts anymore.  A bunch of
   relatively minor races got fixed in process, plus an ext4 struct file
   leak.

 - related thing - fget_light() partially unuglified; see fdget() in
   there (and yes, it generates the code as good as we used to have).

 - also related - bits of Cyrill's procfs stuff that got entangled into
   that work; _not_ all of it, just the initial move to fs/proc/fd.c and
   switch of fdinfo to seq_file.

 - Alex's fs/coredump.c spiltoff - the same story, had been easier to
   take that commit than mess with conflicts.  The rest is a separate
   pile, this was just a mechanical code movement.

 - a few misc patches all over the place.  Not all for this cycle,
   there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
  MAX_LFS_FILESIZE should be a loff_t
  compat: fs: Generic compat_sys_sendfile implementation
  fs: push rcu_barrier() from deactivate_locked_super() to filesystems
  btrfs: reada_extent doesn't need kref for refcount
  coredump: move core dump functionality into its own file
  coredump: prevent double-free on an error path in core dumper
  usb/gadget: fix misannotations
  fcntl: fix misannotations
  ceph: don't abuse d_delete() on failure exits
  hypfs: ->d_parent is never NULL or negative
  vfs: delete surplus inode NULL check
  switch simple cases of fget_light to fdget
  new helpers: fdget()/fdput()
  switch o2hb_region_dev_write() to fget_light()
  proc_map_files_readdir(): don't bother with grabbing files
  make get_file() return its argument
  vhost_set_vring(): turn pollstart/pollstop into bool
  switch prctl_set_mm_exe_file() to fget_light()
  switch xfs_find_handle() to fget_light()
  switch xfs_swapext() to fget_light()
  ...
2012-10-02 20:25:04 -07:00
Kirill A. Shutemov
8c0a853770 fs: push rcu_barrier() from deactivate_locked_super() to filesystems
There's no reason to call rcu_barrier() on every
deactivate_locked_super().  We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.

Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths.  E.g.  on my machine exit_group() of a last process in IPC
namespace takes 0.07538s.  rcu_barrier() takes 0.05188s of that time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-02 21:35:55 -04:00
Linus Torvalds
437589a74b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "This is a mostly modest set of changes to enable basic user namespace
  support.  This allows the code to code to compile with user namespaces
  enabled and removes the assumption there is only the initial user
  namespace.  Everything is converted except for the most complex of the
  filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
  nfs, ocfs2 and xfs as those patches need a bit more review.

  The strategy is to push kuid_t and kgid_t values are far down into
  subsystems and filesystems as reasonable.  Leaving the make_kuid and
  from_kuid operations to happen at the edge of userspace, as the values
  come off the disk, and as the values come in from the network.
  Letting compile type incompatible compile errors (present when user
  namespaces are enabled) guide me to find the issues.

  The most tricky areas have been the places where we had an implicit
  union of uid and gid values and were storing them in an unsigned int.
  Those places were converted into explicit unions.  I made certain to
  handle those places with simple trivial patches.

  Out of that work I discovered we have generic interfaces for storing
  quota by projid.  I had never heard of the project identifiers before.
  Adding full user namespace support for project identifiers accounts
  for most of the code size growth in my git tree.

  Ultimately there will be work to relax privlige checks from
  "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
  root in a user names to do those things that today we only forbid to
  non-root users because it will confuse suid root applications.

  While I was pushing kuid_t and kgid_t changes deep into the audit code
  I made a few other cleanups.  I capitalized on the fact we process
  netlink messages in the context of the message sender.  I removed
  usage of NETLINK_CRED, and started directly using current->tty.

  Some of these patches have also made it into maintainer trees, with no
  problems from identical code from different trees showing up in
  linux-next.

  After reading through all of this code I feel like I might be able to
  win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
  userns: Convert the ufs filesystem to use kuid/kgid where appropriate
  userns: Convert the udf filesystem to use kuid/kgid where appropriate
  userns: Convert ubifs to use kuid/kgid
  userns: Convert squashfs to use kuid/kgid where appropriate
  userns: Convert reiserfs to use kuid and kgid where appropriate
  userns: Convert jfs to use kuid/kgid where appropriate
  userns: Convert jffs2 to use kuid and kgid where appropriate
  userns: Convert hpfs to use kuid and kgid where appropriate
  userns: Convert btrfs to use kuid/kgid where appropriate
  userns: Convert bfs to use kuid/kgid where appropriate
  userns: Convert affs to use kuid/kgid wherwe appropriate
  userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
  userns: On ia64 deal with current_uid and current_gid being kuid and kgid
  userns: On ppc convert current_uid from a kuid before printing.
  userns: Convert s390 getting uid and gid system calls to use kuid and kgid
  userns: Convert s390 hypfs to use kuid and kgid where appropriate
  userns: Convert binder ipc to use kuids
  userns: Teach security_path_chown to take kuids and kgids
  userns: Add user namespace support to IMA
  userns: Convert EVM to deal with kuids and kgids in it's hmac computation
  ...
2012-10-02 11:11:09 -07:00
Miao Xie
962197babe Btrfs: fix unnecessary warning when the fragments make the space alloc fail
When we wrote some data by compress mode into a btrfs filesystem which was full
of the fragments, the kernel will report:
	BTRFS warning (device xxx): Aborting unused transaction.

The reason is:
We can not find a long enough free space to store the compressed data because
of the fragmentary free space, and the compressed data can not be splited,
so the kernel outputed the above message.

In fact, btrfs can deal with this problem very well: it fall back to
uncompressed IO, split the uncompressed data into small ones, and then
store them into to the fragmentary free space. So we shouldn't output the
above warning message.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:20 -04:00
Josef Bacik
69ffb54347 Btrfs: create a pinned em when writing to a prealloc range in DIO
Wade Cline reported a problem where he was getting garbage and warnings when
writing to a preallocated range via O_DIRECT.  This is because we weren't
creating our normal pinned extent_map for the range we were writing to,
which was causing all sorts of issues.  This patch fixes the problem and
makes his testcase much happier.  Thanks,

Reported-by: Wade Cline <clinew@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:20 -04:00
Miao Xie
8407aa4643 Btrfs: fix corrupted metadata in the snapshot
When we delete a inode, we will remove all the delayed items including delayed
inode update, and then truncate all the relative metadata. If there is lots of
metadata, we will end the current transaction, and start a new transaction to
truncate the left metadata. In this way, we will leave a inode item that its
link counter is > 0, and also may leave some directory index items in fs/file tree
after the current transaction ends. In other words, the metadata in this fs/file tree
is inconsistent. If we create a snapshot for this tree now, we will find a inode with
corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
because its link counter is not 0.

We fix this problem by updating the inode item before the current transaction ends.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:17 -04:00
David Sterba
837e197283 btrfs: polish names of kmem caches
Usecase:

  watch 'grep btrfs < /proc/slabinfo'

easy to watch all caches in one go.

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-10-01 15:19:16 -04:00
Liu Bo
9e8a4a8b0b Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag
We're going to use this flag EXTENT_DEFRAG to indicate which range
belongs to defragment so that we can implement snapshow-aware defrag:

We set the EXTENT_DEFRAG flag when dirtying the extents that need
defragmented, so later on writeback thread can differentiate between
normal writeback and writeback started by defragmentation.

Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:15 -04:00
Miao Xie
66d8f3dd1c Btrfs: add a new "type" field into the block reservation structure
Sometimes we need choose the method of the reservation according to the type
of the block reservation, such as the reservation for the delayed inode update.
Now we identify the type just by comparing the address of the reservation
variants, it is very ugly if it is a temporary one because we need compare it
with all the common reservation variants. So we add a new "type" field to keep
the type the reservation variants.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:11 -04:00
Sage Weil
ac14aed665 Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs()
Josef has suggested that this is not necessary.  Removing it also avoids
this lockdep splat (after the new sb_internal locking stuff was added):

[  604.090449] ======================================================
[  604.114819] [ INFO: possible circular locking dependency detected ]
[  604.139262] 3.6.0-rc2-ceph-00144-g463b030 #1 Not tainted
[  604.162193] -------------------------------------------------------
[  604.186139] btrfs-cleaner/6669 is trying to acquire lock:
[  604.209555]  (sb_internal#2){.+.+..}, at: [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
[  604.257100]
[  604.257100] but task is already holding lock:
[  604.300366]  (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs]
[  604.352989]
[  604.352989] which lock already depends on the new lock.
[  604.352989]
[  604.427104]
[  604.427104] the existing dependency chain (in reverse order) is:
[  604.478493]
[  604.478493] -> #1 (&fs_info->cleanup_work_sem){.+.+..}:
[  604.529313]        [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
[  604.559621]        [<ffffffff81632b69>] down_read+0x39/0x4e
[  604.589382]        [<ffffffffa004db98>] btrfs_lookup_dentry+0x218/0x550 [btrfs]
[  604.596161] btrfs: unlinked 1 orphans
[  604.675002]        [<ffffffffa006aadd>] create_subvol+0x62d/0x690 [btrfs]
[  604.708859]        [<ffffffffa006d666>] btrfs_mksubvol.isra.52+0x346/0x3a0 [btrfs]
[  604.772466]        [<ffffffffa006d7f2>] btrfs_ioctl_snap_create_transid+0x132/0x190 [btrfs]
[  604.842245]        [<ffffffffa006d8ae>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs]
[  604.912852]        [<ffffffffa00708ae>] btrfs_ioctl+0x138e/0x1990 [btrfs]
[  604.951888]        [<ffffffff8118e9b8>] do_vfs_ioctl+0x98/0x560
[  604.989961]        [<ffffffff8118ef11>] sys_ioctl+0x91/0xa0
[  605.026628]        [<ffffffff8163d569>] system_call_fastpath+0x16/0x1b
[  605.064404]
[  605.064404] -> #0 (sb_internal#2){.+.+..}:
[  605.126832]        [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90
[  605.163671]        [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
[  605.200228]        [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0
[  605.236818]        [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
[  605.274029]        [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs]
[  605.340520]        [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs]
[  605.378720]        [<ffffffff811972c8>] evict+0xb8/0x1c0
[  605.416057]        [<ffffffff811974d5>] iput+0x105/0x210
[  605.452373]        [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs]
[  605.521627]        [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs]
[  605.560520]        [<ffffffff810791ee>] kthread+0xae/0xc0
[  605.598094]        [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
[  605.636499]
[  605.636499] other info that might help us debug this:
[  605.636499]
[  605.736504]  Possible unsafe locking scenario:
[  605.736504]
[  605.801931]        CPU0                    CPU1
[  605.835126]        ----                    ----
[  605.867093]   lock(&fs_info->cleanup_work_sem);
[  605.898594]                                lock(sb_internal#2);
[  605.931954]                                lock(&fs_info->cleanup_work_sem);
[  605.965359]   lock(sb_internal#2);
[  605.994758]
[  605.994758]  *** DEADLOCK ***
[  605.994758]
[  606.075281] 2 locks held by btrfs-cleaner/6669:
[  606.104528]  #0:  (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b5d5>] cleaner_kthread+0x95/0x120 [btrfs]
[  606.165626]  #1:  (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs]
[  606.231297]
[  606.231297] stack backtrace:
[  606.287723] Pid: 6669, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00144-g463b030 #1
[  606.347823] Call Trace:
[  606.376184]  [<ffffffff8162a77c>] print_circular_bug+0x1fb/0x20c
[  606.409243]  [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90
[  606.441343]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
[  606.474583]  [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
[  606.505934]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
[  606.539429]  [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
[  606.571719]  [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0
[  606.603498]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
[  606.637405]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
[  606.670165]  [<ffffffff81172e75>] ? kmem_cache_alloc+0xb5/0x160
[  606.702144]  [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
[  606.735562]  [<ffffffffa00256a6>] ? block_rsv_add_bytes+0x56/0x80 [btrfs]
[  606.769861]  [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs]
[  606.804575]  [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs]
[  606.838756]  [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
[  606.872010]  [<ffffffff811972c8>] evict+0xb8/0x1c0
[  606.903800]  [<ffffffff811974d5>] iput+0x105/0x210
[  606.935416]  [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs]
[  606.970510]  [<ffffffffa003b5d5>] ? cleaner_kthread+0x95/0x120 [btrfs]
[  607.005648]  [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs]
[  607.040724]  [<ffffffffa003b540>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
[  607.104740]  [<ffffffff810791ee>] kthread+0xae/0xc0
[  607.137119]  [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
[  607.169797]  [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
[  607.202472]  [<ffffffff81635430>] ? retint_restore_args+0x13/0x13
[  607.235884]  [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
[  607.268731]  [<ffffffff8163e740>] ? gs_change+0x13/0x13

Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 15:19:08 -04:00
Josef Bacik
2aaa665581 Btrfs: add hole punching
This patch adds hole punching via fallocate.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:07 -04:00
Josef Bacik
2671485d39 Btrfs: remove unused hint byte argument for btrfs_drop_extents
I audited all users of btrfs_drop_extents and found that nobody actually uses
the hint_byte argument.  I'm sure it was used for something at some point but
it's not used now, and the way the pinning works the disk bytenr would never be
immediately useful anyway so lets just remove it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:06 -04:00
Liu Bo
46d8bc3424 Btrfs: fix a bug in checking whether a inode is already in log
This is based on Josef's "Btrfs: turbo charge fsync".

The current btrfs checks if an inode is in log by comparing
root's last_log_commit to inode's last_sub_trans[2].

But the problem is that this root->last_log_commit is shared among
inodes.

Say we have N inodes to be logged, after the first inode,
root's last_log_commit is updated and the N-1 remained files will
be skipped.

This fixes the bug by keeping a local copy of root's last_log_commit
inside each inode and this local copy will be maintained itself.

[1]: we regard each log transaction as a subset of btrfs's transaction,
i.e. sub_trans

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:06 -04:00
Miao Xie
321f0e7022 Btrfs: fix wrong orphan count of the fs/file tree
If we add a new orphan item, we should increase the atomic counter,
not decrease it. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:05 -04:00
Liu Bo
4e2f84e63d Btrfs: improve fsync by filtering extents that we want
This is based on Josef's "Btrfs: turbo charge fsync".

The above Josef's patch performs very good in random sync write test,
because we won't have too much extents to merge.

However, it does not performs good on the test:
dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync

The reason is when we do sequencial sync write, we need to merge the
current extent just with the previous one, so that we can get accumulated
extents to log:

A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...

So we'll have to flush more and more checksum into log tree, which is the
bottleneck according to my tests.

But we can avoid this by telling fsync the real extents that are needed
to be logged.

With this, I did the above dd sync write test (size=50m),

         w/o (orig)   w/ (josef's)   w/ (this)
SATA      104KB/s       109KB/s       121KB/s
ramdisk   1.5MB/s       1.5MB/s       10.7MB/s (613%)

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:05 -04:00
Josef Bacik
ca7e70f590 Btrfs: do not needlessly restart the transaction for enospc
We will stop and restart a transaction every time we move to a different leaf
when truncating a file.  This is for enospc reasons, but really we could
probably get away with doing this a little better by actually working until we
hit an ENOSPC.  So add a ->failfast flag to the block_rsv and set it when we do
truncates which will fail as soon as the block rsv runs out of space, and then
at that point we can stop and restart the transaction and refill the block rsv
and carry on.  This will make rm'ing of a file with lots of extents a bit
faster.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:04 -04:00
Josef Bacik
5dc562c541 Btrfs: turbo charge fsync
At least for the vm workload.  Currently on fsync we will

1) Truncate all items in the log tree for the given inode if they exist

and

2) Copy all items for a given inode into the log

The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing.  This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them.  We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already.  Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write

		Original	Patched
SATA drive	82KB/s		140KB/s
Fusion drive	431KB/s		2532KB/s

So around 2-6 times faster depending on your hardware.  There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok.  This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get.  All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.

The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok.  I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:03 -04:00
Josef Bacik
7c735313bd Btrfs: update last trans if we don't update the inode
There is a completely impossible situation to hit where you can preallocate
a file, fsync it, write into the preallocated region, have the transaction
commit twice and then fsync and then immediately lose power and lose all of
the contents of the write.  This patch fixes this just so I feel better
about the situation and because it is lightweight, we just update the
last_trans when we finish an ordered IO and we don't update the inode
itself.  This way we are completely safe and I feel better.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:02 -04:00
Linus Torvalds
99dbb1632f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull the trivial tree from Jiri Kosina:
 "Tiny usual fixes all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
  doc: fix old config name of kprobetrace
  fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
  btrfs: fix the commment for the action flags in delayed-ref.h
  btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
  vfs: fix kerneldoc for generic_fh_to_parent()
  treewide: fix comment/printk/variable typos
  ipr: fix small coding style issues
  doc: fix broken utf8 encoding
  nfs: comment fix
  platform/x86: fix asus_laptop.wled_type module parameter
  mfd: printk/comment fixes
  doc: getdelays.c: remember to close() socket on error in create_nl_socket()
  doc: aliasing-test: close fd on write error
  mmc: fix comment typos
  dma: fix comments
  spi: fix comment/printk typos in spi
  Coccinelle: fix typo in memdup_user.cocci
  tmiofb: missing NULL pointer checks
  tools: perf: Fix typo in tools/perf
  tools/testing: fix comment / output typos
  ...
2012-10-01 09:06:36 -07:00
Eric W. Biederman
2f2f43d3c7 userns: Convert btrfs to use kuid/kgid where appropriate
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-09-21 03:13:31 -07:00
Liu Bo
8bad3c0244 btrfs: fix comment typo in btrfs_finish_ordered_io
Fix typo errors in comments of btrfs_finish_ordered_io.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-09-01 08:27:57 -07:00
Linus Torvalds
318e151019 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I've split out the big send/receive update from my last pull request
  and now have just the fixes in my for-linus branch.  The send/recv
  branch will wander over to linux-next shortly though.

  The largest patches in this pull are Josef's patches to fix DIO
  locking problems and his patch to fix a crash during balance.  They
  are both well tested.

  The rest are smaller fixes that we've had queued.  The last rc came
  out while I was hacking new and exciting ways to recover from a
  misplaced rm -rf on my dev box, so these missed rc3."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
  Btrfs: fix that repair code is spuriously executed for transid failures
  Btrfs: fix ordered extent leak when failing to start a transaction
  Btrfs: fix a dio write regression
  Btrfs: fix deadlock with freeze and sync V2
  Btrfs: revert checksum error statistic which can cause a BUG()
  Btrfs: remove superblock writing after fatal error
  Btrfs: allow delayed refs to be merged
  Btrfs: fix enospc problems when deleting a subvol
  Btrfs: fix wrong mtime and ctime when creating snapshots
  Btrfs: fix race in run_clustered_refs
  Btrfs: don't run __tree_mod_log_free_eb on leaves
  Btrfs: increase the size of the free space cache
  Btrfs: barrier before waitqueue_active
  Btrfs: fix deadlock in wait_for_more_refs
  btrfs: fix second lock in btrfs_delete_delayed_items()
  Btrfs: don't allocate a seperate csums array for direct reads
  Btrfs: do not strdup non existent strings
  Btrfs: do not use missing devices when showing devname
  Btrfs: fix that error value is changed by mistake
  Btrfs: lock extents as we map them in DIO
  ...
2012-08-29 11:36:22 -07:00
Liu Bo
d280e5be94 Btrfs: fix ordered extent leak when failing to start a transaction
We cannot just return error before freeing ordered extent and releasing reserved
space when we fail to start a transacion.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-08-28 16:53:42 -04:00
Liu Bo
24c03fa5cf Btrfs: fix a dio write regression
This bug is introduced by commit 3b8bde746f6f9bd36a9f05f5f3b6e334318176a9
(Btrfs: lock extents as we map them in DIO).

In dio write, we should unlock the section which we didn't do IO on in case that
we fall back to buffered write.  But we need to not only unlock the section
but also cleanup reserved space for the section.

This bug was found while running xfstests 133, with this 133 no longer complains.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-08-28 16:53:41 -04:00
Josef Bacik
5a24e84c55 Btrfs: fix enospc problems when deleting a subvol
Subvol delete is a special kind of awful where we use the global reserve to
cover the ENOSPC requirements.  The problem is once we're done removing
everything we do a btrfs_update_inode(), which by default will try to do the
delayed update stuff which will use it's own reserve.  There will be no
space in this reserve and we'll return ENOSPC.  So instead use
btrfs_update_inode_fallback() which will just fallback to updating the inode
item in the case of enospc.  This is fine because the global reserve covers
the space requirements for this.  With this patch I can now delete a subvol
on a problem image Dave Sterba sent me.  Thanks,

Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-28 16:53:37 -04:00
Josef Bacik
66657b318e Btrfs: barrier before waitqueue_active
We need a barrir before calling waitqueue_active otherwise we will miss
wakeups.  So in places that do atomic_dec(); then atomic_read() use
atomic_dec_return() which imply a memory barrier (see memory-barriers.txt)
and then add an explicit memory barrier everywhere else that need them.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-28 16:53:33 -04:00
Josef Bacik
c329861da4 Btrfs: don't allocate a seperate csums array for direct reads
We've been allocating a big array for csums instead of storing them in the
io_tree like we do for buffered reads because previously we were locking the
entire range, so we didn't have an extent state for each sector of the
range.  But now that we do the range locking as we map the buffers we can
limit the mapping lenght to sectorsize and use the private part of the
io_tree for our csums.  This allows us to avoid an extra memory allocation
for direct reads which could incur latency.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-28 16:53:30 -04:00
Josef Bacik
eb838e73dc Btrfs: lock extents as we map them in DIO
A deadlock in xfstests 113 was uncovered by commit

d187663ef2

This is because we would not return EIOCBQUEUED for short AIO reads, instead
we'd wait for the DIO to complete and then return the amount of data we
transferred, which would allow our stuff to unlock the remaning amount.  But
with this change this no longer happens, so if we have a short AIO read (for
example if we try to read past EOF), we could leave the section from EOF to
the end of where we tried to read locked.  Fixing this is tricky since there
is no clear way to know exactly how much data DIO truly submitted for IO, so
to make this less hard on ourselves and less combersome we need to lock the
extents as we try to map them, and then we unlock any areas we didn't
actually map.  This makes us completely safe from deadlocks and reliance on
a particular behavior of the DIO code.  This also lays the groundwork for
allowing us to use the normal csum storage method for reads which means we
can remove an allocation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-28 16:53:27 -04:00
Artem Bityutskiy
b257031408 btrfs: nuke pdflush from comments
The pdflush thread is long gone, so this patch removes references to pdflush
from btrfs comments.

Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-04 12:15:35 +04:00
Linus Torvalds
a0e881b7c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs pile from Al Viro:
 "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
  deadlock reproduced by xfstests 068), symlink and hardlink restriction
  patches, plus assorted cleanups and fixes.

  Note that another fsfreeze deadlock (emergency thaw one) is *not*
  dealt with - the series by Fernando conflicts a lot with Jan's, breaks
  userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
  for massive vfsmount leak; this is going to be handled next cycle.
  There probably will be another pull request, but that stuff won't be
  in it."

Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
  delousing target_core_file a bit
  Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
  fs: Remove old freezing mechanism
  ext2: Implement freezing
  btrfs: Convert to new freezing mechanism
  nilfs2: Convert to new freezing mechanism
  ntfs: Convert to new freezing mechanism
  fuse: Convert to new freezing mechanism
  gfs2: Convert to new freezing mechanism
  ocfs2: Convert to new freezing mechanism
  xfs: Convert to new freezing code
  ext4: Convert to new freezing mechanism
  fs: Protect write paths by sb_start_write - sb_end_write
  fs: Skip atime update on frozen filesystem
  fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
  fs: Improve filesystem freezing handling
  switch the protection of percpu_counter list to spinlock
  nfsd: Push mnt_want_write() outside of i_mutex
  btrfs: Push mnt_want_write() outside of i_mutex
  fat: Push mnt_want_write() outside of i_mutex
  ...
2012-08-01 10:26:23 -07:00
Jan Kara
b2b5ef5c8e btrfs: Convert to new freezing mechanism
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
the transaction mechanism to avoid starting transactions on frozen filesystem.
At minimum this is necessary to stop iput() of unlinked file to change frozen
filesystem during truncation.

Checks in cleaner_kthread() and transaction_kthread() can be safely removed
since btrfs_freeze() will lock the mutexes and thus block the threads (and they
shouldn't have anything to do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 09:45:52 +04:00
Linus Torvalds
e2aed8dfa5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull large btrfs update from Chris Mason:
 "This pull request is very large, and the two main features in here
  have been under testing/devel for quite a while.

  We have subvolume quotas from the strato developers.  This enables
  full tracking of how many blocks are allocated to each subvolume (and
  all snapshots) and you can set limits on a per-subvolume basis.  You
  can also create quota groups and toss multiple subvolumes into a big
  group.  It's everything you need to be a web hosting company and give
  each user their own subvolume.

  The userland side of the quotas is being refreshed, they'll send out
  details on where to grab it soon.

  Next is the kernel side of btrfs send/receive from Alexander Block.
  This leverages the same infrastructure as the quota code to figure out
  relationships between blocks and their owners.  It can then compute
  the difference between two snapshots and sends the diffs in a neutral
  format into userland.

  The basic model:

        create a snapshot
        send that snapshot as the initial backup
        make changes
        create a second snapshot
        send the incremental as a backup
        delete the first snapshot
        (use the second snapshot for the next incremental)

  The receive portion is all in userland, and in the 'next' branch of my
  btrfs-progs repo.

  There's still some work to do in terms of optimizing the send side
  from kernel to userland.  The really important part is figuring out
  how two snapshots are different, and this is where we are
  concentrating right now.  The initial send of a dataset is a little
  slower than tar, but the incremental sends are dramatically faster
  than what rsync can do.

  On top of all of that, we have a nice queue of fixes, cleanups and
  optimizations."

Fix up trivial modify/del conflict in fs/btrfs/ioctl.c

Also fix up semantic conflict in fs/btrfs/send.c: the interface to
dentry_open() changed in commit 765927b2d5 ("switch dentry_open() to
struct path, make it grab references itself"), and since it now grabs
whatever references it needs, we should no longer do the mntget() on the
mnt (and we need to dput() the dentry reference we took).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (65 commits)
  Btrfs: uninit variable fixes in send/receive
  Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive
  Btrfs: add btrfs_compare_trees function
  Btrfs: introduce subvol uuids and times
  Btrfs: make iref_to_path non static
  Btrfs: add a barrier before a waitqueue_active check
  Btrfs: call the ordered free operation without any locks held
  Btrfs: Check INCOMPAT flags on remount and add helper function
  Btrfs: add helper for tree enumeration
  btrfs: allow cross-subvolume file clone
  Btrfs: improve multi-thread buffer read
  Btrfs: make btrfs's allocation smoothly with preallocation
  Btrfs: lock the transition from dirty to writeback for an eb
  Btrfs: fix potential race in extent buffer freeing
  Btrfs: don't return true in releasepage unless we actually freed the eb
  Btrfs: suppress printk() if all device I/O stats are zero
  Btrfs: remove unwanted printk() for btrfs device I/O stats
  Btrfs: rewrite BTRFS_SETGET_FUNCS
  Btrfs: zero unused bytes in inode item
  Btrfs: kill free_space pointer from inode structure
  ...

Conflicts:
	fs/btrfs/ioctl.c
2012-07-26 14:48:55 -07:00
Chris Mason
113c1cb530 Merge branch 'send-v2' of git://github.com/ablock84/linux-btrfs into for-linus
This is the kernel portion of btrfs send/receive

Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/backref.h
	fs/btrfs/ctree.c
	fs/btrfs/ioctl.c
	fs/btrfs/ioctl.h

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 19:19:10 -04:00
Alexander Block
8ea05e3a42 Btrfs: introduce subvol uuids and times
This patch introduces uuids for subvolumes. Each
subvolume has it's own uuid. In case it was snapshotted,
it also contains parent_uuid. In case it was received,
it also contains received_uuid.

It also introduces subvolume ctime/otime/stime/rtime. The
first two are comparable to the times found in inodes. otime
is the origin/creation time and ctime is the change time.
stime/rtime are only valid on received subvolumes.
stime is the time of the subvolume when it was
sent. rtime is the time of the subvolume when it was
received.

Additionally to the times, we have a transid for each
time. They are updated at the same place as the times.

btrfs receive uses stransid and rtransid to find out
if a received subvolume changed in the meantime.

If an older kernel mounts a filesystem with the
extented fields, all fields become invalid. The next
mount with a new kernel will detect this and reset the
fields.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
2012-07-25 23:28:38 +02:00
Li Zefan
293f7e0740 Btrfs: zero unused bytes in inode item
The otime field is not zeroed, so users will see random otime in an old
filesystem with a new kernel which has otime support in the future.

The reserved bytes are also not zeroed, and we'll have compatibility
issue if we make use of those bytes.

Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-07-23 16:28:05 -04:00
Li Zefan
b4d7c3c945 Btrfs: kill free_space pointer from inode structure
Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type,
which means every inode has the same BTRFS_I(inode)->free_space pointer.

This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits).

Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-07-23 16:28:05 -04:00
Liu Bo
83eea1f1ba Btrfs: kill root from btrfs_is_free_space_inode
Since root can be fetched via BTRFS_I macro directly, we can save an args
for btrfs_is_free_space_inode().

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:00 -04:00
Liu Bo
287082b0bd Btrfs: fix typo in cow_file_range_async and async_cow_submit
It should be 10 * 1024 * 1024.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-07-23 16:27:55 -04:00
Tsutomu Itoh
b995929515 Btrfs: return error of btrfs_update_inode() to caller
We didn't check error of btrfs_update_inode(), but that error looks
easy to bubble back up.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:54 -04:00
Alexander Block
2bc5565286 Btrfs: don't update atime on RO subvolumes
Before the update_time inode operation was indroduced, it was
not possible to prevent updates of atime on RO subvolumes. VFS
was only able to check for RO on the mount, but did not know
anything about btrfs subvolumes.

btrfs_update_time does now check if the root is RO and skip
updating of times.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-07-23 15:41:38 -04:00
Al Viro
ebfc3b49a7 don't pass nameidata to ->create()
boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:47 +04:00
Al Viro
00cd8dd3bf stop passing nameidata to ->lookup()
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument.  And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:32 +04:00
Al Viro
b3d9b7a3c7 vfs: switch i_dentry/d_alias to hlist
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:32:55 +04:00
Linus Torvalds
5eecb9cc90 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "I held off on my rc5 pull because I hit an oops during log recovery
  after a crash.  I wanted to make sure it wasn't a regression because
  we have some logging fixes in here.

  It turns out that a commit during the merge window just made it much
  more likely to trigger directory logging instead of full commits,
  which exposed an old bug.

  The new backref walking code got some additional fixes.  This should
  be the final set of them.

  Josef fixed up a corner where our O_DIRECT writes and buffered reads
  could expose old file contents (not stale, just not the most recent).
  He and Liu Bo fixed crashes during tree log recover as well.

  Ilya fixed errors while we resume disk balancing operations on
  readonly mounts."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: run delayed directory updates during log replay
  Btrfs: hold a ref on the inode during writepages
  Btrfs: fix tree log remove space corner case
  Btrfs: fix wrong check during log recovery
  Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS
  Btrfs: resume balance on rw (re)mounts properly
  Btrfs: restore restriper state on all mounts
  Btrfs: fix dio write vs buffered read race
  Btrfs: don't count I/O statistic read errors for missing devices
  Btrfs: resolve tree mod log locking issue in btrfs_next_leaf
  Btrfs: fix tree mod log rewind of ADD operations
  Btrfs: leave critical region in btrfs_find_all_roots as soon as possible
  Btrfs: always put insert_ptr modifications into the tree mod log
  Btrfs: fix tree mod log for root replacements at leaf level
  Btrfs: support root level changes in __resolve_indirect_ref
  Btrfs: avoid waiting for delayed refs when we must not
2012-07-05 13:06:25 -07:00
Liu Bo
6bf02314d9 Btrfs: fix wrong check during log recovery
When we're evicting an inode during log recovery, we need to ensure that the inode
is not in orphan state any more, which means inode's run_time flags has _no_
BTRFS_INODE_HAS_ORPHAN_ITEM.  Thus, the BUG_ON was triggered because of a wrong
check for the flags.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-02 15:39:17 -04:00
Josef Bacik
c3473e8300 Btrfs: fix dio write vs buffered read race
Miao pointed out there's a problem with mixing dio writes and buffered
reads.  If the read happens between us invalidating the page range and
actually locking the extent we can bring in pages into page cache.  Then
once the write finishes if somebody tries to read again it will just find
uptodate pages and we'll read stale data.  So we need to lock the extent and
check for uptodate bits in the range.  If there are uptodate bits we need to
unlock and invalidate again.  This will keep this race from happening since
we will hold the extent locked until we create the ordered extent, and then
teh read side always waits for ordered extents.  There was also a race in
how we updated i_size, previously we were relying on the generic DIO stuff
to adjust the i_size after the DIO had completed, but this happens outside
of the extent lock which means reads could come in and not see the updated
i_size.  So instead move this work into where we create the extents, and
then this way the update ordered i_size stuff works properly in the endio
handlers.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-07-02 15:36:23 -04:00
Linus Torvalds
8874e812fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is a small pull with btrfs fixes.  The biggest of the bunch is
  another fix for the new backref walking code.

  We're still hammering out one btrfs dio vs buffered reads problem, but
  that one will have to wait for the next rc."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: delay iput with async extents
  Btrfs: add a missing spin_lock
  Btrfs: don't assume to be on the correct extent in add_all_parents
  Btrfs: introduce btrfs_next_old_item
2012-06-21 13:41:07 -07:00
Josef Bacik
cb77fcd885 Btrfs: delay iput with async extents
There is some concern that these iput()'s could be the final iputs and could
induce lockups on people waiting on writeback.  This would happen in the
rare case that we don't create ordered extents because of an error, but it
is theoretically possible and we already have a mechanism to deal with this
so just make them delayed iputs to negate any worry.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-06-21 07:19:36 -04:00
Linus Torvalds
718f58ad61 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "The dates look like I had to rebase this morning because there was a
  compiler warning for a printk arg that I had missed earlier.

  These are all fixes, including one to prevent using stale pointers for
  device names, and lots of fixes around transaction abort cleanups
  (Josef, Liu Bo).

  Jan Schmidt also sent in a number of fixes for the new reference
  number tracking code.

  Liu Bo beat me to updating the MAINTAINERS file.  Since he thought to
  also fix the git url, I kept his commit."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
  Btrfs: update MAINTAINERS info for BTRFS FILE SYSTEM
  Btrfs: destroy the items of the delayed inodes in error handling routine
  Btrfs: make sure that we've made everything in pinned tree clean
  Btrfs: avoid memory leak of extent state in error handling routine
  Btrfs: do not resize a seeding device
  Btrfs: fix missing inherited flag in rename
  Btrfs: fix incompat flags setting
  Btrfs: fix defrag regression
  Btrfs: call filemap_fdatawrite twice for compression
  Btrfs: keep inode pinned when compressing writes
  Btrfs: implement ->show_devname
  Btrfs: use rcu to protect device->name
  Btrfs: unlock everything properly in the error case for nocow
  Btrfs: fix btrfs_destroy_marked_extents
  Btrfs: abort the transaction if the commit fails
  Btrfs: wake up transaction waiters when aborting a transaction
  Btrfs: fix locking in btrfs_destroy_delayed_refs
  Btrfs: pass locked_page into extent_clear_unlock_delalloc if theres an error
  Btrfs: fix race in tree mod log addition
  Btrfs: add btrfs_next_old_leaf
  ...
2012-06-15 16:04:37 -07:00
Liu Bo
bc1782374b Btrfs: fix missing inherited flag in rename
When we move a file into a directory with compression flag, we need to
inherite BTRFS_INODE_COMPRESS and clear BTRFS_INODE_NOCOMPRESS as well.
But if we move a file into a directory without compression flag, we need
to clear both of them.

It is the way how our setflags deals with compression flag, so keep
the same behaviour here.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-06-15 11:33:30 -04:00
Josef Bacik
7ddf5a42d3 Btrfs: call filemap_fdatawrite twice for compression
I removed this in an earlier commit and I was wrong.  Because compression
can return from filemap_fdatawrite() without having actually set any of it's
pages as writeback() it can make filemap_fdatawait() do essentially nothing,
and then we won't find any ordered extents because they may not have been
created yet.  So not only does this make fsync() completely useless, but it
will also screw up if you truncate on a non-page aligned offset since we
zero out the end and then wait on ordered extents and then call drop caches.
We can drop the cache before the io completes and then we try to unpin the
extent we just wrote we won't find it and everything goes sideways.  So fix
this by putting it back and put a giant comment there to keep me from trying
to remove it in the future.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-06-14 21:30:54 -04:00
Josef Bacik
8180ef8894 Btrfs: keep inode pinned when compressing writes
A user reported lots of problems using compression on the new code and it
turns out part of the problem was that igrab() was failing when we added a
new ordered extent.  This is because when writing out an inode under
compression we immediately return without actually doing anything to the
pages, and then in another thread at some point down the line actually do
the ordered dance.  The problem is between the point that we start writeback
and we actually add the ordered extent we could be trying to reclaim the
inode, which makes igrab() return NULL.  So we need to do an igrab() when we
create the async extent and then drop it when we are done with it.  This
makes sure we stay pinned in memory until the ordered extent can get a
reference on it and we are good to go.  With this patch we no longer panic
in btrfs_finish_ordered_io().  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-06-14 21:30:53 -04:00
Josef Bacik
17ca04aff7 Btrfs: unlock everything properly in the error case for nocow
I was getting hung on umount when a transaction was aborted because a range
of one of the free space inodes was still locked.  This is because the nocow
stuff doesn't unlock anything on error.  This fixed the problem and I
verified that is what was happening.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-06-14 21:29:15 -04:00
Josef Bacik
beb42dd793 Btrfs: pass locked_page into extent_clear_unlock_delalloc if theres an error
While doing my enospc work I got a transaction abortion that resulted in a
panic when we tried to unlock_page() an already unlocked page.  This is
because we aren't calling extent_clear_unlock_delalloc with the locked page
so it was unlocking all the pages in the range.  This is wrong since
__extent_writepage expects to have the page locked still unless we return
*page_started as 1.  This should keep us from panicing.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-06-14 21:29:09 -04:00
Linus Torvalds
1193755ac6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs changes from Al Viro.
 "A lot of misc stuff.  The obvious groups:
   * Miklos' atomic_open series; kills the damn abuse of
     ->d_revalidate() by NFS, which was the major stumbling block for
     all work in that area.
   * ripping security_file_mmap() and dealing with deadlocks in the
     area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
     general.
   * ->encode_fh() switched to saner API; insane fake dentry in
     mm/cleancache.c gone.
   * assorted annotations in fs (endianness, __user)
   * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
   * ->update_time() work from Josef.
   * other bits and pieces all over the place.

  Normally it would've been in two or three pull requests, but
  signal.git stuff had eaten a lot of time during this cycle ;-/"

Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
  nfs: don't open in ->d_revalidate
  vfs: retry last component if opening stale dentry
  vfs: nameidata_to_filp(): don't throw away file on error
  vfs: nameidata_to_filp(): inline __dentry_open()
  vfs: do_dentry_open(): don't put filp
  vfs: split __dentry_open()
  vfs: do_last() common post lookup
  vfs: do_last(): add audit_inode before open
  vfs: do_last(): only return EISDIR for O_CREAT
  vfs: do_last(): check LOOKUP_DIRECTORY
  vfs: do_last(): make ENOENT exit RCU safe
  vfs: make follow_link check RCU safe
  vfs: do_last(): use inode variable
  vfs: do_last(): inline walk_component()
  vfs: do_last(): make exit RCU safe
  vfs: split do_lookup()
  Btrfs: move over to use ->update_time
  fs: introduce inode operation ->update_time
  reiserfs: get rid of resierfs_sync_super
  reiserfs: mark the superblock as dirty a bit later
  ...
2012-06-01 10:34:35 -07:00
Josef Bacik
e41f941a23 Btrfs: move over to use ->update_time
Btrfs had been doing it's own file_update_time so we could catch ENOSPC
properly, so just update our btrfs_update_time to work with the new stuff and
then we'll be fancy later.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-06-01 12:07:52 -04:00
Linus Torvalds
51eab603f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This includes a fairly large change from Josef around data writeback
  completion.  Before, the writeback wasn't completed until the metadata
  insertions for the extent were done, and this made for fairly large
  latency spikes on the last page of each ordered extent.

  We already had a separate mechanism for tracking pending metadata
  insertions, so Josef just needed to tweak things a little to end
  writeback earlier on the page.  Overall it makes us much friendly to
  memory reclaim and lowers latencies quite a lot for synchronous IO.

  Jan Schmidt has finished some background work required to track btree
  blocks as they go through changes in ownership.  It's the missing
  piece he needed for both btrfs send/receive and subvolume quotas.
  Neither of those are ready yet, but the new tracking code is included
  here.  Most of the time, the new code is off.  It is only used by
  scrub and other backref walkers.

  Stefan Behrens has added io failure tracking.  This includes counters
  for which drives are causing the most trouble so the admin (or an
  automated tool) can choose to kick them out.  We're tracking IO
  errors, crc errors, and generation checks we do on each metadata
  block.

  RAID5/6 did miss the cut this time because I'm having trouble with
  corruptions.  I'll nail it down next week and post as a beta testing
  before 3.6"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (58 commits)
  Btrfs: fix tree mod log rewinded level and rewinding of moved keys
  Btrfs: fix tree mod log del_ptr
  Btrfs: add tree_mod_dont_log helper
  Btrfs: add missing spin_lock for insertion into tree mod log
  Btrfs: add inodes before dropping the extent lock in find_all_leafs
  Btrfs: use delayed ref sequence numbers for all fs-tree updates
  Btrfs: fix false positive in check-integrity on unmount
  Btrfs: fix runtime warning in check-integrity check data mode
  Btrfs: set ioprio of scrub readahead to idle
  Btrfs: fix return code in drop_objectid_items
  Btrfs: check to see if the inode is in the log before fsyncing
  Btrfs: return value of btrfs_read_buffer is checked correctly
  Btrfs: read device stats on mount, write modified ones during commit
  Btrfs: add ioctl to get and reset the device stats
  Btrfs: add device counters for detected IO and checksum errors
  btrfs: Drop unused function btrfs_abort_devices()
  Btrfs: fix the same inode id problem when doing auto defragment
  Btrfs: fall back to non-inline if we don't have enough space
  Btrfs: fix how we deal with the orphan block rsv
  Btrfs: convert the inode bit field to use the actual bit operations
  ...
2012-06-01 08:37:31 -07:00
Josef Bacik
2adcac1a73 Btrfs: fall back to non-inline if we don't have enough space
If cow_file_range_inline fails with ENOSPC we abort the transaction which
isn't very nice.  This really shouldn't be happening anyways but there's no
sense in making it a horrible error when we can easily just go allocate
normal data space for this stuff.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:38 -04:00
Josef Bacik
8a35d95ff4 Btrfs: fix how we deal with the orphan block rsv
Ceph was hitting this race where we would remove an inode from the per-root
orphan list before we would release the space we had reserved for the inode.
We actually don't need a list or anything, we just need to make sure the
root doesn't try to free up the orphan reserve until after the inodes have
released their reservations.  So use an atomic counter instead of a list on
the root and only decrement the counter after we've released our
reservation.  I've tested this as well as several others and we no longer
see the warnings that you would see while running ceph.  Thanks,
Btrfs: fix how we deal with the orphan block rsv

Ceph was hitting this race where we would remove an inode from the per-root
orphan list before we would release the space we had reserved for the inode.
We actually don't need a list or anything, we just need to make sure the
root doesn't try to free up the orphan reserve until after the inodes have
released their reservations.  So use an atomic counter instead of a list on
the root and only decrement the counter after we've released our
reservation.  I've tested this as well as several others and we no longer
see the warnings that you would see while running ceph.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:37 -04:00
Josef Bacik
72ac3c0d79 Btrfs: convert the inode bit field to use the actual bit operations
Miao pointed this out while I was working on an orphan problem that messing
with a bitfield where different ranges are protected by different locks
doesn't work out right.  Turns out we've been doing this forever where we
have different parts of the bit field protected by either no lock at all or
different locks which could cause all sorts of weird problems including the
issue I was hitting.  So instead make a runtime_flags thing that we use the
normal bit operations on that are all atomic so we can keep having our
no/different locking for the different flags and then make force_compress
it's own thing so it can be treated normally.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:36 -04:00
Josef Bacik
5fd0204355 Btrfs: finish ordered extents in their own thread
We noticed that the ordered extent completion doesn't really rely on having
a page and that it could be done independantly of ending the writeback on a
page.  This patch makes us not do the threaded endio stuff for normal
buffered writes and direct writes so we can end page writeback as soon as
possible (in irq context) and only start threads to do the ordered work when
it is actually done.  Compression needs to be reworked some to take
advantage of this as well, but atm it has to do a find_get_page in its endio
handler so it must be done in its own thread.  This makes direct writes
quite a bit faster.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:33 -04:00
Josef Bacik
0c4d2d95d0 Btrfs: use i_version instead of our own sequence
We've been keeping around the inode sequence number in hopes that somebody
would use it, but nobody uses it and people actually use i_version which
serves the same purpose, so use i_version where we used the incore inode's
sequence number and that way the sequence is updated properly across the
board, and not just in file write.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:27 -04:00
Linus Torvalds
90324cc1b1 avoid iput() from flusher thread
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPw2J/AAoJECvKgwp+S8Ja5jkP/3uMxkhf8XQpXCI3O1QVfaQr
 uZFfM8sINqIPDVm1dtFjFj7f8Bw9mhE2KAnnJ1rKT8tQwqq9yAse1QPlhCG1ZqoP
 +AnMDDXHtx7WmQZXhBvS9b+unpZ7Jr6r6pO5XrmTL2kRL3YJPUhZ2+xbTT5belTB
 KoAu4WqORZRxfXoC76S7U8K+D4NcAGhAOxCClsIjmY+oocCiCag4FZOyzYIFViqc
 ghUN/+rLQ3fqGGv2yO7Ylx1gUM7sxIwkZQ/h962jFAtxz9czImr2NmRoMliOaOkS
 tvcnIf+E3u0n/zIjzFvzhxKgHJPP8PkcPMk60d3jKmFngBkqFTzNUeVTP8md7HrV
 4DlXisWr+z7YVyWUCFaNcJLmjiWSwQ8DV/clRLobeBf9EJKan5F1PjFgl6PLJM5F
 Qr1+LHMNaetdulBwMRTyveZTzYqw9RmDnD9dWMo4mX/kTpvtC4jTPVV7hkRD+Qlv
 5vTRR+VXL3Q50yClLf0AQMSKTnH2gBuepM/b+7cShLGfsMln8DtUjmbigv+niL63
 BibcCIbIlP2uWGnl37VhsC34AT+RKt3lggrBOpn/7XJMq/wKR7IRP/7V9TfYgaUN
 NBa+wtnLDa1pZEn/X7izdcQP62PzDtmB+ObvYT0Yb40A4+2ud3qF/lB53c1A1ewF
 /9c4zxxekjHZnn2oooEa
 =oLXf
 -----END PGP SIGNATURE-----

Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

Pull writeback tree from Wu Fengguang:
 "Mainly from Jan Kara to avoid iput() in the flusher threads."

* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: Avoid iput() from flusher thread
  vfs: Rename end_writeback() to clear_inode()
  vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
  writeback: Refactor writeback_single_inode()
  writeback: Remove wb->list_lock from writeback_single_inode()
  writeback: Separate inode requeueing after writeback
  writeback: Move I_DIRTY_PAGES handling
  writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
  writeback: Move clearing of I_SYNC into inode_sync_complete()
  writeback: initialize global_dirty_limit
  fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
  mm: page-writeback.c: local functions should not be exposed globally
2012-05-28 09:54:45 -07:00
Jan Kara
dbd5768f87 vfs: Rename end_writeback() to clear_inode()
After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-05-06 13:43:41 +08:00
Linus Torvalds
f7b0069317 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has our collection of bug fixes.  I missed the last rc because I
  thought our patches were making NFS crash during my xfs test runs.
  Turns out it was an NFS client bug fixed by someone else while I tried
  to bisect it.

  All of these fixes are small, but some are fairly high impact.  The
  biggest are fixes for our mount -o remount handling, a deadlock due to
  GFP_KERNEL allocations in readdir, and a RAID10 error handling bug.

  This was tested against both 3.3 and Linus' master as of this morning."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (26 commits)
  Btrfs: reduce lock contention during extent insertion
  Btrfs: avoid deadlocks from GFP_KERNEL allocations during btrfs_real_readdir
  Btrfs: Fix space checking during fs resize
  Btrfs: fix block_rsv and space_info lock ordering
  Btrfs: Prevent root_list corruption
  Btrfs: fix repair code for RAID10
  Btrfs: do not start delalloc inodes during sync
  Btrfs: fix that check_int_data mount option was ignored
  Btrfs: don't count CRC or header errors twice while scrubbing
  Btrfs: fix btrfs_ioctl_dev_info() crash on missing device
  btrfs: don't return EINTR
  Btrfs: double unlock bug in error handling
  Btrfs: always store the mirror we read the eb from
  fs/btrfs/volumes.c: add missing free_fs_devices
  btrfs: fix early abort in 'remount'
  Btrfs: fix max chunk size check in chunk allocator
  Btrfs: add missing read locks in backref.c
  Btrfs: don't call free_extent_buffer twice in iterate_irefs
  Btrfs: Make free_ipath() deal gracefully with NULL pointers
  Btrfs: avoid possible use-after-free in clear_extent_bit()
  ...
2012-04-28 09:30:07 -07:00
Chris Mason
fede766f28 Btrfs: avoid deadlocks from GFP_KERNEL allocations during btrfs_real_readdir
Btrfs has an optimization where it will preallocate dentries during
readdir to fill in enough information to open the inode without an extra
lookup.

But, we're calling d_alloc, which is doing GFP_KERNEL allocations, and
that leads to deadlocks because our readdir code has tree locks held.

For now, disable this optimization.  We'll fix the gfp mask in the next
merge window.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-04-27 14:23:22 -04:00
Josef Bacik
5cf1ab5613 Btrfs: always store the mirror we read the eb from
A user reported a panic where we were trying to fix a bad mirror but the
mirror number we were giving was 0, which is invalid.  This is because we
don't do the transid verification until after the read, so as far as the
read code is concerned the read was a success.  So instead store the mirror
we read from so that if there is some failure post read we know which mirror
to try next and which mirror needs to be fixed if we find a good copy of the
block.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-04-18 19:22:30 +02:00
Arne Jansen
8c9c2bf7a3 btrfs: fix race in reada
When inserting into the radix tree returns EEXIST, get the existing
entry without giving up the spinlock in between.
There was a race for both the zones trees and the extent tree.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-04-18 19:12:44 +02:00
Li Zefan
848cce0d41 Btrfs: avoid setting ->d_op twice
Follow those instructions, and you'll trigger a warning in the
beginning of d_set_d_op():

  # mkfs.btrfs /dev/loop3
  # mount /dev/loop3 /mnt
  # btrfs sub create /mnt/sub
  # btrfs sub snap /mnt /mnt/snap
  # touch /mnt/snap/sub
  touch: cannot touch `tmp': Permission denied

__d_alloc() set d_op to sb->s_d_op (btrfs_dentry_operations), and
then simple_lookup() reset it to simple_dentry_operations, which
triggered the warning.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-18 19:12:44 +02:00
Linus Torvalds
9613bebb22 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes and features from Chris Mason:
 "We've merged in the error handling patches from SuSE.  These are
  already shipping in the sles kernel, and they give btrfs the ability
  to abort transactions and go readonly on errors.  It involves a lot of
  churn as they clarify BUG_ONs, and remove the ones we now properly
  deal with.

  Josef reworked the way our metadata interacts with the page cache.
  page->private now points to the btrfs extent_buffer object, which
  makes everything faster.  He changed it so we write an whole extent
  buffer at a time instead of allowing individual pages to go down,,
  which will be important for the raid5/6 code (for the 3.5 merge
  window ;)

  Josef also made us more aggressive about dropping pages for metadata
  blocks that were freed due to COW.  Overall, our metadata caching is
  much faster now.

  We've integrated my patch for metadata bigger than the page size.
  This allows metadata blocks up to 64KB in size.  In practice 16K and
  32K seem to work best.  For workloads with lots of metadata, this cuts
  down the size of the extent allocation tree dramatically and fragments
  much less.

  Scrub was updated to support the larger block sizes, which ended up
  being a fairly large change (thanks Stefan Behrens).

  We also have an assortment of fixes and updates, especially to the
  balancing code (Ilya Dryomov), the back ref walker (Jan Schmidt) and
  the defragging code (Liu Bo)."

Fixed up trivial conflicts in fs/btrfs/scrub.c that were just due to
removal of the second argument to k[un]map_atomic() in commit
7ac687d9e0.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (75 commits)
  Btrfs: update the checks for mixed block groups with big metadata blocks
  Btrfs: update to the right index of defragment
  Btrfs: do not bother to defrag an extent if it is a big real extent
  Btrfs: add a check to decide if we should defrag the range
  Btrfs: fix recursive defragment with autodefrag option
  Btrfs: fix the mismatch of page->mapping
  Btrfs: fix race between direct io and autodefrag
  Btrfs: fix deadlock during allocating chunks
  Btrfs: show useful info in space reservation tracepoint
  Btrfs: don't use crc items bigger than 4KB
  Btrfs: flush out and clean up any block device pages during mount
  btrfs: disallow unequal data/metadata blocksize for mixed block groups
  Btrfs: enhance superblock sanity checks
  Btrfs: change scrub to support big blocks
  Btrfs: minor cleanup in scrub
  Btrfs: introduce common define for max number of mirrors
  Btrfs: fix infinite loop in btrfs_shrink_device()
  Btrfs: fix memory leak in resolver code
  Btrfs: allow dup for data chunks in mixed mode
  Btrfs: validate target profiles only if we are going to use them
  ...
2012-03-30 12:44:29 -07:00
Liu Bo
4cb13e5d6e Btrfs: fix recursive defragment with autodefrag option
$ mkfs.btrfs disk
$ mount disk /mnt -o autodefrag
$ dd if=/dev/zero of=/mnt/foobar bs=4k count=10 2>/dev/null && sync
$ for i in `seq 9 -2 0`; do dd if=/dev/zero of=/mnt/foobar bs=4k count=1 \
  seek=$i conv=notrunc 2> /dev/null; done && sync

then we'll get to defrag "foobar" again and again.
So does option "-o autodefrag,compress".

Reasons:
When the cleaner kthread gets to fetch inodes from the defrag tree and defrag
them, it will dirty pages and submit them, this will comes to another DATA COW
where the processing inode will be inserted to the defrag tree again.

This patch sets a rule for COW code, i.e. insert an inode when we're really
going to make some defragments.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29 09:57:45 -04:00
Chris Mason
1d4284bd6e Merge branch 'error-handling' into for-linus
Conflicts:
	fs/btrfs/ctree.c
	fs/btrfs/disk-io.c
	fs/btrfs/extent-tree.c
	fs/btrfs/extent_io.c
	fs/btrfs/extent_io.h
	fs/btrfs/inode.c
	fs/btrfs/scrub.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-28 20:31:37 -04:00
Josef Bacik
0b32f4bbb4 Btrfs: ensure an entire eb is written at once
This patch simplifies how we track our extent buffers.  Previously we could exit
writepages with only having written half of an extent buffer, which meant we had
to track the state of the pages and the state of the extent buffers differently.
Now we only read in entire extent buffers and write out entire extent buffers,
this allows us to simply set bits in our bflags to indicate the state of the eb
and we no longer have to do things like track uptodate with our iotree.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26 17:04:23 -04:00
Josef Bacik
81c9ad237c Btrfs: remove search_start and search_end from find_free_extent and callers
We have been passing nothing but (u64)-1 to find_free_extent for search_end in
all of the callers, so it's completely useless, and we've always been passing 0
in as search_start, so just remove them as function arguments and move
search_start into find_free_extent.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26 14:42:51 -04:00
Jeff Mahoney
79787eaab4 btrfs: replace many BUG_ONs with proper error handling
btrfs currently handles most errors with BUG_ON. This patch is a work-in-
 progress but aims to handle most errors other than internal logic
 errors and ENOMEM more gracefully.

 This iteration prevents most crashes but can run into lockups with
 the page lock on occasion when the timing "works out."

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 11:52:54 +01:00
Mark Fasheh
ce598979be btrfs: Don't BUG_ON errors from btrfs_create_subvol_root()
This is called from only one place - create_subvol() which passes errors
safely back out to it's caller, btrfs_mksubvol where they are handled.

Additionally, btrfs_create_subvol_root() itself bug's needlessly from error
return of btrfs_update_inode(). Since create_subvol() was fixed to catch
errors we can bubble this one up too.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2012-03-22 01:45:36 +01:00
Jeff Mahoney
3fbe5c02ae btrfs: split extent_state ops
set_extent_bit can do exclusive locking but only when called by lock_extent*,

 Drop the exclusive bits argument except when called by lock_extent.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:35 +01:00
Jeff Mahoney
d0082371cf btrfs: drop gfp_t from lock_extent
lock_extent and unlock_extent are always called with GFP_NOFS, drop the
 argument and use GFP_NOFS consistently.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:35 +01:00
Jeff Mahoney
143bede527 btrfs: return void in functions without error conditions
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:34 +01:00
Jeff Mahoney
355808c296 btrfs: ->submit_bio_hook error push-up
This pushes failures from the submit_bio_hook callbacks,
btrfs_submit_bio_hook and btree_submit_bio_hook into the callers, including
callers of submit_one_bio where it catches the failures with BUG_ON.

It also pushes up through the ->readpage_io_failed_hook to
end_bio_extent_writepage where the error is already caught with BUG_ON.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:34 +01:00
Jeff Mahoney
3444a97255 btrfs: Factor out tree->ops->merge_bio_hook call
In submit_extent_page, there's a visually noisy if statement that, in
the midst of other conditions, does the tree dependency for tree->ops
and tree->ops->merge_bio_hook before calling it, and then another
condition afterwards. If an error is returned from merge_bio_hook,
there's no way to catch it. It's considered a routine "1" return
value instead of a failure.

This patch factors out the dependency check into a new local merge_bio
routine and BUG's on an error. The if statement is less noisy as a side-
effect.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:33 +01:00
Jeff Mahoney
0417341e6b btrfs: Simplify btrfs_submit_bio_hook
btrfs_submit_bio_hook currently calls btrfs_bio_wq_end_io in either case
of an if statement that determines one of the arguments.

This patch moves the function call outside of the if statement and uses it
to only determine the different argument. This allows us to catch an
error in one place in a more visually obvious way.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:33 +01:00
Cong Wang
7ac687d9e0 btrfs: remove the second argument of k[un]map_atomic()
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:21 +08:00
Linus Torvalds
855a85f704 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Quoth Chris:
 "This is later than I wanted because I got backed up running through
  btrfs bugs from the Oracle QA teams.  But they are all bug fixes that
  we've queued and tested since rc1.

  Nothing in particular stands out, this just reflects bug fixing and QA
  done in parallel by all the btrfs developers.  The most user visible
  of these is:

    Btrfs: clear the extent uptodate bits during parent transid failures

  Because that helps deal with out of date drives (say an iscsi disk
  that has gone away and come back).  The old code wasn't always
  properly retrying the other mirror for this type of failure."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
  Btrfs: fix compiler warnings on 32 bit systems
  Btrfs: increase the global block reserve estimates
  Btrfs: clear the extent uptodate bits during parent transid failures
  Btrfs: add extra sanity checks on the path names in btrfs_mksubvol
  Btrfs: make sure we update latest_bdev
  Btrfs: improve error handling for btrfs_insert_dir_item callers
  Btrfs: be less strict on finding next node in clear_extent_bit
  Btrfs: fix a bug on overcommit stuff
  Btrfs: kick out redundant stuff in convert_extent_bit
  Btrfs: skip states when they does not contain bits to clear
  Btrfs: check return value of lookup_extent_mapping() correctly
  Btrfs: fix deadlock on page lock when doing auto-defragment
  Btrfs: fix return value check of extent_io_ops
  btrfs: honor umask when creating subvol root
  btrfs: silence warning in raid array setup
  btrfs: fix structs where bitfields and spinlock/atomic share 8B word
  btrfs: delalloc for page dirtied out-of-band in fixup worker
  Btrfs: fix memory leak in load_free_space_cache()
  btrfs: don't check DUP chunks twice
  Btrfs: fix trim 0 bytes after a device delete
  ...
2012-02-24 09:02:53 -08:00
Chris Mason
fe66a05a06 Btrfs: improve error handling for btrfs_insert_dir_item callers
This allows us to gracefully continue if we aren't able to insert
directory items, both for normal files/dirs and snapshots.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23 10:43:45 -05:00
Florian Albrechtskirchinger
12fc9d0923 btrfs: honor umask when creating subvol root
Set the subvol root inode permissions based on the current umask.
2012-02-16 16:35:41 +01:00
Jeff Mahoney
87826df0ec btrfs: delalloc for page dirtied out-of-band in fixup worker
We encountered an issue that was easily observable on s/390 systems but
 could really happen anywhere. The timing just seemed to hit reliably
 on s/390 with limited memory.

 The gist is that when an unexpected set_page_dirty() happened, we'd
 run into the BUG() in btrfs_writepage_fixup_worker since it wasn't
 properly set up for delalloc.

 This patch does the following:
 - Performs the missing delalloc in the fixup worker
 - Allow the start hook to return -EBUSY which informs __extent_writepage
   that it should mark the page skipped and not to redirty it. This is
   required since the fixup worker can fail with -ENOSPC and the page
   will have already been redirtied. That causes an Oops in
   drop_outstanding_extents later. Retrying the fixup worker could
   lead to an infinite loop. Deferring the page redirty also saves us
   some cycles since the page would be stuck in a resubmit-redirty loop
   until the fixup worker completes. It's not harmful, just wasteful.
 - If the fixup worker fails, we mark the page and mapping as errored,
   and end the writeback, similar to what we would do had the page
   actually been submitted to writeback.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-02-15 16:40:25 +01:00
Linus Torvalds
67d2433ee7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix reservations in btrfs_page_mkwrite
  Btrfs: advance window_start if we're using a bitmap
  btrfs: mask out gfp flags in releasepage
  Btrfs: fix enospc error caused by wrong checks of the chunk
  Btrfs: do not defrag a file partially
  Btrfs: fix warning for 32-bit build of fs/btrfs/check-integrity.c
  Btrfs: use cluster->window_start when allocating from a cluster bitmap
  Btrfs: Check for NULL page in extent_range_uptodate
  btrfs: Fix busyloops in transaction waiting code
  Btrfs: make sure a bitmap has enough bytes
  Btrfs: fix uninit warning in backref.c
2012-01-28 17:00:19 -08:00
Chris Mason
9998eb7034 Btrfs: fix reservations in btrfs_page_mkwrite
Josef fixed btrfs_page_mkwrite to properly release reserved
extents if there was an error.  But if we fail to get a reservation
and we fail to dirty the inode (for ENOSPC reasons), we'll end up
trying to release a reservation we never had.

This makes sure we only release if we were able to reserve.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-27 10:44:44 -05:00
Linus Torvalds
f9156c7288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
  Btrfs: use larger system chunks
  Btrfs: add a delalloc mutex to inodes for delalloc reservations
  Btrfs: space leak tracepoints
  Btrfs: protect orphan block rsv with spin_lock
  Btrfs: add allocator tracepoints
  Btrfs: don't call btrfs_throttle in file write
  Btrfs: release space on error in page_mkwrite
  Btrfs: fix btrfsck error 400 when truncating a compressed
  Btrfs: do not use btrfs_end_transaction_throttle everywhere
  Btrfs: add balance progress reporting
  Btrfs: allow for resuming restriper after it was paused
  Btrfs: allow for canceling restriper
  Btrfs: allow for pausing restriper
  Btrfs: add skip_balance mount option
  Btrfs: recover balance on mount
  Btrfs: save balance parameters to disk
  Btrfs: soft profile changing mode (aka soft convert)
  Btrfs: implement online profile changing
  Btrfs: do not reduce profile in do_chunk_alloc()
  Btrfs: virtual address space subset filter
  ...

Fix up trivial conflict in fs/btrfs/ioctl.c due to the use of the new
mnt_drop_write_file() helper.
2012-01-17 15:49:54 -08:00
Josef Bacik
f248679e86 Btrfs: add a delalloc mutex to inodes for delalloc reservations
I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
that and theres no real way to get rid of those, so just stop using i_mutex to
protect delalloc metadata reservations and use a delalloc mutex instead.  This
shouldn't be contended often at all, only if you are writing and mmap writing to
the file at the same time.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-01-16 15:29:43 -05:00
Josef Bacik
90290e1982 Btrfs: protect orphan block rsv with spin_lock
We've been seeing warnings coming out of the orphan commit stuff forever from
ceph.  Turns out it's because we're racing with checking if the orphan block
reserve is set, because we clear it outside of the spin_lock.  So leave the
normal fastpath checks where they are, but take the spin_lock and _recheck_ to
make sure we haven't had an orphan block rsv added in the meantime.  Then clear
the root's orphan block rsv and release the lock.  With this patch a user said
the warnings went away and they usually showed up pretty soon after he started
ceph.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-01-16 15:29:42 -05:00
Josef Bacik
ec39e180fd Btrfs: release space on error in page_mkwrite
If updating the inode gave us an ENOSPC we were just returning in page_mkwrite,
which is a problem since we make our reservation right before trying to update
the inode, so fix the out label so that we actually free our reservation.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-16 15:28:54 -05:00
Miao Xie
f70a9a6b94 Btrfs: fix btrfsck error 400 when truncating a compressed
Reproduce steps:
 # mkfs.btrfs /dev/sdb5
 # mount /dev/sdb5 -o compress=lzo /mnt
 # dd if=/dev/zero of=/mnt/tmpfile bs=128K count=1
 # sync
 # truncate -s 64K /mnt/tmpfile
 root 5 inode 257 errors 400

This is because of the wrong if condition, which is used to check if we should
subtract the bytes of the dropped range from i_blocks/i_bytes of i-node or not.
When we truncate a compressed extent, btrfs substracts the bytes of the whole
extent, it's wrong. We should substract the real size that we truncate, no
matter it is a compressed extent or not. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-16 15:28:54 -05:00
Josef Bacik
7ad85bb76a Btrfs: do not use btrfs_end_transaction_throttle everywhere
A user reported a problem where things like open with O_CREAT would take up to
30 seconds when he had nfs activity on the same mount.  This is because all of
our quick metadata operations, like create, symlink etc all do
btrfs_end_transaction_throttle, which if the transaction is blocked will wait
for the commit to complete before it returns.  This adds a ridiculous amount of
latency and isn't really needed.  The normal btrfs_end_transaction will mark the
transaction as blocked and wake the transaction kthread up if it thinks the
transaction needs to end (this being in the running out of global reserve space
scenario), and this is all that is really needed since we've already done
everything we're going to do, we just need to return.  This should help people
with the latency they were seeing when using synchronous heavy workloads.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-16 15:28:54 -05:00
Chris Mason
9785dbdf26 Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into integration 2012-01-16 15:26:31 -05:00
Linus Torvalds
98793265b4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits)
  Kconfig: acpi: Fix typo in comment.
  misc latin1 to utf8 conversions
  devres: Fix a typo in devm_kfree comment
  btrfs: free-space-cache.c: remove extra semicolon.
  fat: Spelling s/obsolate/obsolete/g
  SCSI, pmcraid: Fix spelling error in a pmcraid_err() call
  tools/power turbostat: update fields in manpage
  mac80211: drop spelling fix
  types.h: fix comment spelling for 'architectures'
  typo fixes: aera -> area, exntension -> extension
  devices.txt: Fix typo of 'VMware'.
  sis900: Fix enum typo 'sis900_rx_bufer_status'
  decompress_bunzip2: remove invalid vi modeline
  treewide: Fix comment and string typo 'bufer'
  hyper-v: Update MAINTAINERS
  treewide: Fix typos in various parts of the kernel, and fix some comments.
  clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR
  gpio: Kconfig: drop unknown symbol 'CS5535_GPIO'
  leds: Kconfig: Fix typo 'D2NET_V2'
  sound: Kconfig: drop unknown symbol ARCH_CLPS7500
  ...

Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new
kconfig additions, close to removed commented-out old ones)
2012-01-08 13:21:22 -08:00
Jan Schmidt
6bf7e080d5 Btrfs: make sure we're not using obsolete code in btrfs_get_extent
There's code in btrfs_get_extent that should never be used. This patch turns
a WARN_ON(1) into a BUG(), hoping we can remove the transaction code from
btrfs_get_extent soon.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-01-05 15:11:55 +01:00
Al Viro
175a4eb7ea fs: propagate umode_t, misc bits
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:55:10 -05:00
Al Viro
1a67aafb5f switch ->mknod() to umode_t
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:54 -05:00
Al Viro
4acdaf27eb switch ->create() to umode_t
vfs_create() ignores everything outside of 16bit subset of its
mode argument; switching it to umode_t is obviously equivalent
and it's the only caller of the method

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:53 -05:00
Al Viro
18bb1db3e7 switch vfs_mkdir() and ->mkdir() to umode_t
vfs_mkdir() gets int, but immediately drops everything that might not
fit into umode_t and that's the only caller of ->mkdir()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:53 -05:00
Al Viro
6b520e0565 vfs: fix the stupidity with i_dentry in inode destructors
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else.  Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:52:40 -05:00
Linus Torvalds
827fa4c762 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: call d_instantiate after all ops are setup
  Btrfs: fix worker lock misuse in find_worker
2011-12-23 14:58:39 -08:00
Al Viro
08c422c27f Btrfs: call d_instantiate after all ops are setup
This closes races where btrfs is calling d_instantiate too soon during
inode creation.  All of the callers of btrfs_add_nondir are updated to
instantiate after the inode is fully setup in memory.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-23 08:02:26 -05:00
Arne Jansen
66d7e7f09f Btrfs: mark delayed refs as for cow
Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
from every call site. The for_cow parameter will later on be used to
determine if a ref will change anything with respect to qgroups.

Delayed refs coming from relocation are always counted as for_cow, as they
don't change subvol quota.

Also pass in the fs_info for later use.

btrfs_find_all_roots() will use this as an optimization, as changes that are
for_cow will not change anything with respect to which root points to a
certain leaf. Thus, we don't need to add the current sequence number to
those delayed refs.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-12-22 16:22:27 +01:00
Linus Torvalds
c9a7fe9672 Merge branches 'for-linus' and 'for-linus-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: unplug every once and a while
  Btrfs: deal with NULL srv_rsv in the delalloc inode reservation code
  Btrfs: only set cache_generation if we setup the block group
  Btrfs: don't panic if orphan item already exists
  Btrfs: fix leaked space in truncate
  Btrfs: fix how we do delalloc reservations and how we free reservations on error
  Btrfs: deal with enospc from dirtying inodes properly
  Btrfs: fix num_workers_starting bug and other bugs in async thread
  BTRFS: Establish i_ops before calling d_instantiate
  Btrfs: add a cond_resched() into the worker loop
  Btrfs: fix ctime update of on-disk inode
  btrfs: keep orphans for subvolume deletion
  Btrfs: fix inaccurate available space on raid0 profile
  Btrfs: fix wrong disk space information of the files
  Btrfs: fix wrong i_size when truncating a file to a larger size
  Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror

* 'for-linus-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: lower the dirty balance poll interval
2011-12-16 12:15:50 -08:00
Chris Mason
567a45e917 Merge branch 'for-chris' of http://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into integration
Conflicts:
	fs/btrfs/inode.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15 13:43:49 -05:00
Josef Bacik
ee4d89f0c4 Btrfs: don't panic if orphan item already exists
I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
a loop.  This is because we will add an orphan item, do the truncate, the
truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
left with an orphan item still in the fs.  Then we come back later to do another
truncate and it blows up because we already have an orphan item.  This is ok so
just fix the BUG_ON() to only BUG() if ret is not EEXIST.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15 11:04:24 -05:00
Josef Bacik
7041ee9728 Btrfs: fix leaked space in truncate
We were occasionaly leaking space when running xfstest 269.  This is because if
we failed to start the transaction in the truncate loop we'd just goto out, but
we need to break so that the inode is removed from the orphan list and the space
is properly freed.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15 11:04:23 -05:00
Josef Bacik
660d3f6cde Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved.  This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such.  The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for.  The other problem is with our error handling
in the reservation code.  There are two cases that we need to deal with

1) We raced with free.  In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation.  However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use.  So as it stands the code was
doing this fine and it worked out, except in case #2

2) We don't race with free.  Nobody comes in and changes anything, and our
reservation fails.  In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything.  So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.

Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex.  We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things.  With this patch my space leak scripts no longer
scream bloody murder.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15 11:04:22 -05:00
Josef Bacik
22c44fe65a Btrfs: deal with enospc from dirtying inodes properly
Now that we're properly keeping track of delayed inode space we've been getting
a lot of warnings out of btrfs_dirty_inode() when running xfstest 83.  This is
because a bunch of people call mark_inode_dirty, which is void so we can't
return ENOSPC.  This needs to be fixed in a few areas

1) file_update_time - this updates the mtime and such when writing to a file,
which will call mark_inode_dirty.  So copy file_update_time into btrfs so we can
call btrfs_dirty_inode directly and return an error if we get one appropriately.

2) fix symlinks to use btrfs_setattr for ->setattr.  For some reason we weren't
setting ->setattr for symlinks, even though we should have been.  This catches
one of the cases where we were getting errors in mark_inode_dirty.

3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly
instead of mark_inode_dirty.  This lets us return errors properly for truncate
and chown/anything related to setattr.

4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and
print an error if we have one.  The only remaining user we can't control for
this is touch_atime(), but we don't really want to keep people from walking
down the tree if we don't have space to save the atime update, so just complain
but don't worry about it.

With this patch xfstests 83 complains a handful of times instead of hundreds of
times.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15 11:04:21 -05:00
Casey Schaufler
ad19db71f4 BTRFS: Establish i_ops before calling d_instantiate
The Smack LSM hook for security_d_instantiate checks
the inode's i_op->getxattr value to determine if the
containing filesystem supports extended attributes.
The BTRFS filesystem sets the inode's i_op value only
after it has instantiated the inode. This results in
Smack incorrectly giving new BTRFS inodes attributes
from the filesystem defaults on the assumption that
values can't be stored on the filesystem. This patch
moves the assignment of inode operation vectors ahead
of the calls to d_instantiate, letting Smack know that
the filesystem supports extended attributes. There
should be no impact on the performance or behavior of
BTRFS.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15 10:50:38 -05:00
Arne Jansen
f8e9e0b07b btrfs: keep orphans for subvolume deletion
Since we have the free space caches, btrfs_orphan_cleanup also runs for
the tree_root. Unfortunately this also cleans up the orphans used to mark
subvol deletions in progress.

Currently if a subvol deletion gets interrupted twice by umount/mount, the
deletion will not be continued and the space permanently lost, though it
would be possible to write a tool to recover those lost subvol deletions.
This patch checks if the orphan belongs to a subvol (dead root) and skips
the deletion.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15 10:50:37 -05:00
Miao Xie
3642320e07 Btrfs: fix wrong disk space information of the files
Btrfsck report errors after the 83th case of xfstests was run, The error
number is 400, it means the used disk space of the file is wrong.

The reason of this bug is that:
The file truncation may fail when the space of the file system is not enough,
and leave some file extents, whose offset are beyond the end of the files.
When we want to expand those files, we will drop those file extents, and
put in dummy file extents, and then we should update the i-node. But btrfs
forgets to do it.

This patch adds the forgotten i-node update.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15 10:50:36 -05:00
Miao Xie
f4a2f4c548 Btrfs: fix wrong i_size when truncating a file to a larger size
Btrfsck report error 100 after the 83th case of xfstests was run, it means
the i_size of the file is wrong.

The reason of this bug is that:
Btrfs increased i_size of the file at the beginning, but it failed to expand
the file, and failed to update the i_size to the old size because there is no
enough space in the file system, so we found a wrong i_size.

This patch fixes this bug by updating the i_size just when we pass the file
expanding and get enough space to update i-node.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15 10:50:35 -05:00
Justin P. Mattock
42b2aa86c6 treewide: Fix typos in various parts of the kernel, and fix some comments.
The below patch fixes some typos in various parts of the kernel, as well as fixes some comments.
Please let me know if I missed anything, and I will try to get it changed and resent.

Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-12-02 14:57:31 +01:00
Linus Torvalds
b930c26416 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix meta data raid-repair merge problem
  Btrfs: skip allocation attempt from empty cluster
  Btrfs: skip block groups without enough space for a cluster
  Btrfs: start search for new cluster at the beginning
  Btrfs: reset cluster's max_size when creating bitmap
  Btrfs: initialize new bitmaps' list
  Btrfs: fix oops when calling statfs on readonly device
  Btrfs: Don't error on resizing FS to same size
  Btrfs: fix deadlock on metadata reservation when evicting a inode
  Fix URL of btrfs-progs git repository in docs
  btrfs scrub: handle -ENOMEM from init_ipath()
2011-12-01 08:28:53 -08:00
Miao Xie
aa38a711a8 Btrfs: fix deadlock on metadata reservation when evicting a inode
When I ran the xfstests, I found the test tasks was blocked on meta-data
reservation.

By debugging, I found the reason of this bug:
   start transaction
        |
	v
   reserve meta-data space
	|
	v
   flush delay allocation -> iput inode -> evict inode
	^					|
	|					v
   wait for delay allocation flush <- reserve meta-data space

And besides that, the flush on evicting inode will block the thread, which
is reclaiming the memory, and make oom happen easily.

Fix this bug by skipping the flush step when evicting inode.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-11-30 18:46:03 +01:00
Linus Torvalds
af36d15f58 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: remove free-space-cache.c WARN during log replay
  Btrfs: sectorsize align offsets in fiemap
  Btrfs: clear pages dirty for io and set them extent mapped
  Btrfs: wait on caching if we're loading the free space cache
  Btrfs: prefix resize related printks with btrfs:
  btrfs: fix stat blocks accounting
  Btrfs: avoid unnecessary bitmap search for cluster setup
  Btrfs: fix to search one more bitmap for cluster setup
  btrfs: mirror_num should be int, not u64
  btrfs: Fix up 32/64-bit compatibility for new ioctls
  Btrfs: fix barrier flushes
  Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush
2011-11-22 08:53:40 -08:00
David Sterba
fadc0d8be4 btrfs: fix stat blocks accounting
Round inode bytes and delalloc bytes up to real blocksize before
converting to sector size. Otherwise eg. files smaller than 512
are reported with zero blocks due to incorrect rounding.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:15 -05:00
Linus Torvalds
c1f4246716 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: rename the option to nospace_cache
  Btrfs: handle bio_add_page failure gracefully in scrub
  Btrfs: fix deadlock caused by the race between relocation
  Btrfs: only map pages if we know we need them when reading the space cache
  Btrfs: fix orphan backref nodes
  Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush}
  Btrfs: fix unreleased path in btrfs_orphan_cleanup()
  Btrfs: fix no reserved space for writing out inode cache
  Btrfs: fix nocow when deleting the item
  Btrfs: tweak the delayed inode reservations again
  Btrfs: rework error handling in btrfs_mount()
  Btrfs: close devices on all error paths in open_ctree()
  Btrfs: avoid null dereference and leaks when bailing from open_ctree()
  Btrfs: fix subvol_name leak on error in btrfs_mount()
  Btrfs: fix memory leak in btrfs_parse_early_options()
  Btrfs: fix our reservations for updating an inode when completing io
  Btrfs: fix oops on NULL trans handle in btrfs_truncate
  btrfs: fix double-free 'tree_root' in 'btrfs_mount()'
2011-11-11 23:47:06 -02:00
Miao Xie
3254c87618 Btrfs: fix unreleased path in btrfs_orphan_cleanup()
When we did stress test for the space relocation, the deadlock happened.
By debugging, We found it was caused by the carelessness that we forgot
to unlock the read lock of the extent buffers in btrfs_orphan_cleanup()
before we end the transaction handle, so the transaction commit task waited
the task, which called btrfs_orphan_cleanup(), to unlock the extent buffer,
but that task waited the commit task to end the transaction commit, and
the deadlock happened. Fix it.

Signed-ff-by: Miao Xie <miaox@cn.fujitsu.com>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:05 -05:00
Chris Mason
2115133f8b Btrfs: tweak the delayed inode reservations again
Josef sent along an incremental to the inode reservation
code to make sure we try and fall back to directly updating
the inode item if things go horribly wrong.

This reworks that patch slightly, adding a fallback function
that will always try to update the inode item directly without
going through the delayed_inode code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:39:08 -05:00
Josef Bacik
7fd2ae21a4 Btrfs: fix our reservations for updating an inode when completing io
People have been reporting ENOSPC crashes in finish_ordered_io.  This is because
we try to steal from the delalloc block rsv to satisfy a reservation to update
the inode.  The problem with this is we don't explicitly save space for updating
the inode when doing delalloc.  This is kind of a problem and we've gotten away
with this because way back when we just stole from the delalloc reserve without
any questions, and this worked out fine because generally speaking the leaf had
been modified either by the mtime update when we did the original write or
because we just updated the leaf when we inserted the file extent item, only on
rare occasions had the leaf not actually been modified, and that was still ok
because we'd just use a block or two out of the over-reservation that is
delalloc.

Then came the delayed inode stuff.  This is amazing, except it wants a full
reservation for updating the inode since it may do it at some point down the
road after we've written the blocks and we have to recow everything again.  This
worked out because the delayed inode stuff just stole from the global reserve,
that is until recently when I changed that because it caused other problems.

So here we are, we're doing everything right and being screwed for it.  So take
an extra reservation for the inode at delalloc reservation time and carry it
through the life of the delalloc reservation.  If we need it we can steal it in
the delayed inode stuff.  If we have already stolen it try and do a normal
metadata reservation.  If that fails try to steal from the delalloc reservation.
If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
solve this and in the meantime we'll steal from the global reserve.

With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
any problems.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-08 15:47:34 -05:00
Chris Mason
917c16b2b6 Btrfs: fix oops on NULL trans handle in btrfs_truncate
If we fail to reserve space in the transaction during truncate, we can
error out with a NULL trans handle.  The cleanup code needs an extra
check to make sure we aren't trying to use the bad handle.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-08 14:49:59 -05:00
Linus Torvalds
6a6662ced4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits)
  Btrfs: check for a null fs root when writing to the backup root log
  Btrfs: fix race during transaction joins
  Btrfs: fix a potential btrfs_bio leak on scrub fixups
  Btrfs: rename btrfs_bio multi -> bbio for consistency
  Btrfs: stop leaking btrfs_bios on readahead
  Btrfs: stop the readahead threads on failed mount
  Btrfs: fix extent_buffer leak in the metadata IO error handling
  Btrfs: fix the new inspection ioctls for 32 bit compat
  Btrfs: fix delayed insertion reservation
  Btrfs: ClearPageError during writepage and clean_tree_block
  Btrfs: be smarter about committing the transaction in reserve_metadata_bytes
  Btrfs: make a delayed_block_rsv for the delayed item insertion
  Btrfs: add a log of past tree roots
  btrfs: separate superblock items out of fs_info
  Btrfs: use the global reserve when truncating the free space cache inode
  Btrfs: release metadata from global reserve if we have to fallback for unlink
  Btrfs: make sure to flush queued bios if write_cache_pages waits
  Btrfs: fix extent pinning bugs in the tree log
  Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN
  Btrfs: don't wait as long for more batches during SSD log commit
  ...
2011-11-06 20:03:41 -08:00
Chris Mason
806468f8bf Merge git://git.jan-o-sch.net/btrfs-unstable into integration
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/extent_io.c
	fs/btrfs/extent_io.h
	fs/btrfs/scrub.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:07:10 -05:00
David Sterba
6c41761fc6 btrfs: separate superblock items out of fs_info
fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)

Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-11-06 03:04:01 -05:00
Josef Bacik
5a77d76c24 Btrfs: release metadata from global reserve if we have to fallback for unlink
I fixed a problem where we weren't reserving space for an orphan item when we
had to fallback to using the global reserve for an unlink, but I introduced
another problem.  I was migrating the bytes from the transaction reserve to the
global reserve and then releasing from the global reserve in
btrfs_end_transaction().  The problem with this is that a migrate will jack up
the size for the destination, but leave the size alone for the source, with the
idea that you can do a release normally on the source and it all washes out, and
then you can do a release again on the destination and it works out right.  My
way was skipping the release on the trans_block_rsv which still had the jacked
up size from our original reservation.  So instead release manually from the
global reserve if this transaction was using it, and then set the
trans->block_rsv back to the trans_block_rsv so that btrfs_end_transaction
cleans everything up properly.  With this patch xfstest 83 doesn't emit warnings
about leaking space.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:49 -05:00
Miklos Szeredi
bfe8684869 filesystems: add set_nlink()
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:43 +01:00
Li Zefan
f0dd9592a1 Btrfs: fix direct-io vs nodatacow
To reproduce the bug:

  # mount -o nodatacow /dev/sda7 /mnt/
  # dd if=/dev/zero of=/mnt/tmp bs=4K count=1
  1+0 records in
  1+0 records out
  4096 bytes (4.1 kB) copied, 0.000136115 s, 30.1 MB/s
  # dd if=/dev/zero of=/mnt/tmp bs=4K count=1 conv=notrunc oflag=direct
  dd: writing `/mnt/tmp': Input/output error
  1+0 records in
  0+0 records out

btrfs_ordered_update_i_size() may return 1, but btrfs_endio_direct_write()
mistakenly takes it as an error.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:44 +02:00
Li Zefan
560f7d7545 Btrfs: remove BUG_ON() in compress_file_range()
It's not a big deal if we fail to allocate the array, and instead of
panic we can just give up compressing.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:43 +02:00
Josef Bacik
36ba022ac0 Btrfs: seperate out btrfs_block_rsv_check out into 2 different functions
Currently btrfs_block_rsv_check does 2 things, it will either refill a block
reserve like in the truncate or refill case, or it will check to see if there is
enough space in the global reserve and possibly refill it.  However because of
overcommit we could be well overcommitting ourselves just to try and refill the
global reserve, when really we should just be committing the transaction.  So
breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check.  Refill
will try to reserve more metadata if it can and btrfs_block_rsv_check will not,
it will only tell you if the factor of the total space is still reserved.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:59 -04:00
Josef Bacik
3880a1b46d Btrfs: reserve some space for an orphan item when unlinking
In __unlink_start_trans() if we don't have enough room for a reservation we will
check to see if the unlink will free up space.  If it does that's great, but we
will still could add an orphan item, so we need to reserve enough space to add
the orphan item.  Do this and migrate the space the global reserve so it all
works out right.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:59 -04:00
Josef Bacik
e70bea5fe0 Btrfs: fix the amount of space reserved for unlink
Our unlink reservations were a bit much, we were reserving 10 and I only count 8
possible items we're touching, so comment what we're reserving for and fix the
count value.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:55 -04:00
Josef Bacik
5b0e95bf60 Btrfs: inline checksums into the disk free space cache
Yeah yeah I know this is how we used to do it and then I changed it, but damnit
I'm changing it back.  The fact is that writing out checksums will modify
metadata, which could cause us to dirty a block group we've already written out,
so we have to truncate it and all of it's checksums and re-write it which will
write new checksums which could dirty a blockg roup that has already been
written and you see where I'm going with this?  This can cause unmount or really
anything that depends on a transaction to commit to take it's sweet damned time
to happen.  So go back to the way it was, only this time we're specifically
setting NODATACOW because we can't go through the COW pathway anyway and we're
doing our own built-in cow'ing by truncating the free space cache.  The other
new thing is once we truncate the old cache and preallocate the new space, we
don't need to do that song and dance at all for the rest of the transaction, we
can just overwrite the existing space with the new cache if the block group
changes for whatever reason, and the NODATACOW will let us do this fine.  So
keep track of which transaction we last cleared our cache in and if we cleared
it in this transaction just say we're all setup and carry on.  This survives
xfstests and stress.sh.

The inode cache will continue to use the normal csum infrastructure since it
only gets written once and there will be no more modifications to the fs tree in
a transaction commit.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:54 -04:00
Josef Bacik
8f6d7f4f45 Btrfs: break out of orphan cleanup if we can't make progress
I noticed while running xfstests 83 that if we didn't have enough space to
delete our inode the orphan cleanup would just loop.  This is because it keeps
finding the same orphan item and keeps trying to kill it but can't because we
don't get an error back from iput for deleting the inode.  So keep track of the
last guy we tried to kill, if it's the same as the one we're trying to kill
currently we know we are having problems and can just error out.  I don't have a
way to test this so look hard and make sure it's right.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:49 -04:00
Josef Bacik
726c35fa0e Btrfs: use the global reserve as a backup for deleting inodes
Xfstests 83 really stresses our ENOSPC since it uses a 100mb fs which ends up
with the mixed block group stuff.  Because of this we can run into a situation
where we don't have enough space to delete inodes, or even worse we can't free
the inodes when we next mount the fs which causes the orphan code to lose its
mind.  So if we fail to make our reservation, steal from the global reserve.
The global reserve will end up taking up the entire rest of the free space on
the fs in this worst case so there really is no other option.  With this patch
test 83 doesn't freak out.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:48 -04:00
Josef Bacik
a8c9e57697 Btrfs: fix orphan cleanup regression
In fixing how we deal with bad inodes, we had a regression in the orphan cleanup
code, since it expects to get a bad inode back.  So fix it to deal with getting
-ESTALE back by deleting the orphan item manually and moving on.  Thanks,

Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:46 -04:00
Josef Bacik
3b16a4e3c3 Btrfs: use the inode's mapping mask for allocating pages
Johannes pointed out we were allocating only kernel pages for doing writes,
which is kind of a big deal if you are on 32bit and have more than a gig of ram.
So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
don't re-enter.  Thanks,

Reported-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:45 -04:00
Josef Bacik
4a92b1b8d2 Btrfs: stop passing a trans handle all around the reservation code
The only thing that we need to have a trans handle for is in
reserve_metadata_bytes and thats to know how much flushing we can do.  So
instead of passing it around, just check current->journal_info for a
trans_handle so we know if we can commit a transaction to try and free up space
or not.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:44 -04:00
Josef Bacik
c09544e07f Btrfs: handle enospc accounting for free space inodes
Since free space inodes now use normal checksumming we need to make sure to
account for their metadata use.  So reserve metadata space, and then if we fail
to write out the metadata we can just release it, otherwise it will be freed up
when the io completes.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:42 -04:00
Josef Bacik
4a33854257 Btrfs: set truncate block rsv's size
While debugging a different issue I noticed that we were always reserving space
when we tried to use our truncate block rsv's.  This is because they didn't have
a ->size value, so use_block_rsv just assumes there is nothing reserved and it
does a reserve_metadata_bytes.  This is because btrfs_check_block_rsv() doesn't
actually add to the size of the block rsv.  That seems to be the right thing to
do so set ->size to the minimum truncate size we need, since we will always only
refill to that size anyway, and this way everything works out correctly.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:40 -04:00
Josef Bacik
482e6dc526 Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_check
If you run xfstest 224 it you will get lots of messages about not being able to
delete inodes and that they will be cleaned up next mount.  This is because
btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
flush, so if there was not enough space, it simply failed.  But in truncate and
evict case we could easily flush space to try and get enough space to do our
work, so make btrfs_block_rsv_check take a flush argument to pass down to
reserve_metadata_bytes.  Now xfstests 224 runs fine without all those
complaints.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:38 -04:00
Josef Bacik
07127184ef Btrfs: reduce the amount of space needed for truncates
With btrfs_truncate_inode_items we always return if we have to go to another
leaf, which makes us do our reservation again.  This means we will only ever
modify one leaf at a time, so we only need 1 items worth of slack space.  Also,
since we are deleting we will not be creating nodes as we go down, if anything
we'll be free'ing them as we merge them together, so make a different
calculation for truncate which will only have the worst case useage of COW'ing
the entire path down to the leaf.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:37 -04:00
Josef Bacik
907cbcebd4 Btrfs: optimize how we account for space in truncate
Currently we're starting and stopping a transaction for no real reason, so kill
that and just reserve enough space as if we can truncate all in one transaction.
Also use btrfs_block_rsv_check() for our reserve to minimize the amount of space
we may have to allocate for our slack space.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:35 -04:00
Josef Bacik
4289a667a0 Btrfs: fix how we reserve space for deleting inodes
I converted btrfs_truncate to do sane reservations for truncate, but didn't
convert btrfs_evict_inode.  Basically we need to save the orphan_rsv for
deleting the orphan item, and do normal reservations for our truncate.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:33 -04:00
Josef Bacik
37be25bcb6 Btrfs: kill the durable block rsv stuff
This is confusing code and isn't used by anything anymore, so delete it.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:32 -04:00
Josef Bacik
dba68306f3 Btrfs: kill the orphan space calculation for snapshots
This patch kills off the calculation for the amount of space needed for the
orphan operations during a snapshot.  The thing is we only do snapshots on
commit, so any space that is in the block_rsv->freed[] isn't going to be in the
new snapshot anyway, so there isn't any reason to require that space to be
reserved for the snapshot to occur.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:32 -04:00
Josef Bacik
7709cde33f Btrfs: calculate checksum space correctly
We have not been reserving enough space for checksums.  We were just reserving
bytes for the checksum items themselves, we were not taking into account having
to cow the tree and such.  This patch adds a csum_bytes counter to the inode for
keeping track of the number of bytes outstanding we have for checksums.  Then we
calculate how many leaves would be required for the checksums we are given and
use that to reserve space.  This adds a significant amount of bytes to our
reservations, but we will handle this later.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:31 -04:00
Josef Bacik
0cbbdf7c9c Btrfs: kill reserved_bytes in inode
reserved_bytes is not used for anything in the inode, remove it.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:28 -04:00
Jan Schmidt
4a54c8c165 btrfs: Moved repair code from inode.c to extent_io.c
The raid-retry code in inode.c can be generalized so that it works for
metadata as well. Thus, this patch moves it to extent_io.c and makes the
raid-retry code a raid-repair code.

Repair works that way: Whenever a read error occurs and we have more
mirrors to try, note the failed mirror, and retry another. If we find a
good one, check if we did note a failure earlier and if so, do not allow
the read to complete until after the bad sector was written with the good
data we just fetched. As we have the extent locked while reading, no one
can change the data in between.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 13:38:42 +02:00
Jan Schmidt
1503140d3e btrfs: Do not use bio->bi_bdev after submission
The block layer modifies bio->bi_bdev and bio->bi_sector while working on
the bio, they do _not_ come back unmodified in the completion callback.

To call add_page, we need at least some bi_bdev set, which is why the code
was working, previously. With this patch, we use the latest_bdev from
fsinfo instead of the leftover in the bio. This gives us the possibility to
use the bi_bdev field for another purpose.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 13:38:42 +02:00
Jan Schmidt
8ddc7d9cd0 btrfs: add mirror_num to extent_read_full_page
Currently, extent_read_full_page always assumes we are trying to read mirror
0, which generally is the best we can do. To add flexibility, pass it as a
parameter. This will be needed by scrub fixup code.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 12:54:28 +02:00
Josef Bacik
a66e7cc626 Btrfs: only clear the need lookup flag after the dentry is setup
We can race with readdir and the RCU path walking stuff.  This is because we
clear the need lookup flag before actually instantiating the inode.  This will
lead the RCU path walk stuff to find a dentry it thinks is valid without a
d_inode attached.  So instead unhash the dentry when we first start the lookup,
and then clear the flag after we've instantiated the dentry so we're garunteed
to either try the slow lookup, or have the d_inode set properly.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-18 10:34:03 -04:00
Chris Mason
2cf4ce7c2a Merge branch 'btrfs-3.0' into for-linus 2011-09-18 10:31:44 -04:00
Hidetoshi Seto
3765fefaee btrfs: fix d_off in the first dirent
Since the d_off in the first dirent for "." (that originates from
the 4th argument "offset" of filldir() for the 2nd dirent for "..")
is wrongly assigned in btrfs_real_readdir(), telldir returns same
offset for different locations.

 | # mkfs.btrfs /dev/sdb1
 | # mount /dev/sdb1 fs0
 | # cd fs0
 | # touch file0 file1
 | # ../test
 | telldir: 0
 | readdir: d_off = 2, d_name = "."
 | telldir: 2
 | readdir: d_off = 2, d_name = ".."
 | telldir: 2
 | readdir: d_off = 3, d_name = "file0"
 | telldir: 3
 | readdir: d_off = 2147483647, d_name = "file1"
 | telldir: 2147483647

To fix this problem, pass filp->f_pos (which is loff_t) instead.

 | # ../test
 | telldir: 0
 | readdir: d_off = 1, d_name = "."
 | telldir: 1
 | readdir: d_off = 2, d_name = ".."
 | telldir: 2
 | readdir: d_off = 3, d_name = "file0"
 :

At the moment the "offset" for "." is unused because there is no
preceding dirent, however it is better to pass filp->f_pos to follow
grammatical usage.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-18 10:20:46 -04:00
Linus Torvalds
0b001b2eda Merge branch 'for-linus' of git://github.com/chrismason/linux
* 'for-linus' of git://github.com/chrismason/linux:
  Btrfs: add dummy extent if dst offset excceeds file end in
  Btrfs: calc file extent num_bytes correctly in file clone
  btrfs: xattr: fix attribute removal
  Btrfs: fix wrong nbytes information of the inode
  Btrfs: fix the file extent gap when doing direct IO
  Btrfs: fix unclosed transaction handle in btrfs_cont_expand
  Btrfs: fix misuse of trans block rsv
  Btrfs: reset to appropriate block rsv after orphan operations
  Btrfs: skip locking if searching the commit root in csum lookup
  btrfs: fix warning in iput for bad-inode
  Btrfs: fix an oops when deleting snapshots
2011-09-12 11:47:49 -07:00
Miao Xie
a39f752143 Btrfs: fix wrong nbytes information of the inode
If we write some data into the data hole of the file(no preallocation for this
hole), Btrfs will allocate some disk space, and update nbytes of the inode, but
the other element--disk_i_size needn't be updated. At this condition, we must
update inode metadata though disk_i_size is not changed(btrfs_ordered_update_i_size()
return 1).

 # mkfs.btrfs /dev/sdb1
 # mount /dev/sdb1 /mnt
 # touch /mnt/a
 # truncate -s 856002 /mnt/a
 # dd if=/dev/zero of=/mnt/a bs=4K count=1 conv=nocreat,notrunc
 # umount /mnt
 # btrfsck /dev/sdb1
 root 5 inode 257 errors 400
 found 32768 bytes used err is 1

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:25 -04:00
Miao Xie
5b397377e9 Btrfs: fix unclosed transaction handle in btrfs_cont_expand
The function - btrfs_cont_expand() forgot to close the transaction handle before
it jump out the while loop. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:24 -04:00
Sergei Trofimovich
e0b6d65be5 btrfs: fix warning in iput for bad-inode
iput() shouldn't be called for inodes in I_NEW state.
We need to mark inode as constructed first.

WARNING: at fs/inode.c:1309 iput+0x20b/0x210()
Call Trace:
 [<ffffffff8103e7ba>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff8103e805>] warn_slowpath_null+0x15/0x20
 [<ffffffff810eaf0b>] iput+0x20b/0x210
 [<ffffffff811b96fb>] btrfs_iget+0x1eb/0x4a0
 [<ffffffff811c3ad6>] btrfs_run_defrag_inodes+0x136/0x210
 [<ffffffff811ad55f>] cleaner_kthread+0x17f/0x1a0
 [<ffffffff81035b7d>] ? sub_preempt_count+0x9d/0xd0
 [<ffffffff811ad3e0>] ? transaction_kthread+0x280/0x280
 [<ffffffff8105af86>] kthread+0x96/0xa0
 [<ffffffff814336d4>] kernel_thread_helper+0x4/0x10
 [<ffffffff8105aef0>] ? kthread_worker_fn+0x190/0x190
 [<ffffffff814336d0>] ? gs_change+0xb/0xb

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
CC: Konstantin Khlebnikov <khlebnikov@openvz.org>
Tested-by: David Sterba <dsterba@suse.cz>
CC: Josef Bacik <josef@redhat.com>
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:24 -04:00
Jeff Mahoney
cb6db4e576 btrfs: btrfs_permission's RO check shouldn't apply to device nodes
This patch tightens the read-only access checks in btrfs_permission to
 match the constraints in inode_permission. Currently, even though the
 device node itself will be unmodified, read-write access to device nodes
 is denied to when the device node resides on a read-only subvolume or a
 is a file that has been marked read-only by the btrfs conversion utility.

 With this patch applied, the check only affects regular files,
 directories, and symlinks. It also restructures the code a bit so that
 we don't duplicate the MAY_WRITE check for both tests.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-18 10:16:03 -04:00
Linus Torvalds
ed8f37370d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits)
  Btrfs: don't call writepages from within write_full_page
  Btrfs: Remove unused variable 'last_index' in file.c
  Btrfs: clean up for find_first_extent_bit()
  Btrfs: clean up for wait_extent_bit()
  Btrfs: clean up for insert_state()
  Btrfs: remove unused members from struct extent_state
  Btrfs: clean up code for merging extent maps
  Btrfs: clean up code for extent_map lookup
  Btrfs: clean up search_extent_mapping()
  Btrfs: remove redundant code for dir item lookup
  Btrfs: make acl functions really no-op if acl is not enabled
  Btrfs: remove remaining ref-cache code
  Btrfs: remove a BUG_ON() in btrfs_commit_transaction()
  Btrfs: use wait_event()
  Btrfs: check the nodatasum flag when writing compressed files
  Btrfs: copy string correctly in INO_LOOKUP ioctl
  Btrfs: don't print the leaf if we had an error
  btrfs: make btrfs_set_root_node void
  Btrfs: fix oops while writing data to SSD partitions
  Btrfs: Protect the readonly flag of block group
  ...

Fix up trivial conflicts (due to acl and writeback cleanups) in
 - fs/btrfs/acl.c
 - fs/btrfs/ctree.h
 - fs/btrfs/extent_io.c
2011-08-02 21:14:05 -10:00
Linus Torvalds
1b8e94993c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set
  VFS: Reorganise shrink_dcache_for_umount_subtree() after demise of dcache_lock
  VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree()
  VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree()
  switch posix_acl_chmod() to umode_t
  switch posix_acl_from_mode() to umode_t
  switch posix_acl_equiv_mode() to umode_t *
  switch posix_acl_create() to umode_t *
  block: initialise bd_super in bdget()
  vfs: avoid call to inode_lru_list_del() if possible
  vfs: avoid taking inode_hash_lock on pipes and sockets
  vfs: conditionally call inode_wb_list_del()
  VFS: Fix automount for negative autofs dentries
  Btrfs: load the key from the dir item in readdir into a fake dentry
  devtmpfs: missing initialialization in never-hit case
  hppfs: missing include
2011-08-01 13:48:31 -10:00
Jeff Mahoney
1bf85046e4 btrfs: Make extent-io callbacks that never fail return void
The set/clear bit and the extent split/merge hooks only ever return 0.

 Changing them to return void simplifies the error handling cases later.

 This patch changes the hook prototypes, the single implementation of each,
 and the functions that call them to return void instead.

 Since all four of these hooks execute under a spinlock, they're necessarily
 simple.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01 14:30:43 -04:00
Li Zefan
b6973aa622 Btrfs: fix readahead in file defrag
We passed the wrong value to btrfs_force_ra(). Fix this by changing
the argument of btrfs_force_ra() from last_index to nr_page.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01 14:30:42 -04:00
Tsutomu Itoh
b532402e4d Btrfs: return error to caller when btrfs_unlink() failes
When btrfs_unlink_inode() and btrfs_orphan_add() in btrfs_unlink()
are error, the error code is returned to the caller instead of
BUG_ON().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01 14:30:42 -04:00
Chris Mason
b43b31bdf2 Merge branch 'alloc_path' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/btrfs-error-handling into for-linus 2011-08-01 14:27:34 -04:00
Josef Bacik
b4aff1f874 Btrfs: load the key from the dir item in readdir into a fake dentry
In btrfs we have 2 indexes for inodes.  One is for readdir, it's in this nice
sequential order and works out brilliantly for readdir.  However if you use ls,
it usually stat's each file it gets from readdir.  This is where the second
index comes in, which is based on a hash of the name of the file.  So then the
lookup has to lookup this index, and then lookup the inode.  The index lookup is
going to be in random order (since its based on the name hash), which gives us
less than stellar performance.  Since we know the inode location from the
readdir index, I create a dummy dentry and copy the location key into
dentry->d_fsdata.  Then on lookup if we have d_fsdata we use that location to
lookup the inode, avoiding looking up the other directory index.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-08-01 01:31:42 -04:00
Linus Torvalds
22712200e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors
  Btrfs: use the commit_root for reading free_space_inode crcs
  Btrfs: reduce extent_state lock contention for metadata
  Btrfs: remove lockdep magic from btrfs_next_leaf
  Btrfs: make a lockdep class for each root
  Btrfs: switch the btrfs tree locks to reader/writer
  Btrfs: fix deadlock when throttling transactions
  Btrfs: stop using highmem for extent_buffers
  Btrfs: fix BUG_ON() caused by ENOSPC when relocating space
  Btrfs: tag pages for writeback in sync
  Btrfs: fix enospc problems with delalloc
  Btrfs: don't flush delalloc arbitrarily
  Btrfs: use find_or_create_page instead of grab_cache_page
  Btrfs: use a worker thread to do caching
  Btrfs: fix how we merge extent states and deal with cached states
  Btrfs: use the normal checksumming infrastructure for free space cache
  Btrfs: serialize flushers in reserve_metadata_bytes
  Btrfs: do transaction space reservation before joining the transaction
  Btrfs: try to only do one btrfs_search_slot in do_setxattr
2011-07-27 16:43:52 -07:00
Chris Mason
ff95acb673 Merge branch 'integration' into for-linus 2011-07-27 16:18:13 -04:00
Chris Mason
2cf8572dac Btrfs: use the commit_root for reading free_space_inode crcs
Now that we are using regular file crcs for the free space cache,
we can deadlock if we try to read the free_space_inode while we are
updating the crc tree.

This commit fixes things by using the commit_root to read the crcs.  This is
safe because we the free space cache file would already be loaded if
that block group had been changed in the current transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:48 -04:00
Chris Mason
a65917156e Btrfs: stop using highmem for extent_buffers
The extent_buffers have a very complex interface where
we use HIGHMEM for metadata and try to cache a kmap mapping
to access the memory.

The next commit adds reader/writer locks, and concurrent use
of this kmap cache would make it even more complex.

This commit drops the ability to use HIGHMEM with extent buffers,
and rips out all of the related code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:45 -04:00
Josef Bacik
9e0baf60de Btrfs: fix enospc problems with delalloc
So I had this brilliant idea to use atomic counters for outstanding and reserved
extents, but this turned out to be a bad idea.  Consider this where we have 1
outstanding extent and 1 reserved extent

Reserver				Releaser
					atomic_dec(outstanding) now 0
atomic_read(outstanding)+1 get 1
atomic_read(reserved) get 1
don't actually reserve anything because
they are the same
					atomic_cmpxchg(reserved, 1, 0)
atomic_inc(outstanding)
atomic_add(0, reserved)
					free reserved space for 1 extent

Then the reserver now has no actual space reserved for it, and when it goes to
finish the ordered IO it won't have enough space to do it's allocation and you
get those lovely warnings.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:44 -04:00
Josef Bacik
a94733d0bc Btrfs: use find_or_create_page instead of grab_cache_page
grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
GFP_HIGHUSER_MOVABLE.  So instead use find_or_create_page in all cases where we
need GFP_NOFS so we don't deadlock.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-07-27 12:46:43 -04:00
Al Viro
569254b0cc btrfs: S_ISREG(mode) is not mode & S_IFREG...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-26 13:05:05 -04:00
Christoph Hellwig
4e34e719e4 fs: take the ACL checks to common code
Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss.  This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-25 14:30:23 -04:00
Al Viro
10d9f309d8 get rid of useless dget_parent() in btrfs rename() and link()
->d_parent is locked and stable there...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 20:48:00 -04:00
Al Viro
a9049376ee make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)
... and simplify the living hell out of callers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:26 -04:00
Al Viro
0ee5dc676a btrfs: kill magical embedded struct superblock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:20 -04:00
Al Viro
10556cb21a ->permission() sanitizing: don't pass flags to ->permission()
not used by the instances anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:24 -04:00
Al Viro
2830ba7f34 ->permission() sanitizing: don't pass flags to generic_permission()
redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of
them removes that bit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:22 -04:00
Al Viro
178ea73521 kill check_acl callback of generic_permission()
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:16 -04:00
Mark Fasheh
1748f843a0 btrfs: Don't BUG_ON alloc_path errors in btrfs_read_locked_inode
btrfs_iget() also needed an update so that errors from btrfs_locked_inode()
are caught and bubbled back up.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 14:14:45 -07:00
Mark Fasheh
0eb0e19cde btrfs: Don't BUG_ON alloc_path errors in btrfs_truncate_inode_items
I moved the path allocation up a few lines to the top of the function so
that we couldn't get into the state where we've dropped delayed items and
the extent cache but fail due to -ENOMEM.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 14:14:45 -07:00
Mark Fasheh
d8926bb3ba btrfs: don't BUG_ON btrfs_alloc_path() errors
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:

 - Callers of the function catch errors from it already so bubbling the
   error up will be handled.
 - Callers of the function might BUG_ON any nonzero return code in which
   case there is no behavior changed (but we still got to remove a BUG_ON)

The following functions were updated:

btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 14:14:44 -07:00