2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

695 Commits

Author SHA1 Message Date
Namjae Jeon
d0e1d66b5a writeback: remove nr_pages_dirtied arg from balance_dirty_pages_ratelimited_nr()
There is no reason to pass the nr_pages_dirtied argument, because
nr_pages_dirtied value from the caller is unused in
balance_dirty_pages_ratelimited_nr().

Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Vivek Trivedi <vtrivedi018@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11 17:22:21 -08:00
Linus Torvalds
f48d42773b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has our series of fixes for the next rc.  The biggest batch is
  from Jan Schmidt, fixing up some problems in our subvolume quota code
  and fixing btrfs send/receive to work with the new extended inode
  refs."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: do not bug when we fail to commit the transaction
  Btrfs: fix memory leak when cloning root's node
  Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
  Btrfs: Send: preserve ownership (uid and gid) also for symlinks.
  Btrfs: fix deadlock caused by the nested chunk allocation
  btrfs: Return EINVAL when length to trim is less than FSB
  Btrfs: fix memory leak in btrfs_quota_enable()
  Btrfs: send correct rdev and mode in btrfs-send
  Btrfs: extended inode refs support for send mechanism
  Btrfs: Fix wrong error handling code
  Fix a sign bug causing invalid memory access in the ino_paths ioctl.
  Btrfs: comment for loop in tree_mod_log_insert_move
  Btrfs: fix extent buffer reference for tree mod log roots
  Btrfs: determine level of old roots
  Btrfs: tree mod log's old roots could still be part of the tree
  Btrfs: fix a tree mod logging issue for root replacement operations
  Btrfs: don't put removals from push_node_left into tree mod log twice
2012-10-26 09:34:04 -07:00
Josef Bacik
c37b2b6269 Btrfs: do not bug when we fail to commit the transaction
We BUG if we fail to commit the transaction when creating a snapshot, which
is just obnoxious.  Remove the BUG_ON().  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-25 15:59:57 -04:00
Lukas Czerner
e515c18bfe btrfs: Return EINVAL when length to trim is less than FSB
Currently if len argument in btrfs_ioctl_fitrim() is smaller than
one FSB we will continue and finally return 0 bytes discarded.
However if the length to discard is smaller then file system block
we should really return EINVAL.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2012-10-25 15:46:22 -04:00
Jeff Layton
4fa6b5ecbf audit: overhaul __audit_inode_child to accomodate retrying
In order to accomodate retrying path-based syscalls, we need to add a
new "type" argument to audit_inode_child. This will tell us whether
we're looking for a child entry that represents a create or a delete.

If we find a parent, don't automatically assume that we need to create a
new entry. Instead, use the information we have to try to find an
existing entry first. Update it if one is found and create a new one if
not.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:03 -04:00
Jeff Layton
c43a25abba audit: reverse arguments to audit_inode_child
Most of the callers get called with an inode and dentry in the reverse
order. The compiler then has to reshuffle the arg registers and/or
stack in order to pass them on to audit_inode_child.

Reverse those arguments for a micro-optimization.

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:00 -04:00
Linus Torvalds
72055425e5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "This is a large pull, with the bulk of the updates coming from:

   - Hole punching

   - send/receive fixes

   - fsync performance

   - Disk format extension allowing more hardlinks inside a single
     directory (btrfs-progs patch required to enable the compat bit for
     this one)

  I'm cooking more unrelated RAID code, but I wanted to make sure this
  original batch makes it in.  The largest updates here are relatively
  old and have been in testing for some time."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (121 commits)
  btrfs: init ref_index to zero in add_inode_ref
  Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer
  Btrfs: fix page leakage
  Btrfs: do not warn_on when we cannot alloc a page for an extent buffer
  Btrfs: don't bug on enomem in readpage
  Btrfs: cleanup pages properly when ENOMEM in compression
  Btrfs: make filesystem read-only when submitting barrier fails
  Btrfs: detect corrupted filesystem after write I/O errors
  Btrfs: make compress and nodatacow mount options mutually exclusive
  btrfs: fix message printing
  Btrfs: don't bother committing delayed inode updates when fsyncing
  btrfs: move inline function code to header file
  Btrfs: remove unnecessary IS_ERR in bio_readpage_error()
  btrfs: remove unused function btrfs_insert_some_items()
  Btrfs: don't commit instead of overcommitting
  Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called
  Btrfs: be smarter about dropping things from the tree log
  Btrfs: don't lookup csums for prealloc extents
  Btrfs: cache extent state when writing out dirty metadata pages
  Btrfs: do not hold the file extent leaf locked when adding extent item
  ...
2012-10-10 10:49:20 +09:00
Stefan Behrens
5af3e8cce8 Btrfs: make filesystem read-only when submitting barrier fails
So far the return code of barrier_all_devices() is ignored, which
means that errors are ignored. The result can be a corrupt
filesystem which is not consistent.
This commit adds code to evaluate the return code of
barrier_all_devices(). The normal btrfs_error() mechanism is used to
switch the filesystem into read-only mode when errors are detected.

In order to decide whether barrier_all_devices() should return
error or success, the number of disks that are allowed to fail the
barrier submission is calculated. This calculation accounts for the
worst RAID level of metadata, system and data. If single, dup or
RAID0 is in use, a single disk error is already considered to be
fatal. Otherwise a single disk error is tolerated.

The calculation of the number of disks that are tolerated to fail
the barrier operation is performed when the filesystem gets mounted,
when a balance operation is started and finished, and when devices
are added or removed.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-10-09 09:20:19 -04:00
Liu Bo
aa42ffd918 Btrfs: fix off-by-one in file clone
Btrfs uses inclusive range end for lock_extent(), unlock_extent() and
related functions, so we made off-by-one errors in file clone.

This fixes it and also fixes some style problems.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-08 20:07:32 -04:00
David Sterba
7e97b8daf6 btrfs: allow setting NOCOW for a zero sized file via ioctl
Hi,

the patch si simple, but it has user visible impact and I'm not quite sure how
to resolve it.

In short, $subj says it, chattr -C supports it and we want to use it.

The conditions that acutally allow to change the NOCOW flag are clear. What if
I try to set the flag on a file that is not empty? Options:

1) whole ioctl will fail, EINVAL
2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file
     will stay COW-ed and checksummed
2.2) ioctl will succeed, flag will not be removed and a syslog message will
     warn that the COW flag has not been changed
2.2.1) dtto, no syslog message

Man page of chattr states that

 "If it is set on a file which already has data blocks, it is undefined when
 the blocks assigned to the file will be fully stable."

Yes, it's undefined and with current implementation it'll never happen. So from
this end, the user cannot expect anything. I'm trying to find a reasonable
behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of
flags in a deep directory does not fail unnecessarily and does not pollute the
log.

My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting
the fact that I know the code and otherwise would look there before consulting
the documentation.

The patch implements 2.2.1.

david

-------------8<-------------------
From: David Sterba <dsterba@suse.cz>

It's safe to turn off checksums for a zero sized file.

http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030

"We cannot switch on NODATASUM for a file that already has extents that
are checksummed. The invariant here is that either all the extents or
none are checksummed.

Theoretically it's possible to add/remove all checksums from a given
file, but it's a potentially longtime operation, the file has to be in
some intermediate state where the checksums partially exist but have to
be ignored (for the csum->nocsum) until the file is fully converted,
this brings more special cases to extent handling, it has to survive
power failure and remain consistent, and probably needs to be restarted
after next mount."

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-10-04 09:40:00 -04:00
Linus Torvalds
aab174f0df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:

 - big one - consolidation of descriptor-related logics; almost all of
   that is moved to fs/file.c

   (BTW, I'm seriously tempted to rename the result to fd.c.  As it is,
   we have a situation when file_table.c is about handling of struct
   file and file.c is about handling of descriptor tables; the reasons
   are historical - file_table.c used to be about a static array of
   struct file we used to have way back).

   A lot of stray ends got cleaned up and converted to saner primitives,
   disgusting mess in android/binder.c is still disgusting, but at least
   doesn't poke so much in descriptor table guts anymore.  A bunch of
   relatively minor races got fixed in process, plus an ext4 struct file
   leak.

 - related thing - fget_light() partially unuglified; see fdget() in
   there (and yes, it generates the code as good as we used to have).

 - also related - bits of Cyrill's procfs stuff that got entangled into
   that work; _not_ all of it, just the initial move to fs/proc/fd.c and
   switch of fdinfo to seq_file.

 - Alex's fs/coredump.c spiltoff - the same story, had been easier to
   take that commit than mess with conflicts.  The rest is a separate
   pile, this was just a mechanical code movement.

 - a few misc patches all over the place.  Not all for this cycle,
   there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
  MAX_LFS_FILESIZE should be a loff_t
  compat: fs: Generic compat_sys_sendfile implementation
  fs: push rcu_barrier() from deactivate_locked_super() to filesystems
  btrfs: reada_extent doesn't need kref for refcount
  coredump: move core dump functionality into its own file
  coredump: prevent double-free on an error path in core dumper
  usb/gadget: fix misannotations
  fcntl: fix misannotations
  ceph: don't abuse d_delete() on failure exits
  hypfs: ->d_parent is never NULL or negative
  vfs: delete surplus inode NULL check
  switch simple cases of fget_light to fdget
  new helpers: fdget()/fdput()
  switch o2hb_region_dev_write() to fget_light()
  proc_map_files_readdir(): don't bother with grabbing files
  make get_file() return its argument
  vhost_set_vring(): turn pollstart/pollstop into bool
  switch prctl_set_mm_exe_file() to fget_light()
  switch xfs_find_handle() to fget_light()
  switch xfs_swapext() to fget_light()
  ...
2012-10-02 20:25:04 -07:00
Linus Torvalds
437589a74b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "This is a mostly modest set of changes to enable basic user namespace
  support.  This allows the code to code to compile with user namespaces
  enabled and removes the assumption there is only the initial user
  namespace.  Everything is converted except for the most complex of the
  filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
  nfs, ocfs2 and xfs as those patches need a bit more review.

  The strategy is to push kuid_t and kgid_t values are far down into
  subsystems and filesystems as reasonable.  Leaving the make_kuid and
  from_kuid operations to happen at the edge of userspace, as the values
  come off the disk, and as the values come in from the network.
  Letting compile type incompatible compile errors (present when user
  namespaces are enabled) guide me to find the issues.

  The most tricky areas have been the places where we had an implicit
  union of uid and gid values and were storing them in an unsigned int.
  Those places were converted into explicit unions.  I made certain to
  handle those places with simple trivial patches.

  Out of that work I discovered we have generic interfaces for storing
  quota by projid.  I had never heard of the project identifiers before.
  Adding full user namespace support for project identifiers accounts
  for most of the code size growth in my git tree.

  Ultimately there will be work to relax privlige checks from
  "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
  root in a user names to do those things that today we only forbid to
  non-root users because it will confuse suid root applications.

  While I was pushing kuid_t and kgid_t changes deep into the audit code
  I made a few other cleanups.  I capitalized on the fact we process
  netlink messages in the context of the message sender.  I removed
  usage of NETLINK_CRED, and started directly using current->tty.

  Some of these patches have also made it into maintainer trees, with no
  problems from identical code from different trees showing up in
  linux-next.

  After reading through all of this code I feel like I might be able to
  win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
  userns: Convert the ufs filesystem to use kuid/kgid where appropriate
  userns: Convert the udf filesystem to use kuid/kgid where appropriate
  userns: Convert ubifs to use kuid/kgid
  userns: Convert squashfs to use kuid/kgid where appropriate
  userns: Convert reiserfs to use kuid and kgid where appropriate
  userns: Convert jfs to use kuid/kgid where appropriate
  userns: Convert jffs2 to use kuid and kgid where appropriate
  userns: Convert hpfs to use kuid and kgid where appropriate
  userns: Convert btrfs to use kuid/kgid where appropriate
  userns: Convert bfs to use kuid/kgid where appropriate
  userns: Convert affs to use kuid/kgid wherwe appropriate
  userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
  userns: On ia64 deal with current_uid and current_gid being kuid and kgid
  userns: On ppc convert current_uid from a kuid before printing.
  userns: Convert s390 getting uid and gid system calls to use kuid and kgid
  userns: Convert s390 hypfs to use kuid and kgid where appropriate
  userns: Convert binder ipc to use kuids
  userns: Teach security_path_chown to take kuids and kgids
  userns: Add user namespace support to IMA
  userns: Convert EVM to deal with kuids and kgids in it's hmac computation
  ...
2012-10-02 11:11:09 -07:00
Liu Bo
425d17a290 Btrfs: use larger limit for translation of logical to inode
This is the change of the kernel side.

Translation of logical to inode used to have an upper limit 4k on
inode container's size, but the limit is not large enough for a data
with a great many of refs, so when resolving logical address,
we can end up with
"ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493"

This changes to regard 64k as the upper limit and use vmalloc instead of
kmalloc to get memory more easily.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:19 -04:00
Liu Bo
df031f0752 Btrfs: use helper for logical resolve
We already have a helper, iterate_inodes_from_logical(), for logical resolve,
so just use it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:18 -04:00
Liu Bo
69917e4312 Btrfs: fix a bug in parsing return value in logical resolve
In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag.

It is possible to lose our errors because
(-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true.

I'm not sure if it is on purpose, it just looks too hacky if it is.
I'd rather use a real flag and a 'ret' to catch errors.

Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
2012-10-01 15:19:18 -04:00
Liu Bo
9e8a4a8b0b Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag
We're going to use this flag EXTENT_DEFRAG to indicate which range
belongs to defragment so that we can implement snapshow-aware defrag:

We set the EXTENT_DEFRAG flag when dirtying the extents that need
defragmented, so later on writeback thread can differentiate between
normal writeback and writeback started by defragmentation.

Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:15 -04:00
Miao Xie
48c03c4bcf Btrfs: fix wrong size for the reservation of the, snapshot creation
We should insert/update 6 items(root ref, root backref, dir item, dir index,
root item and parent inode) when creating a snapshot, not 5 items, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:12 -04:00
Miao Xie
66d8f3dd1c Btrfs: add a new "type" field into the block reservation structure
Sometimes we need choose the method of the reservation according to the type
of the block reservation, such as the reservation for the delayed inode update.
Now we identify the type just by comparing the address of the reservation
variants, it is very ugly if it is a temporary one because we need compare it
with all the common reservation variants. So we add a new "type" field to keep
the type the reservation variants.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:11 -04:00
Josef Bacik
2671485d39 Btrfs: remove unused hint byte argument for btrfs_drop_extents
I audited all users of btrfs_drop_extents and found that nobody actually uses
the hint_byte argument.  I'm sure it was used for something at some point but
it's not used now, and the way the pinning works the disk bytenr would never be
immediately useful anyway so lets just remove it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:06 -04:00
Josef Bacik
5dc562c541 Btrfs: turbo charge fsync
At least for the vm workload.  Currently on fsync we will

1) Truncate all items in the log tree for the given inode if they exist

and

2) Copy all items for a given inode into the log

The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing.  This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them.  We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already.  Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write

		Original	Patched
SATA drive	82KB/s		140KB/s
Fusion drive	431KB/s		2532KB/s

So around 2-6 times faster depending on your hardware.  There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok.  This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get.  All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.

The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok.  I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:03 -04:00
Al Viro
2903ff019b switch simple cases of fget_light to fdget
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 22:20:08 -04:00
Al Viro
8319aa9127 switch btrfs_ioctl_clone() to fget_light()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:10:09 -04:00
Al Viro
ecd188159e switch btrfs_ioctl_snap_create_transid() to fget_light()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:10:07 -04:00
Eric W. Biederman
2f2f43d3c7 userns: Convert btrfs to use kuid/kgid where appropriate
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-09-21 03:13:31 -07:00
Linus Torvalds
318e151019 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I've split out the big send/receive update from my last pull request
  and now have just the fixes in my for-linus branch.  The send/recv
  branch will wander over to linux-next shortly though.

  The largest patches in this pull are Josef's patches to fix DIO
  locking problems and his patch to fix a crash during balance.  They
  are both well tested.

  The rest are smaller fixes that we've had queued.  The last rc came
  out while I was hacking new and exciting ways to recover from a
  misplaced rm -rf on my dev box, so these missed rc3."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
  Btrfs: fix that repair code is spuriously executed for transid failures
  Btrfs: fix ordered extent leak when failing to start a transaction
  Btrfs: fix a dio write regression
  Btrfs: fix deadlock with freeze and sync V2
  Btrfs: revert checksum error statistic which can cause a BUG()
  Btrfs: remove superblock writing after fatal error
  Btrfs: allow delayed refs to be merged
  Btrfs: fix enospc problems when deleting a subvol
  Btrfs: fix wrong mtime and ctime when creating snapshots
  Btrfs: fix race in run_clustered_refs
  Btrfs: don't run __tree_mod_log_free_eb on leaves
  Btrfs: increase the size of the free space cache
  Btrfs: barrier before waitqueue_active
  Btrfs: fix deadlock in wait_for_more_refs
  btrfs: fix second lock in btrfs_delete_delayed_items()
  Btrfs: don't allocate a seperate csums array for direct reads
  Btrfs: do not strdup non existent strings
  Btrfs: do not use missing devices when showing devname
  Btrfs: fix that error value is changed by mistake
  Btrfs: lock extents as we map them in DIO
  ...
2012-08-29 11:36:22 -07:00
Dan Carpenter
dadd1105ca Btrfs: fix some endian bugs handling the root times
"trans->transid" is cpu endian but we want to store the data as little
endian.  "item->ctime.nsec" is only 32 bits, not 64.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-08-28 16:53:26 -04:00
Alexander Block
e00da2067b Btrfs: remove mnt_want_write call in btrfs_mksubvol
We got a recursive lock in mksubvol because the caller already held
a lock. I think we got into this due to a merge error. Commit a874a63
removed the mnt_want_write call from btrfs_mksubvol and added a
replacement call to mnt_want_write_file in btrfs_ioctl_snap_create_transid.
Commit e7848683 however tried to move all calls to mnt_want_write above
i_mutex. So somewhere while merging this, it got mixed up. The
solution is to remove the mnt_want_write call completely from
mksubvol.

Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-09 11:01:54 -04:00
Linus Torvalds
a0e881b7c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs pile from Al Viro:
 "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
  deadlock reproduced by xfstests 068), symlink and hardlink restriction
  patches, plus assorted cleanups and fixes.

  Note that another fsfreeze deadlock (emergency thaw one) is *not*
  dealt with - the series by Fernando conflicts a lot with Jan's, breaks
  userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
  for massive vfsmount leak; this is going to be handled next cycle.
  There probably will be another pull request, but that stuff won't be
  in it."

Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
  delousing target_core_file a bit
  Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
  fs: Remove old freezing mechanism
  ext2: Implement freezing
  btrfs: Convert to new freezing mechanism
  nilfs2: Convert to new freezing mechanism
  ntfs: Convert to new freezing mechanism
  fuse: Convert to new freezing mechanism
  gfs2: Convert to new freezing mechanism
  ocfs2: Convert to new freezing mechanism
  xfs: Convert to new freezing code
  ext4: Convert to new freezing mechanism
  fs: Protect write paths by sb_start_write - sb_end_write
  fs: Skip atime update on frozen filesystem
  fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
  fs: Improve filesystem freezing handling
  switch the protection of percpu_counter list to spinlock
  nfsd: Push mnt_want_write() outside of i_mutex
  btrfs: Push mnt_want_write() outside of i_mutex
  fat: Push mnt_want_write() outside of i_mutex
  ...
2012-08-01 10:26:23 -07:00
Jan Kara
e7848683ae btrfs: Push mnt_want_write() outside of i_mutex
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason <chris.mason@oracle.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 01:02:51 +04:00
Chris Mason
113c1cb530 Merge branch 'send-v2' of git://github.com/ablock84/linux-btrfs into for-linus
This is the kernel portion of btrfs send/receive

Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/backref.h
	fs/btrfs/ctree.c
	fs/btrfs/ioctl.c
	fs/btrfs/ioctl.h

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 19:19:10 -04:00
Alexander Block
31db9f7c23 Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive
This patch introduces the BTRFS_IOC_SEND ioctl that is
required for send. It allows btrfs-progs to implement
full and incremental sends. Patches for btrfs-progs will
follow.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
2012-07-25 23:30:19 +02:00
Alexander Block
8ea05e3a42 Btrfs: introduce subvol uuids and times
This patch introduces uuids for subvolumes. Each
subvolume has it's own uuid. In case it was snapshotted,
it also contains parent_uuid. In case it was received,
it also contains received_uuid.

It also introduces subvolume ctime/otime/stime/rtime. The
first two are comparable to the times found in inodes. otime
is the origin/creation time and ctime is the change time.
stime/rtime are only valid on received subvolumes.
stime is the time of the subvolume when it was
sent. rtime is the time of the subvolume when it was
received.

Additionally to the times, we have a transid for each
time. They are updated at the same place as the times.

btrfs receive uses stransid and rtransid to find out
if a received subvolume changed in the meantime.

If an older kernel mounts a filesystem with the
extented fields, all fields become invalid. The next
mount with a new kernel will detect this and reset the
fields.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
2012-07-25 23:28:38 +02:00
Mitch Harder
2b0ce2c290 Btrfs: Check INCOMPAT flags on remount and add helper function
In support of the recently added capability to remount with lzo
compression, provide a helper function to check the compression
INCOMPAT flags when remounting with lzo compression, and set
the flags if necessary.

Also, implement the new helper function when defragmenting with
explicit lzo compression and when setting the default subvolume.

Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 16:14:31 -04:00
Chris Mason
b478b2baa3 Merge branch 'qgroup' of git://git.jan-o-sch.net/btrfs-unstable into for-linus
Conflicts:
	fs/btrfs/ioctl.c
	fs/btrfs/ioctl.h
	fs/btrfs/transaction.c
	fs/btrfs/transaction.h

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 16:11:38 -04:00
David Sterba
362a20c5e2 btrfs: allow cross-subvolume file clone
Lift the EXDEV condition and allow different root trees for files being
cloned, then pass source inode's root when searching for extents.
Cloning is not allowed to cross vfsmounts, ie. when two subvolumes from
one filesystem are mounted separately.

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-07-25 17:33:09 +02:00
Liu Bo
b9ca0664dc Btrfs: do not set subvolume flags in readonly mode
$ mkfs.btrfs /dev/sdb7
$ btrfstune -S1 /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs
mount: block device /dev/sdb7 is write-protected, mounting read-only
$ btrfs dev add /dev/sdb8 /mnt/btrfs/

Now we get a btrfs in which mnt flags has readonly but sb flags does
not.  So for those ioctls that only check sb flags with MS_RDONLY, it
is going to be a problem.
Setting subvolume flags is such an ioctl, we should use mnt_want_write_file()
to check RO flags.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:58 -04:00
Liu Bo
e54bfa3104 Btrfs: use mnt_want_write_file instead of mnt_want_write
mnt_want_write_file is faster when file has been opened for write.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:57 -04:00
Liu Bo
768e9dfe82 Btrfs: remove redundant r/o check for superblock
mnt_want_write() and mnt_want_write_file() will check sb->s_flags with
MS_RDONLY, and we don't need to do it ourselves.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:56 -04:00
Liu Bo
a874a63e13 Btrfs: check write access to mount earlier while creating snapshots
Move check of write access to mount into upper functions so that we can
use mnt_want_write_file instead, which is faster than mnt_want_write.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:56 -04:00
David Sterba
b27f7c0c15 btrfs: join DEV_STATS ioctls to one
Commit c11d2c236c (Btrfs: add ioctl to get and reset the device
stats) introduced two ioctls doing almost the same thing distinguished
by just the ioctl number which encodes "do reset after read". I have
suggested

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html

to implement it via the ioctl args. This hasn't happen, and I think we
should use a more clean way to pass flags and should not waste ioctl
numbers.

CC: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-07-23 15:41:40 -04:00
Andrew Mahone
a43a211133 btrfs: ignore unfragmented file checks in defrag when compression enabled - rebased
Rebased on btrfs-next and retested.

Inform should_defrag_range if BTRFS_DEFRAG_RANGE_COMPRESS is set. If so, skip
checks for adjacent extents and extent size when deciding whether to defrag,
as these can prevent an uncompressed and unfragmented file from being
compressed as requested.

Signed-off-by: Andrew Mahone <andrew.mahone@gmail.com>
2012-07-23 15:41:39 -04:00
Al Viro
11e62a8fab btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:43 +04:00
Arne Jansen
6f72c7e20d Btrfs: add qgroup inheritance
When creating a subvolume or snapshot, it is necessary
to initialize the qgroup account with a copy of some
other (tracking) qgroup. This patch adds parameters
to the ioctls to pass the information from which qgroup
to inherit.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-12 10:54:40 +02:00
Arne Jansen
5d13a37bd5 Btrfs: add qgroup ioctls
Ioctls to control the qgroup feature like adding and
removing qgroups and assigning qgroups.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-12 10:54:39 +02:00
Chris Mason
a8c4a33b98 Btrfs: cast devid to unsigned long long for printk %llu
Avoid warning in 32 bit machines

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-06-15 20:07:17 -04:00
Liu Bo
4e42ae1bdc Btrfs: do not resize a seeding device
Seeding devices are not supposed to change any more.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-06-15 11:42:26 -04:00
Li Zefan
6c282eb40e Btrfs: fix defrag regression
If a file has 3 small extents:

| ext1 | ext2 | ext3 |

Running "btrfs fi defrag" will only defrag the last two extents, if those
extent mappings hasn't been read into memory from disk.

This bug was introduced by commit 17ce6ef8d7
("Btrfs: add a check to decide if we should defrag the range")

The cause is, that commit looked into previous and next extents using
lookup_extent_mapping() only.

While at it, remove the code that checks the previous extent, since
it's sufficient to check the next extent.

Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-06-14 21:30:55 -04:00
Josef Bacik
606686eeac Btrfs: use rcu to protect device->name
Al pointed out that we can just toss out the old name on a device and add a
new one arbitrarily, so anybody who uses device->name in printk could
possibly use free'd memory.  Instead of adding locking around all of this he
suggested doing it with RCU, so I've introduced a struct rcu_string that
does just that and have gone through and protected all accesses to
device->name that aren't under the uuid_mutex with rcu_read_lock().  This
protects us and I will use it for dealing with removing the device that we
used to mount the file system in a later patch.  Thanks,

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
2012-06-14 21:29:16 -04:00
Chris Mason
1e20932a23 Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into for-linus
Conflicts:
	fs/btrfs/ulist.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-05-31 16:49:53 -04:00
Stefan Behrens
c11d2c236c Btrfs: add ioctl to get and reset the device stats
An ioctl interface is added to get the device statistic counters.
A second ioctl is added to atomically get and reset these counters.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-05-30 10:23:40 -04:00
Liu Bo
9ba1f6e44e Btrfs: do not do balance in readonly mode
In normal cases, we would not be allowed to do balance in RO mode.
However, when we're using a seeding device and adding another device to sprout,
things will change:

$ mkfs.btrfs /dev/sdb7
$ btrfstune -S 1 /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs -o ro
$ btrfs fi bal /mnt/btrfs   -----------------------> fail.
$ btrfs dev add /dev/sdb8 /mnt/btrfs
$ btrfs fi bal /mnt/btrfs   -----------------------> works!

It should not be designed as an exception, and we'd better add another check for
mnt flags.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:35 -04:00
Jim Meyering
a27202fbe9 Btrfs: NUL-terminate path buffer in DEV_INFO ioctl result
A device with name of length BTRFS_DEVICE_PATH_NAME_MAX or longer
would not be NUL-terminated in the DEV_INFO ioctl result buffer.

Signed-off-by: Jim Meyering <meyering@redhat.com>
2012-05-30 10:23:31 -04:00
Daniel J Blueman
2eec6c8102 Fix minor type issues
Address some minor type issues identified by sparse checker.

Signed-off-by: Daniel J Blueman <daniel@quora.org>
2012-05-30 10:23:30 -04:00
Josef Bacik
0c4d2d95d0 Btrfs: use i_version instead of our own sequence
We've been keeping around the inode sequence number in hopes that somebody
would use it, but nobody uses it and people actually use i_version which
serves the same purpose, so use i_version where we used the incore inode's
sequence number and that way the sequence is updated properly across the
board, and not just in file write.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:27 -04:00
Jan Schmidt
5581a51a59 Btrfs: don't set for_cow parameter for tree block functions
Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed
parameter for_cow = 1. In fact, these two functions should never mark
their tree modification operations as for_cow, because they can change
the number of blocks referenced by a tree.

Hence, we remove the extra for_cow parameter from these functions and
make them pass a zero down.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-26 12:17:53 +02:00
Stefan Behrens
99ba55ad69 Btrfs: fix btrfs_ioctl_dev_info() crash on missing device
When a filesystem is mounted with the degraded option, it is
possible that some of the devices are not there.
btrfs_ioctl_dev_info() crashs in this case because the device
name is a NULL pointer. This ioctl was only used for scrub.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-04-18 19:22:35 +02:00
Liu Bo
e1f041e14c Btrfs: update to the right index of defragment
When we use autodefrag, we forget to update the index which indicates
the last page we've dirty.  And we'll set dirty flags on a same set of
pages again and again.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29 09:57:45 -04:00
Liu Bo
66c2689226 Btrfs: do not bother to defrag an extent if it is a big real extent
$ mkfs.btrfs /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs/ -oautodefrag
$ dd if=/dev/zero of=/mnt/btrfs/foobar bs=4k count=10 oflag=direct 2>/dev/null
$ filefrag -v /mnt/btrfs/foobar
Filesystem type is: 9123683e
File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0     3072              10 eof
/mnt/btrfs/foobar: 1 extent found

Now we have a big real extent [0, 40960), but autodefrag will still defrag it.

$ sync
$ filefrag -v /mnt/btrfs/foobar
Filesystem type is: 9123683e
File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0     3082              10 eof
/mnt/btrfs/foobar: 1 extent found

So if we already find a big real extent, we're ok about that, just skip it.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29 09:57:45 -04:00
Liu Bo
17ce6ef8d7 Btrfs: add a check to decide if we should defrag the range
If our file's layout is as follows:
| hole | data1 | hole | data2 |

we do not need to defrag this file, because this file has holes and
cannot be merged into one extent.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29 09:57:45 -04:00
Liu Bo
1f12bd0632 Btrfs: fix the mismatch of page->mapping
commit 600a45e1d5
(Btrfs: fix deadlock on page lock when doing auto-defragment)
fixes the deadlock on page, but it also introduces another bug.

A page may have been truncated after unlock & lock.
So we need to find it again to get the right one.

And since we've held i_mutex lock, inode size remains unchanged and
we can drop isize overflow checks.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29 09:57:44 -04:00
Liu Bo
ecb8bea87d Btrfs: fix race between direct io and autodefrag
The bug is from running xfstests 209 with autodefrag.

The race is as follows:
       t1                       t2(autodefrag)
   direct IO
     invalidate pagecache
     dio(old data)             add_inode_defrag
     invalidate pagecache
   endio

   direct IO
     invalidate pagecache
                                run_defrag
                                  readpage(old data)
                                  set page dirty (old data)
     dio(new data, rewrite)
     invalidate pagecache (*)
     endio

t2(autodefrag) will get old data into pagecache via readpage and set
pagecache dirty.  Meanwhile, invalidate pagecache(*) will fail due to
dirty flags in pages.  So the old data may be flushed into disk by
flush thread, which will lead to data loss.

And so does the case of user defragment progs.

The patch fixes this race by holding i_mutex when we readpage and set page dirty.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29 09:57:44 -04:00
Chris Mason
98961a7e43 Merge git://git.jan-o-sch.net/btrfs-unstable into for-linus
Conflicts:
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-28 20:33:40 -04:00
Jan Schmidt
7a3ae2f8c8 Btrfs: fix regression in scrub path resolving
In commit 4692cf58 we introduced new backref walking code for btrfs. This
assumes we're searching live roots, which requires a transaction context.
While scrubbing, however, we must not join a transaction because this could
deadlock with the commit path. Additionally, what scrub really wants to do
is resolving a logical address in the commit root it's currently checking.

This patch adds support for logical to path resolving on commit roots and
makes scrub use that.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-03-27 14:51:21 +02:00
Jeff Mahoney
79787eaab4 btrfs: replace many BUG_ONs with proper error handling
btrfs currently handles most errors with BUG_ON. This patch is a work-in-
 progress but aims to handle most errors other than internal logic
 errors and ENOMEM more gracefully.

 This iteration prevents most crashes but can run into lockups with
 the page lock on occasion when the timing "works out."

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 11:52:54 +01:00
Mark Fasheh
ce598979be btrfs: Don't BUG_ON errors from btrfs_create_subvol_root()
This is called from only one place - create_subvol() which passes errors
safely back out to it's caller, btrfs_mksubvol where they are handled.

Additionally, btrfs_create_subvol_root() itself bug's needlessly from error
return of btrfs_update_inode(). Since create_subvol() was fixed to catch
errors we can bubble this one up too.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2012-03-22 01:45:36 +01:00
Jeff Mahoney
d0082371cf btrfs: drop gfp_t from lock_extent
lock_extent and unlock_extent are always called with GFP_NOFS, drop the
 argument and use GFP_NOFS consistently.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:35 +01:00
Linus Torvalds
855a85f704 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Quoth Chris:
 "This is later than I wanted because I got backed up running through
  btrfs bugs from the Oracle QA teams.  But they are all bug fixes that
  we've queued and tested since rc1.

  Nothing in particular stands out, this just reflects bug fixing and QA
  done in parallel by all the btrfs developers.  The most user visible
  of these is:

    Btrfs: clear the extent uptodate bits during parent transid failures

  Because that helps deal with out of date drives (say an iscsi disk
  that has gone away and come back).  The old code wasn't always
  properly retrying the other mirror for this type of failure."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
  Btrfs: fix compiler warnings on 32 bit systems
  Btrfs: increase the global block reserve estimates
  Btrfs: clear the extent uptodate bits during parent transid failures
  Btrfs: add extra sanity checks on the path names in btrfs_mksubvol
  Btrfs: make sure we update latest_bdev
  Btrfs: improve error handling for btrfs_insert_dir_item callers
  Btrfs: be less strict on finding next node in clear_extent_bit
  Btrfs: fix a bug on overcommit stuff
  Btrfs: kick out redundant stuff in convert_extent_bit
  Btrfs: skip states when they does not contain bits to clear
  Btrfs: check return value of lookup_extent_mapping() correctly
  Btrfs: fix deadlock on page lock when doing auto-defragment
  Btrfs: fix return value check of extent_io_ops
  btrfs: honor umask when creating subvol root
  btrfs: silence warning in raid array setup
  btrfs: fix structs where bitfields and spinlock/atomic share 8B word
  btrfs: delalloc for page dirtied out-of-band in fixup worker
  Btrfs: fix memory leak in load_free_space_cache()
  btrfs: don't check DUP chunks twice
  Btrfs: fix trim 0 bytes after a device delete
  ...
2012-02-24 09:02:53 -08:00
Chris Mason
16780cabb8 Btrfs: add extra sanity checks on the path names in btrfs_mksubvol
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23 10:43:45 -05:00
Miao Xie
600a45e1d5 Btrfs: fix deadlock on page lock when doing auto-defragment
When I ran xfstests circularly on a auto-defragment btrfs, the deadlock
happened.

Steps to reproduce:
[tty0]
 # export MOUNT_OPTIONS="-o autodefrag"
 # export TEST_DEV=<partition1>
 # export TEST_DIR=<mountpoint1>
 # export SCRATCH_DEV=<partition2>
 # export SCRATCH_MNT=<mountpoint2>
 # while [ 1 ]
 > do
 > ./check 091 127 263
 > sleep 1
 > done
[tty1]
 # while [ 1 ]
 > do
 > echo 3 > /proc/sys/vm/drop_caches
 > done

Several hours later, the test processes will hang on, and the deadlock will
happen on page lock.

The reason is that:
  Auto defrag task		Flush thread			Test task
				btrfs_writepages()
				  add ordered extent
				  (including page 1, 2)
				  set page 1 writeback
				  set page 2 writeback
				endio_fn()
				  end page 2 writeback
								release page 2
lock page 1
alloc and lock page 2
page 2 is not uptodate
  btrfs_readpage()
    start ordered extent()
    btrfs_writepages()
      try  to lock page 1

so deadlock happens.

Fix this bug by unlocking the page which is in writeback, and re-locking it
after the writeback end.

Signed-off-by: Miao Xie <miax@cn.fujitsu.com>
2012-02-16 17:23:16 +01:00
Linus Torvalds
67d2433ee7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix reservations in btrfs_page_mkwrite
  Btrfs: advance window_start if we're using a bitmap
  btrfs: mask out gfp flags in releasepage
  Btrfs: fix enospc error caused by wrong checks of the chunk
  Btrfs: do not defrag a file partially
  Btrfs: fix warning for 32-bit build of fs/btrfs/check-integrity.c
  Btrfs: use cluster->window_start when allocating from a cluster bitmap
  Btrfs: Check for NULL page in extent_range_uptodate
  btrfs: Fix busyloops in transaction waiting code
  Btrfs: make sure a bitmap has enough bytes
  Btrfs: fix uninit warning in backref.c
2012-01-28 17:00:19 -08:00
Liu Bo
7ec31b548a Btrfs: do not defrag a file partially
xfstests 218 complains that btrfs defrags a file partially:
 After: 1
 Write backwards sync, but contiguous - should defrag to 1 extent
 Before: 10
-After: 1
+After: 2

To fix this, we need to set max_to_defrag count properly.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:12 -05:00
Linus Torvalds
d65773b22b Merge branch 'btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
* 'btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  btrfs: take allocation of ->tree_root into open_ctree()
  btrfs: let ->s_fs_info point to fs_info, not root...
  btrfs: consolidate failure exits in btrfs_mount() a bit
  btrfs: make free_fs_info() call ->kill_sb() unconditional
  btrfs: merge free_fs_info() calls on fill_super failures
  btrfs: kill pointless reassignment of ->s_fs_info in btrfs_fill_super()
  btrfs: make open_ctree() return int
  btrfs: sanitizing ->fs_info, part 5
  btrfs: sanitizing ->fs_info, part 4
  btrfs: sanitizing ->fs_info, part 3
  btrfs: sanitizing ->fs_info, part 2
  btrfs: sanitizing ->fs_info, part 1
  btrfs: fix a deadlock in btrfs_scan_one_device()
  btrfs: fix mount/umount race
  btrfs: get ->kill_sb() of its own
  btrfs: preparation to fixing mount/umount race
2012-01-17 15:52:51 -08:00
Linus Torvalds
f9156c7288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
  Btrfs: use larger system chunks
  Btrfs: add a delalloc mutex to inodes for delalloc reservations
  Btrfs: space leak tracepoints
  Btrfs: protect orphan block rsv with spin_lock
  Btrfs: add allocator tracepoints
  Btrfs: don't call btrfs_throttle in file write
  Btrfs: release space on error in page_mkwrite
  Btrfs: fix btrfsck error 400 when truncating a compressed
  Btrfs: do not use btrfs_end_transaction_throttle everywhere
  Btrfs: add balance progress reporting
  Btrfs: allow for resuming restriper after it was paused
  Btrfs: allow for canceling restriper
  Btrfs: allow for pausing restriper
  Btrfs: add skip_balance mount option
  Btrfs: recover balance on mount
  Btrfs: save balance parameters to disk
  Btrfs: soft profile changing mode (aka soft convert)
  Btrfs: implement online profile changing
  Btrfs: do not reduce profile in do_chunk_alloc()
  Btrfs: virtual address space subset filter
  ...

Fix up trivial conflict in fs/btrfs/ioctl.c due to the use of the new
mnt_drop_write_file() helper.
2012-01-17 15:49:54 -08:00
Josef Bacik
f248679e86 Btrfs: add a delalloc mutex to inodes for delalloc reservations
I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
that and theres no real way to get rid of those, so just stop using i_mutex to
protect delalloc metadata reservations and use a delalloc mutex instead.  This
shouldn't be contended often at all, only if you are writing and mmap writing to
the file at the same time.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-01-16 15:29:43 -05:00
Chris Mason
9785dbdf26 Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into integration 2012-01-16 15:26:31 -05:00
Chris Mason
d756bd2d93 Merge branch 'for-chris' of git://repo.or.cz/linux-btrfs-devel into integration
Conflicts:
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-16 15:26:17 -05:00
Ilya Dryomov
19a39dce3b Btrfs: add balance progress reporting
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:49 +02:00
Ilya Dryomov
de322263d3 Btrfs: allow for resuming restriper after it was paused
Recognize BTRFS_BALANCE_RESUME flag passed from userspace.  We use the
same heuristics used when recovering balance after a crash to try to
start where we left off last time.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:49 +02:00
Ilya Dryomov
a7e99c691a Btrfs: allow for canceling restriper
Implement an ioctl for canceling restriper.  Currently we wait until
relocation of the current block group is finished, in future this can be
done by triggering a commit.  Balance item is deleted and no memory
about the interrupted balance is kept.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:49 +02:00
Ilya Dryomov
837d5b6e46 Btrfs: allow for pausing restriper
Implement an ioctl for pausing restriper.  This pauses the relocation,
but balance is still considered to be "in progress": balance item is
not deleted, other volume operations cannot be started, etc.  If paused
in the middle of profile changing operation we will continue making
allocations with the target profile.

Add a hook to close_ctree() to pause restriper and free its data
structures on unmount.  (It's safe to unmount when restriper is in
"paused" state, we will resume with the same parameters on the next
mount)

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:49 +02:00
Ilya Dryomov
f43ffb60fd Btrfs: add basic infrastructure for selective balancing
This allows to have a separate set of filters for each chunk type
(data,meta,sys).  The code however is generic and switch on chunk type
is only done once.

This commit also adds a type filter: it allows to balance for example
meta and system chunks w/o touching data ones.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov
c9e9f97bdf Btrfs: add basic restriper infrastructure
Add basic restriper infrastructure: extended balancing ioctl and all
related ioctl data structures, add data structure for tracking
restriper's state to fs_info, etc.  The semantics of the old balancing
ioctl are fully preserved.

Explicitly disallow any volume operations when balance is in progress.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Li Zefan
4da6f1a332 Btrfs: reserve metadata space in btrfs_ioctl_setflags()
Check and reserve space for btrfs_update_inode().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11 10:26:39 +08:00
Li Zefan
f062abf089 Btrfs: remove BUG_ON()s in btrfs_ioctl_setflags()
We can recover from errors and return -errno to user space.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11 10:26:38 +08:00
Al Viro
815745cf3e btrfs: let ->s_fs_info point to fs_info, not root...
the latter can be obtained from the former (by looking as ->tree_root)
just as cheaply as we currently are doing the other way round.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-08 19:35:37 -05:00
Jan Schmidt
4692cf58aa Btrfs: new backref walking code
The old backref iteration code could only safely be used on commit roots.
Besides this limitation, it had bugs in finding the roots for these
references. This commit replaces large parts of it by btrfs_find_all_roots()
which a) really finds all roots and the correct roots, b) works correctly
under heavy file system load, c) considers delayed refs.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-01-05 10:49:43 +01:00
Al Viro
2a79f17e4a vfs: mnt_drop_write_file()
new helper (wrapper around mnt_drop_write()) to be used in pair with
mnt_want_write_file().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:52:40 -05:00
Al Viro
a561be7100 switch a bunch of places to mnt_want_write_file()
it's both faster (in case when file has been opened for write) and cleaner.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:52:35 -05:00
Arne Jansen
66d7e7f09f Btrfs: mark delayed refs as for cow
Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
from every call site. The for_cow parameter will later on be used to
determine if a ref will change anything with respect to qgroups.

Delayed refs coming from relocation are always counted as for_cow, as they
don't change subvol quota.

Also pass in the fs_info for later use.

btrfs_find_all_roots() will use this as an optimization, as changes that are
for_cow will not change anything with respect to which root points to a
certain leaf. Thus, we don't need to add the current sequence number to
those delayed refs.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-12-22 16:22:27 +01:00
Chris Mason
567a45e917 Merge branch 'for-chris' of http://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into integration
Conflicts:
	fs/btrfs/inode.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15 13:43:49 -05:00
Josef Bacik
660d3f6cde Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved.  This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such.  The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for.  The other problem is with our error handling
in the reservation code.  There are two cases that we need to deal with

1) We raced with free.  In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation.  However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use.  So as it stands the code was
doing this fine and it worked out, except in case #2

2) We don't race with free.  Nobody comes in and changes anything, and our
reservation fails.  In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything.  So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.

Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex.  We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things.  With this patch my space leak scripts no longer
scream bloody murder.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15 11:04:22 -05:00
Li Zefan
306424cc88 Btrfs: fix ctime update of on-disk inode
To reproduce the bug:

    # touch /mnt/tmp
    # stat /mnt/tmp | grep Change
    Change: 2011-12-09 09:32:23.412105981 +0800
    # chattr +i /mnt/tmp
    # stat /mnt/tmp | grep Change
    Change: 2011-12-09 09:32:43.198105295 +0800
    # umount /mnt
    # mount /dev/loop1 /mnt
    # stat /mnt/tmp | grep Change
    Change: 2011-12-09 09:32:23.412105981 +0800

We should update ctime of in-memory inode before calling
btrfs_update_inode().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15 10:50:37 -05:00
Mike Fleetwood
ece7d20e8b Btrfs: Don't error on resizing FS to same size
It seems overly harsh to fail a resize of a btrfs file system to the
same size when a shrink or grow would succeed.  User app GParted trips
over this error.  Allow it by bypassing the shrink or grow operation.

Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com>
2011-11-30 18:46:04 +01:00
Arnd Hannemann
5bb1468238 Btrfs: prefix resize related printks with btrfs:
For the user it is confusing to find something like:
[10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472
in kernel log, because it doesn't point directly to btrfs.

This patch prefixes those messages with "btrfs:" like other btrfs
related printks.

Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:16 -05:00
Jeff Mahoney
745c4d8e16 btrfs: Fix up 32/64-bit compatibility for new ioctls
This patch casts to unsigned long before casting to a pointer and fixes
 the following warnings:
fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:13 -05:00
Chris Mason
740c3d226c Btrfs: fix the new inspection ioctls for 32 bit compat
The new ioctls to follow backrefs are not clean for 32/64 bit
compat.  This reworks them for u64s everywhere.  They are brand new, so
there are no problems with changing the interface now.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:08:49 -05:00
Chris Mason
806468f8bf Merge git://git.jan-o-sch.net/btrfs-unstable into integration
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/extent_io.c
	fs/btrfs/extent_io.h
	fs/btrfs/scrub.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:07:10 -05:00
David Sterba
6c41761fc6 btrfs: separate superblock items out of fs_info
fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)

Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-11-06 03:04:01 -05:00
David Sterba
a81d3b1ba2 Merge branch 'hotfixes-20111024/josef/for-chris' into btrfs-next-stable 2011-10-24 14:47:58 +02:00
David Sterba
afd582ac8f Merge remote-tracking branch 'remotes/josef/for-chris' into btrfs-next-stable 2011-10-24 14:47:57 +02:00
Lukas Czerner
f4c697e640 btrfs: return EINVAL if start > total_bytes in fitrim ioctl
We should retirn EINVAL if the start is beyond the end of the file
system in the btrfs_ioctl_fitrim(). Fix that by adding the appropriate
check for it.

Also in the btrfs_trim_fs() it is possible that len+start might overflow
if big values are passed. Fix it by decrementing the len so that start+len
is equal to the file system size in the worst case.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2011-10-20 18:10:40 +02:00
Li Zefan
008873eafb Btrfs: honor extent thresh during defragmentation
We won't defrag an extent, if it's bigger than the threshold we
specified and there's no small extent before it, but actually
the code doesn't work this way.

There are three bugs:

- When should_defrag_range() decides we should keep on defragmenting
  an extent, last_len is not incremented. (old bug)

- The length that passes to should_defrag_range() is not the length
  we're going to defrag. (new bug)

- We always defrag 256K bytes data, and a big extent can be part of
  this range. (new bug)

For a file with 4 extents:

        | 4K | 4K | 256K | 256K |

The result of defrag with (the default) 256K extent thresh should be:

        | 264K | 256K |

but with those bugs, we'll get:

        | 520K |

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:39 +02:00
Li Zefan
5ca496604b Btrfs: fix wrong max_to_defrag in btrfs_defrag_file()
It's off-by-one, and thus we may skip the last page while defragmenting.

An example case:

  # create /mnt/file with 2 4K file extents
  # btrfs fi defrag /mnt/file
  # sync
  # filefrag /mnt/file
  /mnt/file: 2 extents found

So it's not defragmented.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:37 +02:00
Li Zefan
151a31b25e Btrfs: use i_size_read() in btrfs_defrag_file()
Don't use inode->i_size directly, since we're not holding i_mutex.

This also fixes another bug, that i_size can change after it's checked
against 0 and then (i_size - 1) can be negative.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:35 +02:00
Li Zefan
cbcc83265d Btrfs: fix defragmentation regression
There's an off-by-one bug:

  # create a file with lots of 4K file extents
  # btrfs fi defrag /mnt/file
  # sync
  # filefrag -v /mnt/file
  Filesystem type is: 9123683e
  File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
   ext logical physical expected length flags
     0       0     3372              64
     1      64     3136     3435      1
     2      65     3436     3136     64
     3     129     3201     3499      1
     4     130     3500     3201     64
     5     194     3266     3563      1
     6     195     3564     3266     64
     7     259     3331     3627      1
     8     260     3628     3331     40 eof

After this patch:

  ...
  # filefrag -v /mnt/file
  Filesystem type is: 9123683e
  File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
   ext logical physical expected length flags
     0       0     3372             300 eof
  /mnt/file: 1 extent found

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:34 +02:00
Diego Calleja
60ccf82f5b btrfs: fix memory leak in btrfs_defrag_file
kmemleak found this:
unreferenced object 0xffff8801b64af968 (size 512):
  comm "btrfs-cleaner", pid 3317, jiffies 4306810886 (age 903.272s)
  hex dump (first 32 bytes):
    00 82 01 07 00 ea ff ff c0 83 01 07 00 ea ff ff  ................
    80 82 01 07 00 ea ff ff c0 87 01 07 00 ea ff ff  ................
  backtrace:
    [<ffffffff816875cc>] kmemleak_alloc+0x5c/0xc0
    [<ffffffff8114aec3>] kmem_cache_alloc_trace+0x163/0x240
    [<ffffffff8127a290>] btrfs_defrag_file+0xf0/0xb20
    [<ffffffff8125d9a5>] btrfs_run_defrag_inodes+0x165/0x210
    [<ffffffff812479d7>] cleaner_kthread+0x177/0x190
    [<ffffffff81075c7d>] kthread+0x8d/0xa0
    [<ffffffff816af5f4>] kernel_thread_helper+0x4/0x10
    [<ffffffffffffffff>] 0xffffffffffffffff

"pages" is not always freed. Fix it removing the unnecesary additional return.

Signed-off-by: Diego Calleja <diegocg@gmail.com>
2011-10-20 18:10:33 +02:00
Josef Bacik
e27425d614 Btrfs: only inherit btrfs specific flags when creating files
Xfstests 79 was failing because we were inheriting the S_APPEND flag when we
weren't supposed to.  There isn't any specific documentation on this so I'm
taking the test as the standard of how things work, and having S_APPEND set on a
directory doesn't mean that S_APPEND gets inherited by its children according to
this test.  So only inherit btrfs specific things.  This will let us set
compress/nocompress on specific directories and everything in the directories
will inherit this flag, same with nodatacow.  With this patch test 79 passes.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:50 -04:00
Josef Bacik
3b16a4e3c3 Btrfs: use the inode's mapping mask for allocating pages
Johannes pointed out we were allocating only kernel pages for doing writes,
which is kind of a big deal if you are on 32bit and have more than a gig of ram.
So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
don't re-enter.  Thanks,

Reported-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:45 -04:00
Linus Torvalds
b2f9452bd5 Merge branch 'btrfs-3.0' of git://github.com/chrismason/linux
* 'btrfs-3.0' of git://github.com/chrismason/linux:
  Btrfs: make sure not to defrag extents past i_size
  Btrfs: fix recursive auto-defrag
2011-10-13 18:20:40 +12:00
Chris Mason
f7f43cc841 Btrfs: make sure not to defrag extents past i_size
The btrfs file defrag code will loop through the extents and
force COW on them.  But there is a concurrent truncate in the middle of
the defrag, it might end up defragging the same range over and over
again.

The problem is that writepage won't go through and do anything on pages
past i_size, so the cow won't happen, so the file will appear to still
be fragmented.  defrag will end up hitting the same extents again and
again.

In the worst case, the truncate can actually live lock with the defrag
because the defrag keeps creating new ordered extents which the truncate
code keeps waiting on.

The fix here is to make defrag check for i_size inside the main loop,
instead of just once before the looping starts.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-10-11 11:45:55 -04:00
Li Zefan
2a0f7f5769 Btrfs: fix recursive auto-defrag
Follow those steps:

  # mount -o autodefrag /dev/sda7 /mnt
  # dd if=/dev/urandom of=/mnt/tmp bs=200K count=1
  # sync
  # dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc

and then it'll go into a loop: writeback -> defrag -> writeback ...

It's because writeback writes [8K, 200K] and then writes [0, 8K].

I tried to make writeback know if the pages are dirtied by defrag,
but the patch was a bit intrusive. Here I simply set writeback_index
when we defrag a file.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-10-10 15:43:34 -04:00
Jan Schmidt
d7728c960d btrfs: new ioctls to do logical->inode and inode->path resolving
these ioctls make use of the new functions initially added for scrub. they
return all inodes belonging to a logical address (BTRFS_IOC_LOGICAL_INO) and
all paths belonging to an inode (BTRFS_IOC_INO_PATHS).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 12:54:28 +02:00
Chris Mason
0a7a0519d1 Merge branch 'btrfs-3.0' into for-linus 2011-09-20 14:49:29 -04:00
Sage Weil
b6f3409b21 Btrfs: reserve sufficient space for ioctl clone
Fix a crash/BUG_ON in the clone ioctl due to insufficient reservation. We
need to reserve space for:

 - adjusting the old extent (possibly splitting it)
 - adding the new extent
 - updating the inode

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-20 14:48:51 -04:00
Chris Mason
2cf4ce7c2a Merge branch 'btrfs-3.0' into for-linus 2011-09-18 10:31:44 -04:00
Li Zefan
dde820fbf7 Btrfs: don't change inode flag of the dest clone file
The dst file will have the same inode flags with dst file after
file clone, and I think it's unexpected.

For example, the dst file will suddenly become immutable after
getting some share of data with src file, if the src is immutable.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-18 10:20:46 -04:00
Li Zefan
0e7b824c4e Btrfs: don't make a file partly checksummed through file clone
To reproduce the bug:

  # mount /dev/sda7 /mnt
  # dd if=/dev/zero of=/mnt/src bs=4K count=1
  # umount /mnt

  # mount -o nodatasum /dev/sda7 /mnt
  # dd if=/dev/zero of=/mnt/dst bs=4K count=1
  # clone_range -s 4K -l 4K /mnt/src /mnt/dst

  # echo 3 > /proc/sys/vm/drop_caches
  # cat /mnt/dst
  # dmesg
  ...
  btrfs no csum found for inode 258 start 0
  btrfs csum failed ino 258 off 0 csum 2566472073 private 0

It's because part of the file is checksummed and the other part is not,
and then btrfs will complain checksum is not found when we read the file.

Disallow file clone if src and dst file have different checksum flag,
so we ensure a file is completely checksummed or unchecksummed.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-18 10:20:46 -04:00
Li Zefan
71ef078610 Btrfs: fix pages truncation in btrfs_ioctl_clone()
It's a bug in commit f81c9cdc56
(Btrfs: truncate pages from clone ioctl target range)

We should pass the dest range to the truncate function, but not the
src range.

Also move the function before locking extent state.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-18 10:20:46 -04:00
Linus Torvalds
0b001b2eda Merge branch 'for-linus' of git://github.com/chrismason/linux
* 'for-linus' of git://github.com/chrismason/linux:
  Btrfs: add dummy extent if dst offset excceeds file end in
  Btrfs: calc file extent num_bytes correctly in file clone
  btrfs: xattr: fix attribute removal
  Btrfs: fix wrong nbytes information of the inode
  Btrfs: fix the file extent gap when doing direct IO
  Btrfs: fix unclosed transaction handle in btrfs_cont_expand
  Btrfs: fix misuse of trans block rsv
  Btrfs: reset to appropriate block rsv after orphan operations
  Btrfs: skip locking if searching the commit root in csum lookup
  btrfs: fix warning in iput for bad-inode
  Btrfs: fix an oops when deleting snapshots
2011-09-12 11:47:49 -07:00
Li Zefan
d525e8ab02 Btrfs: add dummy extent if dst offset excceeds file end in
You can see there's no file extent with range [0, 4096]. Check this by
btrfsck:

 # btrfsck /dev/sda7
 root 5 inode 258 errors 100
 ...

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:25 -04:00
Li Zefan
d72c0842ff Btrfs: calc file extent num_bytes correctly in file clone
num_bytes should be 4096 not 12288.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:25 -04:00
Chris Mason
81d86e1b70 Merge branch 'btrfs-3.0' into for-linus 2011-08-18 10:38:03 -04:00
Sage Weil
f81c9cdc56 Btrfs: truncate pages from clone ioctl target range
We need to truncate page cache pages for the clone ioctl target range or
else we'll confuse ourselves to no end.  If the old data was cached, we
used to still see it (until remount).  If the page was partially updated
we used to get a mix of old and new data.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-16 21:09:31 -04:00
Linus Torvalds
ed8f37370d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits)
  Btrfs: don't call writepages from within write_full_page
  Btrfs: Remove unused variable 'last_index' in file.c
  Btrfs: clean up for find_first_extent_bit()
  Btrfs: clean up for wait_extent_bit()
  Btrfs: clean up for insert_state()
  Btrfs: remove unused members from struct extent_state
  Btrfs: clean up code for merging extent maps
  Btrfs: clean up code for extent_map lookup
  Btrfs: clean up search_extent_mapping()
  Btrfs: remove redundant code for dir item lookup
  Btrfs: make acl functions really no-op if acl is not enabled
  Btrfs: remove remaining ref-cache code
  Btrfs: remove a BUG_ON() in btrfs_commit_transaction()
  Btrfs: use wait_event()
  Btrfs: check the nodatasum flag when writing compressed files
  Btrfs: copy string correctly in INO_LOOKUP ioctl
  Btrfs: don't print the leaf if we had an error
  btrfs: make btrfs_set_root_node void
  Btrfs: fix oops while writing data to SSD partitions
  Btrfs: Protect the readonly flag of block group
  ...

Fix up trivial conflicts (due to acl and writeback cleanups) in
 - fs/btrfs/acl.c
 - fs/btrfs/ctree.h
 - fs/btrfs/extent_io.c
2011-08-02 21:14:05 -10:00
Li Zefan
77906a5075 Btrfs: copy string correctly in INO_LOOKUP ioctl
Memory areas [ptr, ptr+total_len] and [name, name+total_len]
may overlap, so it's wrong to use memcpy().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01 14:30:45 -04:00
Linus Torvalds
22712200e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors
  Btrfs: use the commit_root for reading free_space_inode crcs
  Btrfs: reduce extent_state lock contention for metadata
  Btrfs: remove lockdep magic from btrfs_next_leaf
  Btrfs: make a lockdep class for each root
  Btrfs: switch the btrfs tree locks to reader/writer
  Btrfs: fix deadlock when throttling transactions
  Btrfs: stop using highmem for extent_buffers
  Btrfs: fix BUG_ON() caused by ENOSPC when relocating space
  Btrfs: tag pages for writeback in sync
  Btrfs: fix enospc problems with delalloc
  Btrfs: don't flush delalloc arbitrarily
  Btrfs: use find_or_create_page instead of grab_cache_page
  Btrfs: use a worker thread to do caching
  Btrfs: fix how we merge extent states and deal with cached states
  Btrfs: use the normal checksumming infrastructure for free space cache
  Btrfs: serialize flushers in reserve_metadata_bytes
  Btrfs: do transaction space reservation before joining the transaction
  Btrfs: try to only do one btrfs_search_slot in do_setxattr
2011-07-27 16:43:52 -07:00
Josef Bacik
9e0baf60de Btrfs: fix enospc problems with delalloc
So I had this brilliant idea to use atomic counters for outstanding and reserved
extents, but this turned out to be a bad idea.  Consider this where we have 1
outstanding extent and 1 reserved extent

Reserver				Releaser
					atomic_dec(outstanding) now 0
atomic_read(outstanding)+1 get 1
atomic_read(reserved) get 1
don't actually reserve anything because
they are the same
					atomic_cmpxchg(reserved, 1, 0)
atomic_inc(outstanding)
atomic_add(0, reserved)
					free reserved space for 1 extent

Then the reserver now has no actual space reserved for it, and when it goes to
finish the ordered IO it won't have enough space to do it's allocation and you
get those lovely warnings.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:44 -04:00
Josef Bacik
a94733d0bc Btrfs: use find_or_create_page instead of grab_cache_page
grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
GFP_HIGHUSER_MOVABLE.  So instead use find_or_create_page in all cases where we
need GFP_NOFS so we don't deadlock.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-07-27 12:46:43 -04:00
Al Viro
2fbe8c8ad1 get rid of useless dget_parent() in fs/btrfs/ioctl.c
both callers there have dentry->d_parent stabilized by the fact that
their caller had obtained dentry from lookup_one_len() and had not
dropped ->i_mutex on parent since then.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 20:48:00 -04:00
Josef Bacik
8351583e3f Btrfs: protect the pending_snapshots list with trans_lock
Currently there is nothing protecting the pending_snapshots list on the
transaction.  We only hold the directory mutex that we are snapshotting and a
read lock on the subvol_sem, so we could race with somebody else creating a
snapshot in a different directory and end up with list corruption.  So protect
this list with the trans_lock.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-06-15 13:24:46 -04:00
Li Zefan
027ed2f004 Btrfs: avoid stack bloat in btrfs_ioctl_fs_info()
The size of struct btrfs_ioctl_fs_info_args is as big as 1KB, so
don't declare the variable on stack.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-10 18:57:10 -04:00
David Sterba
a4689d2bd3 btrfs: use btrfs_ino to access inode number
commit 4cb5300bc ("Btrfs: add mount -o auto_defrag") accesses inode
number directly while it should use the helper with the new inode
number allocator.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-04 08:03:46 -04:00
Chris Mason
ff5714cca9 Merge branch 'for-chris' of
git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus

Conflicts:
	fs/btrfs/disk-io.c
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-28 07:00:39 -04:00
Chris Mason
4cb5300bc8 Btrfs: add mount -o auto_defrag
This will detect small random writes into files and
queue the up for an auto defrag process.  It isn't well suited to
database workloads yet, but works for smaller files such as rpm, sqlite
or bdb databases.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-26 17:52:15 -04:00
Chris Mason
d6c0cb379c Merge branch 'cleanups_and_fixes' into inode_numbers
Conflicts:
	fs/btrfs/tree-log.c
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 14:37:47 -04:00
Xiao Guangrong
1f78160ce1 Btrfs: using rcu lock in the reader side of devices list
fs_devices->devices is only updated on remove and add device paths, so we can
use rcu to protect it in the reader side

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:43 -04:00
Hugo Mills
e215686715 btrfs: Ensure the tree search ioctl returns the right number of records
Btrfs's tree search ioctl has a field to indicate that no more than a
given number of records should be returned. The ioctl doesn't honour
this, as the tested value is not incremented until the end of the
copy_to_sk function. This patch removes an unnecessary local variable,
and updates the num_found counter as each key is found in the tree.

Signed-off-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:05:39 -04:00
Josef Bacik
d82a6f1d7e Btrfs: kill BTRFS_I(inode)->block_group
Originally this was going to be used as a way to give hints to the allocator,
but frankly we can get much better hints elsewhere and it's not even used at all
for anything usefull.  In addition to be completely useless, when we initialize
an inode we try and find a freeish block group to set as the inodes block group,
and with a completely full 40gb fs this takes _forever_, so I imagine with say
1tb fs this is just unbearable.  So just axe the thing altoghether, we don't
need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
inode lookup in my testcase.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:12 -04:00
Josef Bacik
a4abeea41a Btrfs: kill trans_mutex
We use trans_mutex for lots of things, here's a basic list

1) To serialize trans_handles joining the currently running transaction
2) To make sure that no new trans handles are started while we are committing
3) To protect the dead_roots list and the transaction lists

Really the serializing trans_handles joining is not too hard, and can really get
bogged down in acquiring a reference to the transaction.  So replace the
trans_mutex with a trans_lock spinlock and use it to do the following

1) Protect fs_info->running_transaction.  All trans handles have to do is check
this, and then take a reference of the transaction and keep on going.
2) Protect the fs_info->trans_list.  This doesn't get used too much, basically
it just holds the current transactions, which will usually just be the currently
committing transaction and the currently running transaction at most.
3) Protect the dead roots list.  This is only ever processed by splicing the
list so this is relatively simple.
4) Protect the fs_info->reloc_ctl stuff.  This is very lightweight and was using
the trans_mutex before, so this is a pretty straightforward change.
5) Protect fs_info->no_trans_join.  Because we don't hold the trans_lock over
the entirety of the commit we need to have a way to block new people from
creating a new transaction while we're doing our work.  So we set no_trans_join
and in join_transaction we test to see if that is set, and if it is we do a
wait_on_commit.
6) Make the transaction use count atomic so we don't need to take locks to
modify it when we're dropping references.
7) Add a commit_lock to the transaction to make sure multiple people trying to
commit the same transaction don't race and commit at the same time.
8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
trans.

I have tested this with xfstests, but obviously it is a pretty hairy change so
lots of testing is greatly appreciated.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:57 -04:00
Josef Bacik
7a7eaa40a3 Btrfs: take away the num_items argument from btrfs_join_transaction
I keep forgetting that btrfs_join_transaction() just ignores the num_items
argument, which leads me to sending pointless patches and looking stupid :).  So
just kill the num_items argument from btrfs_join_transaction and
btrfs_start_ioctl_transaction, since neither of them use it.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:56 -04:00
Chris Mason
712673339a Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/ctree.h
	fs/btrfs/volumes.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 06:30:52 -04:00
Chris Mason
945d8962ce Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into inode_numbers
Conflicts:
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/tree-log.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:33:42 -04:00
Chris Mason
dcc6d07322 Merge branch 'delayed_inode' into inode_numbers
Conflicts:
	fs/btrfs/inode.c
	fs/btrfs/ioctl.c
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 07:07:01 -04:00
Miao Xie
16cdcec736 btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030

After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:30:56 -04:00
Chris Mason
0965537308 Merge branch 'ino-alloc' of git://repo.or.cz/linux-btrfs-devel into inode_numbers
Conflicts:
	fs/btrfs/free-space-cache.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:27:38 -04:00
Linus Torvalds
eed631e0d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix FS_IOC_SETFLAGS ioctl
  Btrfs: fix FS_IOC_GETFLAGS ioctl
  fs: remove FS_COW_FL
  Btrfs: fix easily get into ENOSPC in mixed case
  Prevent oopsing in posix_acl_valid()
2011-05-15 10:22:10 -07:00
Li Zefan
ebcb904dfe Btrfs: fix FS_IOC_SETFLAGS ioctl
Steps to reproduce the bug:

  - Call FS_IOC_SETLFAGS ioctl with flags=FS_COMPR_FL
  - Call FS_IOC_SETFLAGS ioctl with flags=0
  - Call FS_IOC_GETFLAGS ioctl, and you'll see FS_COMPR_FL is still set!

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:28 -04:00
Li Zefan
d0092bdda8 Btrfs: fix FS_IOC_GETFLAGS ioctl
As we've added per file compression/cow support.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:27 -04:00
Li Zefan
e1e8fb6a1f fs: remove FS_COW_FL
FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file
COW in btrfs, but FS_NOCOW_FL is sufficient.

The fact is we don't have corresponding BTRFS_INODE_COW flag.

COW is default, and FS_NOCOW_FL can be used to switch off COW for
a single file.

If we mount btrfs with nodatacow, a newly created file will be set with
the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the
FS_NOCOW_FL flag.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:26 -04:00
Arne Jansen
8628764e1a btrfs: add readonly flag
setting the readonly flag prevents writes in case an error is detected

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-12 14:48:31 +02:00
Jan Schmidt
475f63874d btrfs: new ioctls for scrub
adds ioctls necessary to start and cancel scrubs, to get current
progress and to get info about devices to be scrubbed.
Note that the scrub is done per-device and that the ioctl only
returns after the scrub for this devices is finished or has been
canceled.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-12 14:45:38 +02:00
David Sterba
b3b4aa74b5 btrfs: drop unused parameter from btrfs_release_path
parameter tree root it's not used since commit
5f39d397df ("Btrfs: Create extent_buffer
interface for large blocksizes")

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
Li Zefan
33345d0152 Btrfs: Always use 64bit inode number
There's a potential problem in 32bit system when we exhaust 32bit inode
numbers and start to allocate big inode numbers, because btrfs uses
inode->i_ino in many places.

So here we always use BTRFS_I(inode)->location.objectid, which is an
u64 variable.

There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
and inode->i_ino will be used in those cases.

Another reason to make this change is I'm going to use a special inode
to save free ino cache, and the inode number must be > (u64)-256.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:09 +08:00
Li Zefan
581bb05094 Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.

This fixes it, and it works similarly to how we cache free space in block
cgroups.

We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.

Because we are searching the commit root, we have to carefully handle the
cross-transaction case.

The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:04 +08:00
Linus Torvalds
adff377bb1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
  Btrfs: fix free space cache leak
  Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
  Btrfs end_bio_extent_readpage should look for locked bits
  Btrfs: don't force chunk allocation in find_free_extent
  Btrfs: Check validity before setting an acl
  Btrfs: Fix incorrect inode nlink in btrfs_link()
  Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
  Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
  Btrfs: make uncache_state unconditional
  btrfs: using cached extent_state in set/unlock combinations
  Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
  Btrfs: fix subvolume mount by name problem when default mount subvolume is set
  fix user annotation in ioctl.c
  Btrfs: check for duplicate iov_base's when doing dio reads
  btrfs: properly handle overlapping areas in memmove_extent_buffer
  Btrfs: fix memory leaks in btrfs_new_inode()
  Btrfs: check for duplicate iov_base's when doing dio reads
  Btrfs: reuse the extent_map we found when calling btrfs_get_extent
  Btrfs: do not use async submit for small DIO io's
  Btrfs: don't split dio bios if we don't have to
  ...
2011-04-18 12:24:05 -07:00
Daniel J Blueman
13f2696f1d fix user annotation in ioctl.c
Fix address space annotation correct in ioctl.c.

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>

 		       BTRFS_BLOCK_GROUP_SYSTEM,
@@ -2387,7 +2387,7 @@ long btrfs_ioctl_space_info(struct btrfs_root
*root, void __user *arg)
 		up_read(&info->groups_sem);
 	}

-	user_dest = (struct btrfs_ioctl_space_info *)
+	user_dest = (struct btrfs_ioctl_space_info __user *)
 		(arg + sizeof(struct btrfs_ioctl_space_args));

 	if (copy_to_user(user_dest, dest_orig, alloc_size))
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:25:46 -04:00
Linus Torvalds
884b8267d5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: don't warn in btrfs_add_orphan
  Btrfs: fix free space cache when there are pinned extents and clusters V2
  Btrfs: Fix uninitialized root flags for subvolumes
  btrfs: clear __GFP_FS flag in the space cache inode
  Btrfs: fix memory leak in start_transaction()
  Btrfs: fix memory leak in btrfs_ioctl_start_sync()
  Btrfs: fix subvol_sem leak in btrfs_rename()
  Btrfs: Fix oops for defrag with compression turned on
  Btrfs: fix /proc/mounts info.
  Btrfs: fix compiler warning in file.c
2011-04-05 12:29:25 -07:00
Li Zefan
08fe4db170 Btrfs: Fix uninitialized root flags for subvolumes
root_item->flags and root_item->byte_limit are not initialized when
a subvolume is created. This bug is not revealed until we added
readonly snapshot support - now you mount a btrfs filesystem and you
may find the subvolumes in it are readonly.

To work around this problem, we steal a bit from root_item->inode_item->flags,
and use it to indicate if those fields have been properly initialized.
When we read a tree root from disk, we check if the bit is set, and if
not we'll set the flag and initialize the two fields of the root item.

Reported-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Tested-by: Andreas Philipp <philipp.andreas@gmail.com>
cc: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-05 01:20:24 -04:00
Tsutomu Itoh
8b2b2d3cbe Btrfs: fix memory leak in btrfs_ioctl_start_sync()
Call btrfs_end_transaction() if btrfs_commit_transaction_async() fails.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-05 01:19:42 -04:00
Linus Torvalds
212a17ab87 Merge branch 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (45 commits)
  Btrfs: fix __btrfs_map_block on 32 bit machines
  btrfs: fix possible deadlock by clearing __GFP_FS flag
  btrfs: check link counter overflow in link(2)
  btrfs: don't mess with i_nlink of unlocked inode in rename()
  Btrfs: check return value of btrfs_alloc_path()
  Btrfs: fix OOPS of empty filesystem after balance
  Btrfs: fix memory leak of empty filesystem after balance
  Btrfs: fix return value of setflags ioctl
  Btrfs: fix uncheck memory allocations
  btrfs: make inode ref log recovery faster
  Btrfs: add btrfs_trim_fs() to handle FITRIM
  Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes
  Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP
  Btrfs: make update_reserved_bytes() public
  btrfs: return EXDEV when linking from different subvolumes
  Btrfs: Per file/directory controls for COW and compression
  Btrfs: add datacow flag in inode flag
  btrfs: use GFP_NOFS instead of GFP_KERNEL
  Btrfs: check return value of read_tree_block()
  btrfs: properly access unaligned checksum buffer
  ...

Fix up trivial conflicts in fs/btrfs/volumes.c due to plug removal in
the block layer.
2011-03-28 15:31:05 -07:00
liubo
2d4e6f6ad2 Btrfs: fix return value of setflags ioctl
setflags ioctl should return error when any checks fail.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:51 -04:00
Li Dongyang
f7039b1d5c Btrfs: add btrfs_trim_fs() to handle FITRIM
We take an free extent out from allocator, trim it, then put it back,
but before we trim the block group, we should make sure the block group is
cached, so plus a little change to make cache_block_group() run without a
transaction.

Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:47 -04:00
Liu Bo
75e7cb7fe0 Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount
options right now.  ioctls are needed to set this on a per file or per
directory basis.  This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.

According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super.  However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.

After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.

NOTE:
 - The compression type is selected by such rules:
   If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
   Otherwise, we'll use the default compress type (zlib today).

v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
  will be screwed by inheritance from parent directory.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:41 -04:00
Tsutomu Itoh
db5b493ac7 Btrfs: cleanup some BUG_ON()
This patch changes some BUG_ON() to the error return.
(but, most callers still use BUG_ON())

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:35 -04:00
Serge E. Hallyn
2e14967075 userns: rename is_owner_or_cap to inode_owner_or_capable
And give it a kernel-doc comment.

[akpm@linux-foundation.org: btrfs changed in linux-next]
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:47:13 -07:00
Josef Bacik
66b4ffd110 Btrfs: handle errors in btrfs_orphan_cleanup
If we cannot truncate an inode for some reason we will never delete the orphan
item associated with that inode, which means that we will loop forever in
btrfs_orphan_cleanup.  Instead of doing this just return error so we fail to
mount.  It sucks, but hey it's better than hanging.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-03-17 14:21:26 -04:00
Li Zefan
b4dc2b8c69 Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl
- Check user-specified flags correctly
- Check the inode owership
- Search root item in root tree but not fs tree

Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-16 15:37:58 -05:00
Dan Rosenberg
51788b1bdd btrfs: prevent heap corruption in btrfs_ioctl_space_info()
Commit bf5fc093c5 refactored
btrfs_ioctl_space_info() and introduced several security issues.

space_args.space_slots is an unsigned 64-bit type controlled by a
possibly unprivileged caller.  The comparison as a signed int type
allows providing values that are treated as negative and cause the
subsequent allocation size calculation to wrap, or be truncated to 0.
By providing a size that's truncated to 0, kmalloc() will return
ZERO_SIZE_PTR.  It's also possible to provide a value smaller than the
slot count.  The subsequent loop ignores the allocation size when
copying data in, resulting in a heap overflow or write to ZERO_SIZE_PTR.

The fix changes the slot count type and comparison typecast to u64,
which prevents truncation or signedness errors, and also ensures that we
don't copy more data than we've allocated in the subsequent loop.  Note
that zero-size allocations are no longer possible since there is already
an explicit check for space_args.space_slots being 0 and truncation of
this value is no longer an issue.

Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-14 16:04:23 -05:00
Tsutomu Itoh
98d5dc13e7 btrfs: fix return value check of btrfs_start_transaction()
The error check of btrfs_start_transaction() is added, and the mistake
of the error check on several places is corrected.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-01 07:17:27 -05:00
Tsutomu Itoh
abd30bb0af btrfs: check return value of btrfs_start_ioctl_transaction() properly
btrfs_start_ioctl_transaction() returns ERR_PTR(), not NULL.
So, it is necessary to use IS_ERR() to check the return value.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-28 16:40:37 -05:00
Tsutomu Itoh
3612b49598 btrfs: fix return value check of btrfs_join_transaction()
The error check of btrfs_join_transaction()/btrfs_join_transaction_nolock()
is added, and the mistake of the error check in several places is
corrected.

For more stable Btrfs, I think that we should reduce BUG_ON().
But, I think that long time is necessary for this.
So, I propose this patch as a short-term solution.

With this patch:
 - To more stable Btrfs, the part that should be corrected is clarified.
 - The panic isn't done by the NULL pointer reference etc. (even if
   BUG_ON() is increased temporarily)
 - The error code is returned in the place where the error can be easily
   returned.

As a long-term plan:
 - BUG_ON() is reduced by using the forced-readonly framework, etc.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-28 16:40:37 -05:00
Chris Mason
eab49bec41 Merge branch 'bug-fixes' of git://repo.or.cz/linux-btrfs-devel into btrfs-38 2011-01-28 16:24:59 -05:00
Li Zefan
4d728ec7ae Btrfs: Fix file clone when source offset is not 0
Suppose:
- the source extent is: [0, 100]
- the src offset is 10
- the clone length is 90
- the dest offset is 0

This statement:

	new_key.offset = key.offset + destoff - off

will produce such an extent for the dest file:

	[ino, BTRFS_EXTENT_DATA_KEY, -10]

, which is obviously wrong.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:11:18 +08:00
Chris Mason
f892436eb2 Merge branch 'lzo-support' of git://repo.or.cz/linux-btrfs-devel into btrfs-38 2011-01-16 11:25:54 -05:00
Li Zefan
0caa102da8 Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls
This allows us to set a snapshot or a subvolume readonly or writable
on the fly.

Usage:

Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and then
call ioctl(BTRFS_IOCTL_SUBVOL_SETFLAGS);

Changelog for v3:

- Change to pass __u64 as ioctl parameter.

Changelog for v2:

- Add _GETFLAGS ioctl.
- Check if the passed fd is the root of a subvolume.
- Change the name from _SNAP_SETFLAGS to _SUBVOL_SETFLAGS.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-23 08:49:19 +08:00
Li Zefan
b83cc9693f Btrfs: Add readonly snapshots support
Usage:

Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).

Implementation:

- Set readonly bit of btrfs_root_item->flags.
- Add readonly checks in btrfs_permission (inode_permission),
btrfs_setattr, btrfs_set/remove_xattr and some ioctls.

Changelog for v3:

- Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
- Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-23 08:49:17 +08:00
Li Zefan
fa0d2b9bd7 Btrfs: Refactor btrfs_ioctl_snap_create()
Split it into two functions for two different ioctls, since they
share no common code.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-23 08:49:15 +08:00
Li Zefan
1a419d85a7 btrfs: Allow to specify compress method when defrag
Update defrag ioctl, so one can choose lzo or zlib when turning
on compression in defrag operation.

Changelog:

v1 -> v2
- Add incompability flag.
- Fix to check invalid compress type.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-22 23:15:48 +08:00
Li Zefan
261507a02c btrfs: Allow to add new compression algorithm
Make the code aware of compression type, instead of always assuming
zlib compression.

Also make the zlib workspace function as common code for all
compression types.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2010-12-22 23:15:45 +08:00
Li Zefan
fdfb1e4f6c Btrfs: Make async snapshot ioctl more generic
If we had reserved some bytes in struct btrfs_ioctl_vol_args, we
wouldn't have to create a new structure for async snapshot creation.

Here we convert async snapshot ioctl to use a more generic ABI, as
we'll add more ioctls for snapshots/subvolumes in the future, readonly
snapshots for example.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:11 -05:00
Sage Weil
75eaa0e22c Btrfs: fix sync subvol/snapshot creation
We were incorrectly taking the async path even for the sync ioctls by
passing in &transid unconditionally.

There's ample room for further cleanup here, but this keeps the fix simple.

Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:10 -05:00
Josef Bacik
6a91221304 Btrfs: use dget_parent where we can UPDATED
There are lots of places where we do dentry->d_parent->d_inode without holding
the dentry->d_lock.  This could cause problems with rename.  So instead we need
to use dget_parent() and hold the reference to the parent as long as we are
going to use it's inode and then dput it at the end.

Signed-off-by: Josef Bacik <josef@redhat.com>
Cc: raven@themaw.net
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:09 -05:00
Li Zefan
5f3888ff6f btrfs: Set file size correctly in file clone
Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the
original filesize of the dest file 'file2' is 30K:

  # ls -l /mnt/file2
  -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2

Now clone file1 to file2, the dest file should be 40K, but it
still shows 30K:

  # ls -l /mnt/file2
  -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:06 -05:00
Li Zefan
2a6b8daeda btrfs: Check if dest_offset is block-size aligned before cloning file
We've done the check for src_offset and src_length, and We should
also check dest_offset, otherwise we'll corrupt the destination
file:

  (After cloning file1 to file2 with unaligned dest_offset)
  # cat /mnt/file2
  cat: /mnt/file2: Input/output error

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:05 -05:00
Sage Weil
4260f7c751 Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Add a mount option user_subvol_rm_allowed that allows users to delete a
(potentially non-empty!) subvol when they would otherwise we allowed to do
an rmdir(2).  We duplicate the may_delete() checks from the core VFS code
to implement identical security checks (minus the directory size check).
We additionally require that the user has write+exec permission on the
subvol root inode.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 21:42:10 -04:00
Sage Weil
531cb13f1e Btrfs: make SNAP_DESTROY async
There is no reason to force an immediate commit when deleting a snapshot.
Users have some expectation that space from a deleted snapshot be freed
immediately, but even if we do commit the reclaim is a background process.

If users _do_ want the deletion to be durable, they can call 'sync'.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 21:42:10 -04:00
Sage Weil
72fd032e94 Btrfs: add SNAP_CREATE_ASYNC ioctl
Create a snap without waiting for it to commit to disk.  The ioctl is
ordered such that subsequent operations will not be contained by the
created snapshot, and the commit is initiated, but the ioctl does not
wait for the snapshot to commit to disk.

We return the specific transid to userspace so that an application can wait
for this specific snapshot creation to commit via the WAIT_SYNC ioctl.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 21:41:57 -04:00
Sage Weil
462045928b Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete.  Any modification started after the ioctl returns is
guaranteed not to be included in the commit.  If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.

WAIT_SYNC will wait for any in-progress commit to complete.  If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately.  If the specified transaction doesn't exist, it
returns EINVAL.

If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.

These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.

Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.

Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result.  However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well.  Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
Sage Weil
fccdae435c Btrfs: fix lockdep warning on clone ioctl
I'm no lockdep expert, but this appears to make the lockdep warning go
away for the i_mutex locking in the clone ioctl.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:33 -04:00
Sage Weil
050006a753 Btrfs: fix clone ioctl where range is adjacent to extent
We had an edge case issue where the requested range was just
following an existing extent. Instead of skipping to the next
extent, we used the previous one which lead to having zero
sized extents.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:33 -04:00
Sage Weil
9a019196ec Btrfs: fix delalloc checks in clone ioctl
The lookup_first_ordered_extent() was done on the wrong inode, and the
->delalloc_bytes test was wrong, as the following
btrfs_wait_ordered_range() would only invoke a range write and wouldn't
write the entire file data range. Also, a bad parameter was passed to
btrfs_wait_ordered_range().

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:33 -04:00
Andi Kleen
559af82114 Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.

Still needs more review.

Found by gcc 4.6's new warnings

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:37 -04:00
Julia Lawall
2354d08fe9 Btrfs: use memdup_user helpers
Use memdup_user when user data is immediately copied into the
allocated region.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@

-  to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+  to = memdup_user(from,size);
   if (
-      to==NULL
+      IS_ERR(to)
                 || ...) {
   <+... when != goto l1;
-  -ENOMEM
+  PTR_ERR(to)
   ...+>
   }
-  if (copy_from_user(to, from, size) != 0) {
-    <+... when != goto l2;
-    -EFAULT
-    ...+>
-  }
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:18 -04:00
Josef Bacik
bf5fc093c5 Btrfs: fix the df ioctl to report raid types
The new ENOSPC stuff broke the df ioctl since we no longer create seperate space
info's for each RAID type.  So instead, loop through each space info's raid
lists so we can get the right RAID information which will allow the df ioctl to
tell us RAID types again.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:53 -04:00
Dan Rosenberg
2ebc346478 Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE
1.  The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check
whether the donor file is append-only before writing to it.

2.  The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer
overflow that allows a user to specify an out-of-bounds range to copy
from the source file (if off + len wraps around).  I haven't been able
to successfully exploit this, but I'd imagine that a clever attacker
could use this to read things he shouldn't.  Even if it's not
exploitable, it couldn't hurt to be safe.

Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
cc: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-07-19 16:58:20 -04:00
Sage Weil
b5384d48f4 Btrfs: fix CLONE ioctl destination file size expansion to block boundary
The CLONE and CLONE_RANGE ioctls round up the range of extents being
cloned to the block size when the range to clone extends to the end of file
(this is always the case with CLONE).  It was then using that offset when
extending the destination file's i_size.  Fix this by not setting i_size
beyond the originally requested ending offset.

This bug was introduced by a22285a6 (2.6.35-rc1).

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-07-19 16:15:06 -04:00
Dan Carpenter
cf1e99a4e0 Btrfs: btrfs_lookup_dir_item() can return ERR_PTR
btrfs_lookup_dir_item() can return either ERR_PTRs or null.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:37 -04:00
Dan Carpenter
d327099a23 Btrfs: unwind after btrfs_start_transaction() errors
This was added by a22285a6a3: "Btrfs: Integrate metadata reservation
with start_transaction".  If we goto out here then we skip all the
unwinding and there are locks still held etc.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:35 -04:00
Yan, Zheng
d68fc57b7e Btrfs: Metadata reservation for orphan inodes
reserve metadata space for handling orphan inodes

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:52 -04:00
Yan, Zheng
8929ecfa50 Btrfs: Introduce global metadata reservation
Reserve metadata space for extent tree, checksum tree and root tree

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:52 -04:00
Yan, Zheng
0ca1f7ceb1 Btrfs: Update metadata reservation for delayed allocation
Introduce metadata reservation context for delayed allocation
and update various related functions.

This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:51 -04:00
Yan, Zheng
a22285a6a3 Btrfs: Integrate metadata reservation with start_transaction
Besides simplify the code, this change makes sure all metadata
reservation for normal metadata operations are released after
committing transaction.

Changes since V1:

Add code that check if unlink and rmdir will free space.

Add ENOSPC handling for clone ioctl.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:50 -04:00
Linus Torvalds
18e41da89d Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: check for read permission on src file in the clone ioctl
2010-05-15 12:55:31 -07:00
Dan Rosenberg
5dc6416414 Btrfs: check for read permission on src file in the clone ioctl
The existing code would have allowed you to clone a file that was
only open for writing

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-15 12:05:50 -04:00
Linus Torvalds
795d580bae Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: add check for changed leaves in setup_leaf_for_split
  Btrfs: create snapshot references in same commit as snapshot
  Btrfs: fix small race with delalloc flushing waitqueue's
  Btrfs: use add_to_page_cache_lru, use __page_cache_alloc
  Btrfs: fix chunk allocate size calculation
  Btrfs: kill max_extent mount option
  Btrfs: fail to mount if we have problems reading the block groups
  Btrfs: check btrfs_get_extent return for IS_ERR()
  Btrfs: handle kmalloc() failure in inode lookup ioctl
  Btrfs: dereferencing freed memory
  Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk()
  Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree()
  Btrfs: Remove unnecessary finish_wait() in wait_current_trans()
  Btrfs: add NULL check for do_walk_down()
  Btrfs: remove duplicate include in ioctl.c

Fix trivial conflict in fs/btrfs/compression.c due to slab.h include
cleanups.
2010-04-05 13:21:15 -07:00
Dan Carpenter
6cf8bfbf5e Btrfs: check btrfs_get_extent return for IS_ERR()
btrfs_get_extent() never returns NULL, only a valid pointer or ERR_PTR()

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30 21:19:09 -04:00
Dan Carpenter
c2b96929e2 Btrfs: handle kmalloc() failure in inode lookup ioctl
Return -ENOMEM if kmalloc() fails.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30 21:19:09 -04:00
Dan Carpenter
683be16eb6 Btrfs: dereferencing freed memory
The original code dereferenced range on the next line.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30 21:19:09 -04:00
Andrea Gelmini
2f3014fc2a Btrfs: remove duplicate include in ioctl.c
fs/btrfs/ioctl.c: ctree.h is included more than once.

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30 21:19:08 -04:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Chris Mason
8ad6fcab56 Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree
This is used by the inode lookup ioctl to follow all the backrefs up
to the subvol root.  But the search being done would sometimes land one
past the last item in the leaf instead of finding the backref.

This changes the search to look for the highest possible backref and hop
back one item.  It also fixes a leaked path on failure to find the root.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:23:10 -04:00
Chris Mason
1b53ac4d1b Btrfs: allow treeid==0 in the inode lookup ioctl
When a root id of 0 is sent to the inode lookup ioctl, it will
use the root of the file we're ioctling and pass the root id
back to userland along with the results.

This allows userland to do searches based on that root later on.


Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:17:05 -04:00
Chris Mason
90fdde147f Btrfs: return keys for large items to the search ioctl
The search ioctl was skipping large items entirely (ones that are too
big for the results buffer).  This changes things to at least copy
the item header so that we can send information about the item back to
userland.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:14:54 -04:00
Chris Mason
abc6e1341b Btrfs: fix key checks and advance in the search ioctl
The search ioctl was working well for finding tree roots, but using it for
generic searches requires a few changes to how the keys are advanced.
This treats the search control min fields for objectid, type and offset
more like a key, where we drop the offset to zero once we bump the type,
etc.

The downside of this is that we are changing the min_type and min_offset
fields during the search, and so the ioctl caller needs extra checks to make sure
the keys in the result are the ones it wanted.

This also changes key_in_sk to use btrfs_comp_cpu_keys, just to make
things more readable.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:10:08 -04:00
Chris Mason
7fde62bffb Btrfs: buffer results in the space_info ioctl
The space_info ioctl was using copy_to_user inside rcu_read_lock.  This
commit changes things to copy into a buffer first and then dump the
result down to userland.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-16 15:40:10 -04:00
Sage Weil
854d2c3531 Btrfs: fix search_ioctl key advance
key->type is u8, not u64.

fs/btrfs/ioctl.c: In function 'copy_to_sk':
fs/btrfs/ioctl.c:1024: warning: comparison is always true due to limited range of data type

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-16 14:24:27 -04:00
Akinobu Mita
91748467a5 btrfs: use memparse
Use memparse() instead of its own private implementation.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:14 -04:00
Josef Bacik
1406e4327b Btrfs: add a "df" ioctl for btrfs
df is a very loaded question in btrfs.  This gives us a way to get the per-space
usage information so we can tell exactly what is in use where.  This will help
us figure out ENOSPC problems, and help users better understand where their disk
space is going.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:14 -04:00
Josef Bacik
2ac55d41b5 Btrfs: cache the extent state everywhere we possibly can V2
This patch just goes through and fixes everybody that does

lock_extent()
blah
unlock_extent()

to use

lock_extent_bits()
blah
unlock_extent_cached()

and pass around a extent_state so we only have to do the searches once per
function.  This gives me about a 3 mb/s boots on my random write test.  I have
not converted some things, like the relocation and ioctl's, since they aren't
heavily used and the relocation stuff is in the middle of being re-written.  I
also changed the clear_extent_bit() to only unset the cached state if we are
clearing EXTENT_LOCKED and related stuff, so we can do things like this

lock_extent_bits()
clear delalloc bits
unlock_extent_cached()

without losing our cached state.  I tested this thoroughly and turned on
LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
fine.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:13 -04:00
Chris Mason
1e701a3292 Btrfs: add new defrag-range ioctl.
The btrfs defrag ioctl was limited to doing the entire file.  This
commit adds a new interface that can defrag a specific range inside
the file.

It can also force compression on the file, allowing you to selectively
compress individual files after they were created, even when mount -o
compress isn't turned on.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:10 -04:00
Chris Mason
940100a4a7 Btrfs: be more selective in the defrag ioctl
The btrfs defrag ioctl had some bugs around delalloc accounting, and it
wasn't properly skipping pages that were not in the mapping.

It wasn't properly clearing the page checked flag, which could make the
writeback code ignore the page forever while pinning it as dirty.

This commit fixes those problems and makes defrag a little smarter.  It
skips holes and it doesn't waste time defragging large extents.  If a
tiny extent comes before a very large extent, it will defrag both of
them to make sure the tiny extent ends up next to something big.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:10 -04:00
Josef Bacik
6ef5ed0d38 Btrfs: add ioctl and incompat flag to set the default mount subvol
This patch needs to go along with my previous patch.  This lets us set the
default dir item's location to whatever root we want to use as our default
mounting subvol.  With this we don't have to use mount -o subvol=<tree id>
anymore to mount a different subvol, we can just set the new one and it will
just magically work.  I've done some moderate testing with this, mostly just
switching the default mount around, mounting subvols and the default mount at
the same time and such, everything seems to work.  Thanks,

Older kernels would generally be able to still mount the filesystem with the
default subvolume set, but it would result in a different volume being mounted,
which could be an even more unpleasant suprise for users.  So if you set your
default subvolume, you can't go back to older kernels.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:08 -04:00
Chris Mason
ac8e9819d7 Btrfs: add search and inode lookup ioctls
The search ioctl is a generic tool for doing btree searches from
userland applications.  The first user of the search ioctl is a
subvolume listing feature, but we'll also use it to find new
files in a subvolume.

The search ioctl allows you to specify min and max keys to search for,
along with min and max transid.  It returns the items along with a
header that includes the item key.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 10:55:10 -04:00
TARUISI Hiroaki
98d377a089 Btrfs: add a function to lookup a directory path by following backrefs
This will be used by the inode lookup ioctl.

Signed-off-by: TARUISI Hiroaki <taruishi.hiroak@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 10:55:09 -04:00
Yan, Zheng
86b9f2eca5 Btrfs: Fix per root used space accounting
The bytes_used field in root item was originally planned to
trace the amount of used data and tree blocks. But it never
worked right since we can't trace freeing of data accurately.
This patch changes it to only trace the amount of tree blocks.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17 12:33:35 -05:00
Yan, Zheng
2e4bfab970 Btrfs: Avoid orphan inodes cleanup during committing transaction
btrfs_lookup_dentry may trigger orphan cleanup, so it's not good
to call it while committing a transaction.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17 12:33:33 -05:00
Yan, Zheng
920bbbfb05 Btrfs: Rewrite btrfs_drop_extents
Rewrite btrfs_drop_extents by using btrfs_duplicate_item, so we can
avoid calling lock_extent within transaction.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-15 21:24:52 -05:00
Chris Mason
ac6889cbb2 Btrfs: fix file clone ioctl for bookend extents
The file clone ioctl was incorrectly taking the offset into the
extent on disk into account when calculating the length of the
cloned extent.

The length never changes based on the offset into the physical extent.

Test case:

fallocate -l 1g image
mke2fs image
bcp image image2
e2fsck -f image2

(errors on image2)

The math bug ends up wrapping the length of the extent, and things
go wrong from there.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-09 11:29:53 -04:00
Yan, Zheng
efefb1438b Btrfs: remove negative dentry when deleting subvolumne
The use of btrfs_dentry_delete is removing dentries from the
dcache when deleting subvolumne. btrfs_dentry_delete ignores
negative dentries. This is incorrect since if we don't remove
the negative dentry, its parent dentry can't be removed.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-09 09:25:16 -04:00
Sage Weil
1ab86aedbc Btrfs: fix error cases for ioctl transactions
Fix leak of vfsmount write reference and open_ioctl_trans reference on
ENOMEM.  Clean up the error paths while we're at it.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-29 18:38:44 -04:00
Josef Bacik
9ed74f2dba Btrfs: proper -ENOSPC handling
At the start of a transaction we do a btrfs_reserve_metadata_space() and
specify how many items we plan on modifying.  Then once we've done our
modifications and such, just call btrfs_unreserve_metadata_space() for
the same number of items we reserved.

For keeping track of metadata needed for data I've had to add an extent_io op
for when we merge extents.  This lets us track space properly when we are doing
sequential writes, so we don't end up reserving way more metadata space than
what we need.

The only place where the metadata space accounting is not done is in the
relocation code.  This is because Yan is going to be reworking that code in the
near future, so running btrfs-vol -b could still possibly result in a ENOSPC
related panic.  This patch also turns off the metadata_ratio stuff in order to
allow users to more efficiently use their disk space.

This patch makes it so we track how much metadata we need for an inode's
delayed allocation extents by tracking how many extents are currently
waiting for allocation.  It introduces two new callbacks for the
extent_io tree's, merge_extent_hook and split_extent_hook.  These help
us keep track of when we merge delalloc extents together and split them
up.  Reservations are handled prior to any actually dirty'ing occurs,
and then we unreserve after we dirty.

btrfs_unreserve_metadata_for_delalloc() will make the appropriate
unreservations as needed based on the number of reservations we
currently have and the number of extents we currently have.  Doing the
reservation outside of doing any of the actual dirty'ing lets us do
things like filemap_flush() the inode to try and force delalloc to
happen, or as a last resort actually start allocation on all delalloc
inodes in the fs.  This has survived dbench, fs_mark and an fsx torture
test.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-28 16:29:42 -04:00
Sage Weil
1fb58a6051 Btrfs: fix arithmetic error in clone ioctl
Fix an arithmetic error that was breaking extents cloned via the clone
ioctl starting in the second half of a file.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21 16:00:27 -04:00
Yan, Zheng
76dda93c6a Btrfs: add snapshot/subvolume destroy ioctl
This patch adds snapshot/subvolume destroy ioctl.  A subvolume that isn't being
used and doesn't contains links to other subvolumes can be destroyed.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21 16:00:26 -04:00
Yan, Zheng
4df27c4d5c Btrfs: change how subvolumes are organized
btrfs allows subvolumes and snapshots anywhere in the directory tree.
If we snapshot a subvolume that contains a link to other subvolume
called subvolA, subvolA can be accessed through both the original
subvolume and the snapshot. This is similar to creating hard link to
directory, and has the very similar problems.

The aim of this patch is enforcing there is only one access point to
each subvolume. Only the first directory entry (the one added when
the subvolume/snapshot was created) is treated as valid access point.
The first directory entry is distinguished by checking root forward
reference. If the corresponding root forward reference is missing,
we know the entry is not the first one.

This patch also adds snapshot/subvolume rename support, the code
allows rename subvolume link across subvolumes.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21 15:56:00 -04:00
Chris Mason
83ebade34b Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable 2009-09-11 19:07:25 -04:00
Chris Mason
a1ed835e1a Btrfs: Fix extent replacment race
Data COW means that whenever we write to a file, we replace any old
extent pointers with new ones.  There was a window where a readpage
might find the old extent pointers on disk and cache them in the
extent_map tree in ram in the middle of a given write replacing them.

Even though both the readpage and the write had their respective bytes
in the file locked, the extent readpage inserts may cover more bytes than
it had locked down.

This commit closes the race by keeping the new extent pinned in the extent
map tree until after the on-disk btree is properly setup with the new
extent pointers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:07 -04:00
Alexey Dobriyan
405f55712d headers: smp_lock.h redux
* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
  It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT

  This will make hardirq.h inclusion cheaper for every PREEMPT=n config
  (which includes allmodconfig/allyesconfig, BTW)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-12 12:22:34 -07:00
Chris Mason
c8a894d77d Btrfs: fix the file clone ioctl for preallocated extents 2009-07-02 13:41:16 -04:00
Chris Mason
0b4dcea579 Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir
This happens during subvol creation.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 11:13:35 -04:00
Christoph Hellwig
6cbff00f46 Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
Add support for the standard attributes set via chattr and read via
lsattr.  Currently we store the attributes in the flags value in
the btrfs inode, but I wonder whether we should split it into two so
that we don't have to keep converting between the two formats.

Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
as they were confusing the existing code and got in the way of the
new additions.

Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
trivial.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Yan Zheng
5d4f98a28c Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one.  At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root.  This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces.  But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:46 -04:00
Linus Torvalds
5732c46849 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
  Btrfs: make show_options result match actual option names
  Btrfs: remove outdated comment in btrfs_ioctl_resize()
  Btrfs: remove some WARN_ONs in the IO failure path
  Btrfs: Don't loop forever on metadata IO failures
  Btrfs: init inode ordered_data_close flag properly
2009-05-14 19:18:44 -07:00
Li Hong
5d847a8ed9 Btrfs: remove outdated comment in btrfs_ioctl_resize()
In Li Zefan's commit dae7b665cf,
a combination call of kmalloc() and copy_from_user() is replaced by
memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:33 -04:00
Linus Torvalds
4ebf662337 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: look for acls during btrfs_read_locked_inode
  Btrfs: fix acl caching
  Btrfs: Fix a bunch of printk() warnings.
  Btrfs: Fix a trivial warning using max() of u64 vs ULL.
  Btrfs: remove unused btrfs_bit_radix slab
  Btrfs: ratelimit IO error printks
  Btrfs: remove #if 0 code
  Btrfs: When shrinking, only update disk size on success
  Btrfs: fix deadlocks and stalls on dead root removal
  Btrfs: fix fallocate deadlock on inode extent lock
  Btrfs: kill btrfs_cache_create
  Btrfs: don't export symbols
  Btrfs: simplify makefile
  Btrfs: try to keep a healthy ratio of metadata vs data block groups
2009-04-27 11:16:33 -07:00
Joel Becker
21380931eb Btrfs: Fix a bunch of printk() warnings.
Just happened to notice a bunch of %llu vs u64 warnings.  Here's a patch
to cast them all.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 08:37:49 -04:00
Chris Mason
e980b50cda Btrfs: fix fallocate deadlock on inode extent lock
The btrfs fallocate call takes an extent lock on the entire range
being fallocated, and then runs through insert_reserved_extent on each
extent as they are allocated.

The problem with this is that btrfs_drop_extents may decide to try
and take the same extent lock fallocate was already holding.  The solution
used here is to push down knowledge of the range that is already locked
going into btrfs_drop_extents.

It turns out that at least one other caller had the same bug.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:05 -04:00
Li Zefan
dae7b665cf btrfs: use memdup_user()
Remove open-coded memdup_user().

Note this changes some GFP_NOFS to GFP_KERNEL, since copy_from_user() may
cause pagefault, it's pointless to pass GFP_NOFS to kmalloc().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20 23:02:50 -04:00
Al Viro
ce3b0f8d5c New helper - current_umask()
current->fs->umask is what most of fs_struct users are doing.
Put that into a helper function.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-31 23:00:26 -04:00
Josef Bacik
6a63209fc0 Btrfs: add better -ENOSPC handling
This is a step in the direction of better -ENOSPC handling.  Instead of
checking the global bytes counter we check the space_info bytes counters to
make sure we have enough space.

If we don't we go ahead and try to allocate a new chunk, and then if that fails
we return -ENOSPC.  This patch adds two counters to btrfs_space_info,
bytes_delalloc and bytes_may_use.

bytes_delalloc account for extents we've actually setup for delalloc and will
be allocated at some point down the line. 

bytes_may_use is to keep track of how many bytes we may use for delalloc at
some point.  When we actually set the extent_bit for the delalloc bytes we
subtract the reserved bytes from the bytes_may_use counter.  This keeps us from
not actually being able to allocate space for any delalloc bytes.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-20 11:00:09 -05:00
Huang Weiyi
7eaebe7d50 Btrfs: removed unused #include <version.h>'s
Removed unused #include <version.h>'s in btrfs

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Chris Mason
d397712bcc Btrfs: Fix checkpatch.pl warnings
There were many, most are fixed now.  struct-funcs.c generates some warnings
but these are bogus.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-05 21:25:51 -05:00
Yan Zheng
52c2617990 Btrfs: update directory's size when creating subvol/snapshot
Make sure directory's size properly updated when creating
subvol/snapshot.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-05 15:43:43 -05:00
Chris Mason
e441d54de4 Btrfs: add permission checks to the ioctls
Only root can add/remove devices
Only root can defrag subtrees
Only files open for writing can be defragged
Only files open for writing can be the destination for a clone

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-05 16:57:23 -05:00
Yan Zheng
ab67b7c1f7 Btrfs: Add missing mnt_drop_write in ioctl.c
This patch adds the missing mnt_drop_write to match
mnt_want_write in btrfs_ioctl_defrag and btrfs_ioctl_clone

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-19 10:58:39 -05:00
Yan Zheng
d2fb3437e4 Btrfs: fix leaking block group on balance
The block group structs are referenced in many different
places, and it's not safe to free while balancing.  So, those block
group structs were simply leaked instead.

This patch replaces the block group pointer in the inode with the starting byte
offset of the block group and adds reference counting to the block group
struct.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-11 16:30:39 -05:00
Sage Weil
cfc8ea8720 Btrfs: mnt_drop_write in ioctl_trans_end
Add missing mnt_drop_write to match the mnt_want_write in
btrfs_ioctl_trans_start.

Signed-off-by: Sage Weil <sage@newdream.net>
2008-12-11 16:30:06 -05:00
Chris Mason
d20f7043fa Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block.  Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block.  This means that when we read the inode,
we've probably read in at least some checksums as well.

But, this has a few problems:

* The checksums are indexed by logical offset in the file.  When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data.  It would be faster if we could checksum
the compressed data instead.

* If we implement encryption, we'll be checksumming the plain text and
storing that on disk.  This is significantly less secure.

* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct.  This makes the raid
layer balancing and extent moving much more expensive.

* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.

* There is potentitally one copy of the checksum in each subvolume
referencing an extent.

The solution used here is to store the extent checksums in a dedicated
tree.  This allows us to index the checksums by phyiscal extent
start and length.  It means:

* The checksum is against the data stored on disk, after any compression
or encryption is done.

* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.

This makes compression significantly faster by reducing the amount of
data that needs to be checksummed.  It will also allow much faster
raid management code in general.

The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent.  This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
Josef Bacik
607d432da0 Btrfs: add support for multiple csum algorithms
This patch gives us the space we will need in order to have different csum
algorithims at some point in the future.  We save the csum algorithim type
in the superblock, and use those instead of define's.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-12-02 07:17:45 -05:00
Christoph Hellwig
7a865e8ac3 Btrfs: btrfs: pass void __user * to btrfs_ioctl_clone_range
Cleans the code up a little and also avoids a sparse warning due to the
incorrect cast in the current version of the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-12-02 09:52:24 -05:00
Christoph Hellwig
4bcabaa30a Btrfs: clean up btrfs_ioctl a little bit
Provide a void __user *argp pointer so that we can avoid duplicating
the cast for various sub-command calls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-12-02 06:36:08 -05:00
Christoph Hellwig
b2950863c6 Btrfs: make things static and include the right headers
Shut up various sparse warnings about symbols that should be either
static or have their declarations in scope.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-12-02 09:54:17 -05:00
Sage Weil
1ffa4f426c Btrfs: remove unneeded btrfs_start_delalloc_inodes call
It is called by btrfs_sync_fs.

Signed-off-by: Sage Weil <sage@newdream.net>
2008-12-02 09:53:09 -05:00
Chris Mason
4b4e25f2a6 Btrfs: compat code fixes
The btrfs git kernel trees is used to build a standalone tree for
compiling against older kernels.  This commit makes the standalone tree
work with 2.6.27

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-20 10:22:27 -05:00
Chris Mason
ea9e8b11bd Btrfs: prevent loops in the directory tree when creating snapshots
For a directory tree:

/mnt/subvolA/subvolB

btrfsctl -s /mnt/subvolA/subvolB /mnt

Will create a directory loop with subvolA under subvolB.  This
commit uses the forward refs for each subvol and snapshot to error out
before creating the loop.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-17 21:14:24 -05:00
Chris Mason
0660b5af3f Btrfs: Add backrefs and forward refs for subvols and snapshots
Subvols and snapshots can now be referenced from any point in the directory
tree.  We need to maintain back refs for them so we can find lost
subvols.

Forward refs are added so that we know all of the subvols and
snapshots referenced anywhere in the directory tree of a single subvol.  This
can be used to do recursive snapshotting (but they aren't yet) and it is
also used to detect and prevent directory loops when creating new snapshots.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-17 20:37:39 -05:00
Chris Mason
3394e1607e Btrfs: Give each subvol and snapshot their own anonymous devid
Each subvolume has its own private inode number space, and so we need
to fill in different device numbers for each subvolume to avoid confusing
applications.

This commit puts a struct super_block into struct btrfs_root so it can
call set_anon_super() and get a different device number generated for
each root.

btrfs_rename is changed to prevent renames across subvols.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-17 20:42:26 -05:00
Chris Mason
3de4586c52 Btrfs: Allow subvolumes and snapshots anywhere in the directory tree
Before, all snapshots and subvolumes lived in a single flat directory.  This
was awkward and confusing because the single flat directory was only writable
with the ioctls.

This commit changes the ioctls to create subvols and snapshots at any
point in the directory tree.  This requires making separate ioctls for
snapshot and subvol creation instead of a combining them into one.

The subvol ioctl does:

btrfsctl -S subvol_name parent_dir

After the ioctl is done subvol_name lives inside parent_dir.

The snapshot ioctl does:

btrfsctl -s path_for_snapshot root_to_snapshot

path_for_snapshot can be an absolute or relative path.  btrfsctl breaks it up
into directory and basename components.

root_to_snapshot can be any file or directory in the FS.  The snapshot
is taken of the entire root where that file lives.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-17 21:02:50 -05:00
Yan Zheng
2b82032c34 Btrfs: Seed device support
Seed device is a special btrfs with SEEDING super flag
set and can only be mounted in read-only mode. Seed
devices allow people to create new btrfs on top of it.

The new FS contains the same contents as the seed device,
but it can be mounted in read-write mode.

This patch does the following:

1) split code in btrfs_alloc_chunk into two parts. The first part does makes
the newly allocated chunk usable, but does not do any operation that modifies
the chunk tree. The second part does the the chunk tree modifications. This
division is for the bootstrap step of adding storage to the seed device.

2) Update device management code to handle seed device.
The basic idea is: For an FS grown from seed devices, its
seed devices are put into a list. Seed devices are
opened on demand at mounting time. If any seed device is
missing or has been changed, btrfs kernel module will
refuse to mount the FS.

3) make btrfs_find_block_group not return NULL when all
block groups are read-only.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-17 21:11:30 -05:00
Yan Zheng
c146afad2c Btrfs: mount ro and remount support
This patch adds mount ro and remount support. The main
changes in patch are: adding btrfs_remount and related
helper function; splitting the transaction related code
out of close_ctree into btrfs_commit_super; updating
allocator to properly handle read only block group.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-12 14:34:12 -05:00
Sage Weil
c5c9cd4d1b Btrfs: allow clone of an arbitrary file range
This patch adds an additional CLONE_RANGE ioctl to clone an arbitrary 
(block-aligned) file range to another file.  The original CLONE ioctl 
becomes a special case of cloning the entire file range.  The logic is a 
bit more complex now since ranges may be cloned to different offsets, and 
because we may only be cloning the beginning or end of a particular extent 
or checksum item.

An additional sanity check ensures the source and destination files aren't 
the same (which would previously deadlock), although eventually this could 
be extended to allow the duplication of file data at a different offset 
within the same file.

Any extents within the destination range in the target file are dropped.

We currently do not cope with the case where a compressed inline extent 
needs to be split.  This will probably require decompressing the extent 
into a temporary address_space, and inserting just the cloned portion as a 
new compressed inline extent.  For now, just return -EINVAL in this case.  
Note that this never comes up in the more common case of cloning an entire 
file.
    
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-12 14:32:25 -05:00
Yan Zheng
d899e05215 Btrfs: Add fallocate support v2
This patch updates btrfs-progs for fallocate support.

fallocate is a little different in Btrfs because we need to tell the
COW system that a given preallocated extent doesn't need to be
cow'd as long as there are no snapshots of it.  This leverages the
-o nodatacow checks.
 
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-30 14:25:28 -04:00
Yan Zheng
80ff385665 Btrfs: update nodatacow code v2
This patch simplifies the nodatacow checker. If all references
were created after the latest snapshot, then we can avoid COW
safely. This patch also updates run_delalloc_nocow to do more
fine-grained checking.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-30 14:20:02 -04:00
Yan Zheng
84234f3a1f Btrfs: Add root tree pointer transaction ids
This patch adds transaction IDs to root tree pointers.
Transaction IDs in tree pointers are compared with the
generation numbers in block headers when reading root
blocks of trees. This can detect some types of IO errors.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-29 14:49:05 -04:00
Chris Mason
a3dddf3fc8 Btrfs: Don't call security_inode_mkdir during subvol creation
Subvol creation already requires privs, and security_inode_mkdir isn't
exported.  For now we don't need it.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-10 10:23:22 -04:00
Christoph Hellwig
cb8e70901d Btrfs: Fix subvolume creation locking rules
Creating a subvolume is in many ways like a normal VFS ->mkdir, and we
really need to play with the VFS topology locking rules.  So instead of
just creating the snapshot on disk and then later getting rid of
confliting aliases do it correctly from the start.  This will become
especially important once we allow for subvolumes anywhere in the tree,
and not just below a hidden root.

Note that snapshots will need the same treatment, but do to the delay
in creating them we can't do it currently.  Chris promised to fix that
issue, so I'll wait on that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-10-09 13:39:39 -04:00
Yan Zheng
3bb1a1bc42 Btrfs: Remove offset field from struct btrfs_extent_ref
The offset field in struct btrfs_extent_ref records the position
inside file that file extent is referenced by. In the new back
reference system, tree leaves holding references to file extent
are recorded explicitly. We can scan these tree leaves very quickly, so the
offset field is not required.

This patch also makes the back reference system check the objectid
when extents are in deleting.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-09 11:46:24 -04:00
Yan Zheng
a76a3cd40c Btrfs: Count space allocated to file in bytes
This patch makes btrfs count space allocated to file in bytes instead
of 512 byte sectors.

Everything else in btrfs uses a byte count instead of sector sizes or
blocks sizes, so this fits better.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-09 11:46:29 -04:00
Zheng Yan
5b21f2ed3f Btrfs: extent_map and data=ordered fixes for space balancing
* Add an EXTENT_BOUNDARY state bit to keep the writepage code
from merging data extents that are in the process of being
relocated.  This allows us to do accounting for them properly.

* The balancing code relocates data extents indepdent of the underlying
inode.  The extent_map code was modified to properly account for
things moving around (invalidating extent_map caches in the inode).

* Don't take the drop_mutex in the create_subvol ioctl.  It isn't
required.

* Fix walking of the ordered extent list to avoid races with sys_unlink

* Change the lock ordering rules.  Transaction start goes outside
the drop_mutex.  This allows btrfs_commit_transaction to directly
drop the relocation trees.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:05:38 -04:00
Zheng Yan
31840ae1a6 Btrfs: Full back reference support
This patch makes the back reference system to explicit record the
location of parent node for all types of extents. The location of
parent node is placed into the offset field of backref key. Every
time a tree block is balanced, the back references for the affected
lower level extents are updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Christoph Hellwig
b214107eda Btrfs: trivial sparse fixes
Fix a bunch of trivial sparse complaints.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Yan Zheng
7ea394f119 Btrfs: Fix nodatacow for the new data=ordered mode
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Yan Zheng
ae01a0abf3 Btrfs: Update clone file ioctl
This patch updates the file clone ioctl for the tree locking and new
data ordered code.

---

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
ea8c281947 Btrfs: Maintain a list of inodes that are delalloc and a way to wait on them
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Sage Weil
9ca9ee09c1 Btrfs: fix ioctl-initiated transactions vs wait_current_trans()
Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in
progress) breaks the transaction start/stop ioctls by making
btrfs_start_transaction conditionally wait for the next transaction to
start.  If an application artificially is holding a transaction open,
things deadlock.

This workaround maintains a count of open ioctl-initiated transactions in
fs_info, and avoids wait_current_trans() if any are currently open (in
start_transaction() and btrfs_throttle()).  The start transaction ioctl
uses a new btrfs_start_ioctl_transaction() that _does_ call
wait_current_trans(), effectively pushing the join/wait decision to the
outer ioctl-initiated transaction.

This more or less neuters btrfs_throttle() when ioctl-initiated
transactions are in use, but that seems like a pretty fundamental
consequence of wrapping lots of write()'s in a transaction.  Btrfs has no
way to tell if the application considers a given operation as part of it's
transaction.

Obviously, if the transaction start/stop ioctls aren't being used, there
is no effect on current behavior.

Signed-off-by: Sage Weil <sage@newdream.net>
---
 ctree.h       |    1 +
 ioctl.c       |   12 +++++++++++-
 transaction.c |   18 +++++++++++++-----
 transaction.h |    2 ++
 4 files changed, 27 insertions(+), 6 deletions(-)

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
f87f057b49 Btrfs: Improve and cleanup locking done by walk_down_tree
While dropping snapshots, walk_down_tree does most of the work of checking
reference counts and limiting tree traversal to just the blocks that
we are freeing.

It dropped and held the allocation mutex in strange and confusing ways,
this commit changes it to only hold the mutex while actually freeing a block.

The rest of the checks around reference counts should be safe without the lock
because we only allow one process in btrfs_drop_snapshot at a time.  Other
processes dropping reference counts should not drop it to 1 because
their tree roots already have an extra ref on the block.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Mark Fasheh
5516e5957f Btrfs: Null terminate strings passed in from userspace
The 'char name[BTRFS_PATH_NAME_MAX]' member of struct btrfs_ioctl_vol_args
is passed directly to strlen() after being copied from user. I haven't
verified this, but in theory a userspace program could pass in an
unterminated string and cause a kernel crash as strlen walks off the end of
the array.

This patch terminates the ->name string in all btrfs ioctl functions which
currently use a 'struct btrfs_ioctl_vol_args'. Since the string is now
properly terminated, it's length will never be longer than
BTRFS_PATH_NAME_MAX so that error check has been removed.

By the way, it might be better overall to just have the ioctl pass an
unterminated string + length structure but I didn't bother with that since
it'd change the kernel/user interface.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Josef Bacik
8e8a1e31f2 Btrfs: Fix a few functions that exit without stopping their transaction
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Josef Bacik
aec7477b3b Btrfs: Implement new dir index format
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
3eaa288527 Btrfs: Fix the defragmention code and the block relocation code for data=ordered
Before setting an extent to delalloc, the code needs to wait for
pending ordered extents.

Also, the relocation code needs to wait for ordered IO before scanning
the block group again.  This is because the extents are not removed
until the IO for the new extents is finished

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
7d9eb12c87 Btrfs: Add locking around volume management (device add/remove/balance)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason
89ce8a63d0 Add btrfs_end_transaction_throttle to force writers to wait for pending commits
The existing throttle mechanism was often not sufficient to prevent
new writers from coming in and making a given transaction run forever.
This adds an explicit wait at the end of most operations so they will
allow the current transaction to close.

There is no wait inside file_write, inode updates, or cow filling, all which
have different deadlock possibilities.

This is a temporary measure until better asynchronous commit support is
added.  This code leads to stalls as it waits for data=ordered
writeback, and it really needs to be fixed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
a213501153 Btrfs: Replace the big fs_mutex with a collection of other locks
Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
925baeddc5 Btrfs: Start btree concurrency work.
The allocation trees and the chunk trees are serialized via their own
dedicated mutexes.  This means allocation location is still not very
fine grained.

The main FS btree is protected by locks on each block in the btree.  Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.

The end result of a search is now a path where only the lowest level
is locked.  Releasing or freeing the path drops any locks held.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Christoph Hellwig
df5b5520b2 BTRFS_IOC_TRANS_START should be privilegued
As mentioned in the comment next to it btrfs_ioctl_trans_start can
do bad damage to filesystems and thus should be limited to privilegued
users.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Christoph Hellwig
f46b5a66b3 Btrfs: split out ioctl.c
Split the ioctl handling out of inode.c into a file of it's own.
Also fix up checkpatch.pl warnings for the moved code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00