2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

1095 Commits

Author SHA1 Message Date
Josef Bacik
723bda2083 Btrfs: fix the allocator loop logic
I was testing with empty_cluster = 0 to try and reproduce a problem and kept
hitting early enospc panics.  This was because our loop logic was a little
confused.  So this is what I did

1) Make the loop variable the ultimate decider on wether we should loop again
isntead of checking to see if we had an uncached bg, empty size or empty
cluster.

2) Increment loop before checking to see what we are on to make the loop
definitions make more sense.

3) If we are on the chunk alloc loop don't set empty_size/empty_cluster to 0
unless we didn't actually allocate a chunk.  If we did allocate a chunk we
should be able to easily setup a new cluster so clearing
empty_size/empty_cluster makes us less efficient.

This kept me from hitting panics while trying to reproduce the other problem.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-06-08 16:37:29 -04:00
Josef Bacik
f2bb8f5cfb Btrfs: don't commit the transaction if we dont have enough pinned bytes
I noticed when running an enospc test that we would get stuck committing the
transaction in check_data_space even though we truly didn't have enough space.
So check to see if bytes_pinned is bigger than num_bytes, if it's not don't
commit the transaction.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-06-08 15:08:31 -04:00
David Sterba
7841cb2898 btrfs: add helper for fs_info->closing
wrap checking of filesystem 'closing' flag and fix a few missing memory
barriers.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-06-04 08:11:22 -04:00
Chris Mason
ff5714cca9 Merge branch 'for-chris' of
git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus

Conflicts:
	fs/btrfs/disk-io.c
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-28 07:00:39 -04:00
Chris Mason
d6c0cb379c Merge branch 'cleanups_and_fixes' into inode_numbers
Conflicts:
	fs/btrfs/tree-log.c
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 14:37:47 -04:00
Tsutomu Itoh
1cd307990d Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
Currently, btrfs_truncate_item and btrfs_extend_item returns only 0.
So, the check by BUG_ON in the caller is unnecessary.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:39 -04:00
Sergei Trofimovich
c4f675cd40 btrfs: don't spin in shrink_delalloc if there is nothing to free
Observed as a large delay when --mixed filesystem is filled up.
Test example:
1. create tiny --mixed FS:
   $ dd if=/dev/zero of=2G.img seek=$((2048 * 1024 * 1024 - 1)) count=1 bs=1
   $ mkfs.btrfs --mixed 2G.img
   $ mount -oloop 2G.img /mnt/ut/
2. Try to fill it up:
   $ dd if=/dev/urandom of=10M.file bs=10240 count=1024
   $ seq 1 256 | while read file_no; do echo $file_no; time cp 10M.file ${file_no}.copy; done

Up to '200.copy' it goes fast, but when disk fills-up each -ENOSPC
message takes 3 seconds to pop-up _every_ ENOSPC (and in usermode linux
it's even more: 30-60 seconds!). (Maybe, time depends on kernel's timer resolution).

No IO, no CPU load, just rescheduling. Some debugging revealed busy spinning
in shrink_delalloc.

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:14 -04:00
Josef Bacik
cca1c81f43 Btrfs: don't try to allocate from a block group that doesn't have enough space
If we have a very large filesystem, we can spend a lot of time in
find_free_extent just trying to allocate from empty block groups.  So instead
check to see if the block group even has enough space for the allocation, and if
not go on to the next block group.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:15 -04:00
Josef Bacik
026fd31782 Btrfs: don't always do readahead
Our readahead is sort of sloppy, and really isn't always needed.  For example if
ls is doing a stating ls (which is the default) it's going to stat in non-disk
order, so if say you have a directory with a stupid amount of files, readahead
is going to do nothing but waste time in the case of doing the stat.  Taking the
unconditional readahead out made my test go from 57 minutes to 36 minutes.  This
means that everywhere we do loop through the tree we want to make sure we do set
path->reada properly, so I went through and found all of the places where we
loop through the path and set reada to 1.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:14 -04:00
Josef Bacik
589d8ade83 Btrfs: try not to sleep as much when doing slow caching
When the fs is super full and we unmount the fs, we could get stuck in this
thing where unmount is waiting for the caching kthread to make progress and the
caching kthread keeps scheduling because we're in the middle of a commit.  So
instead just let the caching kthread keep going and only yeild if
need_resched().  This makes my horrible umount case go from taking up to 10
minutes to taking less than 20 seconds.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:13 -04:00
Josef Bacik
d82a6f1d7e Btrfs: kill BTRFS_I(inode)->block_group
Originally this was going to be used as a way to give hints to the allocator,
but frankly we can get much better hints elsewhere and it's not even used at all
for anything usefull.  In addition to be completely useless, when we initialize
an inode we try and find a freeish block group to set as the inodes block group,
and with a completely full 40gb fs this takes _forever_, so I imagine with say
1tb fs this is just unbearable.  So just axe the thing altoghether, we don't
need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
inode lookup in my testcase.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:12 -04:00
Josef Bacik
fcb80c2aff Btrfs: fix how we do space reservation for truncate
The ceph guys keep running into problems where we have space reserved in our
orphan block rsv when freeing it up.  This is because they tend to do snapshots
alot, so their truncates tend to use a bunch of space, so when we go to do
things like update the inode we have to steal reservation space in order to make
the reservation happen.  This happens because truncate can use as much space as
it freaking feels like, but we still have to hold space for removing the orphan
item and updating the inode, which will definitely always happen.  So in order
to fix this we need to split all of the reservation stuf up.  So with this patch
we have

1) The orphan block reserve which only holds the space for deleting our orphan
item when everything is over.

2) The truncate block reserve which gets allocated and used specifically for the
space that the truncate will use on a per truncate basis.

3) The transaction will always have 1 item's worth of data reserved so we can
update the inode normally.

Hopefully this will make the ceph problem go away.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:08 -04:00
Josef Bacik
a4abeea41a Btrfs: kill trans_mutex
We use trans_mutex for lots of things, here's a basic list

1) To serialize trans_handles joining the currently running transaction
2) To make sure that no new trans handles are started while we are committing
3) To protect the dead_roots list and the transaction lists

Really the serializing trans_handles joining is not too hard, and can really get
bogged down in acquiring a reference to the transaction.  So replace the
trans_mutex with a trans_lock spinlock and use it to do the following

1) Protect fs_info->running_transaction.  All trans handles have to do is check
this, and then take a reference of the transaction and keep on going.
2) Protect the fs_info->trans_list.  This doesn't get used too much, basically
it just holds the current transactions, which will usually just be the currently
committing transaction and the currently running transaction at most.
3) Protect the dead roots list.  This is only ever processed by splicing the
list so this is relatively simple.
4) Protect the fs_info->reloc_ctl stuff.  This is very lightweight and was using
the trans_mutex before, so this is a pretty straightforward change.
5) Protect fs_info->no_trans_join.  Because we don't hold the trans_lock over
the entirety of the commit we need to have a way to block new people from
creating a new transaction while we're doing our work.  So we set no_trans_join
and in join_transaction we test to see if that is set, and if it is we do a
wait_on_commit.
6) Make the transaction use count atomic so we don't need to take locks to
modify it when we're dropping references.
7) Add a commit_lock to the transaction to make sure multiple people trying to
commit the same transaction don't race and commit at the same time.
8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
trans.

I have tested this with xfstests, but obviously it is a pretty hairy change so
lots of testing is greatly appreciated.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:57 -04:00
Josef Bacik
7a7eaa40a3 Btrfs: take away the num_items argument from btrfs_join_transaction
I keep forgetting that btrfs_join_transaction() just ignores the num_items
argument, which leads me to sending pointless patches and looking stupid :).  So
just kill the num_items argument from btrfs_join_transaction and
btrfs_start_ioctl_transaction, since neither of them use it.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:56 -04:00
Chris Mason
945d8962ce Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into inode_numbers
Conflicts:
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/tree-log.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:33:42 -04:00
Chris Mason
dcc6d07322 Merge branch 'delayed_inode' into inode_numbers
Conflicts:
	fs/btrfs/inode.c
	fs/btrfs/ioctl.c
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 07:07:01 -04:00
Miao Xie
16cdcec736 btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030

After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:30:56 -04:00
Chris Mason
0965537308 Merge branch 'ino-alloc' of git://repo.or.cz/linux-btrfs-devel into inode_numbers
Conflicts:
	fs/btrfs/free-space-cache.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:27:38 -04:00
liubo
1aba86d67f Btrfs: fix easily get into ENOSPC in mixed case
When a btrfs disk is created by mixed data & metadata option, it will have no
pure data or pure metadata space info.

In btrfs's for-linus branch, commit 78b1ea13838039cd88afdd62519b40b344d6c920
(Btrfs: fix OOPS of empty filesystem after balance) initializes space infos at
the very beginning.  The problem is this initialization does not take the mixed
case into account, which will cause btrfs will easily get into ENOSPC in mixed
case.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:26 -04:00
David Sterba
182608c829 btrfs: remove old unused commented out code
Remove code which has been #if0-ed out for a very long time and does not
seem to be related to current codebase anymore.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-06 12:34:10 +02:00
David Sterba
8cc33e5c19 btrfs: Document a mutex lock/unlock sequence 2011-05-02 15:29:25 +02:00
David Sterba
b3b4aa74b5 btrfs: drop unused parameter from btrfs_release_path
parameter tree root it's not used since commit
5f39d397df ("Btrfs: Create extent_buffer
interface for large blocksizes")

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba
172ddd60a6 btrfs: drop gfp parameter from alloc_extent_map
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00
David Sterba
62a45b6092 btrfs: make functions static when possible
Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:20 +02:00
Tsutomu Itoh
8d413713ca Btrfs: check return value of kmalloc()
The check on the return value of kmalloc() is added to some places.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
Li Zefan
82d5902d9c Btrfs: Support reading/writing on disk free ino cache
This is similar to block group caching.

We dedicate a special inode in fs tree to save free ino cache.

At the very first time we create/delete a file after mount, the free ino
cache will be loaded from disk into memory. When the fs tree is commited,
the cache will be written back to disk.

To keep compatibility, we check the root generation against the generation
of the special inode when loading the cache, so the loading will fail
if the btrfs filesystem was mounted in an older kernel before.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:11 +08:00
Li Zefan
33345d0152 Btrfs: Always use 64bit inode number
There's a potential problem in 32bit system when we exhaust 32bit inode
numbers and start to allocate big inode numbers, because btrfs uses
inode->i_ino in many places.

So here we always use BTRFS_I(inode)->location.objectid, which is an
u64 variable.

There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
and inode->i_ino will be used in those cases.

Another reason to make this change is I'm going to use a special inode
to save free ino cache, and the inode number must be > (u64)-256.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:09 +08:00
Li Zefan
34d52cb6c5 Btrfs: Make free space cache code generic
So we can re-use the code to cache free inode numbers.

The change is quite straightforward. Two new structures are introduced.

- struct btrfs_free_space_ctl

  We move those variables that are used for caching free space from
  struct btrfs_block_group_cache to this new struct.

- struct btrfs_free_space_op

  We do block group specific work (e.g. calculation of extents threshold)
  through functions registered in this struct.

And then we can remove references to struct btrfs_block_group_cache.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:03 +08:00
Josef Bacik
6d74119f1a Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
Everytime we try to allocate disk space we try and see if we can pre-emptively
allocate a chunk, but in the common case we don't allocate anything, so there is
no sense in taking the chunk_mutex at all.  So instead if we are allocating a
chunk, mark it in the space_info so we don't get two people trying to allocate
at the same time.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Reviewed-by: Liu Bo <liubo2009@cn.fujitsu.com>
2011-04-16 07:10:56 -04:00
Chris Mason
0e4f8f8888 Btrfs: don't force chunk allocation in find_free_extent
find_free_extent likes to allocate in contiguous clusters,
which makes writeback faster, especially on SSD storage.  As
the FS fragments, these clusters become harder to find and we have
to decide between allocating a new chunk to make more clusters
or giving up on the cluster to allocate from the free space
we have.

Right now it creates too many chunks, and you can end up with
a whole FS that is mostly empty metadata chunks.  This commit
changes the allocation code to be more strict and only
allocate new chunks when we've made good use of the chunks we
already have.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-15 16:05:44 -04:00
liubo
c59021f846 Btrfs: fix OOPS of empty filesystem after balance
btrfs will remove unused block groups after balance.
When a empty filesystem is balanced, the block group with tag "DATA" may be
dropped, and after umount and mount again, it will not find "DATA" space_info
and lead to OOPS.
So we initial the necessary space_infos(DATA, SYSTEM, METADATA) to avoid OOPS.

Reported-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:53 -04:00
liubo
9f7c43c967 Btrfs: fix memory leak of empty filesystem after balance
After Josef's patch(commit 3c14874acc),
btrfs will exclude super bytes when reading block groups(by marking a extent
state UPTODATE).  However, these bytes do not get freed while balance remove
unused block groups, and we won't process those removed ones any more, when
we do umount and unload the btrfs module,  btrfs hits a memory leak.

This patch add the missing free operation.

Reproduce steps:
$ mkfs.btrfs disk
$ mount disk /mnt/btrfs -o loop
$ btrfs filesystem balance /mnt/btrfs
$ umount /mnt/btrfs
$ rmmod btrfs

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:52 -04:00
Yoshinori Sano
dac97e516c Btrfs: fix uncheck memory allocations
To make Btrfs code more robust, several return value checks where memory
allocation can fail are introduced. I use BUG_ON where I don't know how
to handle the error properly, which increases the number of using the
notorious BUG_ON, though.

Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:49 -04:00
Li Dongyang
f7039b1d5c Btrfs: add btrfs_trim_fs() to handle FITRIM
We take an free extent out from allocator, trim it, then put it back,
but before we trim the block group, we should make sure the block group is
cached, so plus a little change to make cache_block_group() run without a
transaction.

Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:47 -04:00
Li Dongyang
5378e60734 Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes
Callers of btrfs_discard_extent() should check if we are mounted with -o discard,
as we want to make fitrim to work even the fs is not mounted with -o discard.
Also we should use REQ_DISCARD to map the free extent to get a full mapping,
last we only return errors if
1. the error is not a EOPNOTSUPP
2. no device supports discard

Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:46 -04:00
Li Dongyang
b4d00d569a Btrfs: make update_reserved_bytes() public
Make the function public as we should update the reserved extents calculations
after taking out an extent for trimming.

Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:43 -04:00
Miao Xie
fc0e4a314e btrfs: use GFP_NOFS instead of GFP_KERNEL
In the filesystem context, we must allocate memory by GFP_NOFS,
or we may start another filesystem operation and make kswap thread hang up.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:39 -04:00
Tsutomu Itoh
97d9a8a420 Btrfs: check return value of read_tree_block()
This patch is checking return value of read_tree_block(),
and if it is NULL, error processing.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:37 -04:00
Tsutomu Itoh
db5b493ac7 Btrfs: cleanup some BUG_ON()
This patch changes some BUG_ON() to the error return.
(but, most callers still use BUG_ON())

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:35 -04:00
liubo
1abe9b8a13 Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
              dd-7822  [000]  2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
              dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
 btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
   flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
   flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
   flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
   flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

Here is what I have added:

1) ordere_extent:
        btrfs_ordered_extent_add
        btrfs_ordered_extent_remove
        btrfs_ordered_extent_start
        btrfs_ordered_extent_put

These provide critical information to understand how ordered_extents are
updated.

2) extent_map:
        btrfs_get_extent

extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.

3) writepage:
        __extent_writepage
        btrfs_writepage_end_io_hook

Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.

4) inode:
        btrfs_inode_new
        btrfs_inode_request
        btrfs_inode_evict

These can show where and when a inode is created, when a inode is evicted.

5) sync:
        btrfs_sync_file
        btrfs_sync_fs

These show sync arguments.

6) transaction:
        btrfs_transaction_commit

In transaction based filesystem, it will be useful to know the generation and
who does commit.

7) back reference and cow:
	btrfs_delayed_tree_ref
	btrfs_delayed_data_ref
	btrfs_delayed_ref_head
	btrfs_cow_block

Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.

8) chunk:
	btrfs_chunk_alloc
	btrfs_chunk_free

Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.

9) reserved_extent:
	btrfs_reserved_extent_alloc
	btrfs_reserved_extent_free

These can show how btrfs uses its space.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:33 -04:00
Josef Bacik
a826d6dcb3 Btrfs: check items for correctness as we search
Currently if we have corrupted items things will blow up in spectacular ways.
So as we read in blocks and they are leaves, check the entire leaf to make sure
all of the items are correct and point to valid parts in the leaf for the item
data the are responsible for.  If the item is corrupt we will kick back EIO and
not read any of the copies since they are likely to not be correct either.  This
will catch generic corruptions, it will be up to the individual callers of
btrfs_search_slot to make sure their items are right.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-03-17 14:21:37 -04:00
Josef Bacik
66b4ffd110 Btrfs: handle errors in btrfs_orphan_cleanup
If we cannot truncate an inode for some reason we will never delete the orphan
item associated with that inode, which means that we will loop forever in
btrfs_orphan_cleanup.  Instead of doing this just return error so we fail to
mount.  It sucks, but hey it's better than hanging.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-03-17 14:21:26 -04:00
Josef Bacik
57a45ced94 Btrfs: change reserved_extents to an atomic_t
We track delayed allocation per inodes via 2 counters, one is
outstanding_extents and reserved_extents.  Outstanding_extents is already an
atomic_t, but reserved_extents is not and is protected by a spinlock.  So
convert this to an atomic_t and instead of using a spinlock, use atomic_cmpxchg
when releasing delalloc bytes.  This makes our inode 72 bytes smaller, and
reduces locking overhead (albiet it was minimal to begin with).  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-03-17 14:21:18 -04:00
Linus Torvalds
0e5b88cd99 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: break out of shrink_delalloc earlier
  btrfs: fix not enough reserved space
  btrfs: fix dip leak
  Btrfs: make sure not to return overlapping extents to fiemap
  Btrfs: deal with short returns from copy_from_user
  Btrfs: fix regressions in copy_from_user handling
2011-03-13 16:00:49 -07:00
Chris Mason
36e39c40b3 Btrfs: break out of shrink_delalloc earlier
Josef had changed shrink_delalloc to exit after three shrink
attempts, which wasn't quite enough because new writers could
race in and steal free space.

But it also fixed deadlocks and stalls as we tried to recover
delalloc reservations.  The code was tweaked to loop 1024
times, and would reset the counter any time a small amount
of progress was made.  This was too drastic, and with a
lot of writers we can end up stuck in shrink_delalloc forever.

The shrink_delalloc loop is fairly complex because the caller is looping
too, and the caller will go ahead and force a transaction commit to make
sure we reclaim space.

This reworks things to exit shrink_delalloc when we've forced some
writeback and the delalloc reservations have gone down.  This means
the writeback has not just started but has also finished at
least some of the metadata changes required to reclaim delalloc
space.

If we've got this wrong, we're returning ENOSPC too early, which
is a big improvement over the current behavior of hanging the machine.

Test 224 in xfstests hammers on this nicely, and with 1000 writers
trying to fill a 1GB drive we get our first ENOSPC at 93% full.  The
other writers are able to continue until we get 100%.

This is a worst case test for btrfs because the 1000 writers are doing
small IO, and the small FS size means we don't have a lot of room
for metadata chunks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-12 07:08:42 -05:00
Linus Torvalds
4660ba63f1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix fiemap bugs with delalloc
  Btrfs: set FMODE_EXCL in btrfs_device->mode
  Btrfs: make btrfs_rm_device() fail gracefully
  Btrfs: Avoid accessing unmapped kernel address
  Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl
  Btrfs: allow balance to explicitly allocate chunks as it relocates
  Btrfs: put ENOSPC debugging under a mount option
2011-02-25 14:03:39 -08:00
Chris Mason
c87f08ca44 Btrfs: allow balance to explicitly allocate chunks as it relocates
Btrfs device shrinking and balancing ends up reallocating all the blocks
in order to allow COW to move them to new destinations.  It is somewhat
awkward in terms of ENOSPC because most of the enospc code is built
around the idea that some operation on a reference counted tree triggers
allocations in the non-reference counted trees.

This commit changes the balancing code to deal with enospc by trying to
allocate a new chunk.  If that allocation succeeds, we go ahead and
retry whatever failed due to enospc.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-16 15:28:47 -05:00
Chris Mason
91435650c2 Btrfs: put ENOSPC debugging under a mount option
ENOSPC in btrfs is getting to the point where the extra debugging isn't
required.  I've put it under mount -o enospc_debug just in case someone
is having difficult problems.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-16 15:28:36 -05:00
Linus Torvalds
007a14af26 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: check return value of alloc_extent_map()
  Btrfs - Fix memory leak in btrfs_init_new_device()
  btrfs: prevent heap corruption in btrfs_ioctl_space_info()
  Btrfs: Fix balance panic
  Btrfs: don't release pages when we can't clear the uptodate bits
  Btrfs: fix page->private races
2011-02-15 08:00:35 -08:00
Tsutomu Itoh
c26a920373 Btrfs: check return value of alloc_extent_map()
I add the check on the return value of alloc_extent_map() to several places.
In addition, alloc_extent_map() returns only the address or NULL.
Therefore, check by IS_ERR() is unnecessary. So, I remove IS_ERR() checking.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-14 16:21:37 -05:00
Linus Torvalds
cb5520f02c Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (33 commits)
  Btrfs: Fix page count calculation
  btrfs: Drop __exit attribute on btrfs_exit_compress
  btrfs: cleanup error handling in btrfs_unlink_inode()
  Btrfs: exclude super blocks when we read in block groups
  Btrfs: make sure search_bitmap finds something in remove_from_bitmap
  btrfs: fix return value check of btrfs_start_transaction()
  btrfs: checking NULL or not in some functions
  Btrfs: avoid uninit variable warnings in ordered-data.c
  Btrfs: catch errors from btrfs_sync_log
  Btrfs: make shrink_delalloc a little friendlier
  Btrfs: handle no memory properly in prepare_pages
  Btrfs: do error checking in btrfs_del_csums
  Btrfs: use the global block reserve if we cannot reserve space
  Btrfs: do not release more reserved bytes to the global_block_rsv than we need
  Btrfs: fix check_path_shared so it returns the right value
  btrfs: check return value of btrfs_start_ioctl_transaction() properly
  btrfs: fix return value check of btrfs_join_transaction()
  fs/btrfs/inode.c: Add missing IS_ERR test
  btrfs: fix missing break in switch phrase
  btrfs: fix several uncheck memory allocations
  ...
2011-02-07 14:06:18 -08:00
Josef Bacik
3c14874acc Btrfs: exclude super blocks when we read in block groups
This has been resulting in a BUT_ON(ret) after btrfs_reserve_extent in
btrfs_cow_file_range.  The reason is we don't actually calculate the bytes_super
for a block group until we go to cache it, which means that the space_info can
hand out reservations for space that it doesn't actually have, and we can run
out of data space.  This is also a problem if you are using space caching since
we don't ever calculate bytes_super for the block groups.  So instead everytime
we read a block group call exclude_super_stripes, which calculates the
bytes_super for the block group so it can be left out of the space_info.  Then
whenever caching completes we just call free_excluded_extents so that the super
excluded extents are freed up.  Also if we are unmounting and we hit any block
groups that haven't been cached we still need to call free_excluded_extents to
make sure things are cleaned up properly.  Thanks,

Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-06 07:17:44 -05:00
Tsutomu Itoh
98d5dc13e7 btrfs: fix return value check of btrfs_start_transaction()
The error check of btrfs_start_transaction() is added, and the mistake
of the error check on several places is corrected.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-01 07:17:27 -05:00
Tsutomu Itoh
5df6708348 btrfs: checking NULL or not in some functions
Because NULL is returned when the memory allocation fails,
it is checked whether it is NULL.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-01 07:16:37 -05:00
Josef Bacik
b1953bcec9 Btrfs: make shrink_delalloc a little friendlier
Xfstests 224 will just sit there and spin for ever until eventually we give up
flushing delalloc and exit.  On my box this took several hours.  I could not
interrupt this process either, even though we use INTERRUPTIBLE.  So do 2 things

1) Keep us from looping over and over again without reclaiming anything
2) If we get interrupted exit the loop

I tested this and the test now exits in a reasonable amount of time, and can be
interrupted with ctrl+c.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-31 16:27:28 -05:00
Josef Bacik
68a82277b8 Btrfs: use the global block reserve if we cannot reserve space
We call use_block_rsv right before we make an allocation in order to make sure
we have enough space.  Now normally people have called btrfs_start_transaction()
with the appropriate amount of space that we need, so we just use some of that
pre-reserved space and move along happily.  The problem is where people use
btrfs_join_transaction(), which doesn't actually reserve any space.  So we try
and reserve space here, but we cannot flush delalloc, so this forces us to
return -ENOSPC when in reality we have plenty of space.  The most common symptom
is seeing a bunch of "couldn't dirty inode" messages in syslog.  With
xfstests 224 we end up falling back to start_transaction and then doing all the
flush delalloc stuff which causes to hang for a very long time.

So instead steal from the global reserve, which is what this is meant for
anyway.  With this patch and the other 2 I have sent xfstests 224 now passes
successfully.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-28 16:40:37 -05:00
Josef Bacik
e9e22899de Btrfs: do not release more reserved bytes to the global_block_rsv than we need
When we do btrfs_block_rsv_release, if global_block_rsv is not full we will
release all the extra bytes to global_block_rsv, even if it's only a little
short of the amount of space that we need to reserve.  This causes us to starve
ourselves of reservable space during the transaction which will force us to
shrink delalloc bytes and commit the transaction more often than we should.  So
instead just add the amount of bytes we need to add to the global reserve so
reserved == size, and then add the rest back into the space_info for general
use.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-28 16:40:37 -05:00
Tsutomu Itoh
3612b49598 btrfs: fix return value check of btrfs_join_transaction()
The error check of btrfs_join_transaction()/btrfs_join_transaction_nolock()
is added, and the mistake of the error check in several places is
corrected.

For more stable Btrfs, I think that we should reduce BUG_ON().
But, I think that long time is necessary for this.
So, I propose this patch as a short-term solution.

With this patch:
 - To more stable Btrfs, the part that should be corrected is clarified.
 - The panic isn't done by the NULL pointer reference etc. (even if
   BUG_ON() is increased temporarily)
 - The error code is returned in the place where the error can be easily
   returned.

As a long-term plan:
 - BUG_ON() is reduced by using the forced-readonly framework, etc.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-28 16:40:37 -05:00
Linus Torvalds
eee2a817df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
  Btrfs: forced readonly mounts on errors
  btrfs: Require CAP_SYS_ADMIN for filesystem rebalance
  Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
  btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()
  btrfs: check NULL or not
  btrfs: Don't pass NULL ptr to func that may deref it.
  btrfs: mount failure return value fix
  btrfs: Mem leak in btrfs_get_acl()
  btrfs: fix wrong free space information of btrfs
  btrfs: make the chunk allocator utilize the devices better
  btrfs: restructure find_free_dev_extent()
  btrfs: fix wrong calculation of stripe size
  btrfs: try to reclaim some space when chunk allocation fails
  btrfs: fix wrong data space statistics
  fs/btrfs: Fix build of ctree
  Btrfs: fix off by one while setting block groups readonly
  Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls
  Btrfs: Add readonly snapshots support
  Btrfs: Refactor btrfs_ioctl_snap_create()
  btrfs: Extract duplicate decompress code
  ...
2011-01-17 14:43:43 -08:00
liubo
acce952b02 Btrfs: forced readonly mounts on errors
This patch comes from "Forced readonly mounts on errors" ideas.

As we know, this is the first step in being more fault tolerant of disk
corruptions instead of just using BUG() statements.

The major content:
- add a framework for generating errors that should result in filesystems
  going readonly.
- keep FS state in disk super block.
- make sure that all of resource will be freed and released at umount time.
- make sure that fter FS is forced readonly on error, there will be no more
  disk change before FS is corrected. For this, we should stop write operation.

After this patch is applied, the conversion from BUG() to such a framework can
happen incrementally.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-17 15:13:08 -05:00
Josef Bacik
f690efb1aa Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
If we run low on space we could get a bunch of warnings out of
btrfs_block_rsv_check, but this is mostly just called via the transaction code
to see if we need to end the transaction, it expects to see failures, so let's
not WARN and freak everybody out for no reason.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Miao Xie
6d07bcec96 btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.

 # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
 # btrfs-show
 Label: none  uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
 	Total devices 2 FS bytes used 28.00KB
 	devid    1 size 5.01GB used 2.03GB path /dev/sda9
 	devid    2 size 10.00GB used 2.01GB path /dev/sda10
 # btrfs device scan /dev/sda9 /dev/sda10
 # mount /dev/sda9 /mnt
 # dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
   (fill the filesystem)
 # sync
 # df -TH
 Filesystem	Type	Size	Used	Avail	Use%	Mounted on
 /dev/sda9	btrfs	17G	8.6G	5.4G	62%	/mnt
 # btrfs-show
 Label: none  uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
 	Total devices 2 FS bytes used 3.99GB
 	devid    1 size 5.01GB used 5.01GB path /dev/sda9
 	devid    2 size 10.00GB used 4.99GB path /dev/sda10

It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.

This patch fixes it by calcing the free space that can be used to allocate
chunks.

Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
   3.1. if it is not zero, and then check the number of the devices that has
        more free space than this device,
        if the number of the devices is beyond the min stripe number, the free
        space can be used, and add into total free space.
        if the number of the devices is below the min stripe number, we can not
        use the free space, the check ends.
   3.2. if the free space is zero, check the next devices, goto 3.1

This implementation is just likely fake chunk allocation.

After appling this patch, df can show correct space information:
 # df -TH
 Filesystem	Type	Size	Used	Avail	Use%	Mounted on
 /dev/sda9	btrfs	17G	8.6G	0	100%	/mnt

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie
7bfc837df9 btrfs: restructure find_free_dev_extent()
- make it return the start position and length of the max free space when it can
  not find a suitable free space.
- make it more readability

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie
d52a5b5f1f btrfs: try to reclaim some space when chunk allocation fails
We cannot write data into files when when there is tiny space in the filesystem.

Reproduce steps:
 # mkfs.btrfs /dev/sda1
 # mount /dev/sda1 /mnt
 # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
 # dd if=/dev/zero of=/mnt/tmpfile1 bs=4K count=99999999999999
   (fill the filesystem)
 # umount /mnt
 # mount /dev/sda1 /mnt
 # rm -f /mnt/tmpfile0
 # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
   (failed with nospec)

But if we do the last step again, we can write data successfully. The reason of
the problem is that btrfs didn't try to commit the current transaction and
reclaim some space when chunk allocation failed.

This patch fixes it by committing the current transaction to reclaim some
space when chunk allocation fails.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Chris Mason
65e5341b9a Btrfs: fix off by one while setting block groups readonly
When we read in block groups, we'll set non-redundant groups
readonly if we find a raid1, DUP or raid10 group.  But the
ro code has an off by one bug in the math around testing to
make sure out accounting doesn't go wrong.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-04 16:41:39 -05:00
Linus Torvalds
e13cf63f2b Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: prevent RAID level downgrades when space is low
  Btrfs: account for missing devices in RAID allocation profiles
  Btrfs: EIO when we fail to read tree roots
  Btrfs: fix compiler warnings
  Btrfs: Make async snapshot ioctl more generic
  Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
  Btrfs: Fix a crash when mounting a subvolume
  Btrfs: fix sync subvol/snapshot creation
  Btrfs: Fix page leak in compressed writeback path
  Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
  Btrfs: fixup return code for btrfs_del_orphan_item
  Btrfs: do not do fast caching if we are allocating blocks for tree_root
  Btrfs: deal with space cache errors better
  Btrfs: fix use after free in O_DIRECT
2010-12-14 11:08:13 -08:00
Chris Mason
83a50de97f Btrfs: prevent RAID level downgrades when space is low
The extent allocator has code that allows us to fill
allocations from any available block group, even if it doesn't
match the raid level we've requested.

This was put in because adding a new drive to a filesystem
made with the default mkfs options actually upgrades the metadata from
single spindle dup to full RAID1.

But, the code also allows us to allocate from a raid0 chunk when we
really want a raid1 or raid10 chunk.  This can cause big trouble because
mkfs creates a small (4MB) raid0 chunk for data and metadata which then
goes unused for raid1/raid10 installs.

The allocator will happily wander in and allocate from that chunk when
things get tight, which is not correct.

The fix here is to make sure that we provide duplication when the
caller has asked for it.  It does all the dups to be any raid level,
which preserves the dup->raid1 upgrade abilities.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 20:07:01 -05:00
Chris Mason
cd02dca564 Btrfs: account for missing devices in RAID allocation profiles
When we mount in RAID degraded mode without adding a new device to
replace the failed one, we can end up using the wrong RAID flags for
allocations.

This results in strange combinations of block groups (raid1 in a raid10
filesystem) and corruptions when we try to allocate blocks from single
spindle chunks on drives that are actually missing.

The first device has two small 4MB chunks in it that mkfs creates and
these are usually unused in a raid1 or raid10 setup.  But, in -o degraded,
the allocator will fall back to these because the mask of desired raid groups
isn't correct.

The fix here is to count the missing devices as we build up the list
of devices in the system.  This count is used when picking the
raid level to make sure we continue using the same levels that were
in place before we lost a drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 20:06:52 -05:00
Josef Bacik
84cd948cb1 Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
Not being able to delete an orphan item isn't a horrible thing.  The worst that
happens is the next time around we try and do the orphan cleanup and we can't
find the referenced object and just delete the item and move on.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-10 16:29:04 -05:00
Josef Bacik
b8399dee47 Btrfs: do not do fast caching if we are allocating blocks for tree_root
Since the fast caching uses normal tree locking, we can possibly deadlock if we
get to the caching via a btrfs_search_slot() on the tree_root.  So just check to
see if the root we are on is the tree root, and just don't do the fast caching.

Reported-by: Sage Weil <sage@newdream.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-09 13:57:13 -05:00
Josef Bacik
2b20982e31 Btrfs: deal with space cache errors better
Currently if the space cache inode generation number doesn't match the
generation number in the space cache header we will just fail to load the space
cache, but we won't mark the space cache as an error, so we'll keep getting that
error each time somebody tries to cache that block group until we actually clear
the thing.  Fix this by marking the space cache as having an error so we only
get the message once.  This patch also makes it so that we don't try and setup
space cache for a block group that isn't cached, since we won't be able to write
it out anyway.  None of these problems are actual problems, they are just
annoying and sub-optimal.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-09 13:57:12 -05:00
Linus Torvalds
aa3fc52546 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
  Btrfs: don't use migrate page without CONFIG_MIGRATION
  Btrfs: deal with DIO bios that span more than one ordered extent
  Btrfs: setup blank root and fs_info for mount time
  Btrfs: fix fiemap
  Btrfs - fix race between btrfs_get_sb() and umount
  Btrfs: update inode ctime when using links
  Btrfs: make sure new inode size is ok in fallocate
  Btrfs: fix typo in fallocate to make it honor actual size
  Btrfs: avoid NULL pointer deref in try_release_extent_buffer
  Btrfs: make btrfs_add_nondir take parent inode as an argument
  Btrfs: hold i_mutex when calling btrfs_log_dentry_safe
  Btrfs: use dget_parent where we can UPDATED
  Btrfs: fix more ESTALE problems with NFS
  Btrfs: handle NFS lookups properly
  btrfs: make 1-bit signed fileds unsigned
  btrfs: Show device attr correctly for symlinks
  btrfs: Set file size correctly in file clone
  btrfs: Check if dest_offset is block-size aligned before cloning file
  Btrfs: handle the space_cache option properly
  btrfs: Fix early enospc because 'unused' calculated with wrong sign.
  ...
2010-11-29 14:11:08 -08:00
Arne Jansen
6f33434850 btrfs: Fix early enospc because 'unused' calculated with wrong sign.
'unused' calculated with wrong sign in reserve_metadata_bytes().
This might have lead to unwanted over-reservations.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:04 -05:00
Linus Torvalds
925d169f5b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (39 commits)
  Btrfs: deal with errors from updating the tree log
  Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
  Btrfs: make SNAP_DESTROY async
  Btrfs: add SNAP_CREATE_ASYNC ioctl
  Btrfs: add START_SYNC, WAIT_SYNC ioctls
  Btrfs: async transaction commit
  Btrfs: fix deadlock in btrfs_commit_transaction
  Btrfs: fix lockdep warning on clone ioctl
  Btrfs: fix clone ioctl where range is adjacent to extent
  Btrfs: fix delalloc checks in clone ioctl
  Btrfs: drop unused variable in block_alloc_rsv
  Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
  Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
  Btrfs: Use ERR_CAST helpers
  Btrfs: use memdup_user helpers
  Btrfs: fix raid code for removing missing drives
  Btrfs: Switch the extent buffer rbtree into a radix tree
  Btrfs: restructure try_release_extent_buffer()
  Btrfs: use the flusher threads for delalloc throttling
  Btrfs: tune the chunk allocation to 5% of the FS as metadata
  ...

Fix up trivial conflicts in fs/btrfs/super.c and fs/fs-writeback.c, and
remove use of INIT_RCU_HEAD in fs/btrfs/extent_io.c (that init macro was
useless and removed in commit 5e8067adfd: "rcu head remove init")
2010-10-30 09:05:48 -07:00
Chris Mason
d8e39c457b Btrfs: drop unused variable in block_alloc_rsv
The alloc_target variable is not really used.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:17:41 -04:00
Andi Kleen
559af82114 Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.

Still needs more review.

Found by gcc 4.6's new warnings

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:37 -04:00
Chris Mason
bf9022e06a Btrfs: use the flusher threads for delalloc throttling
We have a fairly complex set of loops around walking our list of
delalloc inodes when we find metadata delalloc space running low.
It doesn't work very well, can use large amounts of CPU and doesn't
do very efficient writeback.

This switches us to kick the bdi flusher threads instead.  All dirty
data in btrfs is accounted as delalloc data, so this is very similar
in terms of what it writes, but we're able to just kick off the IO
and wait for progress.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:36 -04:00
Chris Mason
e5bc245829 Btrfs: tune the chunk allocation to 5% of the FS as metadata
An earlier commit tried to keep us from allocating too many
empty metadata chunks.  It was somewhat too restrictive and could
lead to ENOSPC errors on empty filesystems.

This increases the limits to about 5% of the FS size, allowing more
metadata chunks to be preallocated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:35 -04:00
Chris Mason
6b5b817f10 Merge branch 'bug-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work
Conflicts:
	fs/btrfs/extent-tree.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 09:27:49 -04:00
Josef Bacik
8216ef866d Btrfs: let the user know space caching is enabled
If you mount -o space_cache, the option will be persistent across mounts, but to
make sure the user knows that they did this, emit a message telling them if they
didn't mount with -o space_cache but the feature is still used.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:37 -04:00
Josef Bacik
88c2ba3b06 Btrfs: Add a clear_cache mount option
If something goes wrong with the free space cache we need a way to make sure
it's not loaded on mount and that it's cleared for everybody.  When you pass the
clear_cache option it will make it so all block groups are setup to be cleared,
which keeps them from being loaded and then they will be truncated when the
transaction is committed.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:36 -04:00
Josef Bacik
67377734fd Btrfs: add support for mixed data+metadata block groups
There are just a few things that need to be fixed in the kernel to support mixed
data+metadata block groups.  Mostly we just need to make sure that if we are
using mixed block groups that we continue to allocate mixed block groups as we
need them.  Also we need to make sure __find_space_info will find our space info
if we search for DATA or METADATA only.  Tested this with xfstests and it works
nicely.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:36 -04:00
Josef Bacik
dde5abee12 Btrfs: check cache->caching_ctl before returning if caching has started
With the free space disk caching we can mark the block group as started with the
caching, but we don't have a caching ctl.  This can race with anybody else who
tries to get the caching ctl before we cache (this is very hard to do btw).  So
instead check to see if cache->caching_ctl is set, and if not return NULL.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:35 -04:00
Josef Bacik
9d66e233c7 Btrfs: load free space cache if it exists
This patch actually loads the free space cache if it exists.  The only thing
that really changes here is that we need to cache the block group if we're going
to remove an extent from it.  Previously we did not do this since the caching
kthread would pick it up.  With the on disk cache we don't have this luxury so
we need to make sure we read the on disk cache in first, and then remove the
extent, that way when the extent is unpinned the free space is added to the
block group.  This has been tested with all sorts of things.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:35 -04:00
Josef Bacik
0cb59c9953 Btrfs: write out free space cache
This is a simple bit, just dump the free space cache out to our preallocated
inode when we're writing out dirty block groups.  There are a bunch of changes
in inode.c in order to account for special cases.  Mostly when we're doing the
writeout we're holding trans_mutex, so we need to use the nolock transacation
functions.  Also we can't do asynchronous completions since the async thread
could be blocked on already completed IO waiting for the transaction lock.  This
has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
tests.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:29 -04:00
Josef Bacik
0af3d00bad Btrfs: create special free space cache inode
In order to save free space cache, we need an inode to hold the data, and we
need a special item to point at the right inode for the right block group.  So
first, create a special item that will point to the right inode, and the number
of extent entries we will have and the number of bitmaps we will have.  We
truncate and pre-allocate space everytime to make sure it's uptodate.

This feature will be turned on as soon as you mount with -o space_cache, however
it is safe to boot into old kernels, they will just generate the cache the old
fashion way.  When you boot back into a newer kernel we will notice that we
modified and not the cache and automatically discard the cache.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-28 15:59:09 -04:00
Josef Bacik
e9bb7f10d3 Btrfs: remove warn_on from use_block_rsv
Because btrfs_dirty_inode does a btrfs_join_transaction, it doesn't actually
reserve space.  It does this so we can try and dirty the inode quickly without
having to deal with the ENOSPC problems.  But if it does get back ENOSPC it
handles it properly.  The problem is use_block_rsv does a WARN_ON whenever this
case happens, even tho btrfs_dirty_inode takes it into account and actually
expects to get -ENOSPC if things are particularly tight.  So instead just remove
the warning.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-26 12:55:03 -04:00
Josef Bacik
382279336f Btrfs: set trans to null in reserve_metadata_bytes if we commit the transaction
btrfs_commit_transaction will free our trans, but because we pass trans to
shrink_delalloc we could possibly have a use after free situation.  So instead
if we commit the transaction, set trans to null and set committed to true so we
don't keep trying to commit a transaction.  This fixes a panic I could reproduce
at will.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-26 12:52:53 -04:00
Josef Bacik
8bb8ab2e93 Btrfs: rework how we reserve metadata bytes
With multi-threaded writes we were getting ENOSPC early because somebody would
come in, start flushing delalloc because they couldn't make their reservation,
and in the meantime other threads would come in and use the space that was
getting freed up, so when the original thread went to check to see if they had
space they didn't and they'd return ENOSPC.  So instead if we have some free
space but not enough for our reservation, take the reservation and then start
doing the flushing.  The only time we don't take reservations is when we've
already overcommitted our space, that way we don't have people who come late to
the party way overcommitting ourselves.  This also moves all of the retrying and
flushing code into reserve_metdata_bytes so it's all uniform.  This keeps my
fs_mark test from returning -ENOSPC as soon as it starts and actually lets me
fill up the disk.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:55:01 -04:00
Josef Bacik
14ed0ca6e8 Btrfs: don't allocate chunks as aggressively
Because the ENOSPC code over reserves super aggressively we end up allocating
chunks way more often than we should.  For example with my fs_mark tests on a
2gb fs I can end up reserved 1gb just for metadata, when only 34mb of that is
being used.  So instead check to see if the amount of space actually used is
less than 30% of the total space, and if so don't allocate a chunk, but only if
we have at least 256mb of free space to make sure we don't put too much pressure
on free space.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:55:00 -04:00
Josef Bacik
0019f10db6 Btrfs: re-work delalloc flushing
Currently we try and flush delalloc, but we only do that in a sort of weak way,
which works fine in most cases but if we're under heavy pressure we need to be
able to wait for flushing to happen.  Also instead of checking the bytes
reserved in the block_rsv, check the space info since it is more accurate.  The
sync option will be used in a future patch.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:58 -04:00
Josef Bacik
6d48755d02 Btrfs: fix reservation code for mixed block groups
The global reservation stuff tries to add together DATA and METADATA used in
order to figure out how much to reserve for everything, but this doesn't work
right for mixed block groups.  Instead if we have mixed block groups just set
data used to 0.  Also with mixed block groups we will use bytes_may_use for
keeping track of delalloc bytes, so we need to take that into account in our
reservation calculations.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:56 -04:00
Josef Bacik
89a55897a2 Btrfs: fix df regression
The new ENOSPC stuff breaks out the raid types which breaks the way we were
reporting df to the system.  This fixes it back so that Available is the total
space available to data and used is the actual bytes used by the filesystem.
This means that Available is Total - data used - all of the metadata space.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:55 -04:00
Josef Bacik
a1f765061e Btrfs: stop trying to shrink delalloc if there are no inodes to reclaim
In very severe ENOSPC cases we can run out of inodes to do delalloc on, which
means we'll just keep looping trying to shrink delalloc.  Instead, if we fail to
shrink delalloc 3 times in a row break out since we're not likely to make any
progress.  Tested this with a 100mb fs an xfstests test 13.  Before the patch it
would hang the box, with the patch we get -ENOSPC like we should.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:51 -04:00
Christoph Hellwig
dd3932eddf block: remove BLKDEV_IFL_WAIT
All the blkdev_issue_* helpers can only sanely be used for synchronous
caller.  To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-16 20:52:58 +02:00
Christoph Hellwig
c3b9a62c8f btrfs: replace barriers with explicit flush / FUA usage
Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:39 +02:00
Linus Torvalds
b25b550bb1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: The file argument for fsync() is never null
  Btrfs: handle ERR_PTR from posix_acl_from_xattr()
  Btrfs: avoid BUG when dropping root and reference in same transaction
  Btrfs: prohibit a operation of changing acl's mask when noacl mount option used
  Btrfs: should add a permission check for setfacl
  Btrfs: btrfs_lookup_dir_item() can return ERR_PTR
  Btrfs: btrfs_read_fs_root_no_name() returns ERR_PTRs
  Btrfs: unwind after btrfs_start_transaction() errors
  Btrfs: btrfs_iget() returns ERR_PTR
  Btrfs: handle kzalloc() failure in open_ctree()
  Btrfs: handle error returns from btrfs_lookup_dir_item()
  Btrfs: Fix BUG_ON for fs converted from extN
  Btrfs: Fix null dereference in relocation.c
  Btrfs: fix remap_file_pages error
  Btrfs: uninitialized data is check_path_shared()
  Btrfs: fix fallocate regression
  Btrfs: fix loop device on top of btrfs
2010-06-11 14:18:47 -07:00
Yan, Zheng
3bf84a5a83 Btrfs: Fix BUG_ON for fs converted from extN
Tree blocks can live in data block groups in FS converted from extN.
So it's easy to trigger the BUG_ON.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:48:35 -04:00
Linus Torvalds
105a048a4f Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
  Btrfs: add more error checking to btrfs_dirty_inode
  Btrfs: allow unaligned DIO
  Btrfs: drop verbose enospc printk
  Btrfs: Fix block generation verification race
  Btrfs: fix preallocation and nodatacow checks in O_DIRECT
  Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
  Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
  Btrfs: rework O_DIRECT enospc handling
  Btrfs: use async helpers for DIO write checksumming
  Btrfs: don't walk around with task->state != TASK_RUNNING
  Btrfs: do aio_write instead of write
  Btrfs: add basic DIO read/write support
  direct-io: do not merge logically non-contiguous requests
  direct-io: add a hook for the fs to provide its own submit_bio function
  fs: allow short direct-io reads to be completed via buffered IO
  Btrfs: Metadata ENOSPC handling for balance
  Btrfs: Pre-allocate space for data relocation
  Btrfs: Metadata ENOSPC handling for tree log
  Btrfs: Metadata reservation for orphan inodes
  Btrfs: Introduce global metadata reservation
  ...
2010-05-27 10:43:44 -07:00
Chris Mason
933b585f70 Btrfs: drop verbose enospc printk
Less printk is good printk.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-26 21:35:34 -04:00
Yan, Zheng
3fd0a5585e Btrfs: Metadata ENOSPC handling for balance
This patch adds metadata ENOSPC handling for the balance code.
It is consisted by following major changes:

1. Avoid COW tree leave in the phrase of merging tree.

2. Handle interaction with snapshot creation.

3. make the backref cache can live across transactions.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:54 -04:00
Yan, Zheng
d68fc57b7e Btrfs: Metadata reservation for orphan inodes
reserve metadata space for handling orphan inodes

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:52 -04:00
Yan, Zheng
8929ecfa50 Btrfs: Introduce global metadata reservation
Reserve metadata space for extent tree, checksum tree and root tree

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:52 -04:00
Yan, Zheng
0ca1f7ceb1 Btrfs: Update metadata reservation for delayed allocation
Introduce metadata reservation context for delayed allocation
and update various related functions.

This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:51 -04:00
Yan, Zheng
a22285a6a3 Btrfs: Integrate metadata reservation with start_transaction
Besides simplify the code, this change makes sure all metadata
reservation for normal metadata operations are released after
committing transaction.

Changes since V1:

Add code that check if unlink and rmdir will free space.

Add ENOSPC handling for clone ioctl.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:50 -04:00
Yan, Zheng
f0486c68e4 Btrfs: Introduce contexts for metadata reservation
Introducing metadata reseravtion contexts has two major advantages.
First, it makes metadata reseravtion more traceable. Second, it can
reclaim freed space and re-add them to the itself after transaction
committed.

Besides add btrfs_block_rsv structure and related helper functions,
This patch contains following changes:

Move code that decides if freed tree block should be pinned into
btrfs_free_tree_block().

Make space accounting more accurate, mainly for handling read only
block groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:50 -04:00
Yan, Zheng
5da9d01b66 Btrfs: Shrink delay allocated space in a synchronized
Shrink delayed allocation space in a synchronized manner is more
controllable than flushing all delay allocated space in an async
thread.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:48 -04:00
Yan, Zheng
424499dbd0 Btrfs: Kill allocate_wait in space_info
We already have fs_info->chunk_mutex to avoid concurrent
chunk creation.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:48 -04:00
Yan, Zheng
b742bb82f1 Btrfs: Link block groups of different raid types
The size of reserved space is stored in space_info. If block groups
of different raid types are linked to separate space_info, changing
allocation profile will corrupt reserved space accounting.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:47 -04:00
Dmitry Monakhov
fbd9b09a17 blkdev: generalize flags for blkdev_issue_fn functions
The patch just convert all blkdev_issue_xxx function to common
set of flags. Wait/allocation semantics preserved.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-28 19:47:36 +02:00
Linus Torvalds
d6cf853d4d Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: make sure the chunk allocator doesn't create zero length chunks
  Btrfs: fix data enospc check overflow
2010-04-12 18:37:04 -07:00
Linus Torvalds
795d580bae Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: add check for changed leaves in setup_leaf_for_split
  Btrfs: create snapshot references in same commit as snapshot
  Btrfs: fix small race with delalloc flushing waitqueue's
  Btrfs: use add_to_page_cache_lru, use __page_cache_alloc
  Btrfs: fix chunk allocate size calculation
  Btrfs: kill max_extent mount option
  Btrfs: fail to mount if we have problems reading the block groups
  Btrfs: check btrfs_get_extent return for IS_ERR()
  Btrfs: handle kmalloc() failure in inode lookup ioctl
  Btrfs: dereferencing freed memory
  Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk()
  Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree()
  Btrfs: Remove unnecessary finish_wait() in wait_current_trans()
  Btrfs: add NULL check for do_walk_down()
  Btrfs: remove duplicate include in ioctl.c

Fix trivial conflict in fs/btrfs/compression.c due to slab.h include
cleanups.
2010-04-05 13:21:15 -07:00
Josef Bacik
ab6e24103c Btrfs: fix data enospc check overflow
Because we account for reserved space we get from the allocator before we
actually account for allocating delalloc space, we can have a small window where
the amount of "used" space in a space_info is more than the total amount of
space in the space_info.  This will cause a overflow in our check, so it will
seem like we have _tons_ of free space, and we'll allow reservations to occur
that will end up larger than the amount of space we have.  I've seen users
report ENOSPC panic's in cow_file_range a few times recently, so I tried to
reproduce this problem and found I could reproduce it if I ran one of my tests
in a loop for like 20 minutes.  With this patch my test ran all night without
issues.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-04-05 16:04:50 -04:00
Josef Bacik
b5cb160084 Btrfs: fix small race with delalloc flushing waitqueue's
Everytime we start a new flushing thread, we init the waitqueue if there isn't a
flushing thread running.  The problem with this is we check
space_info->flushing, which we clear right before doing a wake_up on the
flushing waitqueue, which causes problems if we init the waitqueue in the middle
of clearing the flushing flagh and calling wake_up.  This is hard to hit, but
the code is wrong anyway, so init the flushing/allocating waitqueue when
creating the space info and let it be.  I haven't seen the panic since I've been
using this patch.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-04-05 14:42:00 -04:00
Josef Bacik
287a0ab91d Btrfs: kill max_extent mount option
As Yan pointed out, theres not much reason for all this complicated math to
account for file extents being split up into max_extent chunks, since they are
likely to all end up in the same leaf anyway.  Since there isn't much reason to
use max_extent, just remove the option altogether so we have one less thing we
need to test.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30 21:19:09 -04:00
Josef Bacik
1b1d1f6625 Btrfs: fail to mount if we have problems reading the block groups
We don't actually check the return value of btrfs_read_block_groups, so we can
possibly succeed to mount, but then fail to say read the superblock xattr for
selinux which will cause the vfs code to deactivate the super.

This is a problem because in find_free_extent we just assume that we
will find the right space_info for the allocation we want.  But if we
failed to read the block groups, we won't have setup any space_info's,
and we'll hit a NULL pointer deref in find_free_extent.

This patch fixes that problem by checking the return value of
btrfs_read_block_groups, and failing out properly.  I've also added a
check in find_free_extent so if for some reason we don't find an
appropriate space_info, we just return -ENOSPC.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30 21:19:09 -04:00
Miao Xie
90d2c51dbb Btrfs: add NULL check for do_walk_down()
btrfs_find_create_tree_block() may return NULL, so we must check the returned
value, or we will access a NULL pointer.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30 21:19:08 -04:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Josef Bacik
2ac55d41b5 Btrfs: cache the extent state everywhere we possibly can V2
This patch just goes through and fixes everybody that does

lock_extent()
blah
unlock_extent()

to use

lock_extent_bits()
blah
unlock_extent_cached()

and pass around a extent_state so we only have to do the searches once per
function.  This gives me about a 3 mb/s boots on my random write test.  I have
not converted some things, like the relocation and ioctl's, since they aren't
heavily used and the relocation stuff is in the middle of being re-written.  I
also changed the clear_extent_bit() to only unset the cached state if we are
clearing EXTENT_LOCKED and related stuff, so we can do things like this

lock_extent_bits()
clear delalloc bits
unlock_extent_cached()

without losing our cached state.  I tested this thoroughly and turned on
LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
fine.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:13 -04:00
Yan, Zheng
7a7965f83e Btrfs: Fix oopsen when dropping empty tree.
When dropping a empty tree, walk_down_tree() skips checking
extent information for the tree root. This will triggers a
BUG_ON in walk_up_proc().

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-04 11:31:45 -05:00
Josef Bacik
11dfe35a01 Btrfs: fix possible panic on unmount
We can race with the unmount of an fs and the stopping of a kthread where we
will free the block group before we're done using it.  The reason for this is
because we do not hold a reference on the block group while its caching, since
the allocator drops its reference once it exits or moves on to the next block
group.  This patch fixes the problem by taking a reference to the block group
before we start caching and dropping it when we're done to make sure all
accesses to the block group are safe.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17 20:40:30 -05:00
Josef Bacik
83d3c9696f Btrfs: make metadata chunks smaller
This patch makes us a bit less zealous about making sure we have enough free
metadata space by pearing down the size of new metadata chunks to 256mb instead
of 1gb.  Also, we used to try an allocate metadata chunks when allocating data,
but that sort of thing is done elsewhere now so we can just remove it.  With my
-ENOSPC test I used to have 3gb reserved for metadata out of 75gb, now I have
1.7gb.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17 12:33:38 -05:00
Yan, Zheng
06b2331f83 Btrfs: don't add extent 0 to the free space cache v2
If block group 0 is completely free, btrfs_read_block_groups will
add extent [0, BTRFS_SUPER_INFO_OFFSET) to the free space cache.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17 12:33:36 -05:00
Yan, Zheng
86b9f2eca5 Btrfs: Fix per root used space accounting
The bytes_used field in root item was originally planned to
trace the amount of used data and tree blocks. But it never
worked right since we can't trace freeing of data accurately.
This patch changes it to only trace the amount of tree blocks.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17 12:33:35 -05:00
Yan, Zheng
24bbcf0442 Btrfs: Add delayed iput
iput() can trigger new transactions if we are dropping the
final reference, so calling it in btrfs_commit_transaction
may end up deadlock. This patch adds delayed iput to avoid
the issue.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17 12:33:35 -05:00
Yan, Zheng
8cef4e160d Btrfs: Avoid superfluous tree-log writeout
We allow two log transactions at a time, but use same flag
to mark dirty tree-log btree blocks. So we may flush dirty
blocks belonging to newer log transaction when committing a
log transaction. This patch fixes the issue by using two
flags to mark dirty tree-log btree blocks.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-15 21:24:25 -05:00
Linus Torvalds
aa021baa32 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix panic when trying to destroy a newly allocated
  Btrfs: allow more metadata chunk preallocation
  Btrfs: fallback on uncompressed io if compressed io fails
  Btrfs: find ideal block group for caching
  Btrfs: avoid null deref in unpin_extent_cache()
  Btrfs: skip btrfs_release_path in btrfs_update_root and btrfs_del_root
  Btrfs: fix some metadata enospc issues
  Btrfs: fix how we set max_size for free space clusters
  Btrfs: cleanup transaction starting and fix journal_info usage
  Btrfs: fix data allocation hint start
2009-11-11 13:38:59 -08:00
Chris Mason
33b2580864 Btrfs: allow more metadata chunk preallocation
On an FS where all of the space has not been allocated into chunks yet,
the enospc can return enospc just because the existing metadata chunks
are full.

We get around this by allowing more metadata chunks to be allocated up
to a certain limit, and finding the right limit is a little fuzzy.  The
problem is the reservations for delalloc would preallocate way too much
of the FS as metadata.  We need to start saying no and just force some
IO to happen.

But we also need to let a reasonable amount of the FS become metadata.
This bumps the hard limit up, later releases will have a better system.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 14:20:20 -05:00
Josef Bacik
ccf0e72537 Btrfs: find ideal block group for caching
This patch changes a few things.  Hopefully the comments are helpfull, but
I'll try and be as verbose here.

Problem:

My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space.  The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system.  So

Solution:

1) Don't cache block groups willy-nilly at first.  Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.

2) Don't be so picky about cached block groups.  The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more.  So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.

There is one other tweak in here.  Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation.  This could make us end up with way too many chunks, so keep
track of this particular case.

With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 14:20:19 -05:00
Linus Torvalds
dcbeb0bec5 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: always pin metadata in discard mode
  Btrfs: enable discard support
  Btrfs: add -o discard option
  Btrfs: properly wait log writers during log sync
  Btrfs: fix possible ENOSPC problems with truncate
  Btrfs: fix btrfs acl #ifdef checks
  Btrfs: streamline tree-log btree block writeout
  Btrfs: avoid tree log commit when there are no changes
  Btrfs: only write one super copy during fsync
2009-10-15 15:06:37 -07:00
Chris Mason
444528b3e6 Btrfs: always pin metadata in discard mode
We have an optimization in btrfs to allow blocks to be
immediately freed if they were allocated in this transaction and never
written.  Otherwise they are pinned and freed when the transaction
commits.

This isn't optimal for discard mode because immediately freeing
them means immediately discarding them.  It is better to give the
block to the pinning code and letting the (slow) discard happen later.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-14 10:32:50 -04:00
Christoph Hellwig
0634857488 Btrfs: enable discard support
The discard support code in btrfs currently is guarded by ifdefs for
BIO_RW_DISCARD, which is never defines as it's the name of an enum
memeber.  Just remove the useless ifdefs to actually enable the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-14 10:32:49 -04:00
Christoph Hellwig
e244a0aeb6 Btrfs: add -o discard option
Enable discard by default is not a good idea given the the trim speed
of SSD prototypes we've seen, and the carecteristics for many high-end
arrays.  Turn of discards by default and require the -o discard option
to enable them on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-14 10:32:49 -04:00
Linus Torvalds
474a503d4b Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix file clone ioctl for bookend extents
  Btrfs: fix uninit compiler warning in cow_file_range_nocow
  Btrfs: constify dentry_operations
  Btrfs: optimize back reference update during btrfs_drop_snapshot
  Btrfs: remove negative dentry when deleting subvolumne
  Btrfs: optimize fsync for the single writer case
  Btrfs: async delalloc flushing under space pressure
  Btrfs: release delalloc reservations on extent item insertion
  Btrfs: delay clearing EXTENT_DELALLOC for compressed extents
  Btrfs: cleanup extent_clear_unlock_delalloc flags
  Btrfs: fix possible softlockup in the allocator
  Btrfs: fix deadlock on async thread startup
2009-10-11 11:23:13 -07:00
Yan, Zheng
94fcca9f89 Btrfs: optimize back reference update during btrfs_drop_snapshot
This patch reading level 0 tree blocks that already use full backrefs.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-09 09:25:16 -04:00
Josef Bacik
e3ccfa9897 Btrfs: async delalloc flushing under space pressure
This patch moves the delalloc flushing that occurs when we are under space
pressure off to a async thread pool.  This helps since we only free up
metadata space when we actually insert the extent item, which means it takes
quite a while for space to be free'ed up if we wait on all ordered extents.
However, if space is freed up due to inline extents being inserted, we can
wake people who are waiting up early, and they can finish their work.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-08 15:21:23 -04:00
Josef Bacik
32c00aff71 Btrfs: release delalloc reservations on extent item insertion
This patch fixes an issue with the delalloc metadata space reservation
code.  The problem is we used to free the reservation as soon as we
allocated the delalloc region.  The problem with this is if we are not
inserting an inline extent, we don't actually insert the extent item until
after the ordered extent is written out.  This patch does 3 things,

1) It moves the reservation clearing stuff into the ordered code, so when
we remove the ordered extent we remove the reservation.
2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
delalloc bits in the cases where we want to clear the metadata reservation
when we clear the delalloc extent, in the case that we do an inline extent
or we invalidate the page.
3) It adds another waitqueue to the space info so that when we start a fs
wide delalloc flush, anybody else who also hits that area will simply wait
for the flush to finish and then try to make their allocation.

This has been tested thoroughly to make sure we did not regress on
performance.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-08 15:21:10 -04:00
Josef Bacik
1cdda9b81a Btrfs: fix possible softlockup in the allocator
Like the cluster allocating stuff, we can lockup the box with the normal
allocation path.  This happens when we

1) Start to cache a block group that is severely fragmented, but has a decent
amount of free space.
2) Start to commit a transaction
3) Have the commit try and empty out some of the delalloc inodes with extents
that are relatively large.

The inodes will not be able to make the allocations because they will ask for
allocations larger than a contiguous area in the free space cache.  So we will
wait for more progress to be made on the block group, but since we're in a
commit the caching kthread won't make any more progress and it already has
enough free space that wait_block_group_cache_progress will just return.  So,
if we wait and fail to make the allocation the next time around, just loop and
go to the next block group.  This keeps us from getting stuck in a softlockup.
Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-06 10:04:28 -04:00
Chris Mason
25472b880c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus 2009-10-01 12:58:13 -04:00
Sage Weil
dd7e0b7b02 Btrfs: fix deadlock with free space handling and user transactions
If an ioctl-initiated transaction is open, we can't force a commit during
the free space checks in order to free up pinned extents or else we
deadlock.  Just ENOSPC instead.

A more satisfying solution that reserves space for the entire user
transaction up front is forthcoming...

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-29 19:50:07 -04:00
Josef Bacik
9ed74f2dba Btrfs: proper -ENOSPC handling
At the start of a transaction we do a btrfs_reserve_metadata_space() and
specify how many items we plan on modifying.  Then once we've done our
modifications and such, just call btrfs_unreserve_metadata_space() for
the same number of items we reserved.

For keeping track of metadata needed for data I've had to add an extent_io op
for when we merge extents.  This lets us track space properly when we are doing
sequential writes, so we don't end up reserving way more metadata space than
what we need.

The only place where the metadata space accounting is not done is in the
relocation code.  This is because Yan is going to be reworking that code in the
near future, so running btrfs-vol -b could still possibly result in a ENOSPC
related panic.  This patch also turns off the metadata_ratio stuff in order to
allow users to more efficiently use their disk space.

This patch makes it so we track how much metadata we need for an inode's
delayed allocation extents by tracking how many extents are currently
waiting for allocation.  It introduces two new callbacks for the
extent_io tree's, merge_extent_hook and split_extent_hook.  These help
us keep track of when we merge delalloc extents together and split them
up.  Reservations are handled prior to any actually dirty'ing occurs,
and then we unreserve after we dirty.

btrfs_unreserve_metadata_for_delalloc() will make the appropriate
unreservations as needed based on the number of reservations we
currently have and the number of extents we currently have.  Doing the
reservation outside of doing any of the actual dirty'ing lets us do
things like filemap_flush() the inode to try and force delalloc to
happen, or as a last resort actually start allocation on all delalloc
inodes in the fs.  This has survived dbench, fs_mark and an fsx torture
test.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-28 16:29:42 -04:00
Chris Mason
54bcf382da Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus
Conflicts:
	fs/btrfs/super.c
2009-09-24 10:00:58 -04:00
Chris Mason
7ce618db98 Btrfs: fix early enospc during balancing
We now do extra checks before a balance to make sure
there is room for the balance to take place.  One of
the checks was testing to see if we were trying to
balance away the last block group of a given type.

If there is no space available for new chunks, we
should not try and balance away the last block group
of a give type.  But, the code wasn't checking for
available chunk space, and so it was exiting too soon.

The fix here is to combine some of the checks and make
sure we try to allocate new chunks when we're balancing
the last block group.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-22 14:48:44 -04:00
Chris Mason
33b4d47f5e Btrfs: deal with NULL space info
After a balance it is briefly possible for the space info
field in the inode to be NULL.  This adds some checks
to make sure things properly deal with the NULL value.


Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-22 14:45:50 -04:00
Josef Bacik
1b2da372b0 Btrfs: account for space used by the super mirrors
As we get closer to proper -ENOSPC handling in btrfs, we need more accurate
space accounting for the space info's.  Currently we exclude the free space for
the super mirrors, but the space they take up isn't accounted for in any of the
counters.  This patch introduces bytes_super, which keeps track of the amount
of bytes used for a super mirror in the block group cache and space info.  This
makes sure that our free space caclucations will be completely accurate.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21 19:23:50 -04:00
Josef Bacik
f61408b81c Btrfs: remove dead code
This patch removes a bunch of dead code from the snapshot removal stuff.  It
was confusing me when doing the metadata ENOSPC stuff so I killed it.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21 19:23:49 -04:00
Josef Bacik
0a24325e6d Btrfs: don't keep retrying a block group if we fail to allocate a cluster
The box can get locked up in the allocator if we happen upon a block group
under these conditions:

1) During a commit, so caching threads cannot make progress
2) Our block group currently is in the middle of being cached
3) Our block group currently has plenty of free space in it
4) Our block group is so fragmented that it ends up having no free space chunks
larger than min_bytes calculated by btrfs_find_space_cluster.

What happens is we try and do btrfs_find_space_cluster, which fails because it
is unable to find enough free space chunks that are large than min_bytes and
are close enough together.  Since the block group is not cached we do a
wait_block_group_cache_progress, which waits for the number of bytes we need,
except the block group already has _plenty_ of free space, its just severely
fragmented, so we loop and try again, ad infinitum.  This patch keeps us from
waiting on the block group to finish caching if we failed to find a free space
cluster before.  It also makes sure that we don't even try to find a free space
cluster if we are on our last loop in the allocator, since we will have tried
everything at this point at it is futile.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21 19:23:49 -04:00
Josef Bacik
ba1bf4818b Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents.  For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic.  Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk.  This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.

V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.

-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.

-check to make sure the block group we are going to relocate isn't the last one
in that particular space

-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21 19:23:48 -04:00
Yan, Zheng
76dda93c6a Btrfs: add snapshot/subvolume destroy ioctl
This patch adds snapshot/subvolume destroy ioctl.  A subvolume that isn't being
used and doesn't contains links to other subvolumes can be destroyed.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21 16:00:26 -04:00
Yan, Zheng
1c4850e21d Btrfs: speed up snapshot dropping
This patch contains two changes to avoid unnecessary tree block reads during
snapshot dropping.

First, check tree block's reference count and flags before reading the tree
block. if reference count > 1 and there is no need to update backrefs, we can
avoid reading the tree block.

Second, save when snapshot was created in root_key.offset. we can compare block
pointer's generation with snapshot's creation generation during updating
backrefs. If a given block was created before snapshot was created, the
snapshot can't be the tree block's owner. So we can avoid reading the block.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21 15:55:59 -04:00
Yan Zheng
11833d66be Btrfs: improve async block group caching
This patch gets rid of two limitations of async block group caching.
The old code delays handling pinned extents when block group is in
caching. To allocate logged file extents, the old code need wait
until block group is fully cached. To get rid of the limitations,
This patch introduces a data structure to track the progress of
caching. Base on the caching progress, we know which extents should
be added to the free space cache when handling the pinned extents.
The logged file extents are also handled in a similar way.

This patch also changes how pinned extents are tracked. The old
code uses one tree to track pinned extents, and copy the pinned
extents tree at transaction commit time. This patch makes it use
two trees to track pinned extents. One tree for extents that are
pinned in the running transaction, one tree for extents that can
be unpinned. At transaction commit time, we swap the two trees.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-17 15:47:36 -04:00
Christoph Hellwig
746cd1e7e4 block: use blkdev_issue_discard in blk_ioctl_discard
blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
the only difference between the two is that blkdev_issue_discard needs to
send a barrier discard request and blk_ioctl_discard a non-barrier one,
and blk_ioctl_discard needs to wait on the request.  To facilitates this
add a flags argument to blkdev_issue_discard to control both aspects of the
behaviour.  This will be very useful later on for using the waiting
funcitonality for other callers.

Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-14 08:24:53 +02:00
Chris Mason
890871be85 Btrfs: switch extent_map to a rw lock
There are two main users of the extent_map tree.  The
first is regular file inodes, where it is evenly spread
between readers and writers.

The second is the chunk allocation tree, which maps blocks from
logical addresses to phyiscal ones, and it is 99.99% reads.

The mapping tree is a point of lock contention during heavy IO
workloads, so this commit switches things to a rw lock.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:05 -04:00
Chris Mason
013f1b12f4 Btrfs: make sure the async caching thread advances the key
The async caching thread can end up looping forever if a given
search puts it at the last key in a leaf.  It will end up calling
btrfs_next_leaf and then checking if it needs to politely drop
the read semaphore.

Most of the time this looping isn't noticed because it is able to
make progress the next time around.  But, during log replay,
we wait on the async caching thread to finish, and the async thread
is waiting on the commit, and no progress is really made.

The fix used here is to copy the key out of the next leaf,
that way our search lands there properly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-31 14:57:55 -04:00
Chris Mason
f36f3042ea Btrfs: be more polite in the async caching threads
The semaphore used by the async caching threads can prevent a
transaction commit, which can make the FS appear to stall.  This
releases the semaphore more often when a transaction commit is
in progress.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-30 10:14:46 -04:00
Yan Zheng
276e680d19 Btrfs: preserve commit_root for async caching
The async block group caching code uses the commit_root pointer
to get a stable version of the extent allocation tree for scanning.
This copy of the tree root isn't going to change and it significantly
reduces the complexity of the scanning code.

During a commit, we have a loop where we update the extent allocation
tree root.  We need to loop because updating the root pointer in
the tree of tree roots may allocate blocks which may change the
extent allocation tree.

Right now the commit_root pointer is changed inside this loop.  It
is more correct to change the commit_root pointer only after all the
looping is done.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-30 09:40:40 -04:00
Yan Zheng
f25784b35f Btrfs: Fix async caching interaction with unmount
- don't stop the caching thread until btrfs_commit_super return.

- if caching is interrupted by umount, set last to (u64)-1.
  otherwise the un-scanned range of block group will be considered
  as free extent.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-28 08:41:57 -04:00
Josef Bacik
68b38550dd Btrfs: change how we unpin extents
We are racy with async block caching and unpinning extents.  This patch makes
things much less complicated by only unpinning the extent if the block group is
cached.  We check the block_group->cached var under the block_group->lock spin
lock.  If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back.  If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent.  This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.

One thing that needed to be changed was btrfs_free_super_mirror_extents.  Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory.  So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-27 13:57:01 -04:00
Chris Mason
283bb1979f Btrfs: clear all space_info->full after removing a block group
Btrfs allocates individual extents from block groups, and each
block group has a specific type.  It may hold metadata, data
mirrored or striped etc.

When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r)
we free block groups.  Once a block group is freed, the space it was
using on the device may be available for use by new block groups.

btrfs_remove_block_group was clearing the flag that said
'our devices are full, don't even try to allocate new block groups',
but it was only clearing that flag for a specific type of block group.

This commit clears the full flag for all of the types of block groups,
making it much more likely that we'll be able to balance space when
the drive is close to full.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 16:30:55 -04:00
Josef Bacik
817d52f8db Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner.  Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation.  If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching.  This is how I tested
the speedup from this

mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo

Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.

Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group.  This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.

I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached.  This drastically reduces the amount of time it takes to
cache the block groups.  Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.

This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 09:23:39 -04:00
Josef Bacik
9630308170 Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space.  As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area.  Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.

This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache.  Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram.  The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.

Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space.  This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap.  This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.

I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree.  I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree.  This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.

I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff.  I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues.  I've not seen
any performance regressions in any of my tests.

Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems.  Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 09:23:30 -04:00
Yan Zheng
4a8c9a62d7 Btrfs: make sure all dirty blocks are written at commit time
Write dirty block groups may allocate new block, and so may add new delayed
back ref. btrfs_run_delayed_refs may make some block groups dirty.

commit_cowonly_roots does not handle the recursion properly, and some dirty
blocks can be left unwritten at commit time. This patch moves
btrfs_run_delayed_refs into the loop that writes dirty block groups, and makes
the code not break out of the loop until there are no dirty block groups or
delayed back refs.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 10:07:05 -04:00
Hu Tao
68f5a38c3e Btrfs: fix error message formatting
Make an error msg look nicer by inserting a space between number and word.

Signed-off-by: Hu Tao <hu.taoo@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-02 13:55:45 -04:00
Yan Zheng
2c47e605a9 Btrfs: update backrefs while dropping snapshot
The new backref format has restriction on type of backref item.  If a tree
block isn't referenced by its owner tree, full backrefs must be used for the
pointers in it. When a tree block loses its owner tree's reference, backrefs
for the pointers in it should be updated to full backrefs. Current
btrfs_drop_snapshot misses the code that updates backrefs, so it's unsafe for
general use.

This patch adds backrefs update code to btrfs_drop_snapshot.  It isn't a
problem in the restricted form btrfs_drop_snapshot is used today, but for
general snapshot deletion this update is required.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-02 13:41:17 -04:00
Yan Zheng
85d4198e40 Btrfs: check duplicate backrefs for both data and metadata
lookup_inline_extent_backref only checks for duplicate backref for data
extents. It assumes backrefs for tree block never conflict.

This patch makes lookup_inline_extent_backref check for duplicate backrefs
for both data and tree block, so that we can detect potential bug earlier.
This is a safety check, strictly speaking it is not required.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 08:51:34 -04:00
David Woodhouse
163e783e6a Btrfs: remove crc32c.h and use libcrc32c directly.
There's no need to preserve this abstraction; it used to let us use
hardware crc32c support directly, but libcrc32c is already doing that for us
through the crypto API -- so we're already using the Intel crc32c
acceleration where appropriate.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:53 -04:00
Chris Mason
451d7585a8 Btrfs: add mount -o ssd_spread to spread allocations out
Some SSDs perform best when reusing block numbers often, while
others perform much better when clustering strictly allocates
big chunks of unused space.

The default mount -o ssd will find rough groupings of blocks
where there are a bunch of free blocks that might have some
allocated blocks mixed in.

mount -o ssd_spread will make sure there are no allocated blocks
mixed in.  It should perform better on lower end SSDs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Yan Zheng
5d4f98a28c Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one.  At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root.  This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces.  But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:46 -04:00
Chris Mason
44fb551163 Btrfs: Fix oops and use after free during space balancing
The btrfs allocator uses list_for_each to walk the available block
groups when searching for free blocks.  It starts off with a hint
to help find the best block group for a given allocation.

The hint is resolved into a block group, but we don't properly check
to make sure the block group we find isn't in the middle of being
freed due to filesystem shrinking or balancing.  If it is being
freed, the list pointers in it are bogus and can't be trusted.  But,
the code happily goes along and uses them in the list_for_each loop,
leading to all kinds of fun.

The fix used here is to check to make sure the block group we find really
is on the list before we use it.  list_del_init is used when removing
it from the list, so we can do a proper check.

The allocation clustering code has a similar bug where it will trust
the block group in the current free space cluster.  If our allocation
flags have changed (going from single spindle dup to raid1 for example)
because the drives in the FS have changed, we're not allowed to use
the old block group any more.

The fix used here is to check the current cluster against the
current allocation flags.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-04 15:41:27 -04:00
Sankar P
9f55684c2d Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
Signed-off-by: Sankar P <sankar.curiosity@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:34 -04:00
Joel Becker
21380931eb Btrfs: Fix a bunch of printk() warnings.
Just happened to notice a bunch of %llu vs u64 warnings.  Here's a patch
to cast them all.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 08:37:49 -04:00
Josef Bacik
97e728d435 Btrfs: try to keep a healthy ratio of metadata vs data block groups
This patch makes the chunk allocator keep a good ratio of metadata vs data
block groups.  By default for every 8 data block groups, we'll allocate 1
metadata chunk, or about 12% of the disk will be allocated for metadata.  This
can be changed by specifying the metadata_ratio mount option.

This is simply the number of data block groups that have to be allocated to
force a metadata chunk allocation.  By making sure we allocate metadata chunks
more often, we are less likely to get into situations where the whole disk
has been allocated as data block groups.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:02 -04:00
Chris Mason
fa9c0d795f Btrfs: rework allocation clustering
Because btrfs is copy-on-write, we end up picking new locations for
blocks very often.  This makes it fairly difficult to maintain perfect
read patterns over time, but we can at least do some optimizations
for writes.

This is done today by remembering the last place we allocated and
trying to find a free space hole big enough to hold more than just one
allocation.  The end result is that we tend to write sequentially to
the drive.

This happens all the time for metadata and it happens for data
when mounted -o ssd.  But, the way we record it is fairly racey
and it tends to fragment the free space over time because we are trying
to allocate fairly large areas at once.

This commit gets rid of the races by adding a free space cluster object
with dedicated locking to make sure that only one process at a time
is out replacing the cluster.

The free space fragmentation is somewhat solved by allowing a cluster
to be comprised of smaller free space extents.  This part definitely
adds some CPU time to the cluster allocations, but it allows the allocator
to consume the small holes left behind by cow.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-03 09:47:43 -04:00
Josef Bacik
04018de5d4 Btrfs: kill the pinned_mutex
This patch removes the pinned_mutex.  The extent io map has an internal tree
lock that protects the tree itself, and since we only copy the extent io map
when we are committing the transaction we don't need it there.  We also don't
need it when caching the block group since searching through the tree is also
protected by the internal map spin lock.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-04-03 10:14:18 -04:00
Josef Bacik
6226cb0a5e Btrfs: kill the block group alloc mutex
This patch removes the block group alloc mutex used to protect the free space
tree for allocations and replaces it with a spin lock which is used only to
protect the free space rb tree.  This means we only take the lock when we are
directly manipulating the tree, which makes us a touch faster with
multi-threaded workloads.

This patch also gets rid of btrfs_find_free_space and replaces it with
btrfs_find_space_for_alloc, which takes the number of bytes you want to
allocate, and empty_size, which is used to indicate how much free space should
be at the end of the allocation.

It will return an offset for the allocator to use.  If we don't end up using it
we _must_ call btrfs_add_free_space to put it back.  This is the tradeoff to
kill the alloc_mutex, since we need to make sure nobody else comes along and
takes our space.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-04-03 10:14:18 -04:00
Josef Bacik
2552d17e32 Btrfs: clean up find_free_extent
I've replaced the strange looping constructs with a list_for_each_entry on
space_info->block_groups.  If we have a hint we just jump into the loop with
the block group and start looking for space.  If we don't find anything we
start at the beginning and start looking.  We never come out of the loop with a
ref on the block_group _unless_ we found space to use, then we drop it after we
set the trans block_group.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-04-03 10:14:19 -04:00
Josef Bacik
70cb074345 Btrfs: free space cache cleanups
This patch cleans up the free space cache code a bit.  It better documents the
idiosyncrasies of tree_search_offset and makes the code make a bit more sense.
I took out the info allocation at the start of __btrfs_add_free_space and put it
where it makes more sense.  This was left over cruft from when alloc_mutex
existed.  Also all of the re-searches we do to make sure we inserted properly.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-04-03 10:14:19 -04:00
Chris Mason
d57e62b897 Btrfs: try to free metadata pages when we free btree blocks
COW means we cycle though blocks fairly quickly, and once we
free an extent on disk, it doesn't make much sense to keep the pages around.

This commit tries to immediately free the page when we free the extent,
which lowers our memory footprint significantly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-31 14:27:58 -04:00
Chris Mason
12fcfd22fe Btrfs: tree logging unlink/rename fixes
The tree logging code allows individual files or directories to be logged
without including operations on other files and directories in the FS.
It tries to commit the minimal set of changes to disk in order to
fsync the single file or directory that was sent to fsync or O_SYNC.

The tree logging code was allowing files and directories to be unlinked
if they were part of a rename operation where only one directory
in the rename was in the fsync log.  This patch adds a few new rules
to the tree logging.

1) on rename or unlink, if the inode being unlinked isn't in the fsync
log, we must force a full commit before doing an fsync of the directory
where the unlink was done.  The commit isn't done during the unlink,
but it is forced the next time we try to log the parent directory.

Solution: record transid of last unlink/rename per directory when the
directory wasn't already logged.  For renames this is only done when
renaming to a different directory.

mkdir foo/some_dir
normal commit
rename foo/some_dir foo2/some_dir
mkdir foo/some_dir
fsync foo/some_dir/some_file

The fsync above will unlink the original some_dir without recording
it in its new location (foo2).  After a crash, some_dir will be gone
unless the fsync of some_file forces a full commit

2) we must log any new names for any file or dir that is in the fsync
log.  This way we make sure not to lose files that are unlinked during
the same transaction.

2a) we must log any new names for any file or dir during rename
when the directory they are being removed from was logged.

2a is actually the more important variant.  Without the extra logging
a crash might unlink the old name without recreating the new one

3) after a crash, we must go through any directories with a link count
of zero and redo the rm -rf

mkdir f1/foo
normal commit
rm -rf f1/foo
fsync(f1)

The directory f1 was fully removed from the FS, but fsync was never
called on f1, only its parent dir.  After a crash the rm -rf must
be replayed.  This must be able to recurse down the entire
directory tree.  The inode link count fixup code takes care of the
ugly details.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:52 -04:00
Chris Mason
b9473439d3 Btrfs: leave btree locks spinning more often
btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying.  This may require a kmalloc and it
was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.

This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer.  Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.

This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.

Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:28 -04:00
Chris Mason
b7ec40d784 Btrfs: reduce stalls during transaction commit
To avoid deadlocks and reduce latencies during some critical operations, some
transaction writers are allowed to jump into the running transaction and make
it run a little longer, while others sit around and wait for the commit to
finish.

This is a bit unfair, especially when the callers that jump in do a bunch
of IO that makes all the others procs on the box wait.  This commit
reduces the stalls this produces by pre-reading file extent pointers
during btrfs_finish_ordered_io before the transaction is joined.

It also tunes the drop_snapshot code to politely wait for transactions
that have started writing out their delayed refs to finish.  This avoids
new delayed refs being flooded into the queue while we're trying to
close off the transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:26 -04:00
Chris Mason
c3e69d58e8 Btrfs: process the delayed reference queue in clusters
The delayed reference queue maintains pending operations that need to
be done to the extent allocation tree.  These are processed by
finding records in the tree that are not currently being processed one at
a time.

This is slow because it uses lots of time searching through the rbtree
and because it creates lock contention on the extent allocation tree
when lots of different procs are running delayed refs at the same time.

This commit changes things to grab a cluster of refs for processing,
using a cursor into the rbtree as the starting point of the next search.
This way we walk smoothly through the rbtree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:26 -04:00
Chris Mason
1887be66dc Btrfs: try to cleanup delayed refs while freeing extents
When extents are freed, it is likely that we've removed the last
delayed reference update for the extent.  This checks the delayed
ref tree when things are freed, and if no ref updates area left it
immediately processes the delayed ref.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:26 -04:00
Chris Mason
56bec294de Btrfs: do extent allocation and reference count updates in the background
The extent allocation tree maintains a reference count and full
back reference information for every extent allocated in the
filesystem.  For subvolume and snapshot trees, every time
a block goes through COW, the new copy of the block adds a reference
on every block it points to.

If a btree node points to 150 leaves, then the COW code needs to go
and add backrefs on 150 different extents, which might be spread all
over the extent allocation tree.

These updates currently happen during btrfs_cow_block, and most COWs
happen during btrfs_search_slot.  btrfs_search_slot has locks held
on both the parent and the node we are COWing, and so we really want
to avoid IO during the COW if we can.

This commit adds an rbtree of pending reference count updates and extent
allocations.  The tree is ordered by byte number of the extent and byte number
of the parent for the back reference.  The tree allows us to:

1) Modify back references in something close to disk order, reducing seeks
2) Significantly reduce the number of modifications made as block pointers
are balanced around
3) Do all of the extent insertion and back reference modifications outside
of the performance critical btrfs_search_slot code.

#3 has the added benefit of greatly reducing the btrfs stack footprint.
The extent allocation tree modifications are done without the deep
(and somewhat recursive) call chains used in the past.

These delayed back reference updates must be done before the transaction
commits, and so the rbtree is tied to the transaction.  Throttling is
implemented to help keep the queue of backrefs at a reasonable size.

Since there was a similar mechanism in place for the extent tree
extents, that is removed and replaced by the delayed reference tree.

Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:25 -04:00
Chris Mason
4184ea7f90 Btrfs: Fix locking around adding new space_info
Storage allocated to different raid levels in btrfs is tracked by
a btrfs_space_info structure, and all of the current space_infos are
collected into a list_head.

Most filesystems have 3 or 4 of these structs total, and the list is
only changed when new raid levels are added or at unmount time.

This commit adds rcu locking on the list head, and properly frees
things at unmount time.  It also clears the space_info->full flag
whenever new space is added to the FS.

The locking for the space info list goes like this:

reads: protected by rcu_read_lock()
writes: protected by the chunk_mutex

At unmount time we don't need special locking because all the readers
are gone.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-10 12:39:20 -04:00
Chris Mason
b9447ef80b Btrfs: fix spinlock assertions on UP systems
btrfs_tree_locked was being used to make sure a given extent_buffer was
properly locked in a few places.  But, it wasn't correct for UP compiled
kernels.

This switches it to using assert_spin_locked instead, and renames it to
btrfs_assert_tree_locked to better reflect how it was really being used.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-09 11:45:38 -04:00
Josef Bacik
4e06bdd6cb Btrfs: try committing transaction before returning ENOSPC
This fixes a problem where we could return -ENOSPC when we may actually have
plenty of space, the space is just pinned.  Instead of returning -ENOSPC
immediately, commit the transaction first and then try and do the allocation
again.

This patch also does chunk allocation for metadata if we pass the 80%
threshold for metadata space.  This will help with stack usage since the chunk
allocation will happen early on, instead of when the allocation is happening.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-20 10:59:53 -05:00
Josef Bacik
6a63209fc0 Btrfs: add better -ENOSPC handling
This is a step in the direction of better -ENOSPC handling.  Instead of
checking the global bytes counter we check the space_info bytes counters to
make sure we have enough space.

If we don't we go ahead and try to allocate a new chunk, and then if that fails
we return -ENOSPC.  This patch adds two counters to btrfs_space_info,
bytes_delalloc and bytes_may_use.

bytes_delalloc account for extents we've actually setup for delalloc and will
be allocated at some point down the line. 

bytes_may_use is to keep track of how many bytes we may use for delalloc at
some point.  When we actually set the extent_bit for the delalloc bytes we
subtract the reserved bytes from the bytes_may_use counter.  This keeps us from
not actually being able to allocate space for any delalloc bytes.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-20 11:00:09 -05:00
Yan Zheng
2456242530 Btrfs: hold trans_mutex when using btrfs_record_root_in_trans
btrfs_record_root_in_trans needs the trans_mutex held to make sure two
callers don't race to setup the root in a given transaction.  This adds
it to all the places that were missing it.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-02-12 14:14:53 -05:00
Chris Mason
4008c04a07 Btrfs: make a lockdep class for the extent buffer locks
Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block.  But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.

The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.

btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
gets upset about bad lock orderin.

The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:09:45 -05:00
Chris Mason
536ac8ae86 Btrfs: use larger metadata clusters in ssd mode
Larger metadata clusters can significantly improve writeback performance
on ssd drives with large erasure blocks.  The larger clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle.

On spinning media, lager metadata clusters end up spreading out the
metadata more over time, which makes fsck slower, so we don't want this
to be the default.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 09:41:38 -05:00
Josef Bacik
eb09967089 Btrfs: make sure all pending extent operations are complete
Theres a slight problem with finish_current_insert, if we set all to 1 and then
go through and don't actually skip any of the extents on the pending list, we
could exit right after we've added new extents.

This is a problem because by inserting the new extents we could have gotten new
COW's to happen and such, so we may have some pending updates to do or even
more inserts to do after that.

So this patch will only exit if we have never skipped any of the extents in the
pending list, and we have no extents to insert, this will make sure that all of
the pending work is truly done before we return.  I've been running with this
patch for a few days with all of my other testing and have not seen issues.
Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-12 09:27:38 -05:00
Chris Mason
806638bce9 Btrfs: Fix memory leak in cache_drop_leaf_ref
The code wasn't doing a kfree on the sorted array

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-05 09:08:14 -05:00
Chris Mason
bd56b30205 Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks
Every transaction in btrfs creates a new snapshot, and then schedules the
snapshot from the last transaction for deletion.  Snapshot deletion
works by walking down the btree and dropping the reference counts
on each btree block during the walk.

If if a given leaf or node has a reference count greater than one,
the reference count is decremented and the subtree pointed to by that
node is ignored.

If the reference count is one, walking continues down into that node
or leaf, and the references of everything it points to are decremented.

The old code would try to work in small pieces, walking down the tree
until it found the lowest leaf or node to free and then returning.  This
was very friendly to the rest of the FS because it didn't have a huge
impact on other operations.

But it wouldn't always keep up with the rate that new commits added new
snapshots for deletion, and it wasn't very optimal for the extent
allocation tree because it wasn't finding leaves that were close together
on disk and processing them at the same time.

This changes things to walk down to a level 1 node and then process it
in bulk.  All the leaf pointers are sorted and the leaves are dropped
in order based on their extent number.

The extent allocation tree and commit code are now fast enough for
this kind of bulk processing to work without slowing the rest of the FS
down.  Overall it does less IO and is better able to keep up with
snapshot deletions under high load.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:27:02 -05:00
Chris Mason
b4ce94de9b Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.

So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.

This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.

We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.

The basic idea is:

btrfs_tree_lock() returns with the spin lock held

btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock.  The buffer is
still considered locked by all of the btrfs code.

If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.

Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time.  So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.

btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.

btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.

ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:25:08 -05:00
Chris Mason
b7a9f29fcf Btrfs: sort references by byte number during btrfs_inc_ref
When a block goes through cow, we update the reference counts of
everything that block points to.  The internal pointers of the block
can be in just about any order, and it is likely to have clusters of
things that are close together and clusters of things that are not.

To help reduce the seeks that come with updating all of these reference
counts, sort them by byte number before actual updates are done.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:23:45 -05:00
Yan Zheng
7237f18336 Btrfs: fix tree logs parallel sync
To improve performance, btrfs_sync_log merges tree log sync
requests. But it wrongly merges sync requests for different
tree logs. If multiple tree logs are synced at the same time,
only one of them actually gets synced.

This patch has following changes to fix the bug:

Move most tree log related fields in btrfs_fs_info to
btrfs_root. This allows merging sync requests separately
for each tree log.

Don't insert root item into the log root tree immediately
after log tree is allocated. Root item for log tree is
inserted when log tree get synced for the first time. This
allows syncing the log root tree without first syncing all
log trees.

At tree-log sync, btrfs_sync_log first sync the log tree;
then updates corresponding root item in the log root tree;
sync the log root tree; then update the super block.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 12:54:03 -05:00
Yan Zheng
86288a198d Btrfs: fix stop searching test in replace_one_extent
replace_one_extent searches tree leaves for references to a given extent. It
stops searching if it goes beyond the last possible position.

The last possible position is computed by adding the starting offset of a found
file extent to the full size of the extent. The code uses physical size of the
extent as the full size. This is incorrect when compression is used.

The fix is get the full size from ram_bytes field of file extent item.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 10:49:16 -05:00
Huang Weiyi
653249ff9a Btrfs: remove duplicated #include
Removed duplicated #include "compat.h"in
fs/btrfs/extent-tree.c

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Yan Zheng
5a7be515b1 Btrfs: Fix infinite loop in btrfs_extent_post_op
btrfs_extent_post_op calls finish_current_insert and del_pending_extents. They
both may enter infinite loops.

finish_current_insert enters infinite loop if it only finds some backrefs to
update.  The fix is to check for pending backref updates before restarting the
loop.

The infinite loop in del_pending_extents is due to a the skipped variable
not being properly reset before looping around.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 10:49:16 -05:00
Yan Zheng
3dfdb9348a Btrfs: fix locking issue in btrfs_remove_block_group
We should hold the block_group_cache_lock while modifying the
block groups red-black tree. Thank you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 10:49:16 -05:00
Qinghuang Feng
c6e308713a Btrfs: simplify iteration codes
Merge list_for_each* and list_entry to list_for_each_entry*

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:59:08 -05:00
Huang Weiyi
7eaebe7d50 Btrfs: removed unused #include <version.h>'s
Removed unused #include <version.h>'s in btrfs

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Yan Zheng
07d400a6df Btrfs: tree logging checksum fixes
This patch contains following things.

1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE.  This
struct is kmalloced so we want to keep it reasonable.

2) Replace copy_extent_csums by btrfs_lookup_csums_range.  This was
duplicated code in tree-log.c

3) Remove replay_one_csum. csum items are replayed at the same time as
   replaying file extents. This guarantees we only replay useful csums.

4) nbytes accounting fix.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-06 11:42:00 -05:00
Chris Mason
9ca03b997f Btrfs: drop remaining LINUX_KERNEL_VERSION checks and compat code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-06 09:38:55 -05:00
Chris Mason
d397712bcc Btrfs: Fix checkpatch.pl warnings
There were many, most are fixed now.  struct-funcs.c generates some warnings
but these are bogus.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-05 21:25:51 -05:00
Liu Hui
1f3c79a28c Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.

There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.

This patch change discard behavior as follows:
1, For the extents which need to be free at once,
   we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
   when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
   FTL will destroy file system data.
2009-01-05 15:57:51 -05:00
Yan Zheng
1f80e4db0f Btrfs: set EXTENT_BOUNDARY bit before marking extent delalloc.
There is a race in relocate_inode_pages, it happens when
find_delalloc_range finds the delalloc extent before the
boundary bit is set. Thank you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-19 10:59:04 -05:00
Yan Zheng
34bf63c4dd Btrfs: properly update block accounting for metadata
This adds the missing block accounting code to finish_current_insert and makes
block accounting for root item properly protected by the delalloc spin lock.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-19 10:58:46 -05:00
Chris Mason
dcbdd4dcb9 Btrfs: delete checksum items before marking blocks free
Btrfs maintains a cache of blocks available for allocation in ram.  The
code that frees extents was marking the extents free and then deleting
the checksum items.

This meant it was possible the extent would be reallocated before the
checksum item was actually deleted, leading to races and other
problems as the checksums were updated for the newly allocated extent.

The fix is to delete the checksum before marking the extent free.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-16 13:51:01 -05:00
Chris Mason
75eff68ea6 Btrfs: Don't use spin*lock_irq for the delalloc lock
The delalloc lock doesn't need to have irqs disabled, nobody that
changes the number of delalloc bytes in the FS is running with irqs off.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-15 15:54:40 -05:00
Yan Zheng
17d217fe97 Btrfs: fix nodatasum handling in balancing code
Checksums on data can be disabled by mount option, so it's
possible some data extents don't have checksums or have
invalid checksums. This causes trouble for data relocation.
This patch contains following things to make data relocation
work.

1) make nodatasum/nodatacow mount option only affects new
files. Checksums and COW on data are only controlled by the
inode flags.

2) check the existence of checksum in the nodatacow checker.
If checksums exist, force COW the data extent. This ensure that
checksum for a given block is either valid or does not exist.

3) update data relocation code to properly handle the case
of checksum missing.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-12 10:03:38 -05:00
Yan Zheng
e4404d6e8d Btrfs: shared seed device
This patch makes seed device possible to be shared by
multiple mounted file systems. The sharing is achieved
by cloning seed device's btrfs_fs_devices structure.
Thanks you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-12 10:03:26 -05:00
Yan Zheng
d2fb3437e4 Btrfs: fix leaking block group on balance
The block group structs are referenced in many different
places, and it's not safe to free while balancing.  So, those block
group structs were simply leaked instead.

This patch replaces the block group pointer in the inode with the starting byte
offset of the block group and adds reference counting to the block group
struct.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-11 16:30:39 -05:00
Yan Zheng
0403e47ee2 Btrfs: Add checking of csum tree in balancing code
This updates the space balancing code for the
new checksum format.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-10 20:32:51 -05:00
Chris Mason
459931eca5 Btrfs: Delete csum items when freeing extents
This finishes off the new checksumming code by removing csum items
for extents that are no longer in use.

The trick is doing it without racing because a single csum item may
hold csums for more than one extent.  Extra checks are added to
btrfs_csum_file_blocks to make sure that we are using the correct
csum item after dropping locks.

A new btrfs_split_item is added to split a single csum item so it
can be split without dropping the leaf lock.  This is used to
remove csum bytes from the middle of an item.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-10 09:10:46 -05:00
Yan Zheng
a512bbf855 Btrfs: superblock duplication
This patch implements superblock duplication. Superblocks
are stored at offset 16K, 64M and 256G on every devices.
Spaces used by superblocks are preserved by the allocator,
which uses a reverse mapping function to find the logical
addresses that correspond to superblocks. Thank you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-08 16:46:26 -05:00
Christoph Hellwig
b2950863c6 Btrfs: make things static and include the right headers
Shut up various sparse warnings about symbols that should be either
static or have their declarations in scope.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-12-02 09:54:17 -05:00
Josef Bacik
ea6a478ed9 Btrfs: Fix for lockdep warnings with alloc_mutex and pinned_mutex
This the lockdep complaint by having a different mutex to gaurd caching the
block group, so you don't end up with this backwards dependancy.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-20 12:16:16 -05:00
Chris Mason
4b4e25f2a6 Btrfs: compat code fixes
The btrfs git kernel trees is used to build a standalone tree for
compiling against older kernels.  This commit makes the standalone tree
work with 2.6.27

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-20 10:22:27 -05:00
Chris Mason
15916de835 Btrfs: Fixes for 2.6.28-rc API changes
* open/close_bdev_excl -> open/close_bdev_exclusive
* blkdev_issue_discard takes a GFP mask now
* Fix blkdev_issue_discard usage now that it is enabled

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-19 21:17:22 -05:00
Josef Bacik
07103a3cdb Btrfs: fix free space accounting when unpinning extents
This patch fixes what I hope is the last early ENOSPC bug left.  I did not know
that pinned extents would merge into one big extent when inserted on to the
pinned extent tree, so I was adding free space to a block group that could
possibly span multiple block groups.

This is a big issue because first that space doesn't exist in that block group,
and second we won't actually use that space because there are a bunch of other
checks to make sure we're allocating within the constraints of the block group.

This patch fixes the problem by adding the btrfs_add_free_space to
btrfs_update_pinned_extents which makes sure we are adding the appropriate
amount of free space to the appropriate block group.  Thanks much to Lee Trager
for running my myriad of debug patches to help me track this problem down.
Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-19 15:17:55 -05:00
Liu Hui
b4eec2ca11 Btrfs: Some fixes for batching extent insert.
In insert_extents(), when ret==1 and last is not zero, it should
check if the current inserted item is the last item in this batching
inserts. If so, it should just break from loop. If not, 'cur =
insert_list->next' will make no sense because the list is empty now,
and 'op' will point to an unexpectable place.

There are also some trivial fixs in this patch including one comment
typo error and deleting two redundant lines.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-18 11:30:10 -05:00
Josef Bacik
4ce4cb526f Btrfs: Add some debugging around the ENOSPC bugs
Some people are still reporting problems with early enospc.  This
will help narrown down the cause.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-17 21:12:00 -05:00
Josef Bacik
e3e469f86e Btrfs: fix free space leak
In my batch delete/update/insert patch I introduced a free space leak.  The
extent that we do the original search on in free_extents is never pinned, so we
always update the block saying that it has free space, but the free space never
actually gets added to the free space tree, since op->del will always be 0 and
it's never actually added to the pinned extents tree.

This patch fixes this problem by making sure we call pin_down_bytes on the
pending extent op and set op->del to the return value of pin_down_bytes so
update_block_group is called with the right value.  This seems to fix the case
where we were getting ENOSPC when there was plenty of space available.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-17 21:11:49 -05:00
Yan Zheng
2b82032c34 Btrfs: Seed device support
Seed device is a special btrfs with SEEDING super flag
set and can only be mounted in read-only mode. Seed
devices allow people to create new btrfs on top of it.

The new FS contains the same contents as the seed device,
but it can be mounted in read-write mode.

This patch does the following:

1) split code in btrfs_alloc_chunk into two parts. The first part does makes
the newly allocated chunk usable, but does not do any operation that modifies
the chunk tree. The second part does the the chunk tree modifications. This
division is for the bootstrap step of adding storage to the seed device.

2) Update device management code to handle seed device.
The basic idea is: For an FS grown from seed devices, its
seed devices are put into a list. Seed devices are
opened on demand at mounting time. If any seed device is
missing or has been changed, btrfs kernel module will
refuse to mount the FS.

3) make btrfs_find_block_group not return NULL when all
block groups are read-only.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-17 21:11:30 -05:00
Yan Zheng
c146afad2c Btrfs: mount ro and remount support
This patch adds mount ro and remount support. The main
changes in patch are: adding btrfs_remount and related
helper function; splitting the transaction related code
out of close_ctree into btrfs_commit_super; updating
allocator to properly handle read only block group.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-12 14:34:12 -05:00
Josef Bacik
f3465ca44e Btrfs: batch extent inserts/updates/deletions on the extent root
While profiling the allocator I noticed a good amount of time was being spent in
finish_current_insert and del_pending_extents, and as the filesystem filled up
more and more time was being spent in those functions.  This patch aims to try
and reduce that problem.  This happens two ways

1) track if we tried to delete an extent that we are going to update or insert.
Once we get into finish_current_insert we discard any of the extents that were
marked for deletion.  This saves us from doing unnecessary work almost every
time finish_current_insert runs.

2) Batch insertion/updates/deletions.  Instead of doing a btrfs_search_slot for
each individual extent and doing the needed operation, we instead keep the leaf
around and see if there is anything else we can do on that leaf.  On the insert
case I introduced a btrfs_insert_some_items, which will take an array of keys
with an array of data_sizes and try and squeeze in as many of those keys as
possible, and then return how many keys it was able to insert.  In the update
case we search for an extent ref, update the ref and then loop through the leaf
to see if any of the other refs we are looking to update are on that leaf, and
then once we are done we release the path and search for the next ref we need to
update.  And finally for the deletion we try and delete the extent+ref in pairs,
so we will try to find extent+ref pairs next to the extent we are trying to free
and free them in bulk if possible.

This along with the other cluster fix that Chris pushed out a bit ago helps make
the allocator preform more uniformly as it fills up the disk.  There is still a
slight drop as we fill up the disk since we start having to stick new blocks in
odd places which results in more COW's than on a empty fs, but the drop is not
nearly as severe as it was before.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-12 14:19:50 -05:00
Chris Mason
2ed6d66408 Btrfs: Fix handling of space info full during allocations
When we fail to allocate a new block group, we should still do the
checks to make sure allocations try again with the minimum requested
allocation size.

This also fixes a deadlock that come from a missed down_read in
the chunk allocation failure handling.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-13 09:59:33 -05:00
Chris Mason
8a1413a296 Btrfs: empty_size allocation fixes again
The allocator wasn't catching all of the cases where it needed to do
extra loops because the check to enforce them wasn't happening early
enough.

When the allocator decided to increase the size of the allocation
for metadata clustering, it wasn't always setting the empty_size to
include the extra (optional) bytes.  This also fixes the empty_size field
to be correct.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 16:13:54 -05:00
Chris Mason
f5a31e1667 Btrfs: Try harder while searching for free space
The loop searching for free space would exit out too soon when
metadata clustering was trying to allocate a large extent.  This makes
sure a full scan of the free space is done searching for only the
minimum extent size requested by the higher layers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 11:47:09 -05:00
Chris Mason
5b7c3fcc46 Btrfs: Don't substract too much from the allocation target (avoid wrapping)
When metadata allocation clustering has to fall back to unclustered
allocs because large free areas could not be found, it was sometimes
substracting too much from the total bytes to allocate.  This would
make it wrap below zero.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 07:26:33 -05:00
Chris Mason
42e70e7a2f Btrfs: Fix more false enospc errors and an oops from empty clustering
In comes cases the empty cluster was added twice to the total number of
bytes the allocator was trying to find.

With empty clustering on, the hint byte was sometimes outside of the
block group.  Add an extra goto to find the correct block group.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-07 18:17:11 -05:00
Chris Mason
4366211ccd Btfs: More metadata allocator optimizations
This lowers the empty cluster target for metadata allocations.  The lower
target makes it easier to do allocations and still seems to perform well.

It also fixes the allocator loop to drop the empty cluster when things
start getting difficult, avoiding false enospc warnings.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-07 09:06:11 -05:00
Chris Mason
3b7885bf96 Btrfs: enforce metadata allocation clustering
The allocator uses the last allocation as a starting point for metadata
allocations, and tries to allocate in clusters of at least 256k.

If the search for a free block fails to find the expected block, this patch
forces a new cluster to be found in the free list.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-06 21:48:27 -05:00
Chris Mason
771ed689d2 Btrfs: Optimize compressed writeback and reads
When reading compressed extents, try to put pages into the page cache
for any pages covered by the compressed extent that readpages didn't already
preload.

Add an async work queue to handle transformations at delayed allocation processing
time.  Right now this is just compression.  The workflow is:

1) Find offsets in the file marked for delayed allocation
2) Lock the pages
3) Lock the state bits
4) Call the async delalloc code

The async delalloc code clears the state lock bits and delalloc bits.  It is
important this happens before the range goes into the work queue because
otherwise it might deadlock with other work queue items that try to lock
those extent bits.

The file pages are compressed, and if the compression doesn't work the
pages are written back directly.

An ordered work queue is used to make sure the inodes are written in the same
order that pdflush or writepages sent them down.

This changes extent_write_cache_pages to let the writepage function
update the wbc nr_written count.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-06 22:02:51 -05:00
Yan Zheng
d899e05215 Btrfs: Add fallocate support v2
This patch updates btrfs-progs for fallocate support.

fallocate is a little different in Btrfs because we need to tell the
COW system that a given preallocated extent doesn't need to be
cow'd as long as there are no snapshots of it.  This leverages the
-o nodatacow checks.
 
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-30 14:25:28 -04:00
Yan Zheng
80ff385665 Btrfs: update nodatacow code v2
This patch simplifies the nodatacow checker. If all references
were created after the latest snapshot, then we can avoid COW
safely. This patch also updates run_delalloc_nocow to do more
fine-grained checking.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-30 14:20:02 -04:00
Yan Zheng
6643558db2 Btrfs: Fix bookend extent race v2
When dropping middle part of an extent, btrfs_drop_extents truncates
the extent at first, then inserts a bookend extent.

Since truncation and insertion can't be done atomically, there is a small
period that the bookend extent isn't in the tree. This causes problem for
functions that search the tree for file extent item. The way to fix this is
lock the range of the bookend extent before truncation.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-30 14:19:50 -04:00
Chris Mason
87ef2bb46b Btrfs: prevent looping forever in finish_current_insert and del_pending_extents
finish_current_insert and del_pending_extents process extent tree modifications
that build up while we are changing the extent tree.  It is a confusing
bit of code that prevents recursion.

Both functions run through a list of pending operations and both funcs
add to the list of pending operations.  If you have two procs in either
one of them, they can end up looping forever making more work for each other.

This patch makes them walk forward through the list of pending changes instead
of always trying to process the entire list.  At transaction commit
time, we catch any changes that were left over.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 11:23:27 -04:00
Yan Zheng
84234f3a1f Btrfs: Add root tree pointer transaction ids
This patch adds transaction IDs to root tree pointers.
Transaction IDs in tree pointers are compared with the
generation numbers in block headers when reading root
blocks of trees. This can detect some types of IO errors.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-29 14:49:05 -04:00
Josef Bacik
2517920135 Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.

There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.

The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same.  Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.

Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one.  I have tested this heavily and it does
not appear to break anything.  This has to be applied on top of my
find_free_extent redo patch.

I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-29 14:49:05 -04:00
Josef Bacik
80eb234af0 Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had.  It only happens with Metadata writes, and happens _very_
infrequently.  What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start.  We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP.  We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.

If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of.  This patch completely kills find_free_space and
rolls it into find_free_extent.  I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.

The basic flow is this:  We have the variable loop which is 0, meaning we are
in the hint phase.  We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of.  If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.

This is also where we add the empty_cluster to total_needed.  At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups.  If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good.  If not we start
over at space_info->block_groups and loop through again, with loop == 2.  If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.

Also I've made a groups_sem to handle the group list for the space_info.  This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time.  Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-29 14:49:05 -04:00
Yan Zheng
f82d02d9d8 Btrfs: Improve space balancing code
This patch improves the space balancing code to keep more sharing
of tree blocks. The only case that breaks sharing of tree blocks is
data extents get fragmented during balancing. The main changes in
this patch are:

Add a 'drop sub-tree' function. This solves the problem in old code
that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.

Remove relocation mapping tree. Relocation mappings are stored in
struct btrfs_ref_path and updated dynamically during walking up/down
the reference path. This reduces CPU usage and simplifies code.

This patch also fixes a bug. Root items for reloc trees should be
updated in btrfs_free_reloc_root.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-29 14:49:05 -04:00
Chris Mason
c8b978188c Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents.  It does some fairly large
surgery to the writeback paths.

Compression is off by default and enabled by mount -o compress.  Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.

If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.

* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler.  This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.

* Inline extents are inserted at delalloc time now.  This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.

* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.

From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field.  Neither the encryption or the
'other' field are currently used.

In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k.  This is a
software only limit, the disk format supports u64 sized compressed extents.

In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k.  This is a software only limit
and will be subject to tuning later.

Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data.  This way additional encodings can be
layered on without having to figure out which encoding to checksum.

Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread.  This makes it tricky to
spread the compression load across all the cpus on the box.  We'll have to
look at parallel pdflush walks of dirty inodes at a later time.

Decompression is hooked into readpages and it does spread across CPUs nicely.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
Yan Zheng
5b84e8d6ee Btrfs: Fix leaf reference cache miss
Due to the optimization for truncate, tree leaves only containing
checksum items can be deleted without being COW'ed first. This causes
reference cache misses. The way to fix the miss is create cache
entries for tree leaves only contain checksum.

This patch also fixes a -EEXIST issue in shared reference cache.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-09 11:46:19 -04:00
Yan Zheng
3bb1a1bc42 Btrfs: Remove offset field from struct btrfs_extent_ref
The offset field in struct btrfs_extent_ref records the position
inside file that file extent is referenced by. In the new back
reference system, tree leaves holding references to file extent
are recorded explicitly. We can scan these tree leaves very quickly, so the
offset field is not required.

This patch also makes the back reference system check the objectid
when extents are in deleting.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-09 11:46:24 -04:00
Yan Zheng
a76a3cd40c Btrfs: Count space allocated to file in bytes
This patch makes btrfs count space allocated to file in bytes instead
of 512 byte sectors.

Everything else in btrfs uses a byte count instead of sector sizes or
blocks sizes, so this fits better.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-09 11:46:29 -04:00
Chris Mason
30c43e2444 Btrfs: remove last_log_alloc allocator optimization
The tree logging code was trying to separate tree log allocations
from normal metadata allocations to improve writeback patterns during
an fsync.

But, the code was not effective and ended up just mixing tree log
blocks with regular metadata.  That seems to be working fairly well,
so the last_log_alloc code can be removed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-03 12:24:01 -04:00
Josef Bacik
cf74982385 Btrfs: fix deadlock between alloc_mutex/chunk_mutex
This fixes a deadlock that happens between the alloc_mutex and chunk_mutex.
Process A comes in, decides to do a do_chunk_alloc, which takes the
chunk_mutex, and is holding the alloc_mutex because the only way you get to
do_chunk_alloc is by holding the alloc_mutex.  btrfs_alloc_chunk does its thing
and goes to insert a new item, which results in a cow of the block.

We get into del_pending_extents from there, where if we need to be rescheduled
we drop the alloc_mutex and schedule.  At this point process B comes in to do
an allocation and gets the alloc_mutex, and because process A did not do the
chunk allocation completely it thinks its a good time to do a chunk allocation
as well, and hangs on the chunk_mutex.

Process A wakes up and tries to take the alloc_mutex and cannot.  The way to
fix this is do a mutex_trylock() on chunk_mutex.  If we return 0 we didn't get
the lock, and if this is just a "hey it may be a good time to allocate a chunk"
then we just exit.  If we are trying to force an allocation then we reschedule
and keep trying to acquire the chunk_mutex.  If once we acquire it the space is
already full then we can just exit, otherwise we can continue with the chunk
allocation.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-01 19:11:18 -04:00
Chris Mason
75ccf47d13 Btrfs: fix multi-device code to use raid policies set by mkfs
When reading in block groups, a global mask of the available raid policies
should be adjusted based on the types of block groups found on disk.  This
global mask is then used to decide which raid policy to use for new
block groups.

The recent allocator changes dropped the call that updated the global
mask, making all the block groups allocated at run time single striped
onto a single drive.

This also fixes the async worker threads to set any thread that uses
the requeue mechanism as busy.  This allows us to avoid blocking
on get_request_wait for the async bio submission threads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-30 19:36:34 -04:00
Josef Bacik
45b8c9a8b1 Btrfs: fix seekiness due to finding the wrong block group
This patch fixes a problem where we end up seeking too much when *last_ptr is
valid.  This happens because btrfs_lookup_first_block_group only returns a
block group that starts on or after the given search start, so if the
search_start is in the middle of a block group it will return the block group
after the given search_start, which is suboptimal.

This patch fixes that by doing a btrfs_lookup_block_group, which will return
the block group that contains the given search start.  If we fail to find a
block group, we fall back on btrfs_lookup_first_block_group so we can find the
next block group, not sure if this is absolutely needed, but better safe than
sorry.

Also if we can't find the block group that we need, or it happens to not be of
the right type, we need to add empty_cluster since *last_ptr could point to a
mismatched block group, which means we need to start over with empty_cluster
added to total needed.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-30 14:40:06 -04:00
Zheng Yan
1a40e23b95 Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format.  Before, btrfs-vol -b would break any COW links
on data blocks or metadata.  This was slow and caused the amount
of space used to explode if a large number of snapshots were present.

The new code can keeps the sharing of all data extents and
most of the tree blocks.

To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.

To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).

To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:09:34 -04:00
Zheng Yan
e465768938 Btrfs: Add shared reference cache
Btrfs has a cache of reference counts in leaves, allowing it to
avoid reading tree leaves while deleting snapshots.  To reduce
contention with multiple subvolumes, this cache is private to each
subvolume.

This patch adds shared reference cache support. The new space
balancing code plays with multiple subvols at the same time, So
the old per-subvol reference cache is not well suited.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:04:53 -04:00
Zheng Yan
e856981384 Btrfs: allocator fixes for space balancing update
* Reserved extent accounting:  reserved extents have been
allocated in the rbtrees that track free space but have not
been allocated on disk.  They were never properly accounted for
in the past, making it hard to know how much space was really free.

* btrfs_find_block_group used to return NULL for block groups that
had been removed by the space balancing code.  This made it hard
to account for space during the final stages of a balance run.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:05:48 -04:00
Chris Mason
4434c33c7f Btrfs: fix sleep with spinlock held during unmount
The code to free block groups needs to drop the space info spin lock
before calling btrfs_remove_free_space_cache (which can schedule).

This is safe because at unmount time, nobody else is going to play
with the block groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 15:41:59 -04:00
Zheng Yan
31840ae1a6 Btrfs: Full back reference support
This patch makes the back reference system to explicit record the
location of parent node for all types of extents. The location of
parent node is placed into the offset field of backref key. Every
time a tree block is balanced, the back references for the affected
lower level extents are updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
1c2308f8e7 Add check for tree-log roots in btrfs_alloc_reserved_extents
Tree log blocks are only reserved, and should not ever get fully
allocated on disk.  This check makes sure they stay out of the
extent tree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Josef Bacik
0f9dd46cda Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size.  The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing.  If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible.  When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.

2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset.  also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.

3) cleaned up the allocation code to make it a little easier to read and a
little less complicated.  Basically there are 3 steps, first look from our
provided hint.  If we couldn't find from that given hint, start back at our
original search start and look for space from there.  If that fails try to
allocate space if we can and start looking again.  If not we're screwed and need
to start over again.

4) small fixes.  there were some issues in volumes.c where we wouldn't allocate
the rest of the disk.  fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space.  Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space.  Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups.  The alloc_hint has fixed
this slight degredation and made things semi-normal.

There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space.  This only happens with metadata
allocations, and only when we are almost full.  So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall.  I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Josef Bacik
ef8bbdfe7e Btrfs: fix cache_block_group error handling
cache block group had a few bugs in the error handling code,
this makes sure paths get properly released and the correct return value
goes out.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
d0c803c404 Btrfs: Record dirty pages tree-log pages in an extent_io tree
This is the same way the transaction code makes sure that all the
other tree blocks are safely on disk.  There's an extent_io tree
for each root, and any blocks allocated to the tree logs are
recorded in that tree.

At tree-log sync, the extent_io tree is walked to flush down the
dirty pages and wait for them.

The main benefit is less time spent walking the tree log and skipping
clean pages, and getting sequential IO down to the drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
d00aff0013 Btrfs: Optimize tree log block allocations
Since tree log blocks get freed every transaction, they never really
need to be written to disk.  This skips the step where we update
metadata to record they were allocated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
4bef084857 Btrfs: Tree logging fixes
* Pin down data blocks to prevent them from being reallocated like so:

trans 1: allocate file extent
trans 2: free file extent
trans 3: free file extent during old snapshot deletion
trans 3: allocate file extent to new file
trans 3: fsync new file

Before the tree logging code, this was legal because the fsync
would commit the transation that did the final data extent free
and the transaction that allocated the extent to the new file
at the same time.

With the tree logging code, the tree log subtransaction can commit
before the transaction that freed the extent.  If we crash,
we're left with two different files using the extent.

* Don't wait in start_transaction if log replay is going on.  This
avoids deadlocks from iput while we're cleaning up link counts in the
replay code.

* Don't deadlock in replay_one_name by trying to read an inode off
the disk while holding paths for the directory

* Hold the buffer lock while we mark a buffer as written.  This
closes a race where someone is changing a buffer while we write it.
They are supposed to mark it dirty again after they change it, but
this violates the cow rules.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason
e02119d5a7 Btrfs: Add a write ahead tree log to optimize synchronous operations
File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree.  There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.

After a crash, items are copied out of the log tree and back into the
subvolume.  See tree-log.c for all the details.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
David Woodhouse
21af804c07 Btrfs: Discard sector data in __free_extent()
Date: Tue, 12 Aug 2008 14:13:26 +0100
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Yan Zheng
7ea394f119 Btrfs: Fix nodatacow for the new data=ordered mode
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
ea8c281947 Btrfs: Maintain a list of inodes that are delalloc and a way to wait on them
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
d7a029a89e Btrfs: Don't corrupt ram in shrink_extent_tree, leak it instead
Far from the perfect fix, but these structs are small.  TODO for the
next release.  The block group cache structs are referenced in many
different places, and it isn't safe to just free them while resizing.

A real fix will be a larger change to the allocator so that it doesn't
have to carry about the block group cache structs to find good places
to search for free blocks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
2dd3e67b1e Btrfs: More throttle tuning
* Make walk_down_tree wake up throttled tasks more often
* Make walk_down_tree call cond_resched during long loops
* As the size of the ref cache grows, wait longer in throttle
* Get rid of the reada code in walk_down_tree, the leaves don't get
  read anymore, thanks to the ref cache.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
65b51a009e btrfs_search_slot: reduce lock contention by cowing in two stages
A btree block cow has two parts, the first is to allocate a destination
block and the second is to copy the old bock over.

The first part needs locks in the extent allocation tree, and may need to
do IO.  This changeset splits that into a separate function that can be
called without any tree locks held.

btrfs_search_slot is changed to drop its path and start over if it has
to COW a contended block.  This often means that many writers will
pre-alloc a new destination for a the same contended block, but they
cache their prealloc for later use on lower levels in the tree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
18e35e0ab3 Btrfs: Throttle less often waiting for snapshots to delete
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
f87f057b49 Btrfs: Improve and cleanup locking done by walk_down_tree
While dropping snapshots, walk_down_tree does most of the work of checking
reference counts and limiting tree traversal to just the blocks that
we are freeing.

It dropped and held the allocation mutex in strange and confusing ways,
this commit changes it to only hold the mutex while actually freeing a block.

The rest of the checks around reference counts should be safe without the lock
because we only allow one process in btrfs_drop_snapshot at a time.  Other
processes dropping reference counts should not drop it to 1 because
their tree roots already have an extra ref on the block.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason
37d1aeee39 Btrfs: Throttle tuning
This avoids waiting for transactions with pages locked by breaking out
the code to wait for the current transaction to close into a function
called by btrfs_throttle.

It also lowers the limits for where we start throttling.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
47ac14fa0f Btrfs: Add missing hunk from Yan Zheng's cache reclaim patch
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Yan
bcc63abbf3 Btrfs: implement memory reclaim for leaf reference cache
The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.

The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Yan Zheng
f321e49103 Btrfs: Update and fix mount -o nodatacow
To check whether a given file extent is referenced by multiple snapshots, the
checker walks down the fs tree through dead root and checks all tree blocks in
the path.

We can easily detect whether a given tree block is directly referenced by other
snapshot. We can also detect any indirect reference from other snapshot by
checking reference's generation. The checker can always detect multiple
references, but can't reliably detect cases of single reference. So btrfs may
do file data cow even there is only one reference.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
ab78c84de1 Btrfs: Throttle operations if the reference cache gets too large
A large reference cache is directly related to a lot of work pending
for the cleaner thread.  This throttles back new operations based on
the size of the reference cache so the cleaner thread will be able to keep
up.

Overall, this actually makes the FS faster because the cleaner thread will
be more likely to find things in cache.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
017e5369eb Btrfs: Leaf reference cache update
This changes the reference cache to make a single cache per root
instead of one cache per transaction, and to key by the byte number
of the disk block instead of the keys inside.

This makes it much less likely to have cache misses if a snapshot
or something has an extra reference on a higher node or a leaf while
the first transaction that added the leaf into the cache is dropping.

Some throttling is added to functions that free blocks heavily so they
wait for old transactions to drop.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Yan Zheng
31153d8128 Btrfs: Add a leaf reference cache
Much of the IO done while dropping snapshots is done looking up
leaves in the filesystem trees to see if they point to any extents and
to drop the references on any extents found.

This creates a cache so that IO isn't required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Yan
974e35a82d Btrfs: Properly release lock in pin_down_bytes
When buffer isn't uptodate, pin_down_bytes may leave the tree locked
after it returns.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Josef Bacik
8e8a1e31f2 Btrfs: Fix a few functions that exit without stopping their transaction
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
3eaa288527 Btrfs: Fix the defragmention code and the block relocation code for data=ordered
Before setting an extent to delalloc, the code needs to wait for
pending ordered extents.

Also, the relocation code needs to wait for ordered IO before scanning
the block group again.  This is because the extents are not removed
until the IO for the new extents is finished

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
c286ac48ed Btrfs: alloc_mutex latency reduction
This releases the alloc_mutex in a few places that hold it for over long
operations.  btrfs_lookup_block_group is changed so that it doesn't need
the mutex at all.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
e34a5b4f77 Btrfs: Add some conditional schedules near the alloc_mutex
This helps prevent stalls, especially while the snapshot cleaner is
running hard

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
a61e6f29dc Btrfs: Use a mutex in the extent buffer for tree block locking
This replaces the use of the page cache lock bit for locking, which wasn't
suitable for block size < page size and couldn't be used recursively.

The mutexes alone don't fix either problem, but they are the first step.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
4a09675279 Btrfs: Data ordered fixes
* In btrfs_delete_inode, wait for ordered extents after calling
truncate_inode_pages.  This is much faster, and more correct

* Properly clear our the PageChecked bit everywhere we redirty the page.

* Change the writepage fixup handler to lock the page range and check to
see if an ordered extent had been inserted since the improperly dirtied
page was discovered

* Wait for ordered extents outside the transaction.  This isn't required
for locking rules but does improve transaction latencies

* Reduce contention on the alloc_mutex by dropping it while incrementing
refs on a node/leaf and while dropping refs on a leaf.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
54641bd17d Btrfs: Force caching of metadata block groups on mount to avoid deadlock
This is a temporary change to avoid deadlocks until the extent tree locking
is fixed up.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason
ee6e6504e1 Add a per-inode lock around btrfs_drop_extents
btrfs_drop_extents is always called with a range lock held on the inode.
But, it may operate on extents outside that range as it drops and splits
them.

This patch adds a per-inode mutex that is held while calling
btrfs_drop_extents and while inserting new extents into the tree.  It
prevents races from two procs working against adjacent ranges in the tree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason
e6dcd2dc9c Btrfs: New data=ordered implementation
The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk.  This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
  the extent bit EXTENT_ORDERED is set on the entire range of the extent.
  A struct btrfs_ordered_extent is allocated an inserted into a per-inode
  rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
  to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
  are written, The previously reserved extent is inserted into the FS
  btree and into the extent allocation trees.  The checksums for the file
  data are also updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason
7d9eb12c87 Btrfs: Add locking around volume management (device add/remove/balance)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason
3f157a2fd2 Btrfs: Online btree defragmentation fixes
The btree defragger wasn't making forward progress because the new key wasn't
being saved by the btrfs_search_forward function.

This also disables the automatic btree defrag, it wasn't scaling well to
huge filesystems.  The auto-defrag needs to be done differently.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason
079899c238 Btrfs: Change find_extent_buffer to use TestSetPageLocked
This makes it possible for callers to check for extent_buffers in cache
without deadlocking against any btree locks held.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
e7a84565bc Btrfs: Add btree locking to the tree defragmentation code
The online btree defragger is simplified and rewritten to use
standard btree searches instead of a walk up / down mechanism.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
a74a4b97b6 Btrfs: Replace the transaction work queue with kthreads
This creates one kthread for commits and one kthread for
deleting old snapshots.  All the work queues are removed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
333db94cdd Btrfs: Fix snapshot deletion to release the alloc_mutex much more often.
This lowers the impact of snapshot deletion on the rest of the FS.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
5cd57b2cbb Btrfs: Add a skip_locking parameter to struct path, and make various funcs honor it
Allocations may need to read in block groups from the extent allocation tree,
which will require a tree search and take locks on the extent allocation
tree.  But, those locks might already be held in other places, leading
to deadlocks.

Since the alloc_mutex serializes everything right now, it is safe to
skip the btree locking while caching block groups.  A better fix will be
to either create a recursive lock or find a way to back off existing
locks while caching block groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
051e1b9f74 Drop locks in btrfs_search_slot when reading a tree block.
One lock per btree block can make for significant congestion if everyone
has to wait for IO at the high levels of the btree.  This drops
locks held by a path when doing reads during a tree search.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
a213501153 Btrfs: Replace the big fs_mutex with a collection of other locks
Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
925baeddc5 Btrfs: Start btree concurrency work.
The allocation trees and the chunk trees are serialized via their own
dedicated mutexes.  This means allocation location is still not very
fine grained.

The main FS btree is protected by locks on each block in the btree.  Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.

The end result of a search is now a path where only the lowest level
is locked.  Releasing or freeing the path drops any locks held.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
0ef3e66b67 Btrfs: Allocator fix variety pack
* Force chunk allocation when find_free_extent has to do a full scan
* Record the max key at the start of defrag so it doesn't run forever
* Block groups might not be contiguous, make a forward search for the
  next block group in extent-tree.c
* Get rid of extra checks for total fs size
* Fix relocate_one_reference to avoid relocating the same file data block
  twice when referenced by an older transaction
* Use the open device count when allocating chunks so that we don't
  try to allocate from devices that don't exist

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
1259ab75c6 Btrfs: Handle write errors on raid1 and raid10
When duplicate copies exist, writes are allowed to fail to one of those
copies.  This changeset includes a few changes that allow the FS to
continue even when some IOs fail.

It also adds verification of the parent generation number for btree blocks.
This generation is stored in the pointer to a block, and it ensures
that missed writes to are detected.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
ca7a79ad8d Btrfs: Pass down the expected generation number when reading tree blocks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
323da79c9f Btrfs: Chunk relocation fine tuning, and add a few printks to show progress
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason
bbaf549e0c Btrfs: A number of nodatacow fixes
Once part of a delalloc request fails the cow checks, just cow the
entire range

It is possible for the back references to all be from the same root,
but still have snapshots against an extent.  The checks are now more strict,
forcing cow any time there are multiple refs against the data extent.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
a68d5933a0 Btrfs: Update nodatacow mode to support cloned single files and resizing
Before, nodatacow only checked to make sure multiple roots didn't have
references on a single extent.  This check makes sure that multiple
inodes don't have references.

nodatacow needed an extra check to see if the block group was currently
readonly.  This way cows forced by the chunk relocation code are honored.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
bf4ef67924 Btrfs: Properly find the root for snapshotted blocks during chunk relocation
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
a061fc8da7 Btrfs: Add support for online device removal
This required a few structural changes to the code that manages bdev pointers:

The VFS super block now gets an anon-bdev instead of a pointer to the
lowest bdev.  This allows us to avoid swapping the super block bdev pointer
around at run time.

The code to read in the super block no longer goes through the extent
buffer interface.  Things got ugly keeping the mapping constant.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
a236aed14c Btrfs: Deal with failed writes in mirrored configurations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
ec44a35cbe Btrfs: Add balance ioctl to restripe the chunks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
8e7bf94fd5 Btrfs: Do more optimal file RA during shrinking and defrag
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
3bf3d9e9c2 Btrfs: Avoid recursive chunk allocations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
8f18cf1339 Btrfs: Make the resizer work based on shrinking and growing devices
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
bce4eae986 Btrfs: Fix balance_level to free the middle block if there is room in the left one
balance level starts by trying to empty the middle block, and then
pushes from the right to the middle.  This might empty the right block
and leave a small number of pointers in the middle.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
3c12ac7205 Btrfs: Simplify device selection for mirrored reads
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason
3b951516ed Btrfs: Use the extent map cache to find the logical disk block during data retries
The data read retry code needs to find the logical disk block before it
can resubmit new bios.  But, finding this block isn't allowed to take
the fs_mutex because that will deadlock with a number of different callers.

This changes the retry code to use the extent map cache instead, but
that requires the extent map cache to have the extent we're looking for.
This is a problem because btrfs_drop_extent_cache just drops the entire
extent instead of the little tiny part it is invalidating.

The bulk of the code in this patch changes btrfs_drop_extent_cache to
invalidate only a portion of the extent cache, and changes btrfs_get_extent
to deal with the results.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
699122f559 Btrfs: Don't wait on tree block writeback before freeing them anymore
This isn't required anymore because we don't reallocate blocks that
have already been written in this transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
321aecc656 Btrfs: Add RAID10 support
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
e17cade25f Btrfs: Add chunk uuids and update multi-device back references
Block headers now store the chunk tree uuid

Chunk items records the device uuid for each stripes

Device extent items record better back refs to the chunk tree

Block groups record better back refs to the chunk tree

The chunk tree format has also changed.  The objectid of BTRFS_CHUNK_ITEM_KEY
used to be the logical offset of the chunk.  Now it is a chunk tree id,
with the logical offset being stored in the offset field of the key.

This allows a single chunk tree to record multiple logical address spaces,
upping the number of bytes indexed by a chunk tree from 2^64 to
2^128.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
98d20f67cf Add a min size parameter to btrfs_alloc_extent
On huge machines, delayed allocation may try to allocate massive extents.
This change allows btrfs_alloc_extent to return something smaller than
the caller asked for, and the data allocation routines will loop over
the allocations until it fills the whole delayed alloc.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Miguel
a5eb62e345 Btrfs: Endianess bug fix for v0.13 with kernels
Fix for a endianess BUG when using btrfs v0.13 with kernels older than 2.6.23

Problem:

Has of v0.13, btrfs-progs is using crc32c.c equivalent to the one found on
linux-2.6.23/lib/libcrc32c.c Since crc32c_le() changed in linux-2.6.23, when
running btrfs v0.13 with older kernels we have a missmatch between the versions
of crc32c_le() from btrfs-progs and libcrc32c in the kernel.  This missmatch
causes a bug when using btrfs on big endian machines.

Solution:
btrfs_crc32c() macro that when compiling for kernels older than 2.6.23, does
endianess conversion to parameters and return value of crc32c().
This endianess conversion nullifies the differences in implementation
of crc32c_le().
If kernel 2.6.23 or better, it calls crc32c().

Signed-off-by: Miguel Sousa Filipe <miguel.filipe@gmail.com>
---

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
ce9adaa5a7 Btrfs: Do metadata checksums for reads via a workqueue
Before, metadata checksumming was done by the callers of read_tree_block,
which would set EXTENT_CSUM bits in the extent tree to show that a given
range of pages was already checksummed and didn't need to be verified
again.

But, those bits could go away via try_to_releasepage, and the end
result was bogus checksum failures on pages that never left the cache.

The new code validates checksums when the page is read.  It is a little
tricky because metadata blocks can span pages and a single read may
end up going via multiple bios.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
d18a2c4475 Btrfs: Fix allocation profile init
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
6bc34676c0 Btrfs: Don't allow written blocks from this transaction to be reallocated
When a block is freed, it can be immediately reused if it is from
the current transaction.  But, an extra check is required to make sure
the block had not been written yet.  If it were reused after being written,
the transid in the block header might match the transid of the
next time the block was allocated.

The parent node records the transaction ID of the block it is pointing to,
and this is used as part of validating the block on reads.  So, there
can only be one version of a block per transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
611f0e00a2 Btrfs: Add support for duplicate blocks on a single spindle
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
8790d502e4 Btrfs: Add support for mirroring across drives
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
0999df54f8 Btrfs: Verify checksums on tree blocks found without read_tree_block
Checksums were only verified by btrfs_read_tree_block, which meant the
functions to probe the page cache for blocks were not validating checksums.
Normally this is fine because the buffers will only be in cache if they
have already been validated.

But, there is a window while the buffer is being read from disk where
it could be up to date in the cache but not yet verified.  This patch
makes sure all buffers go through checksum verification before they
are used.

This is safer, and it prevents modification of buffers before they go
through the csum code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
ecbe2402cb Btrfs: Keep fs_mutex during reads done by snapshot deletion
There was an optimization to drop the fs_mutex when doing snapshot deletion
reads, but this can lead to false positives on checksumming errors.  Keep
the lock for now.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
593060d756 Btrfs: Implement raid0 when multiple devices are present
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
239b14b32d Btrfs: Bring back mount -o ssd optimizations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
a9218f6b00 Add /dev/btrfs-control for device scanning ioctls
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
7d1660d411 Btrfs: Bring back find_free_extent CPU usage optimizations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
6324fbf334 Btrfs: Dynamic chunk and block group allocation
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason
0b86a832a1 Btrfs: Add support for multiple devices per filesystem
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
7f93bf8d27 Match the extent tree code to btrfs-progs for multi-device merging
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
952fccac50 Btrfs: Remove extent back refs in batches, and avoid duplicate searches
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
d7fc640e6f Btrfs: Allocator improvements
Reduce CPU time searching for free blocks by optimizing find_first_extent_bit

Fix find_free_extent to make better use of the last_alloc hint.  Before it
was often finding blocks just before the hint.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
9afbb0b752 Btrfs: Disable tree defrag in SSD mode
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
5d196fc15d Btrfs: Use 2MB as the empty_size for clustered allocations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
068fe39fa1 Btrfs: Add checks for last byte in disk to allocator grouping
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
f594706643 Btrfs: Add debugging for block group update failure
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
60cde612c8 Btrfs: Use last_alloc optimizations for metadata, even without -o ssd
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
21a4989d26 Btrfs: Hash in the offset and owner for file extent backref keys
This makes searches for backrefs and backref insertion much more efficient
when there are many backrefs for a single extent

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
47e4bb988c Btrfs: Insert extent record and the first backref in a single balance
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
4529ba495c Btrfs: Add data block hints to SSD mode too
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
291d673e6a Btrfs: Do delalloc accounting via hooks in the extent_state code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason
bea495e5b4 Btrfs: Tune readahead during defrag to avoid reading too much at once
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
d1310b2e0c Btrfs: Split the extent_map code into two parts
There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.

The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller.  This allows a few performance
optimizations and is easier to use.

A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
e18e4809b1 Btrfs: Add mount -o ssd, which includes optimizations for seek free storage
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
55c69072d6 Btrfs: Fix extent_buffer usage when nodesize != leafsize
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
c31f8830f0 Btrfs: online shrinking fixes
While shrinking the FS, the allocation functions need to make sure
they don't try to allocate bytes past the end of the FS.

nodatacow needed an extra check to force cows when the existing extents are
past the end of the FS.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
b0331a4c4c Btrfs: Disable btree reada during extent backref lookups.
This reada is generally not effective.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
dc17ff8f11 Btrfs: Add data=ordered support
This forces file data extents down the disk along with the metadata that
references them.  The current implementation is fairly simple, and just
writes out all of the dirty pages in an inode before the commit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
725c8463ea Btrfs: resizer: don't hold the fs_mutex for long periods of time
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason
1372f8e609 Properly call btrfs_search_slot while shrinking
The shrinking code used btrfs_next_leaf to find the next item, but
this does not cow the blocks it touches.  This fix calls search_slot after
finding the next item to do appropriate cow and balancing.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Yan
73e48b277a Btrfs: Properly handle overlapping extent in shrink_extent_tree
The patch fixes the overlapping extent issue in shrink_extent_tree.
It checks whether there is an overlapping extent by using
find_previous_extent. If there is an overlapping extent, it setups
key.objectid and cur_byte properly.

---

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Yan
d548ee5182 Btrfs: Add a helper that finds previous extent item
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
bd09835d9a count_snapshots: Properly update the leaf pointer after btrfs_next_leaf
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
98ed51747b Btrfs: Force inlining off in a few places to save stack usage
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
f9ef6604ac Btrfs: 32 bit compile fixes for the resizer and enospc checks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
4313b3994d Btrfs: Reduce stack usage in the resizer, fix 32 bit compiles
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
56b453c92f Btrfs: Explicitly send a root objectid to count_snapshots_in_path
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
8f662a76c6 Btrfs: Add readahead to the online shrinker, and a mount -o alloc_start= for testing
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
e52ec0eb62 Btrfs: Fix NULL block groups on reading the inode
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
edbd8d4efe Btrfs: Support for online FS resize (grow and shrink)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
be20aa9dba Btrfs: Add mount option to turn off data cow
A number of workloads do not require copy on write data or checksumming.
mount -o nodatasum to disable checksums and -o nodatacow to disable
both copy on write and checksumming.

In nodatacow mode, copy on write is still performed when a given extent
is under snapshot.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
f6dbff55d7 Btrfs: Reorder extent back refs to differentiate btree blocks from file data
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
6caab489c5 Fix btrfs_inc_ref to add backref hints
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
70b043f0c7 Btrfs: Extra NULL block group checks in find_free_extent
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
d8d5f3e16d Btrfs: Add lowest key information to back refs for extent tree blocks as well.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
7bb86316c3 Btrfs: Add back pointers from extents to the btree or file referencing them
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
74493f7a59 Btrfs: Implement generation numbers in block pointers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
1a2b2ac78a Btrfs: Fix extent allocation for btree blocks as the disk fills
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
87ee04eb0f Btrfs: Add simple stripe size parameter
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
00f5c795fc btrfs_drop_extents: make sure the item is getting smaller before truncate
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
015a739c7c Btrfs: Handle writeback under high memory pressure better
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason
0e4de58432 Btrfs: Add check for null block group to find_search_start
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Yan
5cf664263b Btrfs: Off by one fixes for extent-tree.c
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Yan
5e5745dcaf Btrfs: Add full_scan parameter to find_search_start
This patch adds a new parameter 'full_scan' to 'find_search_start',
thereby 'find_search_start' can know whether 'find_free_extent' is in
full scan phrase. I feel that 'find_search_start' should skip calling
'btrfs_find_block_group' when 'find_free_extent' is in full scan
phrase. In my test on a 2GB volume, Oops occurs when space usage is
about 76%. After apply the patch,  Oops occurs when space usage is
near 100%.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Yan
324ae4df00 Btrfs: Add block group pinned accounting back
This patch adds a helper function 'update_pinned_extents' to
extent-tree.c. The usage of the helper function is similar to
'update_block_group',  the last parameter of the function indicates
pin vs unpin.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason
257d0ce36f Btrfs: Allow large data extents in a single file to span into metadata block groups
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason
f84a8b362d Btrfs: Optimize allocations as we need to mix data and metadata into one group
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Yan
c549228ff6 Btrfs: Properly update free space cache in __free_extent
When pin_down_bytes decides not to pin a block because it was from the
current transaction, make sure the in memory cache of free extents is updated

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Yan
b97f9203b4 Btrfs: Fix typo and memory leak in extent-tree.c
This patch fixes a typo in update_block_group and memory leak in
btrfs_free_block_groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Jens Axboe
ae2f5411c4 btrfs: 32-bit type problems
An assorted set of casts to get rid of the warnings on 32-bit archs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason
19c00ddcc3 Btrfs: Add back metadata checksumming
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
0f82731fc5 Breakout BTRFS_SETGET_FUNCS into a separate C file, the inlines were too big.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
4dc119046d Btrfs: Add an extent buffer LRU to reduce radix tree hits
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
e19caa5f0e Btrfs: Fix allocation routines to avoid intermixing data and metadata allocations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
6b80053d02 Btrfs: Add back the online defragging code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
db94535db7 Btrfs: Allow tree blocks larger than the page size
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
1a5bc167f6 Btrfs: Change the remaining radix trees used by extent-tree.c to extent_map trees
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
96b5179d0d Btrfs: Stop using radix trees for the block group cache
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
f510cfecfc Btrfs: Fix extent_buffer and extent_state leaks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
5f39d397df Btrfs: Create extent_buffer interface for large blocksizes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
cf67582bb2 Btrfs: Fix duplicate ENOSPC checks in find_free_extent
find_free_extent would fail to wrap around to the start of the drive because
it was doing the enospc case checking twice in some cases, causing it
to return -ENOSPC early.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason
d3c2fdcf7b Btrfs: Use balance_dirty_pages_nr on btree blocks
btrfs_btree_balance_dirty is changed to pass the number of pages dirtied
for more accurate dirty throttling.  This lets the VM make better decisions
about when to force some writeback.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:00:48 -04:00
Yan
7d7d6068be Btrfs: Fix cache_block_group to catch holes at the start of the group
Cache block group was overly complex and missed free blocks at the very start
of the group.  This patch simplifies things significantly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-14 16:15:28 -04:00
Yan
e9fe395e47 Btrfs: Fix oopsen in extent_tree.c during enospc
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-29 09:11:44 -04:00
Josef Bacik
58176a9604 Btrfs: Add per-root block accounting and sysfs entries
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-29 15:47:34 -04:00
Chris Mason
b888db2bd7 Btrfs: Add delayed allocation to the extent based page tree code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-27 16:49:44 -04:00
Chris Mason
2cc58cf24f Btrfs: Do more extensive readahead during tree searches
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-27 16:49:44 -04:00
Chris Mason
f2183bde1a Btrfs: Add BH_Defrag to mark buffers that are in need of defragging
This allows the tree walking code to defrag only the newly allocated
buffers, it seems to be a good balance between perfect defragging and the
performance hit of repeatedly reallocating blocks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-10 14:42:37 -04:00
Chris Mason
e9d0b13b5b Btrfs: Btree defrag on the extent-mapping tree as well
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-10 14:06:19 -04:00
Chris Mason
409eb95d7f Btrfs: Further reduce the concurrency penalty of defrag and drop_snapshot
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-08 20:17:12 -04:00
Chris Mason
26b8003f10 Btrfs: Replace extent tree preallocation code with some bit radix magic.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-08 20:17:12 -04:00
Chris Mason
f4468e94c8 Btrfs: Let some locks go during defrag and snapshot dropping
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-08 10:08:58 -04:00
Chris Mason
6702ed490c Btrfs: Add run time btree defrag, and an ioctl to force btree defrag
This adds two types of btree defrag, a run time form that tries to
defrag recently allocated blocks in the btree when they are still in ram,
and an ioctl that forces defrag of all btree blocks.

File data blocks are not defragged yet, but this can make a huge difference
in sequential btree reads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-07 16:15:09 -04:00
Chris Mason
3c69faecb8 Btrfs: Fold some btree readahead routines into something more generic.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-07 15:52:22 -04:00
Chris Mason
9f3a742736 Btrfs: Do snapshot deletion in smaller chunks.
Before, snapshot deletion was a single atomic unit.  This caused considerable
lock contention and required an unbounded amount of space.  Now,
the drop_progress field in the root item is used to indicate how far along
snapshot deletion is, and to resume where it left off.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-07 15:52:19 -04:00
Zach Brown
ec6b910fb3 Btrfs: trivial include fixups
Almost none of the files including module.h need to do so,
remove them.

Include sched.h in extent-tree.c to silence a warning about cond_resched()
being undeclared.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-07-11 10:00:37 -04:00
Chris Mason
ccd467d60e Btrfs: crash recovery fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-28 15:57:36 -04:00
Chris Mason
f2654de42a Btrfs: Allow find_free_extent callers to pass in an exclusion range
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-26 12:20:46 -04:00
Chris Mason
4b52dff6d3 Btrfs: Fix super block updates during transaction commit
The super block written during commit was not consistent with the state of
the trees.  This change adds an in-memory copy of the super so that we can
make sure to write out consistent data during a commit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-26 10:06:50 -04:00
Chris Mason
54aa1f4dfd Btrfs: Audit callers and return codes to make sure -ENOSPC gets up the stack
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-22 14:16:25 -04:00
Chris Mason
e011599b0f Btrfs: reada while dropping snapshots
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-19 16:23:05 -04:00
Chris Mason
85e55b13e4 Btrfs: cache the extent tree preallocation
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-19 15:50:51 -04:00
Chris Mason
8c2383c3dd Subject: Rework btrfs_file_write to only allocate while page locks are held
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-18 09:57:58 -04:00
Aneesh
f1ace244c8 btrfs: Code cleanup
Attaching below is some of the code cleanups that i came across while
reading the code.

a) alloc_path already calls init_path.
b) Mention that btrfs_inode is the in memory copy.Ext4 have ext4_inode_info as
the in memory copy ext4_inode as the disk copy

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-13 16:18:26 -04:00
Chris Mason
6cbd557078 Btrfs: add GPLv2
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-12 09:07:21 -04:00
Chris Mason
5af3981c18 Btrfs: printk fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-12 07:50:13 -04:00
Chris Mason
84f54cfa78 Btrfs: 64 bit div fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-12 07:43:08 -04:00
Chris Mason
5276aedab0 Btrfs: fix oops after block group lookup
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-11 21:33:38 -04:00
Chris Mason
fabb568183 Btrfs: d_type optimization
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-07 22:13:21 -04:00
Chris Mason
fbdc762b4e Btrfs: use a separate flag for search_start vs a hint in find_free_extent
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-30 10:22:12 -04:00
Chris Mason
1e2677e000 Btrfs: block group switching
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-29 16:52:18 -04:00
Chris Mason
3a68637562 Btrfs: sparse files!
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-24 13:35:57 -04:00
Chris Mason
de428b63b1 Btrfs: allocator optimizations, truncate readahead
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-18 13:28:27 -04:00
Chris Mason
8d7be552a7 Btrfs: fix check_node and check_leaf to use less cpu
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-10 11:24:42 -04:00
Chris Mason
e37c9e6921 Btrfs: many allocator fixes, pretty solid
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-09 20:13:14 -04:00
Chris Mason
3e1ad54fe2 Btrfs: allocator and tuning
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-07 20:03:49 -04:00
Chris Mason
be74417553 Btrfs: more allocator enhancements
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-06 10:15:01 -04:00
Chris Mason
be08c1b9f8 Btrfs: early metadata/data split
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-03 09:06:49 -04:00
Chris Mason
35b7e47610 Btrfs: fix page cache memory leak
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-02 15:53:43 -04:00
Chris Mason
090d18753c Btrfs: directory readahead
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-01 08:53:32 -04:00
Chris Mason
31f3c99b73 Btrfs: allocator improvements, inode block groups
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-30 15:25:45 -04:00
Chris Mason
308535a05e Btrfs: prealloc more blocks for the extent map
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-28 15:17:08 -04:00
Chris Mason
7c4452b9a6 Btrfs: smarter transaction writeback
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-28 09:29:35 -04:00
Chris Mason
06a2f9fa4c Btrfs: try to drop dead cow pages from ram
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-28 08:48:10 -04:00
Chris Mason
28b8bb9e00 Btrfs: allocator tweaks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-27 11:42:05 -04:00
Chris Mason
cd1bc4653d Btrfs: more block allocator work
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-27 10:08:34 -04:00
Chris Mason
9078a3e1e4 Btrfs: start of block group code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-26 16:46:15 -04:00
Chris Mason
f2458e1d8c Btrfs: change around extent-tree prealloc
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-25 15:52:25 -04:00
Chris Mason
c62a1920ce Btrfs: get rid of the extent_item type field
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-24 12:07:39 -04:00
Chris Mason
5d0c3e60fe Btrfs: fix extent owner/type setting on extent tree blocks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-23 17:01:05 -04:00
Chris Mason
4d77567309 Btrfs: add owner and type fields to the extents aand block headers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-20 20:23:12 -04:00
Chris Mason
236454dfff Btrfs: many file_write fixes, inline data
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-19 13:37:44 -04:00
Chris Mason
a429e51371 Btrfs: working file_write, reorganized key flags
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-18 16:15:28 -04:00
Chris Mason
b18c668581 Btrfs: progress on file_write
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-17 13:26:50 -04:00
Chris Mason
7eccb903a8 Btrfs: create a logical->phsyical block number mapping scheme
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-11 15:53:25 -04:00
Chris Mason
d0dbc6245c Btrfs: drop owner and parentid
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-10 12:36:36 -04:00
Chris Mason
c5739bba52 Btrfs: snapshot progress
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-10 09:27:04 -04:00
Chris Mason
5f26f772e5 Btrfs: more inode indexed directory work
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-05 10:38:44 -04:00
Chris Mason
b1a4d96509 Btrfs: tweak the inode-map and free extent search starts on cold mount
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-04 15:27:52 -04:00
Chris Mason
2da566edd8 Btrfs: csum_verify_file_block locking fix
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-02 15:43:21 -04:00
Chris Mason
5caf2a0029 Btrfs: dynamic allocation of path struct
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-02 11:20:42 -04:00
Chris Mason
2c90e5d658 Btrfs: still corruption hunting
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-04-02 10:50:19 -04:00
Chris Mason
d602557953 Btrfs: corruption hunt continues
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-30 14:27:56 -04:00
Chris Mason
d98237b3ed Btrfs: use a btree inode instead of sb_getblk
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-28 13:57:48 -04:00
Chris Mason
f4b9aa8d3b btrfs_truncate
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-27 11:05:53 -04:00
Chris Mason
6407bf6d7c Btrfs: reference counts on data extents
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-27 06:33:00 -04:00
Chris Mason
dee26a9f7a btrfs_get_block, file read/write
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-26 16:00:06 -04:00
Chris Mason
8ef97622ca Btrfs: add a radix back bit tree
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-26 10:15:30 -04:00
Chris Mason
78fae27ebf Btrfs: leak fixes, pinning fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-25 11:35:08 -04:00
Chris Mason
d561c025ee Btrfs: very minimal locking
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-23 19:47:49 -04:00
Chris Mason
df2ce34c88 Btrfs: properly set new buffers for new blocks up to date
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-23 11:00:45 -04:00
Chris Mason
d571976292 btrfs_create, btrfs_write_super, btrfs_sync_fs
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-23 10:01:08 -04:00
Chris Mason
e20d96d64f Mountable btrfs, with readdir
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-22 12:13:20 -04:00
Chris Mason
2e635a2783 Btrfs: initial move to kernel module land
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-21 11:12:56 -04:00
Chris Mason
1261ec42b3 Btrfs: Better block record keeping, real mkfs
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-20 20:35:03 -04:00
Chris Mason
9f5fae2fe6 Btrfs: Add inode map, and the start of file extent items
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-20 14:38:32 -04:00
Chris Mason
e089f05c18 Btrfs: transaction handles everywhere
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-16 16:20:31 -04:00
Chris Mason
88fd146c27 Btrfs: pin freed blocks from the FS tree too
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-16 08:56:18 -04:00
Chris Mason
62e2749e03 Btrfs: Use a chunk of the key flags to record the item type.
Add (untested and simple) directory item code
Fix comp_keys to use the new key ordering
Add btrfs_insert_empty_item

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-15 12:56:47 -04:00
Chris Mason
123abc88c9 Btrfs: variable block size support
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-14 14:14:43 -04:00
Chris Mason
4beb1b8b75 Btrfs: add leaf data casting helper
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-14 10:31:29 -04:00
Chris Mason
710874947a Btrfs: properly reset block cache on free
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-14 09:20:39 -04:00
Chris Mason
3768f3689f Btrfs: Change the super to point to a tree of trees to enable persistent snapshots
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-13 16:47:54 -04:00
Chris Mason
9aca1d5132 Btrfs: make some funcs static
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-13 11:09:37 -04:00
Chris Mason
234b63a091 rename funcs and structs to btrfs
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-13 10:46:10 -04:00
Chris Mason
cf27e1eec0 Btrfs: struct extent_item endian
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-13 09:49:06 -04:00
Chris Mason
1d4f8a0c1e Btrfs: node->blockptrs endian fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-13 09:28:32 -04:00
Chris Mason
0783fcfc4d Btrfs: struct item endian fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-12 20:12:07 -04:00
Chris Mason
e2fa7227cd Btrfs: struct key endian fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-12 16:22:34 -04:00
Chris Mason
7518a238ea Btrfs: get/set for struct header fields
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-12 12:01:18 -04:00
Chris Mason
83e15a28e0 fix leak in btrfs_drop_snapshot
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-12 09:03:27 -04:00
Chris Mason
20524f0226 Btrfs: recursion free-first pass
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-10 06:35:47 -05:00
Chris Mason
0579da4280 Btrfs: Fixup last found extent caching
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-07 16:15:30 -05:00
Chris Mason
037e639048 Btrfs: get rid of add recursion
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-07 11:50:24 -05:00
Chris Mason
a28ec19775 Btrfs: Fixup reference counting on cows
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-06 20:08:01 -05:00
Chris Mason
02217ed299 Btrfs: early reference counting
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-02 16:08:05 -05:00
Chris Mason
f0930a37f1 Btrfs: Fix extent code to use merge during delete
Remove implicit commit in del_item and insert_item
Add implicit commit to close()
Add commit op in random-test

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-03-02 09:47:58 -05:00
Chris Mason
0f70abe2b3 Btrfs: more return code checking
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-02-28 16:46:22 -05:00
Chris Mason
aa5d6bed25 Btrfs: return code checking
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-02-28 16:35:06 -05:00
Chris Mason
7cf75962ac Btrfs: u64 cleanups
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-02-26 10:55:01 -05:00
Chris Mason
fec577fb7f Btrfs: Add fsx-style randomized tree tester
Add debug-tree command to print the tree
Add extent-tree.c to the repo
Comment ctree.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-02-26 10:40:21 -05:00