2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

282 Commits

Author SHA1 Message Date
Christoph Hellwig
11f75b3bba xfs: update got in xfs_bmap_shift_update_extent
This way the caller gets the proper updated extent returned in got.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:28 -07:00
Christoph Hellwig
bf8062800a xfs: remove xfs_bmse_shift_one
Instead do the actual left and right shift work in the callers, and just
keep a helper to update the bmap and rmap btrees as well as the in-core
extent list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:28 -07:00
Christoph Hellwig
ecfea3f0c8 xfs: split xfs_bmap_shift_extents
Have a separate helper for insert vs collapse, as this prepares us for
simplifying the code in the next patches.

Also changed the done output argument to a bool intead of int for both
new functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:27 -07:00
Christoph Hellwig
6b18af0dfd xfs: remove XFS_BMAP_MAX_SHIFT_EXTENTS
The define was always set to 1, which means looping until we reach is
was dead code from the start.

Also remove an initialization of next_fsb for the done case that doesn't
fit the new code flow - it was never checked by the caller in the done
case to start with.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:27 -07:00
Christoph Hellwig
42b67dc6ff xfs: remove the never fully implemented UUID fork format
Remove the dead code dealing with the UUID fork format that was never
implemented in Linux (and neither in IRIX as far as I know).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:27 -07:00
Christoph Hellwig
e8e0e170e2 xfs: remove XFS_BMAP_TRACE_EXLIST
Instead of looping over all extents in some debug-only helper just
insert trace points into the loops that already exist in the calling
functions.

Also split the xfs_extlist trace point into one each for reading and
writing extents from disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:27 -07:00
Christoph Hellwig
ca5d8e5b7b xfs: move pre/post-bmap tracing into xfs_iext_update_extent
xfs_iext_update_extent already has basically all the information needed
to centralize the bmap pre/post tracing.  We just need to pass inode +
bmap state instead of the inode fork pointer to get all trace annotations.

In addition to covering all the existing trace points this gives us
tracing coverage for the extent shifting operations for free.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:27 -07:00
Christoph Hellwig
d138604fb1 xfs: remove post-bmap tracing in xfs_bmap_local_to_extents
Now that we use xfs_iext_insert this is already covered by the tracing
in that function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:27 -07:00
Christoph Hellwig
35e62da55f xfs: make better use of the 'state' variable in xfs_bmap_del_extent_real
We already have all the information about the fork a=D1=95 well as additional
tracing information, so pass that to xfs_iext_remove().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:26 -07:00
Christoph Hellwig
060ea65b39 xfs: add a xfs_bmap_fork_to_state helper
This creates the right initial bmap state from the passed in inode
fork enum.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:26 -07:00
Darrick J. Wong
f135761a73 xfs: refactor btree pointer checks
Refactor the btree pointer checks so that we can call them from the
scrub code without logging errors to dmesg.  Preserve the existing error
reporting for regular operations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-10-26 15:38:23 -07:00
Christoph Hellwig
b5cfbc2282 xfs: replace xfs_bmbt_lookup_ge with xfs_bmbt_lookup_first
We only use xfs_bmbt_lookup_ge to look up the first bmap record in an
inode, so replace xfs_bmbt_lookup_ge with a special purpose helper that
is a bit more descriptive.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:22 -07:00
Christoph Hellwig
e16cf9b03c xfs: pass a struct xfs_bmbt_irec to xfs_bmbt_lookup_eq
Now that we've massaged the callers into the right form we can always
pass the actual extent record instead of the individual fields.

As an additional benefit the btree cursor will now be prepoulated with
the correct extent state instead of having to fix it up later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:22 -07:00
Christoph Hellwig
a67d00a555 xfs: pass a struct xfs_bmbt_irec to xfs_bmbt_update
Now that we've massaged the callers into the right form we can always
pass the actual extent record instead of the individual fields.

With that xfs_bmbt_disk_set_allf can go away, and xfs_bmbt_disk_set_all
can be merged into the former implementation of xfs_bmbt_disk_set_allf.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:22 -07:00
Christoph Hellwig
79fa6143a9 xfs: refactor xfs_bmap_add_extent_unwritten_real
Use xfs_iext_get_extent to find, and xfs_iext_update_extent to update
entries in the in-core extent list.  This isolates the function from
the detailed layout of the extent list, and generally makes the code
a lot more readable.

Also get rid of the oldext and newext variables as using the extent
records is a lot more descriptive.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:22 -07:00
Christoph Hellwig
ca1862b083 xfs: refactor delalloc accounting in xfs_bmap_add_extent_delay_real
Account for all changes to the delalloc reservation in da_new, and use a
single call xfs_mod_fdblocks to reserve/free blocks, including always
checking for an error.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:21 -07:00
Christoph Hellwig
4dcb886987 xfs: refactor xfs_bmap_add_extent_delay_real
Use xfs_iext_get_extent to find, and xfs_iext_update_extent to update
entries in the in-core extent list.  This isolates the function from
the detailed layout of the extent list, and generally makes the code
a lot more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:21 -07:00
Christoph Hellwig
1abb9e5532 xfs: refactor xfs_bmap_add_extent_hole_real
Use xfs_iext_update_extent to update entries in the in-core extent list.
This isolates the function from the detailed layout of the extent list,
and generally makes the code a lot more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:21 -07:00
Christoph Hellwig
3ffc18ecd3 xfs: refactor xfs_bmap_add_extent_hole_delay
Use xfs_iext_get_extent to find, and xfs_iext_update_extent to update
entries in the in-core extent list.  This isolates the function from
the detailed layout of the extent list, and generally makes the code
a lot more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:21 -07:00
Christoph Hellwig
48fd52b16d xfs: refactor xfs_del_extent_real
Use xfs_iext_update_extent to update entries in the in-core extent list.
This isolates the function from the detailed layout of the extent list,
and generally makes the code a lot more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:21 -07:00
Christoph Hellwig
491f6f8abf xfs: use the state defines in xfs_bmap_del_extent_real
Use the same defines as the other extent add and delete helpers, which
both improves code readability and trace point output.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:21 -07:00
Christoph Hellwig
0173c689ff xfs: use correct state defines in xfs_bmap_del_extent_{cow,delay}
Use the _FILLING values to match the usage in the xfs_bmap_add_extent_*
helpers.  No change in behavior, just better naming in the code and
tracepoint output.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:21 -07:00
Christoph Hellwig
1b24b633aa xfs: move some more code into xfs_bmap_del_extent_real
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:21 -07:00
Christoph Hellwig
e1d7553faf xfs: use xfs_bmap_del_extent_delay for the data fork as well
And remove the delalloc code from xfs_bmap_del_extent, which gets renamed
to xfs_bmap_del_extent_real to fit the naming scheme used by the other
xfs_bmap_{add,del}_extent_* routines.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:20 -07:00
Christoph Hellwig
8280f6ed46 xfs: rename bno to end in __xfs_bunmapi
Rename the bno variable that's used as the end of the range in
__xfs_bunmapi to end, which better describes it.  Additionally change
the start variable which takes the initial value of bno to be the
function parameter itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:20 -07:00
Christoph Hellwig
b213d69293 xfs: don't set XFS_BTCUR_BPRV_WASDEL in xfs_bunmapi
The XFS_BTCUR_BPRV_WASDEL flag is supposed to indicate that we are
converting a delayed allocation to a real one, which isn't the case
in xfs_bunmapi.  Setting it could theoretically lead to misaccounting
here, but it's unlikely that we ever hit it in practice.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:20 -07:00
Christoph Hellwig
e3f0f7563e xfs: use xfs_iext_get_extent instead of open coding it
This avoids exposure to details of the extent list implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:20 -07:00
Christoph Hellwig
5e422f5e4f xfs: fix incorrect extent state in xfs_bmap_add_extent_unwritten_real
There was one spot in xfs_bmap_add_extent_unwritten_real that didn't use the
passed in new extent state but always converted to normal, leading to wrong
behavior when converting from normal to unwritten.

Only found by code inspection, it seems like this code path to move partial
extent from written to unwritten while merging it with the next extent is
rarely exercised.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-26 15:38:20 -07:00
Brian Foster
40214d128e xfs: trim writepage mapping to within eof
The writeback rework in commit fbcc025613 ("xfs: Introduce
writeback context for writepages") introduced a subtle change in
behavior with regard to the block mapping used across the
->writepages() sequence. The previous xfs_cluster_write() code would
only flush pages up to EOF at the time of the writepage, thus
ensuring that any pages due to file-extending writes would be
handled on a separate cycle and with a new, updated block mapping.

The updated code establishes a block mapping in xfs_writepage_map()
that could extend beyond EOF if the file has post-eof preallocation.
Because we now use the generic writeback infrastructure and pass the
cached mapping to each writepage call, there is no implicit EOF
limit in place. If eofblocks trimming occurs during ->writepages(),
any post-eof portion of the cached mapping becomes invalid. The
eofblocks code has no means to serialize against writeback because
there are no pages associated with post-eof blocks. Therefore if an
eofblocks trim occurs and is followed by a file-extending buffered
write, not only has the mapping become invalid, but we could end up
writing a page to disk based on the invalid mapping.

Consider the following sequence of events:

- A buffered write creates a delalloc extent and post-eof
  speculative preallocation.
- Writeback starts and on the first writepage cycle, the delalloc
  extent is converted to real blocks (including the post-eof blocks)
  and the mapping is cached.
- The file is closed and xfs_release() trims post-eof blocks. The
  cached writeback mapping is now invalid.
- Another buffered write appends the file with a delalloc extent.
- The concurrent writeback cycle picks up the just written page
  because the writeback range end is LLONG_MAX. xfs_writepage_map()
  attributes it to the (now invalid) cached mapping and writes the
  data to an incorrect location on disk (and where the file offset is
  still backed by a delalloc extent).

This problem is reproduced by xfstests test generic/464, which
triggers racing writes, appends, open/closes and writeback requests.

To address this problem, trim the mapping used during writeback to
within EOF when the mapping is validated. This ensures the mapping
is revalidated for any pages encountered beyond EOF as of the time
the current mapping was cached or last validated.

Reported-by: Eryu Guan <eguan@redhat.com>
Diagnosed-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-16 12:26:50 -07:00
Thomas Meyer
749f24f33e xfs: Fix bool initialization/comparison
Bool initializations should use true and false. Bool tests don't need
comparisons.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11 10:21:06 -07:00
Darrick J. Wong
5e5c943c1f xfs: revert "xfs: factor rmap btree size into the indlen calculations"
In commit fd26a88093 we added a worst case estimate for rmapbt blocks
needed to satisfy the block mapping request.  Since then, we added the
ability to reserve enough space in each AG such that we should never run
out of blocks to grow the rmapbt, which makes this calculation
unnecessary.  Revert the commit because it makes the extra delalloc
indlen accounting unnecessary and incorrect.

Reported-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-26 10:55:20 -07:00
Darrick J. Wong
7bf7a193a9 xfs: fix compiler warnings
Fix up all the compiler warnings that have crept in.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-09-02 08:22:19 -07:00
Darrick J. Wong
4cc1ee5e65 xfs: simplify the rmap code in xfs_bmse_merge
In Christoph's patch to refactor xfs_bmse_merge, the updated rmap code
does more work than it needs to (because map-extent auto-merges
records).  Remove the unnecessary unmap and save ourselves a deferred
op.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-09-01 13:08:26 -07:00
Christoph Hellwig
4c35445b59 xfs: use xfs_iext_*_extent helpers in xfs_bmap_split_extent_at
This abstracts the function away from details of the low-level extent
list implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:25 -07:00
Christoph Hellwig
4da6b514ea xfs: use xfs_iext_*_extent helpers in xfs_bmap_shift_extents
This abstracts the function away from details of the low-level extent
list implementation.

Note that it seems like the previous implementation of rmap for
the merge case was completely broken, but it no seems appear to
trigger that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:25 -07:00
Christoph Hellwig
05b7c8ab2b xfs: move some code around inside xfs_bmap_shift_extents
For the first right move we need to look up next_fsb.  That means
our last fsb that contains next_fsb must also be the current extent,
so take advantage of that by moving the code around a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:25 -07:00
Christoph Hellwig
f2285c148c xfs: use xfs_iext_get_extent in xfs_bmap_first_unused
Use the bmap abstraction instead of open-coding bmbt details here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
50bb44c286 xfs: switch xfs_bmap_local_to_extents to use xfs_iext_insert
Use the helper instead of open coding it, to provide a better abstraction
for the scalable extent list work.  This also gets an additional assert
and trace point for free.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
67e4e69cb2 xfs: add a xfs_iext_update_extent helper
This helper is used to update an extent record based on the extent index,
and can be used to provide a level of abstractions between callers that
want to modify in-core extent records and the details of the extent list
implementation.

Also switch all users of the xfs_bmbt_set_all(xfs_iext_get_ext(...))
pattern to this new helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
8ad7c629b1 xfs: remove the ip argument to xfs_defer_finish
And instead require callers to explicitly join the inode using
xfs_defer_ijoin.  Also consolidate the defer error handling in
a few places using a goto label.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
882d8785fb xfs: rename xfs_defer_join to xfs_defer_ijoin
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
5b094d6dac xfs: fix multi-AG deadlock in xfs_bunmapi
Just like in the allocator we must avoid touching multiple AGs out of
order when freeing blocks, as freeing still locks the AGF and can cause
the same AB-BA deadlocks as in the allocation path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-26 08:20:03 -07:00
Darrick J. Wong
4c1a67bd36 xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_write
We must initialize the firstfsb parameter to _bmapi_write so that it
doesn't incorrectly treat stack garbage as a restriction on which AGs
it can search for free space.

Fixes-coverity-id: 1402025
Fixes-coverity-id: 1415167
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-07-20 14:42:33 -07:00
Darrick J. Wong
9e24cfd044 xfs: remove unneeded parameter from XFS_TEST_ERROR
Since we moved the injected error frequency controls to the mountpoint,
we can get rid of the last argument to XFS_TEST_ERROR.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-06-27 18:23:20 -07:00
Darrick J. Wong
e1a4e37cc7 xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent
In a pathological scenario where we are trying to bunmapi a single
extent in which every other block is shared, it's possible that trying
to unmap the entire large extent in a single transaction can generate so
many EFIs that we overflow the transaction reservation.

Therefore, use a heuristic to guess at the number of blocks we can
safely unmap from a reflink file's data fork in an single transaction.
This should prevent problems such as the log head slamming into the tail
and ASSERTs that trigger because we've exceeded the transaction
reservation.

Note that since bunmapi can fail to unmap the entire range, we must also
teach the deferred unmap code to roll into a new transaction whenever we
get low on reservation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: random edits, all bugs are my fault]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-19 08:59:10 -07:00
Darrick J. Wong
6e747506dd xfs: fix warnings about unused stack variables
Reduce stack usage and get rid of compiler warnings by eliminating
unused variables.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-05-16 09:24:36 -07:00
Brian Foster
0daaecacb8 xfs: fix indlen accounting error on partial delalloc conversion
The delalloc -> real block conversion path uses an incorrect
calculation in the case where the middle part of a delalloc extent
is being converted. This is documented as a rare situation because
XFS generally attempts to maximize contiguity by converting as much
of a delalloc extent as possible.

If this situation does occur, the indlen reservation for the two new
delalloc extents left behind by the conversion of the middle range
is calculated and compared with the original reservation. If more
blocks are required, the delta is allocated from the global block
pool. This delta value can be characterized as the difference
between the new total requirement (temp + temp2) and the currently
available reservation minus those blocks that have already been
allocated (startblockval(PREV.br_startblock) - allocated).

The problem is that the current code does not account for previously
allocated blocks correctly. It subtracts the current allocation
count from the (new - old) delta rather than the old indlen
reservation. This means that more indlen blocks than have been
allocated end up stashed in the remaining extents and free space
accounting is broken as a result.

Fix up the calculation to subtract the allocated block count from
the original extent indlen and thus correctly allocate the
reservation delta based on the difference between the new total
requirement and the unused blocks from the original reservation.
Also remove a bogus assert that contradicts the fact that the new
indlen reservation can be larger than the original indlen
reservation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-16 09:24:35 -07:00
Christoph Hellwig
0c1d9e4a61 xfs: simplify validation of the unwritten extent bit
XFS only supports the unwritten extent bit in the data fork, and only if
the file system has a version 5 superblock or the unwritten extent
feature bit.

We currently have two routines that validate the invariant:
xfs_check_nostate_extents which return -EFSCORRUPTED when it's not met,
and xfs_validate_extent that triggers and assert in debug build.

Both of them iterate over all extents of an inode fork when called,
which isn't very efficient.

This patch instead adds a new helper that verifies the invariant one
extent at a time, and calls it from the places where we iterate over
all extents to converted them from or two the in-memory format.  The
callers then return -EFSCORRUPTED when reading invalid extents from
disk, or trigger an assert when writing them to disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Eric Sandeen
4f1adf3373 xfs: more do_div cleanups
On some architectures do_div does the pointer compare
trick to make sure that we've sent it an unsigned 64-bit
number.  (Why unsigned?  I don't know.)

Fix up the few places that squawk about this; in
xfs_bmap_wants_extents() we just used a bare int64_t so change
that to unsigned.

In xfs_adjust_extent_unmap_boundaries() all we wanted was the
mod, and we have an xfs-specific function to handle that w/o
side effects, which includes proper casting for do_div.

In xfs_daddr_to_ag[b]no, we were using the wrong type anyway;
XFS_BB_TO_FSBT returns a block in the filesystem, so use
xfs_rfsblock_t not xfs_daddr_t, and gain the unsignedness
from that type as a bonus.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Christoph Hellwig
7590632a33 xfs: remove bmap block allocation retries
Now that reflink operations don't set the firstblock value we don't
need the workarounds for non-NULL firstblock values without a prior
allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Christoph Hellwig
bf8eadbacb xfs: remove xfs_bmap_remap_alloc
The main thing that xfs_bmap_remap_alloc does is fixing the AGFL, similar
to what we do in the space allocator.  But the reflink code doesn't touch
the allocation btree unlike the normal space allocator, so we couldn't
care less about the state of the AGFL.

So remove xfs_bmap_remap_alloc and just handle the di_nblocks update in
the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Christoph Hellwig
6ebd5a4413 xfs: introduce xfs_bmapi_remap
Add a new helper to be used for reflink extent list additions instead of
funneling them through xfs_bmapi_write and overloading the firstblock
member in struct xfs_bmalloca and struct xfs_alloc_args.

With some small changes to xfs_bmap_remap_alloc this also means we do
not need a xfs_bmalloca structure for this case at all.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:39 -07:00
Christoph Hellwig
6d04558f9f xfs: pass individual arguments to xfs_bmap_add_extent_hole_real
For the reflink case we'd much rather pass the required arguments than
faking up a struct xfs_bmalloca.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:39 -07:00
Christoph Hellwig
39e07daa46 xfs: remove attr fork handling in xfs_bmap_finish_one
We never do COW operations for the attr fork, so don't pretend we handle
them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:39 -07:00
Christoph Hellwig
52813fb13f xfs: fix integer truncation in xfs_bmap_remap_alloc
bno should be a xfs_fsblock_t, which is 64-bit wides instead of a
xfs_aglock_t, which truncates the value to 32 bits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:39 -07:00
Christoph Hellwig
2fcc319d24 xfs: try any AG when allocating the first btree block when reflinking
When a reflink operation causes the bmap code to allocate a btree block
we're currently doing single-AG allocations due to having ->firstblock
set and then try any higher AG due a little reflink quirk we've put in
when adding the reflink code.  But given that we do not have a minleft
reservation of any kind in this AG we can still not have any space in
the same or higher AG even if the file system has enough free space.
To fix this use a XFS_ALLOCTYPE_FIRST_AG allocation in this fall back
path instead.

[And yes, we need to redo this properly instead of piling hacks over
 hacks.  I'm working on that, but it's not going to be a small series.
 In the meantime this fixes the customer reported issue]

Also add a warning for failing allocations to make it easier to debug.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-03-08 10:38:53 -08:00
Brian Foster
f65e6fad29 xfs: use iomap new flag for newly allocated delalloc blocks
Commit fa7f138 ("xfs: clear delalloc and cache on buffered write
failure") fixed one regression in the iomap error handling code and
exposed another. The fundamental problem is that if a buffered write
is a rewrite of preexisting delalloc blocks and the write fails, the
failure handling code can punch out preexisting blocks with valid
file data.

This was reproduced directly by sub-block writes in the LTP
kernel/syscalls/write/write03 test. A first 100 byte write allocates
a single block in a file. A subsequent 100 byte write fails and
punches out the block, including the data successfully written by
the previous write.

To address this problem, update the ->iomap_begin() handler to
distinguish newly allocated delalloc blocks from preexisting
delalloc blocks via the IOMAP_F_NEW flag. Use this flag in the
->iomap_end() handler to decide when a failed or short write should
punch out delalloc blocks.

This introduces the subtle requirement that ->iomap_begin() should
never combine newly allocated delalloc blocks with existing blocks
in the resulting iomap descriptor. This can occur when a new
delalloc reservation merges with a neighboring extent that is part
of the current write, for example. Therefore, drop the
post-allocation extent lookup from xfs_bmapi_reserve_delalloc() and
just return the record inserted into the fork. This ensures only new
blocks are returned and thus that preexisting delalloc blocks are
always handled as "found" blocks and not punched out on a failed
rewrite.

Reported-by: Xiong Zhou <xzhou@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-03-08 09:58:08 -08:00
Christoph Hellwig
410d17f67e xfs: tune down agno asserts in the bmap code
In various places we currently assert that xfs_bmap_btalloc allocates
from the same as the firstblock value passed in, unless it's either
NULLAGNO or the dop_low flag is set.  But the reflink code does not
fully follow this convention as it passes in firstblock purely as
a hint for the allocator without actually having previous allocations
in the transaction, and without having a minleft check on the current
AG, leading to the assert firing on a very full and heavily used
file system.  As even the reflink code only allocates from equal or
higher AGs for now we can simply the check to always allow for equal
or higher AGs.

Note that we need to eventually split the two meanings of the firstblock
value.  At that point we can also allow the reflink code to allocate
from any AG instead of limiting it in any way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-16 17:20:39 -08:00
Brian Foster
75d65361cf xfs: split indlen reservations fairly when under reserved
Certain workoads that punch holes into speculative preallocation can
cause delalloc indirect reservation splits when the delalloc extent is
split in two. If further splits occur, an already short-handed extent
can be split into two in a manner that leaves zero indirect blocks for
one of the two new extents. This occurs because the shortage is large
enough that the xfs_bmap_split_indlen() algorithm completely drains the
requested indlen of one of the extents before it honors the existing
reservation.

This ultimately results in a warning from xfs_bmap_del_extent(). This
has been observed during file copies of large, sparse files using 'cp
--sparse=always.'

To avoid this problem, update xfs_bmap_split_indlen() to explicitly
apply the reservation shortage fairly between both extents. This smooths
out the overall indlen shortage and defers the situation where we end up
with a delalloc extent with zero indlen reservation to extreme
circumstances.

Reported-by: Patrick Dung <mpatdung@gmail.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-16 17:20:12 -08:00
Brian Foster
0e339ef855 xfs: handle indlen shortage on delalloc extent merge
When a delalloc extent is created, it can be merged with pre-existing,
contiguous, delalloc extents. When this occurs,
xfs_bmap_add_extent_hole_delay() merges the extents along with the
associated indirect block reservations. The expectation here is that the
combined worst case indlen reservation is always less than or equal to
the indlen reservation for the individual extents.

This is not always the case, however, as existing extents can less than
the expected indlen reservation if the extent was previously split due
to a hole punch. If a new extent merges with such an extent, the total
indlen requirement may be larger than the sum of the indlen reservations
held by both extents.

xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen
reservation is always available and assigns it to the merged extent
without consideration for the indlen held by the pre-existing extent. As
a result, the subsequent xfs_mod_fdblocks() call can attempt an
unintentional allocation rather than a free (indicated by an ASSERT()
failure). Further, if the allocation happens to fail in this context,
the failure goes unhandled and creates a filesystem wide block
accounting inconsistency.

Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the
indlen reservation assigned to the merged extent to the sum of the
indlen reservations held by each of the individual extents.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-16 17:20:12 -08:00
Christoph Hellwig
a14234c72b xfs: go straight to real allocations for direct I/O COW writes
When we allocate COW fork blocks for direct I/O writes we currently first
create a delayed allocation, and then convert it to a real allocation
once we've got the delayed one.

As there is no good reason for that this patch instead makes use call
xfs_bmapi_write from the COW allocation path.  The only interesting bits
are a few tweaks the low-level allocator to allow for this, most notably
the need to remove the call to xfs_bmap_extsize_align for the cowextsize
in xfs_bmap_btalloc - for the existing convert case it's a no-op, but
for the direct allocation case it would blow up our block reservation
way beyond what we reserved for the transaction.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-06 17:47:47 -08:00
Darrick J. Wong
05a630d76b xfs: allow unwritten extents in the CoW fork
In the data fork, we only allow extents to perform the following state
transitions:

delay -> real <-> unwritten

There's no way to move directly from a delalloc reservation to an
/unwritten/ allocated extent.  However, for the CoW fork we want to be
able to do the following to each extent:

delalloc -> unwritten -> written -> remapped to data fork

This will help us to avoid a race in the speculative CoW preallocation
code between a first thread that is allocating a CoW extent and a second
thread that is remapping part of a file after a write.  In order to do
this, however, we need two things: first, we have to be able to
transition from da to unwritten, and second the function that converts
between real and unwritten has to be made aware of the cow fork.  Do
both of those things.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-02-02 15:14:01 -08:00
Darrick J. Wong
d5a91baeb6 xfs: filter out obviously bad btree pointers
Don't let anybody load an obviously bad btree pointer.  Since the values
come from disk, we must return an error, not just ASSERT.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2017-02-02 15:13:58 -08:00
Eric Sandeen
b6f41e4482 xfs: remove boilerplate around xfs_btree_init_block
Now that xfs_btree_init_block_int is able to determine crc
status from the passed-in mp, we can determine the proper
magic as well if we are given a btree number, rather than
an explicit magic value.

Change xfs_btree_init_block[_int] callers to pass in the
btree number, and let xfs_btree_init_block_int use the
xfs_magics array via the xfs_btree_magic macro to determine
which magic value is needed.  This makes all of the
if (crc) / else stanzas identical, and the if/else can be
removed, leading to a single, common init_block call.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-30 16:32:24 -08:00
Eric Sandeen
f88ae46b09 xfs: glean crc status from mp not flags in xfs_btree_init_block_int
xfs_btree_init_block_int() can determine whether crcs are
in effect without the passed-in XFS_BTREE_CRC_BLOCKS flag;
the mp argument allows us to determine this from the
superblock.  Remove the flag from callers, and use
xfs_sb_version_hascrc(&mp->m_sb) internally instead.

This removes one difference between the if & else cases
in the callers.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-30 16:32:24 -08:00
Christoph Hellwig
493611ebd6 xfs: extsize hints are not unlikely in xfs_bmap_btalloc
With COW files they are the hotpath, just like for files with the
extent size hint attribute.  We really shouldn't micro-manage anything
but failure cases with unlikely.

Additionally Arnd Bergmann recently reported that one of these two
unlikely annotations causes link failures together with an upcoming
kernel instrumentation patch, so let's get rid of it ASAP.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-25 08:59:43 -08:00
Christoph Hellwig
d2b3964a07 xfs: fix COW writeback race
Due to the way how xfs_iomap_write_allocate tries to convert the whole
found extents from delalloc to real space we can run into a race
condition with multiple threads doing writes to this same extent.
For the non-COW case that is harmless as the only thing that can happen
is that we call xfs_bmapi_write on an extent that has already been
converted to a real allocation.  For COW writes where we move the extent
from the COW to the data fork after I/O completion the race is, however,
not quite as harmless.  In the worst case we are now calling
xfs_bmapi_write on a region that contains hole in the COW work, which
will trip up an assert in debug builds or lead to file system corruption
in non-debug builds.  This seems to be reproducible with workloads of
small O_DSYNC write, although so far I've not managed to come up with
a with an isolated reproducer.

The fix for the issue is relatively simple:  tell xfs_bmapi_write
that we are only asked to convert delayed allocations and skip holes
in that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-23 10:55:07 -08:00
Christoph Hellwig
255c516278 xfs: fix bogus minleft manipulations
We can't just set minleft to 0 when we're low on space - that's exactly
what we need minleft for: to protect space in the AG for btree block
allocations when we are low on free space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-09 13:36:36 -08:00
Dave Chinner
a444d72e60 Merge branch 'xfs-4.10-misc-fixes-3' into for-next 2016-12-07 17:42:30 +11:00
Darrick J. Wong
0f352f8ee8 xfs: error out if trying to add attrs and anextents > 0
We shouldn't assert if somehow we end up trying to add an attr fork to
an inode that apparently already has attr extents because this is an
indication of on-disk corruption.  Instead, return an error code to
userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:38:11 +11:00
Darrick J. Wong
356a322522 xfs: complain if we don't get nextents bmap records
When reading into memory all extents of a btree-format inode fork,
complain if the number of extents we find is not the same as the number
of extents reported in the inode core.  This is needed to stop an IO
action from accessing the garbage areas of the in-core fork.

[dchinner: removed redundant assert]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:36:56 +11:00
Eric Sandeen
c44a1f2262 xfs: handle cow fork in xfs_bmap_trace_exlist
By inspection, xfs_bmap_trace_exlist isn't handling cow forks,
and will trace the data fork instead.

Fix this by setting state appropriately if whichfork
== XFS_COW_FORK.

()___()
< @ @ >
 |   |
 {o_o}
  (|)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:32:00 +11:00
Eric Sandeen
7710517fc3 xfs: pass state not whichfork to trace_xfs_extlist
When xfs_bmap_trace_exlist called trace_xfs_extlist,
it sent in the "whichfork" var instead of the bmap "state"
as expected (even though state was already set up for this
purpose).

As a result, the xfs_bmap_class in tracing code used
"whichfork" not state in xfs_iext_state_to_fork(), and got
the wrong ifork pointer.  It all goes downhill from
there, including an ASSERT when ifp_bytes is empty
by the time it reaches xfs_iext_get_ext():

XFS: Assertion failed: idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:31:50 +11:00
Dave Chinner
b7b26110ed Merge branch 'xfs-4.10-misc-fixes-2' into for-next 2016-11-28 15:06:03 +11:00
Brian Foster
974ae922ef xfs: track preallocation separately in xfs_bmapi_reserve_delalloc()
Speculative preallocation is currently processed entirely by the callers
of xfs_bmapi_reserve_delalloc(). The caller determines how much
preallocation to include, adjusts the extent length and passes down the
resulting request.

While this works fine for post-eof speculative preallocation, it is not
as reliable for COW fork preallocation. COW fork preallocation is
implemented via the cowextszhint, which aligns the start offset as well
as the length of the extent. Further, it is difficult for the caller to
accurately identify when preallocation occurs because the returned
extent could have been merged with neighboring extents in the fork.

To simplify this situation and facilitate further COW fork preallocation
enhancements, update xfs_bmapi_reserve_delalloc() to take a separate
preallocation parameter to incorporate into the allocation request. The
preallocation blocks value is tacked onto the end of the request and
adjusted to accommodate neighboring extents and extent size limits.
Since xfs_bmapi_reserve_delalloc() now knows precisely how much
preallocation was included in the allocation, it can also tag the inodes
appropriately to support preallocation reclaim.

Note that xfs_bmapi_reserve_delalloc() callers are not yet updated to
use the preallocation mechanism. This patch should not change behavior
outside of correctly tagging reflink inodes when start offset
preallocation occurs (which the caller does not handle correctly).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-28 14:57:42 +11:00
Darrick J. Wong
fd26a88093 xfs: factor rmap btree size into the indlen calculations
When we're estimating the amount of space it's going to take to satisfy
a delalloc reservation, we need to include the space that we might need
to grow the rmapbt.  This helps us to avoid running out of space later
when _iomap_write_allocate needs more space than we reserved.  Eryu Guan
observed this happening on generic/224 when sunit/swidth were set.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-28 14:57:42 +11:00
Dave Chinner
ed24bee6f2 Merge branch 'xfs-4.10-extent-lookup' into for-next 2016-11-24 11:41:59 +11:00
Christoph Hellwig
0e8d630ba0 xfs: remove NULLEXTNUM
We only ever set a field to this constant for an impossible to reach
error case in xfs_bmap_search_extents.  That functions has been removed,
so we can remove the constant as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:40:32 +11:00
Christoph Hellwig
6edc977f77 xfs: remove xfs_bmap_search_extents
Now that all users are gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:40:14 +11:00
Christoph Hellwig
65c5f41978 xfs: remove prev argument to xfs_bmapi_reserve_delalloc
We can easily lookup the previous extent for the cases where we need it,
which saves the callers from looking it up for us later in the series.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:44 +11:00
Christoph Hellwig
7efc794561 xfs: use new extent lookup helpers in __xfs_bunmapi
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:44 +11:00
Christoph Hellwig
2d58f6ef79 xfs: use new extent lookup helpers in xfs_bmapi_write
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:43 +11:00
Christoph Hellwig
334f3423d6 xfs: use new extent lookup helpers in xfs_bmapi_read
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:43 +11:00
Christoph Hellwig
86685f7ba5 xfs: cleanup xfs_bmap_last_before
Rewrite the function using xfs_iext_lookup_extent and xfs_iext_get_extent,
and massage the flow into something easily understandable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:38 +11:00
Dave Chinner
0fc204e2eb Merge branch 'xfs-4.10-misc-fixes-1' into for-next 2016-11-10 10:29:43 +11:00
Eric Sandeen
5d829300be xfs: provide helper for counting extents from if_bytes
The open-coded pattern:

ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)

is all over the xfs code; provide a new helper
xfs_iext_count(ifp) to count the number of inline extents
in an inode fork.

[dchinner: pick up several missed conversions]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 12:59:42 +11:00
Darrick J. Wong
4fd29ec472 xfs: check return value of _trans_reserve_quota_nblks
Check the return value of xfs_trans_reserve_quota_nblks for errors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:59:26 +11:00
Christoph Hellwig
64e6428ddd xfs: remove xfs_bunmapi_cow
Since no one uses it anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:54:59 +11:00
Christoph Hellwig
fa5c836ca8 xfs: refactor xfs_bunmapi_cow
Split out two helpers for deleting delayed or real extents from the COW fork.
This allows to call them directly from xfs_reflink_cow_end_io once that
function is refactored to iterate the extent tree.  It will also allow
to reuse the delalloc deletion from xfs_bunmapi in the future.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:54:14 +11:00
Darrick J. Wong
0a0af28cad xfs: add xfs_trim_extent
This helpers allows to trim an extent to a subset of it's original range
while making sure the block numbers in it remain valid,

In the future xfs_trim_extent and xfs_bmapi_trim_map should probably be
merged in some form.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: split from a previous patch from Darrick, moved around and added
 support for "raw" delayed extents"]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:51:50 +11:00
Eric Sandeen
fe23759eaf xfs: remove pointless error goto in xfs_bmap_remap_alloc
The commit:

f65306ea xfs: map an inode's offset to an exact physical block

added a pointless error0: target; remove it.

Addresses-Coverity-Id: 1373865
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:44:53 +11:00
Colin Ian King
1d55a4bfd0 xfs: remove redundant assignment of ifp
Remove redundant ifp = ifp statement, it does nothing. Found with
static analysis by CoverityScan.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:40:55 +11:00
Darrick J. Wong
90e2056d76 xfs: try other AGs to allocate a BMBT block
Prior to the introduction of reflink, allocating a block and mapping
it into a file was performed in a single transaction with a single
block reservation, and the allocator was supposed to find enough
blocks to allocate the extent and any BMBT blocks that might be
necessary (unless we're low on space).

However, due to the way copy on write works, allocation and mapping
have been split into two transactions, which means that we must be
able to handle the case where we allocate an extent for CoW but that
AG runs out of free space before the blocks can be mapped into a file,
and the mapping requires a new BMBT block.  When this happens, look in
one of the other AGs for a BMBT block instead of taking the FS down.

The same applies to the functions that convert a data fork to extents
and later btree format.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:28 -07:00
Darrick J. Wong
f7ca352272 xfs: create a separate cow extent size hint for the allocator
Create a per-inode extent size allocator hint for copy-on-write.  This
hint is separate from the existing extent size hint so that CoW can
take advantage of the fragmentation-reducing properties of extent size
hints without disabling delalloc for regular writes.

The extent size hint that's fed to the allocator during a copy on
write operation is the greater of the cowextsize and regular extsize
hint.

During reflink, if we're sharing the entire source file to the entire
destination file and the destination file doesn't already have a
cowextsize hint, propagate the source file's cowextsize hint to the
destination file.

Furthermore, zero the bulkstat buffer prior to setting the fields
so that we don't copy kernel memory contents into userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:26 -07:00
Darrick J. Wong
174edb0e46 xfs: store in-progress CoW allocations in the refcount btree
Due to the way the CoW algorithm in XFS works, there's an interval
during which blocks allocated to handle a CoW can be lost -- if the FS
goes down after the blocks are allocated but before the block
remapping takes place.  This is exacerbated by the cowextsz hint --
allocated reservations can sit around for a while, waiting to get
used.

Since the refcount btree doesn't normally store records with refcount
of 1, we can use it to record these in-progress extents.  In-progress
blocks cannot be shared because they're not user-visible, so there
shouldn't be any conflicts with other programs.  This is a better
solution than holding EFIs during writeback because (a) EFIs can't be
relogged currently, (b) even if they could, EFIs are bound by
available log space, which puts an unnecessary upper bound on how much
CoW we can have in flight, and (c) we already have a mechanism to
track blocks.

At mount time, read the refcount records and free anything we find
with a refcount of 1 because those were in-progress when the FS went
down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:05 -07:00
Darrick J. Wong
4862cfe825 xfs: support removing extents from CoW fork
Create a helper method to remove extents from the CoW fork without
any of the side effects (rmapbt/bmbt updates) of the regular extent
deletion routine.  We'll eventually use this to clear out the CoW fork
during ioend processing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 13:55:40 -07:00
Darrick J. Wong
60b4984fc3 xfs: support allocating delayed extents in CoW fork
Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed
allocation extents in the CoW fork to real allocations, and wire this
up all the way back to xfs_iomap_write_allocate().  In a subsequent
patch, we'll modify the writepage handler to call this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:41 -07:00
Darrick J. Wong
be51f8119c xfs: support bmapping delalloc extents in the CoW fork
Allow the creation of delayed allocation extents in the CoW fork.  In
a subsequent patch we'll wire up iomap_begin to actually do this via
reflink helper functions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:40 -07:00
Darrick J. Wong
3993baeb3c xfs: introduce the CoW fork
Introduce a new in-core fork for storing copy-on-write delalloc
reservations and allocated extents that are in the process of being
written out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:40 -07:00
Darrick J. Wong
4453593be6 xfs: return work remaining at the end of a bunmapi operation
Return the range of file blocks that bunmapi didn't free.  This hint
is used by CoW and reflink to figure out what part of an extent
actually got freed so that it can set up the appropriate atomic
remapping of just the freed range.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:39 -07:00
Darrick J. Wong
9f3afb57d5 xfs: implement deferred bmbt map/unmap operations
Implement deferred versions of the inode block map/unmap functions.
These will be used in subsequent patches to make reflink operations
atomic.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 11:05:44 -07:00
Darrick J. Wong
4847acf868 xfs: pass bmapi flags through to bmap_del_extent
Pass BMAPI_ flags from bunmapi into bmap_del_extent and extend
BMAPI_REMAP (which means "don't touch the allocator or the quota
accounting") to apply to bunmapi as well.  This will be used to
implement the unmap operation, which will be used by swapext.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 11:05:44 -07:00
Darrick J. Wong
f65306ea52 xfs: map an inode's offset to an exact physical block
Teach the bmap routine to know how to map a range of file blocks to a
specific range of physical blocks, instead of simply allocating fresh
blocks.  This enables reflink to map a file to blocks that are already
in use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 11:05:44 -07:00
Darrick J. Wong
62aab20f08 xfs: adjust refcount when unmapping file blocks
When we're unmapping blocks from a reflinked file, decrease the
refcount of the affected blocks and free the extents that are no
longer in use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:23 -07:00
Dave Chinner
155cd433b5 Merge branch 'xfs-4.9-log-recovery-fixes' into for-next 2016-10-03 09:56:28 +11:00
Dave Chinner
a89b3f97bb Merge branch 'xfs-4.9-delalloc-rework' into for-next 2016-10-03 09:52:51 +11:00
Dave Chinner
292378edcb xfs: remote attribute blocks aren't really userdata
When adding a new remote attribute, we write the attribute to the
new extent before the allocation transaction is committed. This
means we cannot reuse busy extents as that violates crash
consistency semantics. Hence we currently treat remote attribute
extent allocation like userdata because it has the same overwrite
ordering constraints as userdata.

Unfortunately, this also allows the allocator to incorrectly apply
extent size hints to the remote attribute extent allocation. This
results in interesting failures, such as transaction block
reservation overruns and in-memory inode attribute fork corruption.

To fix this, we need to separate the busy extent reuse configuration
from the userdata configuration. This changes the definition of
XFS_BMAPI_METADATA slightly - it now means that allocation is
metadata and reuse of busy extents is acceptible due to the metadata
ordering semantics of the journal. If this flag is not set, it
means the allocation is that has unordered data writeback, and hence
busy extent reuse is not allowed. It no longer implies the
allocation is for user data, just that the data write will not be
strictly ordered. This matches the semantics for both user data
and remote attribute block allocation.

As such, This patch changes the "userdata" field to a "datatype"
field, and adds a "no busy reuse" flag to the field.
When we detect an unordered data extent allocation, we immediately set
the no reuse flag. We then set the "user data" flags based on the
inode fork we are allocating the extent to. Hence we only set
userdata flags on data fork allocations now and consider attribute
fork remote extents to be an unordered metadata extent.

The result is that remote attribute extents now have the expected
allocation semantics, and the data fork allocation behaviour is
completely unchanged.

It should be noted that there may be other ways to fix this (e.g.
use ordered metadata buffers for the remote attribute extent data
write) but they are more invasive and difficult to validate both
from a design and implementation POV. Hence this patch takes the
simple, obvious route to fixing the problem...

Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26 08:21:28 +10:00
Christoph Hellwig
51446f5ba4 xfs: rewrite and optimize the delalloc write path
Currently xfs_iomap_write_delay does up to lookups in the inode
extent tree, which is rather costly especially with the new iomap
based write path and small write sizes.

But it turns out that the low-level xfs_bmap_search_extents gives us
all the information we need in the regular delalloc buffered write
path:

 - it will return us an extent covering the block we are looking up
   if it exists.  In that case we can simply return that extent to
   the caller and are done
 - it will tell us if we are beyoned the last current allocated
   block with an eof return parameter.  In that case we can create a
   delalloc reservation and use the also returned information about
   the last extent in the file as the hint to size our delalloc
   reservation.
 - it can tell us that we are writing into a hole, but that there is
   an extent beyoned this hole.  In this case we can create a
   delalloc reservation that covers the requested size (possible
   capped to the next existing allocation).

All that can be done in one single routine instead of bouncing up
and down a few layers.  This reduced the CPU overhead of the block
mapping routines and also simplified the code a lot.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:10:21 +10:00
Darrick J. Wong
3fd129b63f xfs: set up per-AG free space reservations
One unfortunate quirk of the reference count and reverse mapping
btrees -- they can expand in size when blocks are written to *other*
allocation groups if, say, one large extent becomes a lot of tiny
extents.  Since we don't want to start throwing errors in the middle
of CoWing, we need to reserve some blocks to handle future expansion.
The transaction block reservation counters aren't sufficient here
because we have to have a reserve of blocks in every AG, not just
somewhere in the filesystem.

Therefore, create two per-AG block reservation pools.  One feeds the
AGFL so that rmapbt expansion always succeeds, and the other feeds all
other metadata so that refcountbt expansion never fails.

Use the count of how many reserved blocks we need to have on hand to
create a virtual reservation in the AG.  Through selective clamping of
the maximum length of allocation requests and of the length of the
longest free extent, we can make it look like there's less free space
in the AG unless the reservation owner is asking for blocks.

In other words, play some accounting tricks in-core to make sure that
we always have blocks available.  On the plus side, there's nothing to
clean up if we crash, which is contrast to the strategy that the rough
draft used (actually removing extents from the freespace btrees).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 10:30:52 +10:00
Darrick J. Wong
9c19464469 xfs: propagate bmap updates to rmapbt
When we map, unmap, or convert an extent in a file's data or attr
fork, schedule a respective update in the rmapbt.  Previous versions
of this patch required a 1:1 correspondence between bmap and rmap,
but this is no longer true as we now have ability to make interval
queries against the rmapbt.

We use the deferred operations code to handle redo operations
atomically and deadlock free.  This plumbs in all five rmap actions
(map, unmap, convert extent, alloc, free); we'll use the first three
now for file data, and reflink will want the last two.  We also add
an error injection site to test log recovery.

Finally, we need to fix the bmap shift extent code to adjust the
rmaps correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 12:16:05 +10:00
Darrick J. Wong
525488520a xfs: rmap btree requires more reserved free space
Originally-From: Dave Chinner <dchinner@redhat.com>

The rmap btree is allocated from the AGFL, which means we have to
ensure ENOSPC is reported to userspace before we run out of free
space in each AG. The last allocation in an AG can cause a full
height rmap btree split, and that means we have to reserve at least
this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
Update the various space calculation functions to handle this.

Also, because the macros are now executing conditional code and are
called quite frequently, convert them to functions that initialise
variables in the struct xfs_mount, use the new variables everywhere
and document the calculations better.

[darrick.wong@oracle.com: don't reserve blocks if !rmap]
[dchinner@redhat.com: update m_ag_max_usable after growfs]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:38:24 +10:00
Darrick J. Wong
340785cca1 xfs: add owner field to extent allocation and freeing
For the rmap btree to work, we have to feed the extent owner
information to the the allocation and freeing functions. This
information is what will end up in the rmap btree that tracks
allocated extents. While we technically don't need the owner
information when freeing extents, passing it allows us to validate
that the extent we are removing from the rmap btree actually
belonged to the owner we expected it to belong to.

We also define a special set of owner values for internal metadata
that would otherwise have no owner. This allows us to tell the
difference between metadata owned by different per-ag btrees, as
well as static fs metadata (e.g. AG headers) and internal journal
blocks.

There are also a couple of special cases we need to take care of -
during EFI recovery, we don't actually know who the original owner
was, so we need to pass a wildcard to indicate that we aren't
checking the owner for validity. We also need special handling in
growfs, as we "free" the space in the last AG when extending it, but
because it's new space it has no actual owner...

While touching the xfs_bmap_add_free() function, re-order the
parameters to put the struct xfs_mount first.

Extend the owner field to include both the owner type and some sort
of index within the owner.  The index field will be used to support
reverse mappings when reflink is enabled.

When we're freeing extents from an EFI, we don't have the owner
information available (rmap updates have their own redo items).
xfs_free_extent therefore doesn't need to do an rmap update. Make
sure that the log replay code signals this correctly.

This is based upon a patch originally from Dave Chinner. It has been
extended to add more owner information with the intent of helping
recovery operations when things go wrong (e.g. offset of user data
block in a file).

[dchinner: de-shout the xfs_rmap_*_owner helpers]
[darrick: minor style fixes suggested by Christoph Hellwig]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:33:42 +10:00
Darrick J. Wong
ba9e780246 xfs: add tracepoints and error injection for deferred extent freeing
Add a couple of tracepoints for the deferred extent free operation and
a site for injecting errors while finishing the operation.  This makes
it easier to debug deferred ops and test log redo.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:26:33 +10:00
Darrick J. Wong
2c3234d1ef xfs: rename flist/free_list to dfops
Mechanical change of flist/free_list to dfops, since they're now
deferred ops, not just a freeing list.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:19:29 +10:00
Darrick J. Wong
310a75a3c6 xfs: change xfs_bmap_{finish,cancel,init,free} -> xfs_defer_*
Drop the compatibility shims that we were using to integrate the new
deferred operation mechanism into the existing code.  No new code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:18:10 +10:00
Darrick J. Wong
3ab78df2a5 xfs: rework xfs_bmap_free callers to use xfs_defer_ops
Restructure everything that used xfs_bmap_free to use xfs_defer_ops
instead.  For now we'll just remove the old symbols and play some
cpp magic to make it work; in the next patch we'll actually rename
everything.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:15:38 +10:00
Darrick J. Wong
f4a0660de3 xfs: fix locking of the rt bitmap/summary inodes
When we're deleting realtime extents, we need to lock the summary
inode in case we need to update the summary info to prevent an assert
on the rsumip inode lock on a debug kernel.  While we're at it, fix
the locking annotations so that we avoid triggering lockdep warnings.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 11:00:42 +10:00
Darrick J. Wong
e66a4c678e xfs: convert list of extents to free into a regular list
In struct xfs_bmap_free, convert the open-coded free extent list to
a regular list, then use list_sort to sort it prior to processing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21 11:53:28 +10:00
Darrick J. Wong
59bad075bd xfs: rearrange xfs_bmap_add_free parameters
This is already in xfsprogs' libxfs, so port it to the kernel.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21 11:53:28 +10:00
Linus Torvalds
0b9210c9c8 xfs: update for 4.7-rc1
Changes in this update:
 o fixes for mount line parsing, sparse warnings, read-only compat
   feature remount behaviour
 o allow fast path symlink lookups for inline symlinks.
 o attribute listing cleanups
 o writeback goes direct to bios rather than indirecting through
   bufferheads
 o transaction allocation cleanup
 o optimised kmem_realloc
 o added configurable error handling for metadata write errors,
   changed default error handling behaviour from "retry forever" to
   "retry until unmount then fail"
 o fixed several inode cluster writeback lookup vs reclaim race
   conditions
 o fixed inode cluster writeback checking wrong inode after lookup
 o fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
 o cleaned up inode reclaim tagging
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXRo8LAAoJEK3oKUf0dfodxLgP+wQMd46i/nCncr6jMcdoVXfL
 rE6cL1LJWWVOWzax/bmdlV1VNXqqW7n0ABAVMqikzqSEd+fBQe/HOkdBeVUywu7o
 bmqgNxuofMqHaiYhiTvUijHLHWLFxIgd/jNT7L5oGRzZdmP260VGf3EPipN7aA9U
 Y3KVhFQCqohpeIUeSV4Z/eIDdHN5LyUI1s+7zMLquHKCWyO4aO4GBX8YlyXdRRVe
 cwCZb4TBryS0PBCIra31MZ5wBRwLx8PBXqcNsnTQSR5Uu+WeuwxofXz5q3kdmNOU
 XGTWiabQbcvaC4twrzqnErHEX41PAs43tWxsI/qJH49QIFdfOYM+t8ERhEa2Q3DW
 Ihl+Q/2qiOuZZterG/t5MrxhybrmQhEFVJT6Ib8b/CnwpRm+K8kWTead1YJL8Xzd
 F9k8e57BCgTbDA7jWxWDbp7eQ1/4KglBD4sefFPpsuFgO882mmo5GmymALGjmitw
 JH1v3HL3PeTkQoHfcay8ZM/zNjX643CXHwCWYEOAgD+e77TPiOs/cHLZaXbrBkLK
 PpSJNfYiBe31eeSOEGsxivMapLpus+cHZyK3uR+XU+naJhjOdaBDTTo8RsAD9jS5
 C/dzxc4l7o+gYT+UjV5KtyfEeVwtGo5mtR9XozPbNDjNQor8Vo6NQMZXMXpFYDZI
 2XgzVNpkEf/74kexdEzV
 =0tYo
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs updates from Dave Chinner:
 "A pretty average collection of fixes, cleanups and improvements in
  this request.

  Summary:
   - fixes for mount line parsing, sparse warnings, read-only compat
     feature remount behaviour
   - allow fast path symlink lookups for inline symlinks.
   - attribute listing cleanups
   - writeback goes direct to bios rather than indirecting through
     bufferheads
   - transaction allocation cleanup
   - optimised kmem_realloc
   - added configurable error handling for metadata write errors,
     changed default error handling behaviour from "retry forever" to
     "retry until unmount then fail"
   - fixed several inode cluster writeback lookup vs reclaim race
     conditions
   - fixed inode cluster writeback checking wrong inode after lookup
   - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
   - cleaned up inode reclaim tagging"

* tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
  xfs: fix warning in xfs_finish_page_writeback for non-debug builds
  xfs: move reclaim tagging functions
  xfs: simplify inode reclaim tagging interfaces
  xfs: rename variables in xfs_iflush_cluster for clarity
  xfs: xfs_iflush_cluster has range issues
  xfs: mark reclaimed inodes invalid earlier
  xfs: xfs_inode_free() isn't RCU safe
  xfs: optimise xfs_iext_destroy
  xfs: skip stale inodes in xfs_iflush_cluster
  xfs: fix inode validity check in xfs_iflush_cluster
  xfs: xfs_iflush_cluster fails to abort on error
  xfs: remove xfs_fs_evict_inode()
  xfs: add "fail at unmount" error handling configuration
  xfs: add configuration handlers for specific errors
  xfs: add configuration of error failure speed
  xfs: introduce table-based init for error behaviors
  xfs: add configurable error support to metadata buffers
  xfs: introduce metadata IO error class
  xfs: configurable error behavior via sysfs
  xfs: buffer ->bi_end_io function requires irq-safe lock
  ...
2016-05-26 10:13:40 -07:00
Christoph Hellwig
253f4911f2 xfs: better xfs_trans_alloc interface
Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
that returns a transaction with all the required log and block reservations,
and which allows passing transaction flags directly to avoid the cumbersome
_xfs_trans_alloc interface.

While we're at it we also get rid of the transaction type argument that has
been superflous since we stopped supporting the non-CIL logging mode.  The
guts of it will be removed in another patch.

[dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 09:19:55 +10:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Dave Chinner
2cdb958aba Merge branch 'xfs-misc-fixes-4.6-4' into for-next 2016-03-15 11:44:35 +11:00
Brian Foster
d34999c97a xfs: borrow indirect blocks from freed extent when available
xfs_bmap_del_extent() handles extent removal from the in-core and
on-disk extent lists. When removing a delalloc range, it updates the
indirect block reservation appropriately based on the removal. It
currently enforces that the new indirect block reservation is less than
or equal to the original. This is normally the case in all situations
except for in certain cases when the removed range creates a hole in a
single delalloc extent, thus splitting a single delalloc extent in two.

It is possible with small enough extents to split an indlen==1 extent
into two such slightly smaller extents. This leaves one extent with 0
indirect blocks and leads to assert failures in other areas (e.g.,
xfs_bunmapi() if the extent happens to be removed).

Update the indlen distribution code to steal blocks from the deleted
extent, if necessary, to satisfy the worst case total indirect
reservation for the new extents. This is safe as the caller does not
update the fdblocks counters until the extent is removed. Blocks stolen
in this manner simply remain accounted as allocated, having ownership
transferred from the data extent to an indirect reservation.

As a precaution, fall back to the original reservation algorithm if the
new indlen requirement is not met and warn if we end up with extents
without any reservation at all to detect this more easily in the future.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15 11:42:47 +11:00
Brian Foster
a9bd24ac2b xfs: refactor delalloc indlen reservation split into helper
The delayed allocation indirect reservation splitting code is not
sufficient in some cases where a delalloc extent is split in two. In
preparation for enhancements to this code, refactor the current indlen
distribution algorithm into a new helper function.

[dchinner: rename temp, temp2 variables]

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15 11:42:46 +11:00
Brian Foster
b2706a05ba xfs: update freeblocks counter after extent deletion
xfs_bunmapi() currently updates the fdblocks counter, unreserves quota,
etc. before the extent is deleted by xfs_bmap_del_extent(). The function
has problems dividing up the indirect reserved blocks for scenarios
where a single delalloc extent is split in two. Particularly, there
aren't always enough blocks reserved for multiple extents in a single
extent reservation.

The solution to this problem is to allow the extent removal code to
steal from the deleted extent to meet indirect reservation requirements.
Move the block of code in xfs_bmapi() that updates the fdblocks counter
to after the call to xfs_bmap_del_extent() to allow the codepath to
update the extent record before the free blocks are accounted. Also,
reshuffle the code slightly so the delalloc accounting occurs near the
xfs_bmap_del_extent() call to provide context for the comments.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15 11:42:46 +11:00
Dave Chinner
ab9d1e4f7b Merge branch 'xfs-misc-fixes-4.6-3' into for-next 2016-03-09 08:18:30 +11:00
Luis de Bethencourt
a5fd276bdc xfs: remove impossible condition
bp_release is set to 0 just before the breakpoint of the for loop before
the conditional check (in line 458). The other breakpoint is a goto that
skips the dead code.

Addresses-Coverity-Id: 102338

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-09 08:17:56 +11:00
Dave Chinner
3c1a79f5ff Merge branch 'xfs-misc-fixes-4.6-2' into for-next 2016-03-07 09:34:54 +11:00
Christoph Hellwig
a7e5d03ba8 xfs: remove xfs_trans_get_block_res
Just use the t_blk_res field directly instead of obsfucating the reference
by a macro.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:58:21 +11:00
Dave Chinner
c19b3b05ae xfs: mode di_mode to vfs inode
Move the di_mode value from the xfs_icdinode to the VFS inode, reducing
the xfs_icdinode byte another 2 bytes and collapsing another 2 byte hole
in the structure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09 16:54:58 +11:00
Eric Sandeen
f6106efae5 xfs: eliminate committed arg from xfs_bmap_finish
Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
associated comments were replicated several times across
the attribute code, all dealing with what to do if the
transaction was or wasn't committed.

And in that replicated code, an ASSERT() test of an
uninitialized variable occurs in several locations:

	error = xfs_attr_thing(&args);
	if (!error) {
		error = xfs_bmap_finish(&args.trans, args.flist,
					&committed);
	}
	if (error) {
		ASSERT(committed);

If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
never set "committed", and then test it in the ASSERT.

Fix this up by moving the committed state internal to xfs_bmap_finish,
and add a new inode argument.  If an inode is passed in, it is passed
through to __xfs_trans_roll() and joined to the transaction there if
the transaction was committed.

xfs_qm_dqalloc() was a little unique in that it called bjoin rather
than ijoin, but as Dave points out we can detect the committed state
but checking whether (*tpp != tp).

Addresses-Coverity-Id: 102360
Addresses-Coverity-Id: 102361
Addresses-Coverity-Id: 102363
Addresses-Coverity-Id: 102364
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-11 11:34:01 +11:00
Dave Chinner
e35438196c xfs: bmapbt checking on debug kernels too expensive
For large sparse or fragmented files, checking every single entry in
the bmapbt on every operation is prohibitively expensive. Especially
as such checks rarely discover problems during normal operations on
high extent coutn files. Our regression tests don't tend to exercise
files with hundreds of thousands to millions of extents, so mostly
this isn't noticed.

However, trying to run things like xfs_mdrestore of large filesystem
dumps on a debug kernel quickly becomes impossible as the CPU is
completely burnt up repeatedly walking the sparse file bmapbt that
is generated for every allocation that is made.

Hence, if the file has more than 10,000 extents, just don't bother
with walking the tree to check it exhaustively. The btree code has
checks that ensure that the newly inserted/removed/modified record
is correctly ordered, so the entrie tree walk in thses cases has
limited additional value.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-08 11:28:49 +11:00
Darrick J. Wong
6d3eb1eca0 libxfs: use a convenience variable instead of open-coding the fork
Use a convenience variable instead of open-coding the inode fork.
This isn't really needed for now, but will become important when we
add the copy-on-write fork later.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04 16:12:42 +11:00
Eric Sandeen
f1f96c4946 xfs: get mp from bma->ip in xfs_bmap code
In my earlier commit

  c29aad4 xfs: pass mp to XFS_WANT_CORRUPTED_GOTO

I added some local mp variables with code which indicates that
mp might be NULL.  Coverity doesn't like this now, because the
updated per-fs XFS_STATS macros dereference mp.

I don't think this is actually a problem; from what I can tell,
we cannot get to these functions with a null bma->tp, so my NULL
check was probably pointless.  Still, it's not super obvious.

So switch this code to get mp from the inode on the xfs_bmalloca
structure, with no conditional, because the functions are already
using bmap->ip directly.

Addresses-Coverity-Id: 1339552
Addresses-Coverity-Id: 1339553
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04 16:10:42 +11:00
Dave Chinner
264e89ad34 Merge branch 'xfs-dax-updates' into for-next 2015-11-03 13:28:41 +11:00
Dave Chinner
3fbbbea34b xfs: introduce BMAPI_ZERO for allocating zeroed extents
To enable DAX to do atomic allocation of zeroed extents, we need to
drive the block zeroing deep into the allocator. Because
xfs_bmapi_write() can return merged extents on allocation that were
only partially allocated (i.e. requested range spans allocated and
hole regions, allocation into the hole was contiguous), we cannot
zero the extent returned from xfs_bmapi_write() as that can
overwrite existing data with zeros.

Hence we have to drive the extent zeroing into the allocation code,
prior to where we merge the extents into the BMBT and return the
resultant map. This means we need to propagate this need down to
the xfs_alloc_vextent() and issue the block zeroing at this point.

While this functionality is being introduced for DAX, there is no
reason why it is specific to DAX - we can per-zero blocks during the
allocation transaction on any type of device. It's just slow (and
usually slower than unwritten allocation and conversion) on
traditional block devices so doesn't tend to get used. We can,
however, hook hardware zeroing optimisations via sb_issue_zeroout()
to this operation, so it may be useful in future and hence the
"allocate zeroed blocks" API needs to be implementation neutral.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:27:22 +11:00
Dave Chinner
316433beda Merge branch 'xfs-logging-fixes' into for-next 2015-10-12 18:37:58 +11:00
Bill O'Donnell
ff6d6af235 xfs: per-filesystem stats counter implementation
This patch modifies the stats counting macros and the callers
to those macros to properly increment, decrement, and add-to
the xfs stats counts. The counts for global and per-fs stats
are correctly advanced, and cleared by writing a "1" to the
corresponding clear file.

global counts: /sys/fs/xfs/stats/stats
per-fs counts: /sys/fs/xfs/sda*/stats/stats

global clear:  /sys/fs/xfs/stats/stats_clear
per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear

[dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
 stats structures and macros. ]

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 18:21:22 +11:00
Brian Foster
b7cdc66be5 xfs: log local to remote symlink conversions correctly on v5 supers
A local format symlink inode is converted to extent format when an
extended attribute is set on an inode as part of the attribute fork
creation. This means a block is allocated, the local symlink target name
is copied to the block and the block is logged. Currently,
xfs_bmap_local_to_extents() handles logging the remote block data based
on the size of the data fork prior to the conversion. This is not
correct on v5 superblock filesystems, which add an additional header to
remote symlink blocks that is nonexistent in local format inodes.

As a result, the full length of the remote symlink block content is not
logged. This can lead to corruption should a crash occur and log
recovery replay this transaction.

Since a callout is already used to initialize the new remote symlink
block, update the local-to-extents conversion mechanism to make the
callout also responsible for logging the block. It is already required
to set the log buffer type and format the block appropriately based on
the superblock version. This ensures the remote symlink is always logged
correctly. Note that xfs_bmap_local_to_extents() is only called for
symlinks so there are no other callouts that require modification.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 15:40:24 +11:00
Brian Foster
d4a97a0422 xfs: add missing bmap cancel calls in error paths
If a failure occurs after the bmap free list is populated and before
xfs_bmap_finish() completes successfully (which returns a partial
list on failure), the bmap free list must be cancelled. Otherwise,
the extent items on the list are never freed and a memory leak
occurs.

Several random error paths throughout the code suffer this problem.
Fix these up such that xfs_bmap_cancel() is always called on error.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-19 10:01:40 +10:00
Dave Chinner
3d238b7e0e Merge branch 'xfs-freelist-cleanup' into for-next 2015-06-23 08:48:43 +10:00
Dave Chinner
496817b4be xfs: clean up XFS_MIN_FREELIST macros
We no longer calculate the minimum freelist size from the on-disk
AGF, so we don't need the macros used for this. That means the
nested macros can be cleaned up, and turn this into an actual
function so the logic is clear and concise. This will make it much
easier to add support for the rmap btree when the time comes.

This also gets rid of the XFS_AG_MAXLEVELS macro used by these
freelist macros as it is simply a wrapper around a single variable.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22 10:13:30 +10:00
Dave Chinner
50adbcb4c4 xfs: xfs_alloc_fix_freelist() can use incore perag structures
At the moment, xfs_alloc_fix_freelist() uses a mix of per-ag based
access and agf buffer  based access to freelist and space usage
information. However, once the AGF buffer is locked inside this
function, it is guaranteed that both the in-memory and on-disk
values are identical. xfs_alloc_fix_freelist() doesn't modify the
values in the structures directly, so it is a read-only user of the
infomration, and hence can use the per-ag structure exclusively for
determining what it should do.

This opens up an avenue for cleaning up a lot of duplicated logic
whose only difference is the structure it gets the data from, and in
doing so removes a lot of needless byte swapping overhead when
fixing up the free list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22 10:04:31 +10:00
Dave Chinner
4ea7976616 Merge branch 'xfs-commit-cleanup' into for-next
Conflicts:
	fs/xfs/xfs_attr_inactive.c
2015-06-04 13:55:48 +10:00
Christoph Hellwig
70393313dd xfs: saner xfs_trans_commit interface
The flags argument to xfs_trans_commit is not useful for most callers, as
a commit of a transaction without a permanent log reservation must pass
0 here, and all callers for a transaction with a permanent log reservation
except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES.  So remove
the flags argument from the public xfs_trans_commit interfaces, and
introduce low-level __xfs_trans_commit variant just for xfs_trans_roll
that regrants a log reservation instead of releasing it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04 13:48:08 +10:00
Christoph Hellwig
4906e21545 xfs: remove the flags argument to xfs_trans_cancel
xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and
XFS_TRANS_ABORT.  Both of them are a direct product of the transaction
state, and can be deducted:

 - any dirty transaction needs XFS_TRANS_ABORT to be properly canceled,
   and XFS_TRANS_ABORT is a noop for a transaction that is not dirty.
 - any transaction with a permanent log reservation needs
   XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing
   XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent
   log reservation is invalid.

So just remove the flags argument and do the right thing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04 13:47:56 +10:00
Dave Chinner
e01c025fbd Merge branch 'xfs-misc-fixes-for-4.2' into for-next 2015-06-01 10:50:18 +10:00
Brian Foster
2e588a46aa xfs: always log the inode on unwritten extent conversion
The fsync() requirements for crash consistency on XFS are to flush file
data and force any in-core inode updates to the log. We currently check
whether the inode is pinned to identify whether the log needs to be
forced, since a non-zero pin count generally represents an inode that
has transactions awaiting a flush to the on-disk log.

This is not sufficient in all cases, however. Reports of xfstests test
generic/311 failures on ppc64/s390x hosts have identified failures to
fsync outstanding inode modifications due to the inode not being pinned
at the time of the fsync. This occurs because certain bmap updates can
complete by logging bmapbt buffers but without ever dirtying (and thus
pinning) the core inode. The following is a specific incarnation of this
problem:

$ mount $dev /mnt -o noatime,nobarrier
$ for i in $(seq 0 2 31); do \
        xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \
	done
$ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \
	hexdump /mnt/file; \
	./xfstests-dev/src/godown /mnt
...
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0014000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f8000
$ umount /mnt; mount ...
$ hexdump /mnt/file
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f8000

In short, the unwritten extent conversion for the last write is lost
despite the fact that an fsync executed before the filesystem was
shutdown. Note that this is impossible to reproduce on v5 supers due to
unconditional time callbacks for di_changecount and highly difficult to
reproduce on CONFIG_HZ=1000 kernels due to those same callbacks
frequently updating cmtime prior to the bmap update. CONFIG_HZ=100
reduces timer granularity enough to increase the odds that time updates
are skipped and allows this to reproduce within a handful of attempts.

To deal with this problem, unconditionally log the core in the unwritten
extent conversion path. Fix up logflags after the extent conversion to
keep the extent update code consistent with the other extent update
helpers. This fixup is not necessary for the other (hole, delay) extent
helpers because they execute in the block allocation codepath, which
already logs the inode for other reasons (e.g., for di_nblocks).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-01 07:15:23 +10:00
Dave Chinner
6dea405eee xfs: extent size hints can round up extents past MAXEXTLEN
This results in BMBT corruption, as seen by this test:

# mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc
....
# mount /dev/vdc /mnt/scratch
# xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo

which results in this failure on a debug kernel:

XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211
....
Call Trace:
 [<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100
 [<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20
 [<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120
 [<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70
 [<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70
 [<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0
 [<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170
 [<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400
 [<ffffffff81ddcc29>] ? down_write+0x29/0x60
 [<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310
 [<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120
 [<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570
 [<ffffffff811cd680>] vfs_fallocate+0x140/0x260
 [<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80
 [<ffffffff81ddec09>] system_call_fastpath+0x12/0x17

The tracepoint that indicates the extent that triggered the assert
failure is:

xfs_iext_insert:   idx 0 offset 0 block 16777224 count 2097152 flag 1

Clearly indicating that the extent length is greater than MAXEXTLEN,
which is 2097151. A prior trace point shows the allocation was an
exact size match and that a length greater than MAXEXTLEN was asked
for:

xfs_alloc_size_done:  agno 1 agbno 8 minlen 2097152 maxlen 2097152
					    ^^^^^^^        ^^^^^^^

We don't see this problem with extent size hints through the IO path
because we can't do single IOs large enough to trigger MAXEXTLEN
allocation. fallocate(), OTOH, is not limited in it's allocation
sizes and so needs help here.

The issue is that the extent size hint alignment is rounding up the
extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking
into account extent size hints when calculating the maximum extent
length to allocate. xfs_bmapi_reserve_delalloc() is already doing
this, but direct extent allocation is not.

Unfortunately, the calculation in xfs_bmapi_reserve_delalloc() is
wrong, and it works only because delayed allocation extents are not
limited in size to MAXEXTLEN in the in-core extent tree. hence this
calculation does not work for direct allocation, and the delalloc
code needs fixing. This may, in fact be the underlying bug that
occassionally causes transaction overruns in delayed allocation
extent conversion, so now we know it's wrong we should fix it, too.
Many thanks to Brian Foster for finding this problem during review
of this patch.

Hence the fix, after much code reading, is to allow
xfs_bmap_extsize_align() to align partial extents when full
alignment would extend the alignment past MAXEXTLEN. We can safely
do this because all callers have higher layer allocation loops that
already handle short allocations, and so will simply run another
allocation to cover the remainder of the requested allocation range
that we ignored during alignment. The advantage of this approach is
that it also removes the need for callers to do anything other than
limit their requests to MAXEXTLEN - they don't really need to be
aware of extent size hints at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29 07:40:06 +10:00