Default attr fork offset is based on inode size, so is a fixed
geometry parameter of the inode. Move it to the xfs_ino_geometry
structure and stop calculating it on every call to
xfs_default_attroffset().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
If we allocate quota inodes in the process of mounting a filesystem but
then decide to abort the mount, it's possible that the quota inodes are
sitting around pinned by the log. Now that inode reclaim relies on the
AIL to flush inodes, we have to force the log and push the AIL in
between releasing the quota inodes and kicking off reclaim to tear down
all the incore inodes. Do this by extracting the bits we need from the
unmount path and reusing them. As an added bonus, failed writes during
a failed mount will not retry forever now.
This was originally found during a fuzz test of metadata directories
(xfs/1546), but the actual symptom was that reclaim hung up on the quota
inodes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Split the block preallocation garbage collection work into per-AG work
items so that we can take advantage of parallelization.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Shorten the names of the two functions that start and stop block
preallocation garbage collection and move them up to the other blockgc
functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_quiesce_attr() is now a wrapper for xfs_log_clean(). Remove it
and call xfs_log_clean() directly.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
xfs_log_sbcount() calls xfs_sync_sb() to sync superblock counters to
disk when lazy superblock accounting is enabled. This occurs on
unmount, freeze, and read-only (re)mount and ensures the final
values are calculated and persisted to disk before each form of
quiesce completes.
Now that log covering occurs in all of these contexts and uses the
same xfs_sync_sb() mechanism to update log state, there is no need
to log the superblock separately for any reason. Update the log
quiesce path to sync the superblock at least once for any mount
where lazy superblock accounting is enabled. If the log is already
covered, it will remain in the covered state. Otherwise, the next
sync as part of the normal covering sequence will carry the
associated superblock update with it. Remove xfs_log_sbcount() now
that it is no longer needed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
xfs_log_sbcount() syncs the superblock specifically to accumulate
the in-core percpu superblock counters and commit them to disk. This
is required to maintain filesystem consistency across quiesce
(freeze, read-only mount/remount) or unmount when lazy superblock
accounting is enabled because individual transactions do not update
the superblock directly.
This mechanism works as expected for writable mounts, but
xfs_log_sbcount() skips the update for read-only mounts. Read-only
mounts otherwise still allow log recovery and write out an unmount
record during log quiesce. If a read-only mount performs log
recovery, it can modify the in-core superblock counters and write an
unmount record when the filesystem unmounts without ever syncing the
in-core counters. This leaves the filesystem with a clean log but in
an inconsistent state with regard to lazy sb counters.
Update xfs_log_sbcount() to use the same logic
xfs_log_unmount_write() uses to determine when to write an unmount
record. This ensures that lazy accounting is always synced before
the log is cleaned. Refactor this logic into a new helper to
distinguish between a writable filesystem and a writable log.
Specifically, the log is writable unless the filesystem is mounted
with the norecovery mount option, the underlying log device is
read-only, or the filesystem is shutdown. Drop the freeze state
check because the update is already allowed during the freezing
process and no context calls this function on an already frozen fs.
Also, retain the shutdown check in xfs_log_unmount_write() to catch
the case where the preceding log force might have triggered a
shutdown.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
xfs_wait_buftarg() is vaguely named and somewhat overloaded. Its
primary purpose is to reclaim all buffers from the provided buffer
target LRU. In preparation to refactor xfs_wait_buftarg() into
serialization and LRU draining components, rename the function and
associated helpers to something more descriptive. This patch has no
functional changes with the minor exception of renaming a
tracepoint.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In xfs_initialize_perag(), if kmem_zalloc(), xfs_buf_hash_init(), or
radix_tree_preload() failed, the returned value 'error' is not set
accordingly.
Reported-as-fixing: 8b26c5825e ("xfs: handle ENOMEM correctly during initialisation of perag structures")
Fixes: 9b24717979 ("xfs: cache unlinked pointers in an rhashtable")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Merge xfs_getsb into its only caller, and clean that one up a little bit
as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
With the recent rework of the inode cluster flushing, we no longer
ever wait on the the inode flush "lock". It was never a lock in the
first place, just a completion to allow callers to wait for inode IO
to complete. We now never wait for flush completion as all inode
flushing is non-blocking. Hence we can get rid of all the iflock
infrastructure and instead just set and check a state flag.
Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
xfs_iflock_nowait() test-and-set operations on that flag, and
replace all the xfs_ifunlock() calls to clear operations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove kmem_realloc() function and convert its users to use MM API
directly (krealloc())
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Clean up xfs_reclaim_inodes() callers. Most callers want blocking
behaviour, so just make the existing SYNC_WAIT behaviour the
default.
For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
directly because we just want optimistic clean inode reclaim to be
done in the background.
For xfs_quiesce_attr() we can just remove the inode reclaim calls as
they are a historic relic that was required to flush dirty inodes
that contained unlogged changes. We now log all changes to the
inodes, so the sync AIL push from xfs_log_quiesce() called by
xfs_quiesce_attr() will do all the required inode writeback for
freeze.
Seeing as we now want to loop until all reclaimable inodes have been
reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG
tag rather than having xfs_reclaim_inodes_ag() tell it that inodes
were skipped. This is much more reliable and will always loop until
all reclaimable inodes are reclaimed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Inode reclaim will still throttle direct reclaim on the per-ag
reclaim locks. This is no longer necessary as reclaim can run
non-blocking now. Hence we can remove these locks so that we don't
arbitrarily block reclaimers just because there are more direct
reclaimers than there are AGs.
This can result in multiple reclaimers working on the same range of
an AG, but this doesn't cause any apparent issues. Optimising the
spread of concurrent reclaimers for best efficiency can be done in a
future patchset.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Shaokun Zhang reported that XFS was using substantial CPU time in
percpu_count_sum() when running a single threaded benchmark on
a high CPU count (128p) machine from xfs_mod_ifree(). The issue
is that the filesystem is empty when the benchmark runs, so inode
allocation is running with a very low inode free count.
With the percpu counter batching, this means comparisons when the
counter is less that 128 * 256 = 32768 use the slow path of adding
up all the counters across the CPUs, and this is expensive on high
CPU count machines.
The summing in xfs_mod_ifree() is only used to fire an assert if an
underrun occurs. The error is ignored by the higher level code.
Hence this is really just debug code and we don't need to run it
on production kernels, nor do we need such debug checks to return
error values just to trigger an assert.
Finally, xfs_mod_icount/xfs_mod_ifree are only called from
xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
directly call the percpu_counter_add/percpu_counter_compare
functions. The compare functions are now run only on debug builds as
they are internal to ASSERT() checks and so only compiled in when
ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).
Reported-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There are a couple places where we directly call printk_once() and one
of them doesn't follow the standard xfs subsystem printk format as a
result.
#define printk_once variants to go with our existing printk_ratelimited
#defines so we can do one-shot printks in a consistent manner.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Just dereference bp->b_addr directly and make the code a little
simpler and more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit
and swidth values could cause xfs_repair to fail loudly. The problem
here is that repair calculates the where mkfs should have allocated the
root inode, based on the superblock geometry. The allocation decisions
depend on sunit, which means that we really can't go updating sunit if
it would lead to a subsequent repair failure on an otherwise correct
filesystem.
Port from xfs_repair some code that computes the location of the root
inode and teach mount to skip the ondisk update if it would cause
problems for repair. Along the way we'll update the documentation,
provide a function for computing the minimum AGFL size instead of
open-coding it, and cut down some indenting in the mount code.
Note that we allow the mount to proceed (and new allocations will
reflect this new geometry) because we've never screened this kind of
thing before. We'll have to wait for a new future incompat feature to
enforce correct behavior, alas.
Note that the geometry reporting always uses the superblock values, not
the incore ones, so that is what xfs_info and xfs_growfs will report.
[1] https://lore.kernel.org/linux-xfs/20191125130744.GA44777@bfoster/T/#m00f9594b511e076e2fcdd489d78bc30216d72a7d
Reported-by: Alex Lyakas <alex@zadara.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
If the administrator provided a sunit= mount option, we need to validate
the raw parameter, convert the mount option units (512b blocks) into the
internal unit (fs blocks), and then validate that the (now cooked)
parameter doesn't screw anything up on disk. The incore inode geometry
computation can depend on the new sunit option, but a subsequent patch
will make validating the cooked value depends on the computed inode
geometry, so break the sunit update into two steps.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Convert the last of the open coded corruption check and report idioms to
use the XFS_IS_CORRUPT macro.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Eliminate struct xfs_mount field m_fsname by using the super block s_id
field directly.
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Rework xfs_parseargs to fill out the default value and then parse the
option directly into the mount structure, similar to what we do for
other updates, and open code the now trivial updates based on on the
on-disk superblock directly into xfs_mountfs.
Note that this change rejects the allocsize=0 mount option that has been
documented as invalid for a long time instead of just ignoring it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the allocsize name to match the mount option and usage instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
m_readio_blocks is entirely unused, and m_readio_blocks is only used in
xfs_stat_blksize in a max statements that is a no-op as it always has
the same value as m_writeio_log.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The -o wsync allocsize overwrite overwrite was part of a special hack
for NFSv2 servers in IRIX and has no real purpose in modern Linux, so
remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If the CONFIG_BUG is enabled, BUG is executed and then system is crashed.
However, the bailout for mount is no longer proceeding.
Using WARN_ON_ONCE rather than BUG can prevent this situation.
Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Since no caller is using KM_NOSLEEP and no callee branches on KM_SLEEP,
we can remove KM_NOSLEEP and replace KM_SLEEP with 0.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There are many, many xfs header files which are included but
unneeded (or included twice) in the xfs code, so remove them.
nb: xfs_linux.h includes about 9 headers for everyone, so those
explicit includes get removed by this. I'm not sure what the
preference is, but if we wanted explicit includes everywhere,
a followup patch could remove those xfs_*.h includes from
xfs_linux.h and move them into the files that need them.
Or it could be left as-is.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The flags value is always passed as 0 so remove the argument.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
inode_cluster_size is supposed to represent the size (in bytes) of an
inode cluster buffer. We avoid having to handle multiple clusters per
filesystem block on filesystems with large blocks by openly rounding
this value up to 1 FSB when necessary. However, we never reset
inode_cluster_size to reflect this new rounded value, which adds to the
potential for mistakes in calculating geometries.
Fix this by setting inode_cluster_size to reflect the rounded-up size if
needed, and special-case the few places in the sparse inodes code where
we actually need the smaller value to validate on-disk metadata.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Migrate all of the inode geometry setup code from xfs_mount.c into a
single libxfs function that we can share with xfsprogs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Separate the inode geometry information into a distinct structure.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
"reclaim" is used throughout the icache code to mean reclamation of
incore inode structures. It's also used for two helper functions that
toggle background deletion of speculative preallocations. Separate
the second of the two uses to make things less confusing.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a percpu counter to track the number of blocks directly reserved for
delayed allocations on the data device. This counter (in contrast to
i_delayed_blks) does not track allocated CoW staging extents or anything
going on with the realtime device. It will be used in the upcoming
summary counter scrub function to check the free block counts without
having to freeze the filesystem or walk all the inodes to find the
delayed allocations.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If we know the filesystem metadata isn't healthy during unmount, we want
to encourage the administrator to run xfs_repair right away. We can't
do this if BAD_SUMMARY will cause an unclean log unmount to force
summary recalculation, so turn it off if the fs is bad.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Replace the BAD_SUMMARY mount flag with calls to the equivalent health
tracking code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add the necessary in-core metadata fields to keep track of which parts
of the filesystem have been observed and which parts were observed to be
unhealthy, and print a warning at unmount time if we have unfixed
problems.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Use a rhashtable to cache the unlinked list incore. This should speed
up unlinked processing considerably when there are a lot of inodes on
the unlinked list because iunlink_remove no longer has to traverse an
entire bucket list to find which inode points to the one being removed.
The incore list structure records "X.next_unlinked = Y" relations, with
the rhashtable using Y to index the records. This makes finding the
inode X that points to a inode Y very quick. If our cache fails to find
anything we can always fall back on the old method.
FWIW this drastically reduces the amount of time it takes to remove
inodes from the unlinked list. I wrote a program to open a lot of
O_TMPFILE files and then close them in the same order, which takes
a very long time if we have to traverse the unlinked lists. With the
ptach, I see:
+ /d/t/tmpfile/tmpfile
Opened 193531 files in 6.33s.
Closed 193531 files in 5.86s
real 0m12.192s
user 0m0.064s
sys 0m11.619s
+ cd /
+ umount /mnt
real 0m0.050s
user 0m0.004s
sys 0m0.030s
And without the patch:
+ /d/t/tmpfile/tmpfile
Opened 193588 files in 6.35s.
Closed 193588 files in 751.61s
real 12m38.853s
user 0m0.084s
sys 12m34.470s
+ cd /
+ umount /mnt
real 0m0.086s
user 0m0.000s
sys 0m0.060s
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Store the inode cluster alignment information in units of inodes and
blocks in the mount data so that we don't have to keep recalculating
them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Store the number of inodes and blocks per inode cluster in the mount
data so that we don't have to keep recalculating them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Since the sb write verifier trips on bad icounts, we should also force a
mount time recalculation of the summary counters if the icount is bad.
This helps us avoid blowing up at freeze/unmount time when the bad
counter gets written back out.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Move the per-AG busy extent tree initialization to the per-ag structure
initialization since we don't want online repair to leak the old tree.
We only deconstruct the tree at unmount time, so this should be safe.
This also enables us to eliminate the commented out initialization in
the xfsprogs libxfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Replace the IRELE macro with a proper function so that we can do proper
typechecking and so that we can stop open-coding iput in scrub, which
means that we'll be able to ftrace inode lifetimes going through scrub
correctly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Use the "bad summary count" mount flag from the previous patch to skip
writing the unmount record to force log recovery at the next mount,
which will recalculate the summary counters for us.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Filippo Giunchedi complained that xfs doesn't even perform basic sanity
checks of the fs summary counters at mount time. Therefore, recalculate
the summary counters from the AGFs after log recovery if the counts were
bad (or we had to recover the fs). Enhance the recalculation routine to
fail the mount entirely if the new values are also obviously incorrect.
We use a mount state flag to record the "bad summary count" state so
that the (subsequent) online fsck patches can detect subtlely incorrect
counts and set the flag; clear it userspace asks for a repair; or force
a recalculation at the next mount if nobody fixes it by unmount time.
Reported-by: Filippo Giunchedi <fgiunchedi@wikimedia.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Remove the verbose license text from XFS files and replace them
with SPDX tags. This does not change the license of any of the code,
merely refers to the common, up-to-date license files in LICENSES/
This change was mostly scripted. fs/xfs/Makefile and
fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
and modified by the following command:
for f in `git grep -l "GNU General" fs/xfs/` ; do
echo $f
cat $f | awk -f hdr.awk > $f.new
mv -f $f.new $f
done
And the hdr.awk script that did the modification (including
detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
is as follows:
$ cat hdr.awk
BEGIN {
hdr = 1.0
tag = "GPL-2.0"
str = ""
}
/^ \* This program is free software/ {
hdr = 2.0;
next
}
/any later version./ {
tag = "GPL-2.0+"
next
}
/^ \*\// {
if (hdr > 0.0) {
print "// SPDX-License-Identifier: " tag
print str
print $0
str=""
hdr = 0.0
next
}
print $0
next
}
/^ \* / {
if (hdr > 1.0)
next
if (hdr > 0.0) {
if (str != "")
str = str "\n"
str = str $0
next
}
print $0
next
}
/^ \*/ {
if (hdr > 0.0)
next
print $0
next
}
// {
if (hdr > 0.0) {
if (str != "")
str = str "\n"
str = str $0
next
}
print $0
}
END { }
$
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When looking up the root inode at mount time, we don't actually do
any verification to check that the inode is allocated and accounted
for correctly in the INOBT. Make the checks on the root inode more
robust by making it an untrusted lookup. This forces the inode
lookup to use the inode btree to verify the inode is allocated
and mapped correctly to disk. This will also have the effect of
catching a significant number of AGI/INOBT related corruptions in
AG 0 at mount time.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Rebuilding the reverse-mapping tree requires us to quiesce all inodes in
the filesystem, so we must stop background reclamation of post-EOF and
CoW prealloc blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Most of the generic data structures embedded in xfs_mount are
dynamically initialized immediately after mp is allocated. A few
fields are left out and initialized during the xfs_mountfs()
sequence, after mp has been attached to the superblock.
To clean this up and help prevent premature access of associated
fields, refactor xfs_mount allocation and all dependent init calls
into a new helper. This self-documents that all low level data
structures (i.e., locks, trees, etc.) should be initialized before
xfs_mount is attached to the superblock.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The dmevmask structure member is a dmapi leftover; it's
set here and there but never actually used. Remove it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The mutex pag_ici_reclaim_lock of xfs_perag_t structure is initialized in
xfs_initialize_perag. If happen errors in xfs_initialize_perag, or free
resources in xfs_free_perag, wo need to destroy the mutex before free
perag.
Signed-off-by: Xiongwei Song <sxwjean@me.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When mounting fails, we must force-reclaim inodes (and disable delayed
reclaim) /after/ the realtime and quota control have let go of the
realtime and quota inodes. Without this, we corrupt the timer list and
cause other weird problems.
Found by xfs/376 fuzzing u3.bmbt[0].lastoff on an rmap filesystem to
force a bogus post-eof extent reclaim that causes the fs to go down.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we fail a mount on account of cow recovery errors, it's possible that
a previous quotacheck left some dquots in memory. The bailout clause of
xfs_mountfs forgets to purge these, and so we leak them. Fix that.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Way back when we established inode block-map redo log items, it was
discovered that we needed to prevent the VFS from evicting inodes during
log recovery because any given inode might be have bmap redo items to
replay even if the inode has no link count and is ultimately deleted,
and any eviction of an unlinked inode causes the inode to be truncated
and freed too early.
To make this possible, we set MS_ACTIVE so that inodes would not be torn
down immediately upon release. Unfortunately, this also results in the
quota inodes not being released at all if a later part of the mount
process should fail, because we never reclaim the inodes. So, set
MS_ACTIVE right before we do the last part of log recovery and clear it
immediately after we finish the log recovery so that everything
will be torn down properly if we abort the mount.
Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
- Avoid quotacheck deadlocks
- Fix transaction overflows when bunmapping fragmented files
- Refactor directory readahead
- Allow admin to configure if ASSERT is fatal
- Improve transaction usage detail logging during overflows
- Minor cleanups
- Don't leak log items when the log shuts down
- Remove double-underscore typedefs
- Various preparation for online scrubbing
- Introduce new error injection configuration sysfs knobs
- Refactor dq_get_next to use extent map directly
- Fix problems with iterating the page cache for unwritten data
- Implement SEEK_{HOLE,DATA} via iomap
- Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA
- Don't use MAXPATHLEN to check on-disk symlink target lengths
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJZYDw4AAoJEPh/dxk0SrTr2IMP/3JLeygIDtKBBVRPvlCmEXQC
j8w1C/ntn46zZKQ8l14fAFV4HV2d+KJWf8+yDuPuGdMXJfPeKZf95otYhnSx/9Th
MvCH7Nzg63yjEGqXpBkfIVr/GT0KTx28lxiqNViChr7XiXWookgf3SSLINO+vU4J
L2jgLqieJfijiHTBs4qGCQPDwSXVoSOi5XCCQWDYQrXz6DI5UEJc70U53WkH4tRu
RctOgp1lralwEO0PhfomD3m/Gk94taE/4ZpX/j/5Y4tvH/yh5aY3/KTCLm6+mYT3
rgMpmg5hmm+UiCTNoTnQ5RxzGZWCfI1I9FZ3HqDsbhmFtaWh32ti0dEEDYsF8Opj
ARnTty3cRx41LH9dULrVWdwW105AHgwEz8/OZlG0JOca9qzj9GKERMg/hpHINAKN
TrBlkweg86LWZDy23udZJ/v35svNqSFsqL1yV8j5dXyBi+Yi2CGfU27zbBwnj4Jk
047l+OuRbBnEOUULqJTEVBY3euoclwl/yQrW2m409s7vPGkGQBLuFCsDKQdnvJ/A
D7frZqH8XypwnhFOkKybUnBkn4P7vZ2sEuCIZMsrH5k/ys8XyEkaBaOurjvMBOKA
vLIMD6RXDWrFbOoovfK/stEM6/UFoQkgMhBe7vB9EXk1AjM8NYyWZgp5BkHtytC7
qa6GRjtGefhc67hbwXJd
=/GZI
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS updates from Darrick Wong:
"Here are some changes for you for 4.13. For the most part it's fixes
for bugs and deadlock problems, and preparation for online fsck in
some future merge window.
- Avoid quotacheck deadlocks
- Fix transaction overflows when bunmapping fragmented files
- Refactor directory readahead
- Allow admin to configure if ASSERT is fatal
- Improve transaction usage detail logging during overflows
- Minor cleanups
- Don't leak log items when the log shuts down
- Remove double-underscore typedefs
- Various preparation for online scrubbing
- Introduce new error injection configuration sysfs knobs
- Refactor dq_get_next to use extent map directly
- Fix problems with iterating the page cache for unwritten data
- Implement SEEK_{HOLE,DATA} via iomap
- Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA
- Don't use MAXPATHLEN to check on-disk symlink target lengths"
* tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
xfs: don't crash on unexpected holes in dir/attr btrees
xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
xfs: fix contiguous dquot chunk iteration livelock
xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
vfs: Add iomap_seek_hole and iomap_seek_data helpers
vfs: Add page_cache_seek_hole_data helper
xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
xfs: Check for m_errortag initialization in xfs_errortag_test
xfs: grab dquots without taking the ilock
xfs: fix semicolon.cocci warnings
xfs: Don't clear SGID when inheriting ACLs
xfs: free cowblocks and retry on buffered write ENOSPC
xfs: replace log_badcrc_factor knob with error injection tag
xfs: convert drop_writes to use the errortag mechanism
xfs: remove unneeded parameter from XFS_TEST_ERROR
xfs: expose errortag knobs via sysfs
xfs: make errortag a per-mountpoint structure
xfs: free uncommitted transactions during log recovery
xfs: don't allow bmap on rt files
...
Pull percpu updates from Tejun Heo:
"These are the percpu changes for the v4.13-rc1 merge window. There are
a couple visibility related changes - tracepoints and allocator stats
through debugfs, along with __ro_after_init markings and a cosmetic
rename in percpu_counter.
Please note that the simple O(#elements_in_the_chunk) area allocator
used by percpu allocator is again showing scalability issues,
primarily with bpf allocating and freeing large number of counters.
Dennis is working on the replacement allocator and the percpu
allocator will be seeing increased churns in the coming cycles"
* 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: fix static checker warnings in pcpu_destroy_chunk
percpu: fix early calls for spinlock in pcpu_stats
percpu: resolve err may not be initialized in pcpu_alloc
percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
percpu: add tracepoint support for percpu memory
percpu: expose statistics about percpu memory via debugfs
percpu: migrate percpu data structures to internal header
percpu: add missing lockdep_assert_held to func pcpu_free_area
mark most percpu globals as __ro_after_init
Remove the xfs_etest structure in favor of a per-mountpoint structure.
This will give us the flexibility to set as many error injection points
as we want, and later enable us to set up sysfs knobs to set the trigger
frequency as we wish. This comes at a cost of higher memory use, but
unti we hit 1024 injection points (we're at 29) or a lot of mounts this
shouldn't be a huge issue.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Currently, percpu_counter_add is a wrapper around __percpu_counter_add
which is preempt safe due to explicit calls to preempt_disable. Given
how __ prefix is used in percpu related interfaces, the naming
unfortunately creates the false sense that __percpu_counter_add is
less safe than percpu_counter_add. In terms of context-safety,
they're equivalent. The only difference is that the __ version takes
a batch parameter.
Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.
This patch doesn't cause any functional changes.
tj: Minor updates to patch description for clarity. Cosmetic
indentation updates.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: "David S. Miller" <davem@davemloft.net>
This is a purely mechanical patch that removes the private
__{u,}int{8,16,32,64}_t typedefs in favor of using the system
{u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform
the transformation and fix the resulting whitespace and indentation
errors:
s/typedef\t__uint8_t/typedef __uint8_t\t/g
s/typedef\t__uint/typedef __uint/g
s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
s/__uint8_t\t/__uint8_t\t\t/g
s/__uint/uint/g
s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
s/__int/int/g
/^typedef.*int[0-9]*_t;$/d
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers. More to come..
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Use the common helper uuid_is_null() and remove the xfs specific
helper uuid_is_nil().
The common helper does not check for the NULL pointer value as
xfs helper did, but xfs code never calls the helper with a pointer
that can be NULL.
Conform comments and warning strings to use the term 'null uuid'
instead of 'nil uuid', because this is the terminology used by
lib/uuid.c and its users. It is also the terminology used in
userspace by libuuid and xfsprogs.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
[hch: remove now unused uuid.[ch]]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Opencode uuid_getnodeuniq in the only caller, and directly decode
the uuid_t representation instead of using a structure cast for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Copy the uuid of the filesystem to struct super_block s_uuid field,
as several other filesystems already do. Copy regardless of the nouuid
mount option, because other filesystems also do not guaranty uniqueness
of the s_uuid field in super_block struct.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When block size is larger than inode cluster size, the call to
XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
would have set xfs_sb->sb_inoalignmt to 0. Hence in
xfs_set_inoalignment(), xfs_mount->m_inoalign_mask gets initialized to
-1 instead of 0. However, xfs_mount->m_sinoalign would get correctly
intialized to 0 because for every positive value of xfs_mount->m_dalign,
the condition "!(mp->m_dalign & mp->m_inoalign_mask)" would evaluate to
false.
Also, xfs_imap() worked fine even with xfs_mount->m_inoalign_mask having
-1 as the value because blks_per_cluster variable would have the value 1
and hence we would never have a need to use xfs_mount->m_inoalign_mask
to compute the inode chunk's agbno and offset within the chunk.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead we submit the discard requests and use another workqueue to
release the extents from the extent busy list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently we force the log and simply try again if we hit a busy extent,
but especially with online discard enabled it might take a while after
the log force for the busy extents to disappear, and we might have
already completed our second pass.
So instead we add a new waitqueue and a generation counter to the pag
structure so that we can do wakeups once we've removed busy extents,
and we replace the single retry with an unconditional one - after
all we hold the AGF buffer lock, so no other allocations or frees
can be racing with us in this AG.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If pag cannot be allocated, the current error exit path will trip
a null pointer deference error when calling xfs_buf_hash_destroy
with a null pag. Fix this by adding a new error exit labels and
jumping to those accordingly, avoiding the hash destroy and
unnecessary kmem_free on pag.
Up to three things need to be properly unwound:
1) pag memory allocation
2) xfs_buf_hash_init
3) radix_tree_insert
For any given iteration through the loop, any of the above which
succeed must be unwound for /this/ pag, and then all prior
initialized pags must be unwound.
Addresses-Coverity-Id: 1397628 ("Dereference after null check")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
On filesystems with a lot of metadata and in metadata intensive workloads
xfs_buf_find() is showing up at the top of the CPU cycles trace. Most of
the CPU time is spent on CPU cache misses while traversing the rbtree.
As the buffer cache does not need any kind of ordering, but fast lookups
a hashtable is the natural data structure to use. The rhashtable
infrastructure provides a self-scaling hashtable implementation and
allows lookups to proceed while the table is going through a resize
operation.
This reduces the CPU-time spent for the lookups to 1/3 even for small
filesystems with a relatively small number of cached buffers, with
possibly much larger gains on higher loaded filesystems.
[dchinner: reduce minimum hash size to an acceptable size for large
filesystems with many AGs with no active use.]
[dchinner: remove stale rbtree asserts.]
[dchinner: use xfs_buf_map for compare function argument.]
[dchinner: make functions static.]
[dchinner: remove redundant comments.]
Signed-off-by: Lucas Stach <dev@lynxeye.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
As part of the inode block map intent log item recovery process, we
had to set the IRECOVERY flag to prevent an unlinked inode from
being truncated during the first iput call. This required us to set
MS_ACTIVE so that iput puts the inode on the lru instead of
immediately evicting the inode.
Unfortunately, if the mount fails later on, the inodes that have
been loaded (root dir and realtime) actually need to be evicted
since we're aborting the mount. If we don't clear MS_ACTIVE in the
failure step, those inodes are not evicted and therefore leak. The
leak was found by running xfs/130 and rmmoding xfs immediately after
the test.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Trim CoW reservations made on behalf of a cowextsz hint if they get too
old or we run low on quota, so long as we don't have dirty data awaiting
writeback or directio operations in progress.
Garbage collection of the cowextsize extents are kept separate from
prealloc extent reaping because setting the CoW prealloc lifetime to a
(much) higher value than the regular prealloc extent lifetime has been
useful for combatting CoW fragmentation on VM hosts where the VMs
experience bursty write behaviors and we can keep the utilization ratios
low enough that we don't start to run out of space. IOWs, it benefits
us to keep the CoW fork reservations around for as long as we can unless
we run out of blocks or hit inode reclaim.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
To gracefully handle the situation where a CoW operation turns a
single refcount extent into a lot of tiny ones and then run out of
space when a tree split has to happen, use the per-AG reserved block
pool to pre-allocate all the space we'll ever need for a maximal
btree. For a 4K block size, this only costs an overhead of 0.3% of
available disk space.
When reflink is enabled, we have an unfortunate problem with rmap --
since we can share a block billions of times, this means that the
reverse mapping btree can expand basically infinitely. When an AG is
so full that there are no free blocks with which to expand the rmapbt,
the filesystem will shut down hard.
This is rather annoying to the user, so use the AG reservation code to
reserve a "reasonable" amount of space for rmap. We'll prevent
reflinks and CoW operations if we think we're getting close to
exhausting an AG's free space rather than shutting down, but this
permanent reservation should be enough for "most" users. Hopefully.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch@lst.de: ensure that we invalidate the freed btree buffer]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Due to the way the CoW algorithm in XFS works, there's an interval
during which blocks allocated to handle a CoW can be lost -- if the FS
goes down after the blocks are allocated but before the block
remapping takes place. This is exacerbated by the cowextsz hint --
allocated reservations can sit around for a while, waiting to get
used.
Since the refcount btree doesn't normally store records with refcount
of 1, we can use it to record these in-progress extents. In-progress
blocks cannot be shared because they're not user-visible, so there
shouldn't be any conflicts with other programs. This is a better
solution than holding EFIs during writeback because (a) EFIs can't be
relogged currently, (b) even if they could, EFIs are bound by
available log space, which puts an unnecessary upper bound on how much
CoW we can have in flight, and (c) we already have a mechanism to
track blocks.
At mount time, read the refcount records and free anything we find
with a refcount of 1 because those were in-progress when the FS went
down.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Log recovery will iget an inode to replay BUI items and iput the inode
when it's done. Unfortunately, if the inode was unlinked, the iput
will see that i_nlink == 0 and decide to truncate & free the inode,
which prevents us from replaying subsequent BUIs. We can't skip the
BUIs because we have to replay all the redo items to ensure that
atomic operations complete.
Since unlinked inode recovery will reap the inode anyway, we can
safely introduce a new inode flag to indicate that an inode is in this
'unlinked recovery' state and should not be auto-reaped in the
drop_inode path.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Start constructing the refcount btree implementation by establishing
the on-disk format and everything needed to read, write, and
manipulate the refcount btree blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Recently we've had a number of reports where log recovery on a v5
filesystem has reported corruptions that looked to be caused by
recovery being re-run over the top of an already-recovered
metadata. This has uncovered a bug in recovery (fixed elsewhere)
but the vector that caused this was largely unknown.
A kdump test started tripping over this problem - the system
would be crashed, the kdump kernel and environment would boot and
dump the kernel core image, and then the system would reboot. After
reboot, the root filesystem was triggering log recovery and
corruptions were being detected. The metadumps indicated the above
log recovery issue.
What is happening is that the kdump kernel and environment is
mounting the root device read-only to find the binaries needed to do
it's work. The result of this is that it is running log recovery.
However, because there were unlinked files and EFIs to be processed
by recovery, the completion of phase 1 of log recovery could not
mark the log clean. And because it's a read-only mount, the unmount
process does not write records to the log to mark it clean, either.
Hence on the next mount of the filesystem, log recovery was run
again across all the metadata that had already been recovered and
this is what triggered corruption warnings.
To avoid this problem, we need to ensure that a read-only mount
always updates the log when it completes the second phase of
recovery. We already handle this sort of issue with rw->ro remount
transitions, so the solution is as simple as quiescing the
filesystem at the appropriate time during the mount process. This
results in the log being marked clean so the mount behaviour
recorded in the logs on repeated RO mounts will change (i.e. log
recovery will no longer be run on every mount until a RW mount is
done). This is a user visible change in behaviour, but it is
harmless.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Originally-From: Dave Chinner <dchinner@redhat.com>
The rmap btree is allocated from the AGFL, which means we have to
ensure ENOSPC is reported to userspace before we run out of free
space in each AG. The last allocation in an AG can cause a full
height rmap btree split, and that means we have to reserve at least
this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
Update the various space calculation functions to handle this.
Also, because the macros are now executing conditional code and are
called quite frequently, convert them to functions that initialise
variables in the struct xfs_mount, use the new variables everywhere
and document the calculations better.
[darrick.wong@oracle.com: don't reserve blocks if !rmap]
[dchinner@redhat.com: update m_ag_max_usable after growfs]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Originally-From: Dave Chinner <dchinner@redhat.com>
Now we have all the surrounding call infrastructure in place, we can
start filling out the rmap btree implementation. Start with the
on-disk btree format; add everything needed to read, write and
manipulate rmap btree blocks. This prepares the way for adding the
btree operations implementation.
[darrick: record owner and offset info in rmap btree]
[darrick: fork, bmbt and unwritten state in rmap btree]
[darrick: flags are a separate field in xfs_rmap_irec]
[darrick: calculate maxlevels separately]
[darrick: move the 'unwritten' bit into unused parts of rm_offset]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Originally-From: Dave Chinner <dchinner@redhat.com>
XFS reserves a small amount of space in each AG for the minimum
number of free blocks needed for operation. Adding the rmap btree
increases the number of reserved blocks, but it also increases the
complexity of the calculation as the free inode btree is optional
(like the rmbt).
Rather than calculate the prealloc blocks every time we need to
check it, add a function to calculate it at mount time and store it
in the struct xfs_mount, and convert the XFS_PREALLOC_BLOCKS macro
just to use the xfs-mount variable directly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Restructure everything that used xfs_bmap_free to use xfs_defer_ops
instead. For now we'll just remove the old symbols and play some
cpp magic to make it work; in the next patch we'll actually rename
everything.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The upcoming buftarg I/O accounting mechanism maintains a count of
all buffers that have undergone I/O in the current hold-release
cycle. Certain buffers associated with core infrastructure (e.g.,
the xfs_mount superblock buffer, log buffers) are never released,
however. This means that accounting I/O submission on such buffers
elevates the buftarg count indefinitely and could lead to lockup on
unmount.
Define a new buffer flag to explicitly exclude buffers from buftarg
I/O accounting. Set the flag on the superblock and associated log
buffers.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Changes in this update:
o fixes for mount line parsing, sparse warnings, read-only compat
feature remount behaviour
o allow fast path symlink lookups for inline symlinks.
o attribute listing cleanups
o writeback goes direct to bios rather than indirecting through
bufferheads
o transaction allocation cleanup
o optimised kmem_realloc
o added configurable error handling for metadata write errors,
changed default error handling behaviour from "retry forever" to
"retry until unmount then fail"
o fixed several inode cluster writeback lookup vs reclaim race
conditions
o fixed inode cluster writeback checking wrong inode after lookup
o fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
o cleaned up inode reclaim tagging
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXRo8LAAoJEK3oKUf0dfodxLgP+wQMd46i/nCncr6jMcdoVXfL
rE6cL1LJWWVOWzax/bmdlV1VNXqqW7n0ABAVMqikzqSEd+fBQe/HOkdBeVUywu7o
bmqgNxuofMqHaiYhiTvUijHLHWLFxIgd/jNT7L5oGRzZdmP260VGf3EPipN7aA9U
Y3KVhFQCqohpeIUeSV4Z/eIDdHN5LyUI1s+7zMLquHKCWyO4aO4GBX8YlyXdRRVe
cwCZb4TBryS0PBCIra31MZ5wBRwLx8PBXqcNsnTQSR5Uu+WeuwxofXz5q3kdmNOU
XGTWiabQbcvaC4twrzqnErHEX41PAs43tWxsI/qJH49QIFdfOYM+t8ERhEa2Q3DW
Ihl+Q/2qiOuZZterG/t5MrxhybrmQhEFVJT6Ib8b/CnwpRm+K8kWTead1YJL8Xzd
F9k8e57BCgTbDA7jWxWDbp7eQ1/4KglBD4sefFPpsuFgO882mmo5GmymALGjmitw
JH1v3HL3PeTkQoHfcay8ZM/zNjX643CXHwCWYEOAgD+e77TPiOs/cHLZaXbrBkLK
PpSJNfYiBe31eeSOEGsxivMapLpus+cHZyK3uR+XU+naJhjOdaBDTTo8RsAD9jS5
C/dzxc4l7o+gYT+UjV5KtyfEeVwtGo5mtR9XozPbNDjNQor8Vo6NQMZXMXpFYDZI
2XgzVNpkEf/74kexdEzV
=0tYo
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"A pretty average collection of fixes, cleanups and improvements in
this request.
Summary:
- fixes for mount line parsing, sparse warnings, read-only compat
feature remount behaviour
- allow fast path symlink lookups for inline symlinks.
- attribute listing cleanups
- writeback goes direct to bios rather than indirecting through
bufferheads
- transaction allocation cleanup
- optimised kmem_realloc
- added configurable error handling for metadata write errors,
changed default error handling behaviour from "retry forever" to
"retry until unmount then fail"
- fixed several inode cluster writeback lookup vs reclaim race
conditions
- fixed inode cluster writeback checking wrong inode after lookup
- fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
- cleaned up inode reclaim tagging"
* tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
xfs: fix warning in xfs_finish_page_writeback for non-debug builds
xfs: move reclaim tagging functions
xfs: simplify inode reclaim tagging interfaces
xfs: rename variables in xfs_iflush_cluster for clarity
xfs: xfs_iflush_cluster has range issues
xfs: mark reclaimed inodes invalid earlier
xfs: xfs_inode_free() isn't RCU safe
xfs: optimise xfs_iext_destroy
xfs: skip stale inodes in xfs_iflush_cluster
xfs: fix inode validity check in xfs_iflush_cluster
xfs: xfs_iflush_cluster fails to abort on error
xfs: remove xfs_fs_evict_inode()
xfs: add "fail at unmount" error handling configuration
xfs: add configuration handlers for specific errors
xfs: add configuration of error failure speed
xfs: introduce table-based init for error behaviors
xfs: add configurable error support to metadata buffers
xfs: introduce metadata IO error class
xfs: configurable error behavior via sysfs
xfs: buffer ->bi_end_io function requires irq-safe lock
...
If we take "retry forever" literally on metadata IO errors, we can
hang at unmount, once it retries those writes forever. This is the
default behavior, unfortunately.
Add an error configuration option for this behavior and default it
to "fail" so that an unmount will trigger actuall errors, a shutdown
and allow the unmount to succeed. It will be noisy, though, as it
will log the errors and shutdown that occurs.
To fix this, we need to mark the filesystem as being in the process
of unmounting. Do this with a mount flag that is added at the
appropriate time (i.e. before the blocking AIL sync). We also need
to add this flag if mount fails after the initial phase of log
recovery has been run.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We need to be able to change the way XFS behaviours in error
conditions depending on the type of underlying storage. This is
necessary for handling non-traditional block devices with extended
error cases, such as thin provisioned devices that can return ENOSPC
as an IO error.
Introduce the basic sysfs infrastructure needed to define and
configure error behaviours. This is done to be generic enough to
extend to configuring behaviour in other error conditions, such as
ENOMEM, which also has different desired behaviours according to
machine configuration.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Use krealloc to implement our realloc function. This helps to avoid
new allocations if we are still in the slab bucket. At least for the
bmap btree root that's actually the common case.
This also allows removing the now unused oldsize argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
inode32/inode64 allocator behavior with respect to mount, remount
and growfs is a little tricky.
The inode32 mount option should only enable the inode32 allocator
heuristics if the filesystem is large enough for 64-bit inodes to
exist. Today, it has this behavior on the initial mount, but a
remount with inode32 unconditionally changes the allocation
heuristics, even for a small fs.
Also, an inode32 mounted small filesystem should transition to the
inode32 allocator if the filesystem is subsequently grown to a
sufficient size. Today that does not happen.
This patch consolidates xfs_set_inode32 and xfs_set_inode64 into a
single new function, and moves the "is the maximum inode number big
enough to matter" test into that function, so it doesn't rely on the
caller to get it right - which remount did not do, previously.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
They only set/clear/check a flag, no need for obfuscating this
with a macro.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Move the di_mode value from the xfs_icdinode to the VFS inode, reducing
the xfs_icdinode byte another 2 bytes and collapsing another 2 byte hole
in the structure.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Don't leak the UUID table when the module is unloaded.
(Found with kmemleak.)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This patch implements per-filesystem stats objects in sysfs. It
depends on the application of the previous patch series that
develops the infrastructure to support both xfs global stats and
xfs per-fs stats in sysfs.
Stats objects are instantiated when an xfs filesystem is mounted
and deleted on unmount. With this patch, the stats directory is
created and populated with the familiar stats and stats_clear files.
Example:
/sys/fs/xfs/sda9/stats/stats
/sys/fs/xfs/sda9/stats/stats_clear
With this patch, the individual counts within the new per-fs
stats file(s) remain at zero. Functions that use the the macros
to increment, decrement, and add-to the per-fs stats counts will
be covered in a separate new patch to follow this one. Note that
the counts within the global stats file (/sys/fs/xfs/stats/stats)
advance normally and can be cleared as it was prior to this patch.
[dchinner: move setup/teardown to xfs_fs_{fill|put}_super() so
it is down before/after any path that uses the per-mount stats. ]
Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The root inode is read as part of the xfs_mountfs() sequence and the
reference is dropped in the event of failure after we grab the
inode. The reference drop doesn't necessarily free the inode,
however. It marks it for reclaim and potentially kicks off the
reclaim workqueue. The workqueue is destroyed further up the error
path, which means we are subject to crash if the workqueue job runs
after this point or a memory leak which is identified if the
xfs_inode_zone is destroyed (e.g., on module removal). Both of these
outcomes are reproducible via manual instrumentation of a mount
error after the root inode xfs_iget() call in xfs_mountfs().
Update the xfs_mountfs() error path to cancel any potential reclaim
work items and to run a synchronous inode reclaim if the root inode
is marked for reclaim. This ensures that no jobs remain on the queue
before it is destroyed and that the root inode is freed before the
reclaim mechanism is torn down.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Log recovery occurs in two phases at mount time. In the first phase,
EFIs and EFDs are processed and potentially cancelled out. EFIs without
EFD objects are inserted into the AIL for processing and recovery in the
second phase. xfs_mountfs() runs various other operations between the
phases and is thus subject to failure. If failure occurs after the first
phase but before the second, pending EFIs sit on the AIL, pin it and
cause the mount to hang.
Update the mount sequence to ensure that pending EFIs are cancelled in
the event of failure. Add a recovery cancellation mechanism to iterate
the AIL and cancel all EFI items when requested. Plumb cancellation
support through the log mount finish helper and update xfs_mountfs() to
invoke cancellation in the event of failure after recovery has started.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The sparse inode chunks feature uses the helper function to enable the
allocation of sparse inode chunks. The incompatible feature bit is set
on disk at mkfs time to prevent mount from unsupported kernels.
Also, enforce the inode alignment requirements required for sparse inode
chunks at mount time. When enabled, full inode chunks (and all inode
record) alignment is increased from cluster size to inode chunk size.
Sparse inode alignment must match the cluster size of the fs. Both
superblock alignment fields are set as such by mkfs when sparse inode
support is enabled.
Finally, warn that sparse inode chunks is an experimental feature until
further notice.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Because the counters use a custom batch size, the comparison
functions need to be aware of that batch size otherwise the
comparison does not work correctly. This leads to ASSERT failures
on generic/027 like this:
XFS: Assertion failed: 0, file: fs/xfs/xfs_mount.c, line: 1099
------------[ cut here ]------------
....
Call Trace:
[<ffffffff81522a39>] xfs_mod_icount+0x99/0xc0
[<ffffffff815285cb>] xfs_trans_unreserve_and_mod_sb+0x28b/0x5b0
[<ffffffff8152f941>] xfs_log_commit_cil+0x321/0x580
[<ffffffff81528e17>] xfs_trans_commit+0xb7/0x260
[<ffffffff81503d4d>] xfs_bmap_finish+0xcd/0x1b0
[<ffffffff8151da41>] xfs_inactive_ifree+0x1e1/0x250
[<ffffffff8151dbe0>] xfs_inactive+0x130/0x200
[<ffffffff81523a21>] xfs_fs_evict_inode+0x91/0xf0
[<ffffffff811f3958>] evict+0xb8/0x190
[<ffffffff811f433b>] iput+0x18b/0x1f0
[<ffffffff811e8853>] do_unlinkat+0x1f3/0x320
[<ffffffff811d548a>] ? filp_close+0x5a/0x80
[<ffffffff811e999b>] SyS_unlinkat+0x1b/0x40
[<ffffffff81e0892e>] system_call_fastpath+0x12/0x71
This is a regression introduced by commit 501ab32 ("xfs: use generic
percpu counters for inode counter").
This patch fixes the same problem for both the inode counter and the
free block counter in the superblocks.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>