We're about to start using bch_validate_flags for superblock section
validation - it's no longer bkey specific.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Eliminating bch2_dev_bkey_exists() uses and replacing them with proper
checks; this one was unnecessary since the caller already has it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When building with clang's -Wincompatible-function-pointer-types-strict
(a warning designed to catch potential kCFI failures at build time),
there are several warnings along the lines of:
fs/bcachefs/bkey_methods.c:118:2: error: incompatible function pointer types initializing 'int (*)(struct btree_trans *, enum btree_id, unsigned int, struct bkey_s_c, struct bkey_s, enum btree_iter_update_trigger_flags)' with an expression of type 'int (struct btree_trans *, enum btree_id, unsigned int, struct bkey_s_c, struct bkey_s, unsigned int)' [-Werror,-Wincompatible-function-pointer-types-strict]
118 | BCH_BKEY_TYPES()
| ^~~~~~~~~~~~~~~~
fs/bcachefs/bcachefs_format.h:394:2: note: expanded from macro 'BCH_BKEY_TYPES'
394 | x(inode, 8) \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
fs/bcachefs/bkey_methods.c:117:41: note: expanded from macro 'x'
117 | #define x(name, nr) [KEY_TYPE_##name] = bch2_bkey_ops_##name,
| ^~~~~~~~~~~~~~~~~~~~
<scratch space>:277:1: note: expanded from here
277 | bch2_bkey_ops_inode
| ^~~~~~~~~~~~~~~~~~~
fs/bcachefs/inode.h:26:13: note: expanded from macro 'bch2_bkey_ops_inode'
26 | .trigger = bch2_trigger_inode, \
| ^~~~~~~~~~~~~~~~~~
There are several functions that did not have their flags parameter
converted to 'enum btree_iter_update_trigger_flags' in the recent
unification, which will cause kCFI failures at runtime because the
types, while ABI compatible (hence no warning from the non-strict
version of this warning), do not match exactly.
Fix up these functions (as well as a few other obvious functions that
should have it, even if there are no warnings currently) to resolve the
warnings and potential kCFI runtime failures.
Fixes: 31e4ef3280c8 ("bcachefs: iter/update/trigger/str_hash flag cleanup")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is a nice cleanup - and we've also been having problems with
kthread creation in the mount path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're about to add new asserts for btree_trans locking consistency, and
part of that requires that aren't using the btree_trans while it's
unlocked.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Combine iter/update/trigger/str_hash flags into a single enum, and
x-macroize them for a to_text() function later.
These flags are all for a specific iter/key/update context, so it makes
sense to group them together - iter/update/trigger flags were already
given distinct bits, this cleans up and unifies that handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the discard worker, we were failing to validate the bucket state -
meaning a corrupt needs_discard btree could cause us to discard a bucket
that we shouldn't.
If check_alloc_info hasn't run yet we just want to bail out, otherwise
it's a filesystem inconsistent error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Nested transaction restart handling is typically best avoided; when the
inner context handles a transaction restart it invalidates the outer
transaction context, so we need to make sure to return a
transaction_restart_nested error.
This code wasn't doing that, and hit the assertion in
for_each_btree_key() that checks for that via trans->restart_count.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we've got the errors_silent mechanism, we don't have to check
if the reconstruct_alloc option is set all over the place.
Also - users no longer have to explicitly select fsck and fix_errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Buckets usually can't be discarded until the transaction that made them
empty has been committed in the journal.
Tracing has indicated that we're queuing the discard worker excessively,
only for it to skip over many buckets that are still waiting on a
journal commit, discarding only one or two buckets per iteration.
We want to switch to only queuing the discard worker after a journal
flush write, but there's an important optimization we need to preserve:
if a bucket becomes empty and it was never committed in the journal
while it was in use, we want to discard it and reuse it right away -
since overwriting it before the previous writes are flushed from the
device cache eans those writes only cost bus bandwidth.
So, this patch implements a fast path for buckets that can be discarded
right away. We need new locking between the two discard workers; the new
list of buckets being discarded provides that locking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_trigger_alloc() kicks off certain tasks on bucket state changes;
e.g. triggering the bucket discard worker and the invalidate worker.
We've observed the discard worker running too often - most runs it
doesn't do any work, according to the tracepoint - so clearly, we're
kicking it off too often.
This adds an explicit statechange() macro to make these checks more
precise.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Some (bad) devices can have really terrible discard latency; we don't
want them blocking memory reclaim and causing warnings.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When issuing discards, we may need to flush the journal if there's too
many buckets that can't be discarded until a journal flush.
But the heuristic was bad; we should be comparing the number of buckets
that need to flushes against the number of free buckets, not the number
of buckets we saw.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for disk space accounting rewrite: we're going to want to use
a single callback for both of our current triggers, so we need to change
them to have the same type signature first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
for_each_btree_key() handles transaction restarts, like
for_each_btree_key2(), but only calls bch2_trans_begin() after a
transaction restart - for_each_btree_key2() wraps every loop iteration
in a transaction.
The for_each_btree_key() behaviour is problematic when it leads to
holding the SRCU lock that prevents key cache reclaim for an unbounded
amount of time - there's no real need to keep it around.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This code was somewhat convoluted - because originally bch2_lru_set()
could modify the LRU index if there was a collision.
That's no longer the case, so the "create LRU entry" path has no reason
to update the alloc key, so we can separate the handling of the two fsck
errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces bch2_bucket_sectors() and bch2_bucket_sectors_dirty(),
prep work for separately accounting stripe sectors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_update_cached_sectors_list() is closer to how the new disk space
accounting works, called from trans_mark().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The bucket_offset field of bch_backpointer is a 40-bit bitfield, but the
bch2_backpointer_swab() helper uses swab32. This leads to inconsistency
when an on-disk fs is accessed from an opposite endian machine.
As it turns out, we already have an internal swab40() helper that is
used from the bch_alloc_v4 swab callback. Lift it into the backpointers
header file and use it consistently in both places.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A simple test to populate a filesystem on one CPU architecture and
fsck on an arch of the opposite byte order produces errors related
to the fragmentation LRU. This occurs because the 64-bit
fragmentation_lru field is not byte-order swapped when reads detect
that the on-disk/bset key values were written in opposite byte-order
of the current CPU.
Update the bch2_alloc_v4 swab callback to handle fragmentation_lru
as is done for other multi-byte fields. This doesn't affect existing
filesystems when accessed by CPUs of the same endianness because the
->swab() callback is only called when the bset flags indicate an
endianness mismatch between the CPU and on-disk data.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds a superblock error counter for every distinct fsck
error; this means that when analyzing filesystems out in the wild we'll
be able to see what sorts of inconsistencies are being found and repair,
and hence what bugs to look for.
Errors validating bkeys are not yet considered distinct fsck errors, but
this patch adds a new helper, bkey_fsck_err(), in order to add distinct
error types for them as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we can run with unknown btree IDs, we can't directly index btree
IDs into fixed size arrays.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_dev_resize() was never updated for the allocator rewrite with
persistent freelists, and it wasn't noticed because the tests weren't
running fsck - oops.
Fix this by running bch2_dev_freespace_init() for the new buckets.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
members_v2 has dynamically resizable entries so that we can extend
bch_member. The members can no longer be accessed with simple array
indexing Instead members_v2_get is used to find a member's exact
location within the array and returns a copy of that member.
Alternatively member_v2_get_mut retrieves a mutable point to a member.
Signed-off-by: Hunter Shaffer <huntershaffer182456@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.
But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When building bcachefs for 32-bit ARM, there is a compiler warning in
bch2_bucket_gens_invalid() due to use of an incorrect format specifier:
fs/bcachefs/alloc_background.c:530:10: error: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Werror,-Wformat]
529 | prt_printf(err, "bad val size (%lu != %zu)",
| ~~~
| %zu
530 | bkey_val_bytes(k.k), sizeof(struct bch_bucket_gens));
| ^~~~~~~~~~~~~~~~~~~
fs/bcachefs/util.h:223:54: note: expanded from macro 'prt_printf'
223 | #define prt_printf(_out, ...) bch2_prt_printf(_out, __VA_ARGS__)
| ^~~~~~~~~~~
On 64-bit architectures, size_t is 'unsigned long', so there is no
warning when using %lu but on 32-bit architectures, size_t is 'unsigned
int'. Use '%zu', the format specifier for 'size_t', to eliminate the
warning.
Fixes: 4be0d766a7e9 ("bcachefs: bucket_gens btree")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When building bcachefs for 32-bit ARM, there is a compiler warning in
bch2_alloc_v4_invalid() due to use of an incorrect format specifier:
fs/bcachefs/alloc_background.c:246:30: error: format specifies type 'unsigned long' but the argument has type 'unsigned int' [-Werror,-Wformat]
245 | prt_printf(err, "bad val size (%u > %lu)",
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| %u
246 | alloc_v4_u64s(a.v), bkey_val_u64s(k.k));
| ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
fs/bcachefs/bkey.h:58:27: note: expanded from macro 'bkey_val_u64s'
58 | #define bkey_val_u64s(_k) ((_k)->u64s - BKEY_U64s)
| ^
fs/bcachefs/util.h:223:54: note: expanded from macro 'prt_printf'
223 | #define prt_printf(_out, ...) bch2_prt_printf(_out, __VA_ARGS__)
| ^~~~~~~~~~~
This expression is of type 'size_t'. On 64-bit architectures, size_t is
'unsigned long', so there is no warning when using %lu but on 32-bit
architectures, size_t is 'unsigned int'. Use '%zu', the format specifier
for 'size_t' to eliminate the warning.
Fixes: 11be8e8db283 ("bcachefs: New on disk format: Backpointers")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we set bucket data type to BCH_DATA_stripe based on the data
pointer, not just the stripe pointer, it doesn't make sense to check for
no stripe in the .key_invalid method - this is a situation that
shouldn't happen, but our other fsck/repair code handles it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
clang had a few more warnings about enum conversion, and also didn't
like the opts.c initializer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Recovery and fsck have many different passes/jobs to do, which always
run in the same order - but not all of them run all the time. Some are
for fsck, some for unclean shutdown, some for version upgrades.
This adds some new structure: a defined list of recovery passes that we
can run in a loop, as well as consolidating the log messages.
The main benefit is consolidating the "should run this recovery pass"
logic, as well as cleaning up the "this recovery pass has finished"
state; instead of having a bunch of ad-hoc state bits in c->flags, we've
now got c->curr_recovery_pass.
By consolidating the "should run this recovery pass" logic, in the
future on disk format upgrades will be able to say "upgrading to this
version requires x passes to run", instead of forcing all of fsck to
run.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This folds bch2_bucket_gens_read() into bch2_alloc_read(), doing the
version check there.
This is prep work for enumarating all recovery passes: we need some
cleanup first to make calling all the recovery passes consistent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Version upgrades are not atomic operations: when we do a version upgrade
we need to update the superblock before we start using new features, and
then when the upgrade completes we need to update the superblock again.
This adds a new superblock field so we can detect and handle incomplete
version upgrades.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As part of the forward compatibility patch series, we need to allow for
new key types without complaining loudly when running an old version.
This patch changes the flags parameter of bkey_invalid to an enum, and
adds a new flag to indicate we're being called from the transaction
commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have journal watermarks and alloc watermarks unified,
BTREE_INSERT_USE_RESERVE is redundant and can be deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Excessive inlining may (on some versions of gcc?) cause excessive stack
usage; this turns off some inlining in bch2_check_alloc_info.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add two new helpers for printing error messages with __func__ and
bch2_err_str():
- bch_err_fn
- bch_err_msg
Also kill the old error strings in the recovery path, which were causing
us to incorrectly report memory allocation failures - they're not needed
anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We weren't correctly checking the freespace btree - it's an extents
btree, which means we need to iterate over each bucket in a freespace
extent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the recent bkey_ops.min_val_size addition, bkey values are
automatically extended to the size of the current version.
The check in bch2_alloc_v4_invalid() needs to be updated to take this
into account.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's safe to call bch2_trans_update with a k/v pair where the value
hasn't been filled out, as long as the key part has been and the value
is filled out by transaction commit time.
This patch folds the bch2_trans_update() call into bch2_bkey_make_mut(),
eliminating a bit of boilerplate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- bch2_bkey_get_mut() now handles types increasing in size, allocating
a buffer for the type's current size when necessary
- bch2_bkey_make_mut_typed()
- bch2_bkey_get_mut() now initializes the iterator, like
bch2_bkey_get_iter()
Also, refactor so that most of the code is in functions - now macros are
only used for wrappers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introduce new helpers for a common pattern:
bch2_trans_iter_init();
bch2_btree_iter_peek_slot();
- bch2_bkey_get_iter_type() returns -ENOENT if it doesn't find a key of
the correct type
- bch2_bkey_get_val_typed() copies the val out of the btree to a
(typically stack allocated) variable; it handles the case where the
value in the btree is smaller than the current version of the type,
zeroing out the remainder.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck removes bucket_gens keys for devices that do not exist in the
volume (i.e., if the device was removed). In 'fsck -n' mode, the
associated fsck_err_on() wrapper returns false to skip the key
removal. This proceeds on to the rest of the function, which
eventually segfaults on a NULL bch_dev because the device does not
exist.
Update bch2_check_bucket_gens_key() to skip out of the rest of the
function when the associated device does not exist, regardless of
running fsck in check or repair mode.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In __bch2_alloc_to_v4_mut(), we overrun the buffer we allocate if the
alloc key had backpointers stored in it (which we no longer support).
Fix this with a max() call.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't store backpointers in alloc keys anymore, since we gained the
btree write buffer.
This patch drops support for backpointers in alloc keys, and revs the on
disk format version so that we know a fsck is required.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It appears freespace init can still take awhile, and we've had a report
or two of it getting stuck - let's have it print out where it's at every
10 seconds.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A workqueue resource deadlock has been observed when running fsck
on a filesystem with a full/stuck journal. fsck is not currently
able to repair the fs due to fairly rapid emergency shutdown, but
rather than exit gracefully the fsck process hangs during the
shutdown sequence. Fortunately this is easily recoverable from
userspace, but the root cause involves code shared between the
kernel and userspace and so should be addressed.
The deadlock scenario involves the main task in the bch2_fs_stop()
-> bch2_fs_read_only() path waiting on write references to drain
with the fs state lock held. A bch2_read_only_work() workqueue task
is scheduled on the system_long_wq, blocked on the state lock.
Finally, various other write ref holding workqueue tasks are
scheduled to run on the same workqueue and must complete in order to
release references that the initial task is waiting on.
To avoid this problem, we can split the dependent workqueue tasks
across different workqueues. It's a bit of a waste to create a
dedicated wq for the read-only worker, but there are several tasks
throughout the fs that follow the pattern of acquiring a write
reference and then scheduling to the system wq. Use a local wq
for such tasks to break the subtle dependency between these and the
read-only worker.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This implements a new shutdown path for erasure coding, which is needed
for the upcoming BCH_WRITE_WAIT_FOR_EC write path.
The process is:
- Cancel new stripes being built up
- Close out/cancel open buckets on write points or the partial list
that are for stripes
- Shutdown rebalance/copygc
- Then wait for in flight new stripes to finish
With BCH_WRITE_WAIT_FOR_EC, move ops will be waiting on stripes to fill
up before they complete; the new ec shutdown path is needed for shutting
down copygc/rebalance without deadlocking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Like bch2_trans_mark_bucket(), we shouldn't be incrementing a bucket gen
while it's still open - erasure coding was hitting this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have much more efficient updates to the LRU btree, this
patch adds a new LRU that indexes buckets by fragmentation.
This means copygc no longer has to scan every bucket to find buckets
that need to be evacuated.
Changes:
- A new field in bch_alloc_v4, fragmentation_lru - this corresponds to
the bucket's position in the fragmentation LRU. We add a new field
for this instead of calculating it as needed because we may make the
fragmentation LRU optional; this field indicates whether a bucket is
on the fragmentation LRU.
Also, zoned devices will introduce variable bucket sizes; explicitly
recording the LRU position will be safer for them.
- A new copygc path for using the fragmentation LRU instead of
scanning every bucket and building up an in-memory heap.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch is prep work for the following patch.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Make sure to check for lru entries that point to buckets that don't
exist as well as buckets in the wrong state, and improve the error
message we print out.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch changes how the LRU index works:
Instead of using KEY_TYPE_lru where the bucket the lru entry points to
is part of the value, this switches to KEY_TYPE_set and encoding the
bucket we refer to in the low bits of the key.
This means that we no longer have to check for collisions when inserting
LRU entries. We'll be making using of this in the next patch, which adds
a btree write buffer - a pure write buffer for btree updates, where
updates are appended to a simple array and then periodically sorted and
batch inserted.
This is a new on disk format version, and a forced upgrade.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
To improve mount times, add a btree for just bucket gens, 256 of them
per key: this means we'll have to scan drastically less metadata at
startup.
This adds
- trigger for keeping it in sync with the all btree
- initialization code, for filesystems from previous versions
- new path for reading bucket gens
- new fsck code
And a new on disk format version.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>