2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

6477 Commits

Author SHA1 Message Date
Bryan Gurney
cc7a7fb3b6 dm dust: change ret to r in dust_map_read and dust_map
In the dust_map_read() and dust_map() functions, change the
return code variable "ret" to "r", to match the convention of the
other device-mapper targets.

Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:56:08 -05:00
Bryan Gurney
6ec1be5015 dm dust: change result vars to r
Change the "result" variables to "r" in dust_status() and
dust_message().

Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:55:31 -05:00
Mikulas Patocka
26b924b93c dm cache: replace spin_lock_irqsave with spin_lock_irq
If we are in a place where it is known that interrupts are enabled,
functions spin_lock_irq/spin_unlock_irq should be used instead of
spin_lock_irqsave/spin_unlock_irqrestore.

spin_lock_irq and spin_unlock_irq are faster because they don't need to
push and pop the flags register.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:53:04 -05:00
Mikulas Patocka
235bc86160 dm bio prison: replace spin_lock_irqsave with spin_lock_irq
Replace spin_lock_irqsave/irqrestore with spin_lock_irq/spin_unlock_irq.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:53:03 -05:00
Mikulas Patocka
8e0c9dacc3 dm thin: replace spin_lock_irqsave with spin_lock_irq
If we are in a place where it is known that interrupts are enabled,
functions spin_lock_irq/spin_unlock_irq should be used instead of
spin_lock_irqsave/spin_unlock_irqrestore.

spin_lock_irq and spin_unlock_irq are faster because they don't need to
push and pop the flags register.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:38:26 -05:00
Nikos Tsironis
52c67d416b dm clone: add bucket_lock_irq/bucket_unlock_irq helpers
Introduce bucket_lock_irq() and bucket_unlock_irq() helpers and use them
in places where it is known that interrupts are enabled.

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:36:34 -05:00
Mikulas Patocka
6ca43ed837 dm clone: replace spin_lock_irqsave with spin_lock_irq
If we are in a place where it is known that interrupts are enabled,
functions spin_lock_irq/spin_unlock_irq should be used instead of
spin_lock_irqsave/spin_unlock_irqrestore.

spin_lock_irq and spin_unlock_irq are faster because they don't need to
push and pop the flags register.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:35:42 -05:00
Maged Mokhtar
c1005322ff dm writecache: handle REQ_FUA
Call writecache_flush() on REQ_FUA in writecache_map().

Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Maged Mokhtar <mmokhtar@petasan.org>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:21:40 -05:00
Mikulas Patocka
8dd85873a0 dm writecache: fix uninitialized variable warning
This fixes coverity warning CID 1454301.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:11:44 -05:00
Gustavo A. R. Silva
8adeac3be0 dm stripe: use struct_size() in kmalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct stripe_c {
        ...
        struct stripe stripe[0];
};

In this case alloc_context() and dm_array_too_big() are removed and
replaced by the direct use of the struct_size() helper in kmalloc().

Notice that open-coded form is prone to type mistakes.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:09:59 -05:00
Heinz Mauelshagen
53be73a5d7 dm raid: streamline rs_get_progress() and its raid_status() caller side
Pass already deciphered state into rs_get_progress, simplify recovery offset
definition and combine two st_resync, st_reshape conditionals into one as is
already the case with st_check and st_repair.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:04:37 -05:00
Heinz Mauelshagen
f9f3ee9130 dm raid: simplify rs_setup_recovery call chain
rs_setup_recovery() sets the starting recovery offset.

Drop superfluous rs_setup_recovery() and replace with __rs_setup_recovery().

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:02:52 -05:00
Heinz Mauelshagen
99273d9e6e dm raid: to ensure resynchronization, perform raid set grow in preresume
This fixes a flaw causing raid set extensions not to be synchronized
in case the MD bitmap resize required additional pages to be allocated.

Also share resize code in the raid constructor between
new size changes and those occuring during recovery.

Bump the target version to define the change and document
it in Documentation/admin-guide/device-mapper/dm-raid.rst.

Reported-by: Steve D <steved424@gmail.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:02:26 -05:00
Heinz Mauelshagen
22c992e1a8 dm raid: change rs_set_dev_and_array_sectors API and callers
Add a size argument to rs_set_dev_and_array_sectors as prerequisite
to fixing grown device resynchronization not occuring when new MD
bitmap pages have to be allocated as a result of the extension in
a follwup patch.

Also avoid code duplication by using rs_set_rdev_sectors
in the aforementioned function.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:02:00 -05:00
Mike Snitzer
6ba01df72b dm table: do not allow request-based DM to stack on partitions
Partitioned request-based devices cannot be used as underlying devices
for request-based DM because no partition offsets are added to each
incoming request.  As such, until now, stacking on partitioned devices
would _always_ result in data corruption (e.g. wiping the partition
table, writing to other partitions, etc).  Fix this by disallowing
request-based stacking on partitions.

While at it, since all .request_fn support has been removed from block
core, remove legacy dm-table code that differentiated between blk-mq and
.request_fn request-based.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 11:22:52 -05:00
Yufen Yu
6a5cb53aaa md: no longer compare spare disk superblock events in super_load
We have a test case as follow:

  mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
	--assume-clean --bitmap=internal
  mdadm -S /dev/md1
  mdadm -A /dev/md1 /dev/sd[b-c] --run --force

  mdadm --zero /dev/sda
  mdadm /dev/md1 -a /dev/sda

  echo offline > /sys/block/sdc/device/state
  echo offline > /sys/block/sdb/device/state
  sleep 5
  mdadm -S /dev/md1

  echo running > /sys/block/sdb/device/state
  echo running > /sys/block/sdc/device/state
  mdadm -A /dev/md1 /dev/sd[a-c] --run --force

When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.

After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:

[  172.986064] md: kicking non-fresh sdb from array!
[  173.004210] md: kicking non-fresh sdc from array!
[  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[  173.022406] md1: failed to create bitmap (-5)
[  173.023466] md: md1 stopped.

Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.

In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().

To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-24 15:22:40 -07:00
David Jeffery
775d78319f md: improve handling of bio with REQ_PREFLUSH in md_flush_request()
If pers->make_request fails in md_flush_request(), the bio is lost. To
fix this, pass back a bool to indicate if the original make_request call
should continue to handle the I/O and instead of assuming the flush logic
will push it to completion.

Convert md_flush_request to return a bool and no longer calls the raid
driver's make_request function.  If the return is true, then the md flush
logic has or will complete the bio and the md make_request call is done.
If false, then the md make_request function needs to keep processing like
it is a normal bio. Let the original call to md_handle_request handle any
need to retry sending the bio to the raid driver's make_request function
should it be needed.

Also mark md_flush_request and the make_request function pointer as
__must_check to issue warnings should these critical return values be
ignored.

Fixes: 2bc13b83e6 ("md: batch flush requests.")
Cc: stable@vger.kernel.org # # v4.19+
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-24 15:22:40 -07:00
Guoqing Jiang
fadcbd2901 md/bitmap: avoid race window between md_bitmap_resize and bitmap_file_clear_bit
We need to move "spin_lock_irq(&bitmap->counts.lock)" before unmap previous
storage, otherwise panic like belows could happen as follows.

[  902.353802] sdl: detected capacity change from 1077936128 to 3221225472
[  902.616948] general protection fault: 0000 [#1] SMP
[snip]
[  902.618588] CPU: 12 PID: 33698 Comm: md0_raid1 Tainted: G           O    4.14.144-1-pserver #4.14.144-1.1~deb10
[  902.618870] Hardware name: Supermicro SBA-7142G-T4/BHQGE, BIOS 3.00       10/24/2012
[  902.619120] task: ffff9ae1860fc600 task.stack: ffffb52e4c704000
[  902.619301] RIP: 0010:bitmap_file_clear_bit+0x90/0xd0 [md_mod]
[  902.619464] RSP: 0018:ffffb52e4c707d28 EFLAGS: 00010087
[  902.619626] RAX: ffe8008b0d061000 RBX: ffff9ad078c87300 RCX: 0000000000000000
[  902.619792] RDX: ffff9ad986341868 RSI: 0000000000000803 RDI: ffff9ad078c87300
[  902.619986] RBP: ffff9ad0ed7a8000 R08: 0000000000000000 R09: 0000000000000000
[  902.620154] R10: ffffb52e4c707ec0 R11: ffff9ad987d1ed44 R12: ffff9ad0ed7a8360
[  902.620320] R13: 0000000000000003 R14: 0000000000060000 R15: 0000000000000800
[  902.620487] FS:  0000000000000000(0000) GS:ffff9ad987d00000(0000) knlGS:0000000000000000
[  902.620738] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  902.620901] CR2: 000055ff12aecec0 CR3: 0000001005207000 CR4: 00000000000406e0
[  902.621068] Call Trace:
[  902.621256]  bitmap_daemon_work+0x2dd/0x360 [md_mod]
[  902.621429]  ? find_pers+0x70/0x70 [md_mod]
[  902.621597]  md_check_recovery+0x51/0x540 [md_mod]
[  902.621762]  raid1d+0x5c/0xeb0 [raid1]
[  902.621939]  ? try_to_del_timer_sync+0x4d/0x80
[  902.622102]  ? del_timer_sync+0x35/0x40
[  902.622265]  ? schedule_timeout+0x177/0x360
[  902.622453]  ? call_timer_fn+0x130/0x130
[  902.622623]  ? find_pers+0x70/0x70 [md_mod]
[  902.622794]  ? md_thread+0x94/0x150 [md_mod]
[  902.622959]  md_thread+0x94/0x150 [md_mod]
[  902.623121]  ? wait_woken+0x80/0x80
[  902.623280]  kthread+0x119/0x130
[  902.623437]  ? kthread_create_on_node+0x60/0x60
[  902.623600]  ret_from_fork+0x22/0x40
[  902.624225] RIP: bitmap_file_clear_bit+0x90/0xd0 [md_mod] RSP: ffffb52e4c707d28

Because mdadm was running on another cpu to do resize, so bitmap_resize was
called to replace bitmap as below shows.

PID: 38801  TASK: ffff9ad074a90e00  CPU: 0   COMMAND: "mdadm"
   [exception RIP: queued_spin_lock_slowpath+56]
   [snip]
-- <NMI exception stack> --
 #5 [ffffb52e60f17c58] queued_spin_lock_slowpath at ffffffff9c0b27b8
 #6 [ffffb52e60f17c58] bitmap_resize at ffffffffc0399877 [md_mod]
 #7 [ffffb52e60f17d30] raid1_resize at ffffffffc0285bf9 [raid1]
 #8 [ffffb52e60f17d50] update_size at ffffffffc038a31a [md_mod]
 #9 [ffffb52e60f17d70] md_ioctl at ffffffffc0395ca4 [md_mod]

And the procedure to keep resize bitmap safe is allocate new storage
space, then quiesce, copy bits, replace bitmap, and re-start.

However the daemon (bitmap_daemon_work) could happen even the array is
quiesced, which means when bitmap_file_clear_bit is triggered by raid1d,
then it thinks it should be fine to access store->filemap since
counts->lock is held, but resize could change the storage without the
protection of the lock.

Cc: Jack Wang <jinpu.wang@cloud.ionos.com>
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-24 15:22:40 -07:00
Dan Carpenter
e3fc3f3d09 md/raid0: Fix an error message in raid0_make_request()
The first argument to WARN() is supposed to be a condition.  The
original code will just print the mdname() instead of the full warning
message.

Fixes: c84a1372df ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-24 15:22:40 -07:00
Linus Torvalds
d418d07005 for-linus-2019-10-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2qbF0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptsuEADEKL8pta74uy50pl0t8l9fZ++U+wdIeEIW
 9uumpOEPnI2GpkG1sOyKWK6tl8InQLw6pAquP9MoT2BHXqFHk7NIgtvk67lwQeoc
 dRwklVfvOLAdnKzyfODqE9Fh9BgczZIuOLzgdtNqrPKqgJfFRCwN94Kj/r2tYuy7
 v+riK3A49u12dOLtjU6ciNgZ0m1iUX9s0+PFYVUXtJHU/1OYToQaKP+sgWiue0Ca
 VJP/L4MLYD0a7tfd92WAK7xWLsYWTDw1Gg20hXH/tV+IIDQ5+OXhu2s6PuqI7c0y
 cZqWHQHBDkZMQvT8+V+YqZtEa+xwVCom51prJEPasmdq3fGx+2sDC1HQiySao1ML
 wfFxZvFvY9fm6M7p2xsSNEcOmamrx1aLLyNSbjIvAqLUDYJWWS56BHsKyTU5Z+Jp
 RA9dpq8iR6ISaIAcFf0IB0pJSv1HEeHyo/ixlALqezBFJaMdhWy/M+dEbWKtix9M
 s19ozcpe+omN9+O0anlLtzKNgj2Xnjiwuu8mhVcqn6uG/p6GUOup+lNvTW/fig3I
 JBH8kObjYXL181V9rYVqFutnuqcf2HYqMvV2vzAmg4LYnPVUmU7HMj8zEpxc4N+f
 Evd77j0wXmY9S+4JERxaqQZuvKBEIkvM1rkk3N4NbNghfa7QL4aW+I9cWtuelPC2
 E+DK7if0Gg==
 =rvkw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Keith that address deadlocks, double resets,
   memory leaks, and other regression.

 - Fixup elv_support_iosched() for bio based devices (Damien)

 - Fixup for the ahci PCS quirk (Dan)

 - Socket O_NONBLOCK handling fix for io_uring (me)

 - Timeout sequence io_uring fixes (yangerkun)

 - MD warning fix for parameter default_layout (Song)

 - blkcg activation fixes (Tejun)

 - blk-rq-qos node deletion fix (Tejun)

* tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block:
  nvme-pci: Set the prp2 correctly when using more than 4k page
  io_uring: fix logic error in io_timeout
  io_uring: fix up O_NONBLOCK handling for sockets
  md/raid0: fix warning message for parameter default_layout
  libata/ahci: Fix PCS quirk application
  blk-rq-qos: fix first node deletion of rq_qos_del()
  blkcg: Fix multiple bugs in blkcg_activate_policy()
  io_uring: consider the overflow of sequence for timeout req
  nvme-tcp: fix possible leakage during error flow
  nvmet-loop: fix possible leakage during error flow
  block: Fix elv_support_iosched()
  nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL
  nvme: Wait for reset state when required
  nvme: Prevent resets during paused controller state
  nvme: Restart request timers in resetting state
  nvme: Remove ADMIN_ONLY state
  nvme-pci: Free tagset if no IO queues
  nvme: retain split access workaround for capability reads
  nvme: fix possible deadlock when nvme_update_formats fails
2019-10-18 22:29:36 -04:00
Mikulas Patocka
13bd677a47 dm cache: fix bugs when a GFP_NOWAIT allocation fails
GFP_NOWAIT allocation can fail anytime - it doesn't wait for memory being
available and it fails if the mempool is exhausted and there is not enough
memory.

If we go down this path:
  map_bio -> mg_start -> alloc_migration -> mempool_alloc(GFP_NOWAIT)
we can see that map_bio() doesn't check the return value of mg_start(),
and the bio is leaked.

If we go down this path:
  map_bio -> mg_start -> mg_lock_writes -> alloc_prison_cell ->
  dm_bio_prison_alloc_cell_v2 -> mempool_alloc(GFP_NOWAIT) ->
  mg_lock_writes -> mg_complete
the bio is ended with an error - it is unacceptable because it could
cause filesystem corruption if the machine ran out of memory
temporarily.

Change GFP_NOWAIT to GFP_NOIO, so that the mempool code will properly
wait until memory becomes available. mempool_alloc with GFP_NOIO can't
fail, so remove the code paths that deal with allocation failure.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-10-17 11:13:50 -04:00
Song Liu
3874d73e06 md/raid0: fix warning message for parameter default_layout
The message should match the parameter, i.e. raid0.default_layout.

Fixes: c84a1372df ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Cc: NeilBrown <neilb@suse.de>
Reported-by: Ivan Topolsky <doktor.yak@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-16 09:43:02 -07:00
Mikulas Patocka
b21555786f dm snapshot: rework COW throttling to fix deadlock
Commit 721b1d98fb ("dm snapshot: Fix excessive memory usage and
workqueue stalls") introduced a semaphore to limit the maximum number of
in-flight kcopyd (COW) jobs.

The implementation of this throttling mechanism is prone to a deadlock:

1. One or more threads write to the origin device causing COW, which is
   performed by kcopyd.

2. At some point some of these threads might reach the s->cow_count
   semaphore limit and block in down(&s->cow_count), holding a read lock
   on _origins_lock.

3. Someone tries to acquire a write lock on _origins_lock, e.g.,
   snapshot_ctr(), which blocks because the threads at step (2) already
   hold a read lock on it.

4. A COW operation completes and kcopyd runs dm-snapshot's completion
   callback, which ends up calling pending_complete().
   pending_complete() tries to resubmit any deferred origin bios. This
   requires acquiring a read lock on _origins_lock, which blocks.

   This happens because the read-write semaphore implementation gives
   priority to writers, meaning that as soon as a writer tries to enter
   the critical section, no readers will be allowed in, until all
   writers have completed their work.

   So, pending_complete() waits for the writer at step (3) to acquire
   and release the lock. This writer waits for the readers at step (2)
   to release the read lock and those readers wait for
   pending_complete() (the kcopyd thread) to signal the s->cow_count
   semaphore: DEADLOCK.

The above was thoroughly analyzed and documented by Nikos Tsironis as
part of his initial proposal for fixing this deadlock, see:
https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html

Fix this deadlock by reworking COW throttling so that it waits without
holding any locks. Add a variable 'in_progress' that counts how many
kcopyd jobs are running. A function wait_for_in_progress() will sleep if
'in_progress' is over the limit. It drops _origins_lock in order to
avoid the deadlock.

Reported-by: Guruswamy Basavaiah <guru2018@gmail.com>
Reported-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com>
Tested-by: Nikos Tsironis <ntsironis@arrikto.com>
Fixes: 721b1d98fb ("dm snapshot: Fix excessive memory usage and workqueue stalls")
Cc: stable@vger.kernel.org # v5.0+
Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-10-10 09:46:05 -04:00
Mikulas Patocka
a2f83e8b0c dm snapshot: introduce account_start_copy() and account_end_copy()
This simple refactoring moves code for modifying the semaphore cow_count
into separate functions to prepare for changes that will extend these
methods to provide for a more sophisticated mechanism for COW
throttling.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-10-10 09:45:48 -04:00
YueHaibing
0a005856d3 dm clone: Make __hash_find static
drivers/md/dm-clone-target.c:594:34: warning:
 symbol '__hash_find' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-10-08 14:04:54 -04:00
Linus Torvalds
2e959dd87a for-5.4/post-2019-09-24
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2J8xQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpujgD/94s9GGKN8JShxCpT0YNuWyyFF5gNlaimQU
 RSGAwnv2YUgEGNSUOPpcaj5FAYhTfYzbqoHlE+jytA2U5KXTOhc5Z85QV+TY4HPs
 I03xczYuYD/uX0QuF00zU2+6eV3lETELPiBARbfEQdHfm72iwurweHzlh4dfhbxW
 P7UA/cKixXWF2CH9wg5347Ll93nD24f2pi8BUyLJi/xpdlaRrN11Ii8AzNlRmq52
 VRxURuogl98W89F6EV2VhPGFgUEYHY2Ot7II2OqqV+jmjHDQW9y5hximzINOqkxs
 bQwo5J+WrDSPoqwl8+db2k7QQjAl1XKDAHmCwz+7J/BoOgZj8/M1FMBwzita+5x+
 UqxEYe7k+2G3w2zuhBrq03BypU8pwqFep/QI0cCCPaHs4J5QnkVOScEqd6iV/C3T
 FPvMvqDf7MrElghj4Qa2IZlh/CgqmLG5NUEz8E40cXkdiP+E+eK9ZY2Uwx2XhBrm
 7Gl+SpG5DxWqqJeRNVWjFwM4p5L+01NtwDbTjZ1rsf+mCW5cNsy/L9B4UpPz4HxW
 coAs0y/Ce+ZhCopIXZ4jLDBoTG9yoVg8EcyfaHKD2Zz0mUFxa2xm+LvXKeT49qqx
 xuodpKD3fiuM7h9Xgv+cDsmn8Rr8gSeXEGV7qzpudmkxbp6IVg/yG5hC/dM921GR
 EVrRtUIwdw==
 =aAPP
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
 "Some later additions that weren't quite done for the first pull
  request, and also a few fixes that have arrived since.

  This contains:

   - Kill silly pktcdvd warning on attempting to register a non-scsi
     passthrough device (me)

   - Use symbolic constants for the block t10 protection types, and
     switch to handling it in core rather than in the drivers (Max)

   - libahci platform missing node put fix (Nishka)

   - Small series of fixes for BFQ (Paolo)

   - Fix possible nbd crash (Xiubo)"

* tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block:
  block: drop device references in bsg_queue_rq()
  block: t10-pi: fix -Wswitch warning
  pktcdvd: remove warning on attempting to register non-passthrough dev
  ata: libahci_platform: Add of_node_put() before loop exit
  nbd: fix possible page fault for nbd disk
  nbd: rename the runtime flags as NBD_RT_ prefixed
  block, bfq: push up injection only after setting service time
  block, bfq: increase update frequency of inject limit
  block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1
  block, bfq: update inject limit only after injection occurred
  block: centralize PI remapping logic to the block layer
  block: use symbolic constants for t10_pi type
2019-09-24 16:31:50 -07:00
Linus Torvalds
3e414b5bd2 - crypto and DM crypt advances that allow the crypto API to reclaim
implementation details that do not belong in DM crypt.  The wrapper
   template for ESSIV generation that was factored out will also be used
   by fscrypt in the future.
 
 - Add root hash pkcs#7 signature verification to the DM verity target.
 
 - Add a new "clone" DM target that allows for efficient remote
   replication of a device.
 
 - Enhance DM bufio's cache to be tailored to each client based on use.
   Clients that make heavy use of the cache get more of it, and those
   that use less have reduced cache usage.
 
 - Add a new DM_GET_TARGET_VERSION ioctl to allow userspace to query the
   version number of a DM target (even if the associated module isn't yet
   loaded).
 
 - Fix invalid memory access in DM zoned target.
 
 - Fix the max_discard_sectors limit advertised by the DM raid target; it
   was mistakenly storing the limit in bytes rather than sectors.
 
 - Small optimizations and cleanups in DM writecache target.
 
 - Various fixes and cleanups in DM core, DM raid1 and space map portion
   of DM persistent data library.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl2D7ycTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWp9QCACwTkVGzPGMCbAaCVlCACo8B5JyY4OO
 FNxucqUlt1MHKuBbzJd4XwNGlLg68xjMUKVPYPlgina7TaDl+wvlTbHchaJS8nak
 x1zyhDSywy0F9f6HHiXJi/vshmAfa0xnIM6fQXVPM346S6xf9u7hqOJQMCrdvY92
 w4FhuW9nVt5xizo8iC/3LzoWbhrWncT7dyZUZtG3/tmglhkEK7QwctlgQxcD7tXg
 H1lhntQzHzpxQAVBefWWdw7ubuDd6XCHuQMaxRhyR++c62P3eKDR8ck9hhd3hZKv
 E481gtxcsjKuYLxwULjqFJZaNFitWFNMJ7gppQyKRqCzn2zlGAL6npl8
 =m6zD
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - crypto and DM crypt advances that allow the crypto API to reclaim
   implementation details that do not belong in DM crypt. The wrapper
   template for ESSIV generation that was factored out will also be used
   by fscrypt in the future.

 - Add root hash pkcs#7 signature verification to the DM verity target.

 - Add a new "clone" DM target that allows for efficient remote
   replication of a device.

 - Enhance DM bufio's cache to be tailored to each client based on use.
   Clients that make heavy use of the cache get more of it, and those
   that use less have reduced cache usage.

 - Add a new DM_GET_TARGET_VERSION ioctl to allow userspace to query the
   version number of a DM target (even if the associated module isn't
   yet loaded).

 - Fix invalid memory access in DM zoned target.

 - Fix the max_discard_sectors limit advertised by the DM raid target;
   it was mistakenly storing the limit in bytes rather than sectors.

 - Small optimizations and cleanups in DM writecache target.

 - Various fixes and cleanups in DM core, DM raid1 and space map portion
   of DM persistent data library.

* tag 'for-5.4/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (22 commits)
  dm: introduce DM_GET_TARGET_VERSION
  dm bufio: introduce a global cache replacement
  dm bufio: remove old-style buffer cleanup
  dm bufio: introduce a global queue
  dm bufio: refactor adjust_total_allocated
  dm bufio: call adjust_total_allocated from __link_buffer and __unlink_buffer
  dm: add clone target
  dm raid: fix updating of max_discard_sectors limit
  dm writecache: skip writecache_wait for pmem mode
  dm stats: use struct_size() helper
  dm crypt: omit parsing of the encapsulated cipher
  dm crypt: switch to ESSIV crypto API template
  crypto: essiv - create wrapper template for ESSIV generation
  dm space map common: remove check for impossible sm_find_free() return value
  dm raid1: use struct_size() with kzalloc()
  dm writecache: optimize performance by sorting the blocks for writeback_all
  dm writecache: add unlikely for getting two block with same LBA
  dm writecache: remove unused member pointer in writeback_struct
  dm zoned: fix invalid memory access
  dm verity: add root hash pkcs#7 signature verification
  ...
2019-09-21 10:40:37 -07:00
Max Gurtovoy
54d4e6ab91 block: centralize PI remapping logic to the block layer
Currently t10_pi_prepare/t10_pi_complete functions are called during the
NVMe and SCSi layers command preparetion/completion, but their actual
place should be the block layer since T10-PI is a general data integrity
feature that is used by block storage protocols. Introduce .prepare_fn
and .complete_fn callbacks within the integrity profile that each type
can implement according to its needs.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>

Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY
isn't defined in the config.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-17 20:03:49 -06:00
Linus Torvalds
7ad67ca553 for-5.4/block-2019-09-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1/no0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmo9EACFXMbdNmEEUMyRSdOkVLlr7ZlTyQi1tLpB
 YESDPxdBfybzpi0qa8JSaysGIfvSkSjmSAqBqrWPmASOSOL6CK4bbA4fTYbgPplk
 XeHUdgGiG34oCQUn8Xil5reYaTm7I6LQWnWTpVa5fIhAyUYaGJL+987ykoGmpQmB
 Dvf3YSc+8H0RTp9PCMVd6UCGPkZbVlLImGad3PF5ULvTEaE4RCXC2aiAgh0p1l5A
 J2CkRZ+/mio3zN2O4YN7VdPGfr1Wo1iZ834xbIGLegv1miHXagFk7jwTcC7zIt5t
 oSnJnqIg3iCe7SpWt4Bkzw/zy/2UqaspifbCMgw8vychlViVRUHFO5h85Yboo7kQ
 OMLEQPcwjm6dTHv5h1iXF9LW1O7NoiYmmgvApU9uOo1HUrl1X7PZ3JEfUsVHxkOO
 T4D5igf0Krsl1eAbiwEUQzy7vFZ8PlRHqrHgK+fkyotzHu1BJR7OQkYygEfGFOB/
 EfMxplGDpmibYGuWCwDX2bPAmLV3SPUQENReHrfPJRDt5TD1UkFpVGv/PLLhbr0p
 cLYI78DKpDSigBpVMmwq5nTYpnex33eyDTTA8C0sakcsdzdmU5qv30y3wm4nTiep
 f6gZo6IMXwRg/rCgVVrd9SKQAr/8wEzVlsDW3qyi2pVT8sHIgm0tFv7paihXGdDV
 xsKgmTrQQQ==
 =Qt+h
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Two NVMe pull requests:
     - ana log parse fix from Anton
     - nvme quirks support for Apple devices from Ben
     - fix missing bio completion tracing for multipath stack devices
       from Hannes and Mikhail
     - IP TOS settings for nvme rdma and tcp transports from Israel
     - rq_dma_dir cleanups from Israel
     - tracing for Get LBA Status command from Minwoo
     - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
     - Some consolidation between the fabrics transports for handling
       the CAP register
     - reset race with ns scanning fix for fabrics (move fabrics
       commands to a dedicated request queue with a different lifetime
       from the admin request queue)."
     - controller reset and namespace scan races fixes
     - nvme discovery log change uevent support
     - naming improvements from Keith
     - multiple discovery controllers reject fix from James
     - some regular cleanups from various people

 - Series fixing (and re-fixing) null_blk debug printing and nr_devices
   checks (André)

 - A few pull requests from Song, with fixes from Andy, Guoqing,
   Guilherme, Neil, Nigel, and Yufen.

 - REQ_OP_ZONE_RESET_ALL support (Chaitanya)

 - Bio merge handling unification (Christoph)

 - Pick default elevator correctly for devices with special needs
   (Damien)

 - Block stats fixes (Hou)

 - Timeout and support devices nbd fixes (Mike)

 - Series fixing races around elevator switching and device add/remove
   (Ming)

 - sed-opal cleanups (Revanth)

 - Per device weight support for BFQ (Fam)

 - Support for blk-iocost, a new model that can properly account cost of
   IO workloads. (Tejun)

 - blk-cgroup writeback fixes (Tejun)

 - paride queue init fixes (zhengbin)

 - blk_set_runtime_active() cleanup (Stanley)

 - Block segment mapping optimizations (Bart)

 - lightnvm fixes (Hans/Minwoo/YueHaibing)

 - Various little fixes and cleanups

* tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
  null_blk: format pr_* logs with pr_fmt
  null_blk: match the type of parameter nr_devices
  null_blk: do not fail the module load with zero devices
  block: also check RQF_STATS in blk_mq_need_time_stamp()
  block: make rq sector size accessible for block stats
  bfq: Fix bfq linkage error
  raid5: use bio_end_sector in r5_next_bio
  raid5: remove STRIPE_OPS_REQ_PENDING
  md: add feature flag MD_FEATURE_RAID0_LAYOUT
  md/raid0: avoid RAID0 data corruption due to layout confusion.
  raid5: don't set STRIPE_HANDLE to stripe which is in batch list
  raid5: don't increment read_errors on EILSEQ return
  nvmet: fix a wrong error status returned in error log page
  nvme: send discovery log page change events to userspace
  nvme: add uevent variables for controller devices
  nvme: enable aen regardless of the presence of I/O queues
  nvme-fabrics: allow discovery subsystems accept a kato
  nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
  nvme: Remove redundant assignment of cq vector
  nvme: Assign subsys instance from first ctrl
  ...
2019-09-17 16:57:47 -07:00
Mikulas Patocka
afa179eb60 dm: introduce DM_GET_TARGET_VERSION
This commit introduces a new ioctl DM_GET_TARGET_VERSION. It will load a
target that is specified in the "name" entry in the parameter structure
and return its version.

This functionality is intended to be used by cryptsetup, so that it can
query kernel capabilities before activating the device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-16 10:18:01 -04:00
Mikulas Patocka
6e913b28cd dm bufio: introduce a global cache replacement
This commit introduces a global cache replacement (instead of per-client
cleanup).

If one bufio client uses the cache heavily and another client is not using
it, we want to let the first client use most of the cache. The old
algorithm would partition the cache equally betwen the clients and that is
sub-optimal.

For cache replacement, we use the clock algorithm because it doesn't
require taking any lock when the buffer is accessed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13 17:00:21 -04:00
Guoqing Jiang
067df25c83 raid5: use bio_end_sector in r5_next_bio
Actually, we calculate bio's end sector here, so use the common
way for the purpose.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13 13:14:43 -07:00
Guoqing Jiang
feb9bf9849 raid5: remove STRIPE_OPS_REQ_PENDING
This stripe state is not used anymore after commit 51acbcec6c
("md: remove CONFIG_MULTICORE_RAID456"), so remove the obsoleted
state.

gjiang@nb01257:~/md$ grep STRIPE_OPS_REQ_PENDING drivers/md/ -r
drivers/md/raid5.c:					  (1 << STRIPE_OPS_REQ_PENDING) |
drivers/md/raid5.h:	STRIPE_OPS_REQ_PENDING,

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13 13:14:39 -07:00
NeilBrown
33f2c35a54 md: add feature flag MD_FEATURE_RAID0_LAYOUT
Due to a bug introduced in Linux 3.14 we cannot determine the
correctly layout for a multi-zone RAID0 array - there are two
possibilities.

It is possible to tell the kernel which to chose using a module
parameter, but this can be clumsy to use.  It would be best if
the choice were recorded in the metadata.
So add a feature flag for this purpose.
If it is set, then the 'layout' field of the superblock is used
to determine which layout to use.

If this flag is not set, then mddev->layout gets set to -1,
which causes the module parameter to be required.

Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13 13:10:06 -07:00
NeilBrown
c84a1372df md/raid0: avoid RAID0 data corruption due to layout confusion.
If the drives in a RAID0 are not all the same size, the array is
divided into zones.
The first zone covers all drives, to the size of the smallest.
The second zone covers all drives larger than the smallest, up to
the size of the second smallest - etc.

A change in Linux 3.14 unintentionally changed the layout for the
second and subsequent zones.  All the correct data is still stored, but
each chunk may be assigned to a different device than in pre-3.14 kernels.
This can lead to data corruption.

It is not possible to determine what layout to use - it depends which
kernel the data was written by.
So we add a module parameter to allow the old (0) or new (1) layout to be
specified, and refused to assemble an affected array if that parameter is
not set.

Fixes: 20d0189b10 ("block: Introduce new bio_split()")
cc: stable@vger.kernel.org (3.14+)
Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13 13:10:05 -07:00
Guoqing Jiang
6ce220dd2f raid5: don't set STRIPE_HANDLE to stripe which is in batch list
If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe
could be set with STRIPE_ACTIVE by the handle_stripe function. And if
error happens to the batch_head at the same time, break_stripe_batch_list
is called, then below warning could happen (the same report in [1]), it
means a member of batch list was set with STRIPE_ACTIVE.

[7028915.431770] stripe state: 2001
[7028915.431815] ------------[ cut here ]------------
[7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456]
[...]
[7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G           O    4.14.86-1-storage #4.14.86-1.2~deb9
[7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018
[7028915.431888] Workqueue: raid5wq raid5_do_work [raid456]
[7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000
[7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456]
[7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286
[7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000
[7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458
[7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4
[7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00
[7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001
[7028915.431906] FS:  0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000
[7028915.431907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0
[7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[7028915.431910] Call Trace:
[7028915.431923]  handle_stripe+0x8e7/0x2020 [raid456]
[7028915.431930]  ? __wake_up_common_lock+0x89/0xc0
[7028915.431935]  handle_active_stripes.isra.58+0x35f/0x560 [raid456]
[7028915.431939]  raid5_do_work+0xc6/0x1f0 [raid456]

Also commit 59fc630b8b ("RAID5: batch adjacent full stripe write")
said "If a stripe is added to batch list, then only the first stripe
of the list should be put to handle_list and run handle_stripe."

So don't set STRIPE_HANDLE to stripe which is already in batch list,
otherwise the stripe could be put to handle_list and run handle_stripe,
then the above warning could be triggered.

[1]. https://www.spinics.net/lists/raid/msg62552.html

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13 13:10:05 -07:00
Nigel Croxon
b76b4715eb raid5: don't increment read_errors on EILSEQ return
While MD continues to count read errors returned by the lower layer.
If those errors are -EILSEQ, instead of -EIO, it should NOT increase
the read_errors count.

When RAID6 is set up on dm-integrity target that detects massive
corruption, the leg will be ejected from the array.  Even if the
issue is correctable with a sector re-write and the array has
necessary redundancy to correct it.

The leg is ejected because it runs up the rdev->read_errors beyond
conf->max_nr_stripes.  The return status in dm-drypt when there is
a data integrity error is -EILSEQ (BLK_STS_PROTECTION).

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13 13:10:05 -07:00
Mikulas Patocka
b132ff3332 dm bufio: remove old-style buffer cleanup
Remove code that cleans up buffers if the cache size grows over the limit.

The next commit will introduce a new global cleanup.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13 10:43:48 -04:00
Mikulas Patocka
af53badc0c dm bufio: introduce a global queue
Rename param_spinlock to global_spinlock and introduce a global queue of
all used buffers.  The queue will be used in the following commits.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13 10:43:27 -04:00
Mikulas Patocka
d0a328a385 dm bufio: refactor adjust_total_allocated
Refactor adjust_total_allocated() so that it takes a bool argument
indicating if it should add or subtract the buffer size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13 10:43:07 -04:00
Mikulas Patocka
26d2ef0cd0 dm bufio: call adjust_total_allocated from __link_buffer and __unlink_buffer
Move the call to adjust_total_allocated() to __link_buffer() and
__unlink_buffer() so that only used buffers are counted.  Reserved
buffers are not.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13 10:42:27 -04:00
Nikos Tsironis
7431b7835f dm: add clone target
Add the dm-clone target, which allows cloning of arbitrary block
devices.

dm-clone produces a one-to-one copy of an existing, read-only source
device into a writable destination device: It presents a virtual block
device which makes all data appear immediately, and redirects reads and
writes accordingly.

The main use case of dm-clone is to clone a potentially remote,
high-latency, read-only, archival-type block device into a writable,
fast, primary-type device for fast, low-latency I/O. The cloned device
is visible/mountable immediately and the copy of the source device to
the destination device happens in the background, in parallel with user
I/O.

When the cloning completes, the dm-clone table can be removed altogether
and be replaced, e.g., by a linear table, mapping directly to the
destination device.

For further information and examples of how to use dm-clone, please read
Documentation/admin-guide/device-mapper/dm-clone.rst

Suggested-by: Vangelis Koukis <vkoukis@arrikto.com>
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-12 09:32:31 -04:00
Ming Lei
c8156fc77d dm raid: fix updating of max_discard_sectors limit
Unit of 'chunk_size' is byte, instead of sector, so fix it by setting
the queue_limits' max_discard_sectors to rs->md.chunk_sectors.  Also,
rename chunk_size to chunk_size_bytes.

Without this fix, too big max_discard_sectors is applied on the request
queue of dm-raid, finally raid code has to split the bio again.

This re-split done by raid causes the following nested clone_endio:

1) one big bio 'A' is submitted to dm queue, and served as the original
bio

2) one new bio 'B' is cloned from the original bio 'A', and .map()
is run on this bio of 'B', and B's original bio points to 'A'

3) raid code sees that 'B' is too big, and split 'B' and re-submit
the remainded part of 'B' to dm-raid queue via generic_make_request().

4) now dm will handle 'B' as new original bio, then allocate a new
clone bio of 'C' and run .map() on 'C'. Meantime C's original bio
points to 'B'.

5) suppose now 'C' is completed by raid directly, then the following
clone_endio() is called recursively:

	clone_endio(C)
		->clone_endio(B)		#B is original bio of 'C'
			->bio_endio(A)

'A' can be big enough to make hundreds of nested clone_endio(), then
stack can be corrupted easily.

Fixes: 61697a6abd ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-11 16:18:23 -04:00
Damien Le Moal
737eb78e82 block: Delay default elevator initialization
When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
the only information known about the device is the number of hardware
queues as the block device scan by the device driver is not completed
yet for most drivers. The device type and elevator required features
are not set yet, preventing to correctly select the default elevator
most suitable for the device.

This currently affects all multi-queue zoned block devices which default
to the "none" elevator instead of the required "mq-deadline" elevator.
These drives currently include host-managed SMR disks connected to a
smartpqi HBA and null_blk block devices with zoned mode enabled.
Upcoming NVMe Zoned Namespace devices will also be affected.

Fix this by adding the boolean elevator_init argument to
blk_mq_init_allocated_queue() to control the execution of
elevator_init_mq(). Two cases exist:
1) elevator_init = false is used for calls to
   blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
   case, a call to elevator_init_mq() is added to __device_add_disk(),
   resulting in the delayed initialization of the queue elevator
   after the device driver finished probing the device information. This
   effectively allows elevator_init_mq() access to more information
   about the device.
2) elevator_init = true preserves the current behavior of initializing
   the elevator directly from blk_mq_init_allocated_queue(). This case
   is used for the special request based DM devices where the device
   gendisk is created before the queue initialization and device
   information (e.g. queue limits) is already known when the queue
   initialization is executed.

Additionally, to make sure that the elevator initialization is never
done while requests are in-flight (there should be none when the device
driver calls device_add_disk()), freeze and quiesce the device request
queue before calling blk_mq_init_sched() in elevator_init_mq().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-05 19:52:34 -06:00
Huaisheng Ye
6d1959138c dm writecache: skip writecache_wait for pmem mode
The array bio_in_progress[2] only have chance to be increased and
decreased with ssd mode. For pmem mode, they are not involved at all.
So skip writecache_wait_for_ios in writecache_flush for pmem.

Suggested-by: Doris Yu <tyu1@lenovo.com>
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-05 13:22:05 -04:00
Gustavo A. R. Silva
fb16c799b8 dm stats: use struct_size() helper
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct dm_stat {
	...
        struct dm_stat_shared stat_shared[0];
};

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.

So, replace the following form:

sizeof(struct dm_stat) + (size_t)n_entries * sizeof(struct dm_stat_shared)

with:

struct_size(s, stat_shared, n_entries)

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-04 09:39:22 -04:00
Guoqing Jiang
b0f01ecf29 md/raid5: use bio_end_sector to calculate last_sector
Use the common way to get last_sector.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 14:52:38 -07:00
Yufen Yu
07f1a6850c md/raid1: fail run raid1 array when active disk less than one
When run test case:
  mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal
  mdadm -S /dev/md1
  mdadm -A /dev/md1 /dev/sd[b-c] --run --force

  mdadm --zero /dev/sda
  mdadm /dev/md1 -a /dev/sda

  echo offline > /sys/block/sdc/device/state
  echo offline > /sys/block/sdb/device/state
  sleep 5
  mdadm -S /dev/md1

  echo running > /sys/block/sdb/device/state
  echo running > /sys/block/sdc/device/state
  mdadm -A /dev/md1 /dev/sd[a-c] --run --force

mdadm run fail with kernel message as follow:
[  172.986064] md: kicking non-fresh sdb from array!
[  173.004210] md: kicking non-fresh sdc from array!
[  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[  173.022406] md1: failed to create bitmap (-5)

In fact, when active disk in raid1 array less than one, we
need to return fail in raid1_run().

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 14:52:03 -07:00
Guilherme G. Piccoli
62f7b1989c md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.

In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.

This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.

A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.

With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.

Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 14:49:28 -07:00
Ard Biesheuvel
b1d1e29639 dm crypt: omit parsing of the encapsulated cipher
Only the ESSIV IV generation mode used to use cc->cipher so it could
instantiate the bare cipher used to encrypt the IV. However, this is
now taken care of by the ESSIV template, and so no users of cc->cipher
remain. So remove it altogether.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-03 16:46:16 -04:00
Ard Biesheuvel
a1a262b66e dm crypt: switch to ESSIV crypto API template
Replace the explicit ESSIV handling in the dm-crypt driver with calls
into the crypto API, which now possesses the capability to perform
this processing within the crypto subsystem.

Note that we reorder the AEAD cipher_api string parsing with the TFM
instantiation: this is needed because cipher_api is mangled by the
ESSIV handling, and throws off the parsing of "authenc(" otherwise.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-03 16:45:54 -04:00
Kent Overstreet
a22a9602b8 closures: fix a race on wakeup from closure_sync
The race was when a thread using closure_sync() notices cl->s->done == 1
before the thread calling closure_put() calls wake_up_process(). Then,
it's possible for that thread to return and exit just before
wake_up_process() is called - so we're trying to wake up a process that
no longer exists.

rcu_read_lock() is sufficient to protect against this, as there's an rcu
barrier somewhere in the process teardown path.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-03 08:08:31 -06:00
Dan Carpenter
d66c9920c0 bcache: Fix an error code in bch_dump_read()
The copy_to_user() function returns the number of bytes remaining to be
copied, but the intention here was to return -EFAULT if the copy fails.

Fixes: cafe563591 ("bcache: A block layer cache")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-03 08:08:29 -06:00
Shile Zhang
d55a4ae9e1 bcache: add cond_resched() in __bch_cache_cmp()
Read /sys/fs/bcache/<uuid>/cacheN/priority_stats can take very long
time with huge cache after long run.

Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Tested-by: Heitor Alves de Siqueira <halves@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-03 08:08:28 -06:00
Nigel Croxon
0009fad033 raid5 improve too many read errors msg by adding limits
Often limits can be changed by admin. When discussing such things
it helps if you can provide "self-sustained" facts. Also
sometimes the admin thinks he changed a limit, but it did not
take effect for some reason or he changed the wrong thing.

V3: Only pr_warn when Faulty is 0.
V2: Add read_errors value to pr_warn.

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-27 12:36:37 -07:00
NeilBrown
9d4b45d6af md: don't report active array_state until after revalidate_disk() completes.
Until revalidate_disk() has completed, the size of a new md array will
appear to be zero.
So we shouldn't report, through array_state, that the array is active
until that time.
udev rules check array_state to see if the array is ready.  As soon as
it appear to be zero, fsck can be run.  If it find the size to be
zero, it will fail.

So add a new flag to provide an interlock between do_md_run() and
array_state_show().  This flag is set while do_md_run() is active and
it prevents array_state_show() from reporting that the array is
active.

Before do_md_run() is called, ->pers will be NULL so array is
definitely not active.
After do_md_run() is called, revalidate_disk() will have run and the
array will be completely ready.

We also move various sysfs_notify*() calls out of md_run() into
do_md_run() after MD_NOT_READY is cleared.  This ensure the
information is ready before the notification is sent.

Prior to v4.12, array_state_show() was called with the
mddev->reconfig_mutex held, which provided exclusion with do_md_run().

Note that MD_NOT_READY cleared twice.  This is deliberate to cover
both success and error paths with minimal noise.

Fixes: b7b17c9b67 ("md: remove mddev_lock() from md_attr_show()")
Cc: stable@vger.kernel.org (v4.12++)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-27 12:36:37 -07:00
NeilBrown
480523feae md: only call set_in_sync() when it is expected to succeed.
Since commit 4ad23a9764 ("MD: use per-cpu counter for
writes_pending"), set_in_sync() is substantially more expensive: it
can wait for a full RCU grace period which can be 10s of milliseconds.

So we should only call it when the cost is justified.

md_check_recovery() currently calls set_in_sync() every time it finds
anything to do (on non-external active arrays).  For an array
performing resync or recovery, this will be quite often.
Each call will introduce a delay to the md thread, which can noticeable
affect IO submission latency.

In md_check_recovery() we only need to call set_in_sync() if
'safemode' was non-zero at entry, meaning that there has been not
recent IO.  So we save this "safemode was nonzero" state, and only
call set_in_sync() if it was non-zero.

This measurably reduces mean and maximum IO submission latency during
resync/recovery.

Reported-and-tested-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Fixes: 4ad23a9764 ("MD: use per-cpu counter for writes_pending")
Cc: stable@vger.kernel.org (v4.12+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-27 12:36:36 -07:00
ZhangXiaoxu
c1499a044d dm space map common: remove check for impossible sm_find_free() return value
The function sm_find_free() just returns -ENOSPC and 0.
So remove lone caller's check for some other error.

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 15:39:53 -04:00
Gustavo A. R. Silva
bcd676542c dm raid1: use struct_size() with kzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct mirror_set {
	...
        struct mirror mirror[0];
};

size = sizeof(struct mirror_set) + count * sizeof(struct mirror);
instance = kzalloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kzalloc(struct_size(instance, mirror, count), GFP_KERNEL)

Notice that, in this case, variable len is not necessary, hence it
is removed.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 11:05:32 -04:00
Huaisheng Ye
5229b4896e dm writecache: optimize performance by sorting the blocks for writeback_all
During the process of writeback, the blocks, which have been placed in wbl.list
for writeback soon, are partially ordered for the contiguous ones.

When writeback_all has been set, for most cases, also by default, there will be
a lot of blocks in pmem need to writeback at the same time.
For this case, we could optimize the performance by sorting all blocks in
wbl.list. writecache_writeback doesn't need to get blocks from the tail of
wc->lru, whereas from the first rb_node from the rb_tree.

The benefit is that, writecache_writeback doesn't need to have any cost to sort
the blocks, because of all blocks are incremental originally in rb_tree.
There will be a writecache_flush when writeback_all begins to work, that will
eliminate duplicate blocks in cache by committed/uncommitted.

Testing platform: Thinksystem SR630 with persistent memory.
The cache comes from pmem, which has 1006MB size. The origin device is HDD, 2GB
of which for using.

Testing steps:
 1) dmsetup create mycache --table '0 4194304 writecache p /dev/sdb1 /dev/pmem4  4096 0'
 2) fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
 -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
 3) time dmsetup message /dev/mapper/mycache 0 flush

Here is the results below,
With the patch:
 # fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
 -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
   iops        : min= 1582, max=199470, avg=5305.94, stdev=21273.44, samples=197
 # time dmsetup message /dev/mapper/mycache 0 flush
real	0m44.020s
user	0m0.002s
sys	0m0.003s

Without the patch:
 # fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
 -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
   iops        : min= 1202, max=197650, avg=4968.67, stdev=20480.17, samples=211
 # time dmsetup message /dev/mapper/mycache 0 flush
real	1m39.221s
user	0m0.001s
sys	0m0.003s

I also have checked the data accuracy with this patch by making EXT4 filesystem
on mycache, then mount it for checking md5 of files on that.
The test result is positive, with this patch it could save more than half of time
when writeback_all.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 10:59:00 -04:00
Huaisheng Ye
62421b3880 dm writecache: add unlikely for getting two block with same LBA
In function writecache_writeback, entries g and f has same original
sector only happens at entry f has been committed, but entry g has
NOT yet.

The probability of this happening is very low in the following
256 blocks at most of entry e.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 10:54:41 -04:00
Huaisheng Ye
58912dbce6 dm writecache: remove unused member pointer in writeback_struct
The stucture member pointer page in writeback_struct never has been
used actually. Remove it.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 10:54:15 -04:00
Mikulas Patocka
0c8e9c2d66 dm zoned: fix invalid memory access
Commit 75d66ffb48 ("dm zoned: properly
handle backing device failure") triggers a coverity warning:

*** CID 1452808:  Memory - illegal accesses  (USE_AFTER_FREE)
/drivers/md/dm-zoned-target.c: 137 in dmz_submit_bio()
131             clone->bi_private = bioctx;
132
133             bio_advance(bio, clone->bi_iter.bi_size);
134
135             refcount_inc(&bioctx->ref);
136             generic_make_request(clone);
>>>     CID 1452808:  Memory - illegal accesses  (USE_AFTER_FREE)
>>>     Dereferencing freed pointer "clone".
137             if (clone->bi_status == BLK_STS_IOERR)
138                     return -EIO;
139
140             if (bio_op(bio) == REQ_OP_WRITE && dmz_is_seq(zone))
141                     zone->wp_block += nr_blocks;
142

The "clone" bio may be processed and freed before the check
"clone->bi_status == BLK_STS_IOERR" - so this check can access invalid
memory.

Fixes: 75d66ffb48 ("dm zoned: properly handle backing device failure")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 10:33:58 -04:00
Jaskaran Khurana
88cd3e6cfa dm verity: add root hash pkcs#7 signature verification
The verification is to support cases where the root hash is not secured
by Trusted Boot, UEFI Secureboot or similar technologies.

One of the use cases for this is for dm-verity volumes mounted after
boot, the root hash provided during the creation of the dm-verity volume
has to be secure and thus in-kernel validation implemented here will be
used before we trust the root hash and allow the block device to be
created.

The signature being provided for verification must verify the root hash
and must be trusted by the builtin keyring for verification to succeed.

The hash is added as a key of type "user" and the description is passed
to the kernel so it can look it up and use it for verification.

Adds CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG which can be turned on if root
hash verification is needed.

Kernel commandline dm_verity module parameter 'require_signatures' will
indicate whether to force root hash signature verification (for all dm
verity volumes).

Signed-off-by: Jaskaran Khurana <jaskarankhurana@linux.microsoft.com>
Tested-and-Reviewed-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:13:14 -04:00
Ard Biesheuvel
39d13a1ac4 dm crypt: reuse eboiv skcipher for IV generation
Instead of instantiating a separate cipher to perform the encryption
needed to produce the IV, reuse the skcipher used for the block data
and invoke it one additional time for each block to encrypt a zero
vector and use the output as the IV.

For CBC mode, this is equivalent to using the bare block cipher, but
without the risk of ending up with a non-time invariant implementation
of AES when the skcipher itself is time variant (e.g., arm64 without
Crypto Extensions has a NEON based time invariant implementation of
cbc(aes) but no time invariant implementation of the core cipher other
than aes-ti, which is not enabled by default).

This approach is a compromise between dm-crypt API flexibility and
reducing dependence on parts of the crypto API that should not usually
be exposed to other subsystems, such as the bare cipher API.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:13:13 -04:00
Mikulas Patocka
123d87d553 dm: make dm_table_find_target return NULL
Currently, if we pass too high sector number to dm_table_find_target, it
returns zeroed dm_target structure and callers test if the structure is
zeroed with the macro dm_target_is_valid.

However, returning NULL is common practice to indicate errors.

This patch refactors the dm code, so that dm_table_find_target returns
NULL and its callers test the returned value for NULL. The macro
dm_target_is_valid is deleted. In alloc_targets, we no longer allocate an
extra zeroed target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:13:12 -04:00
Mikulas Patocka
1cfd5d3399 dm table: fix invalid memory accesses with too high sector number
If the sector number is too high, dm_table_find_target() should return a
pointer to a zeroed dm_target structure (the caller should test it with
dm_target_is_valid).

However, for some table sizes, the code in dm_table_find_target() that
performs btree lookup will access out of bound memory structures.

Fix this bug by testing the sector number at the beginning of
dm_table_find_target(). Also, add an "inline" keyword to the function
dm_table_get_size() because this is a hot path.

Fixes: 512875bd96 ("dm: table detect io beyond device")
Cc: stable@vger.kernel.org
Reported-by: Zhang Tao <kontais@zoho.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:11:42 -04:00
ZhangXiaoxu
ae148243d3 dm space map metadata: fix missing store of apply_bops() return value
In commit 6096d91af0 ("dm space map metadata: fix occasional leak
of a metadata block on resize"), we refactor the commit logic to a new
function 'apply_bops'.  But when that logic was replaced in out() the
return value was not stored.  This may lead out() returning a wrong
value to the caller.

Fixes: 6096d91af0 ("dm space map metadata: fix occasional leak of a metadata block on resize")
Cc: stable@vger.kernel.org
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-22 16:11:24 -04:00
ZhangXiaoxu
e4f9d60138 dm btree: fix order of block initialization in btree_split_beneath
When btree_split_beneath() splits a node to two new children, it will
allocate two blocks: left and right.  If right block's allocation
failed, the left block will be unlocked and marked dirty.  If this
happened, the left block'ss content is zero, because it wasn't
initialized with the btree struct before the attempot to allocate the
right block.  Upon return, when flushing the left block to disk, the
validator will fail when check this block.  Then a BUG_ON is raised.

Fix this by completely initializing the left block before allocating and
initializing the right block.

Fixes: 4dcb8b57df ("dm btree: fix leak of bufio-backed block in btree_split_beneath error path")
Cc: stable@vger.kernel.org
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-22 16:11:23 -04:00
Wenwen Wang
dc1a3e8e0c dm raid: add missing cleanup in raid_ctr()
If rs_prepare_reshape() fails, no cleanup is executed, leading to
leak of the raid_set structure allocated at the beginning of
raid_ctr(). To fix this issue, go to the label 'bad' if the error
occurs.

Fixes: 11e4723206 ("dm raid: stop keeping raid set frozen altogether")
Cc: stable@vger.kernel.org
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-21 11:47:05 -04:00
Dan Carpenter
e0702d90b7 dm zoned: fix potential NULL dereference in dmz_do_reclaim()
This function is supposed to return error pointers so it matches the
dmz_get_rnd_zone_for_reclaim() function.  The current code could lead to
a NULL dereference in dmz_do_reclaim()

Fixes: b234c6d7a7 ("dm zoned: improve error handling in reclaim")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-21 11:29:30 -04:00
Bryan Gurney
08c04c84a5 dm dust: use dust block size for badblocklist index
Change the "frontend" dust_remove_block, dust_add_block, and
dust_query_block functions to store the "dust block number", instead
of the sector number corresponding to the "dust block number".

For the "backend" functions dust_map_read and dust_map_write,
right-shift by sect_per_block_shift.  This fixes the inability to
emulate failure beyond the first sector of each "dust block" (for
devices with a "dust block size" larger than 512 bytes).

Fixes: e4f3fabd67 ("dm: add dust target")
Cc: stable@vger.kernel.org
Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-21 11:27:17 -04:00
Mikulas Patocka
5729b6e5a1 dm integrity: fix a crash due to BUG_ON in __journal_read_write()
Fix a crash that was introduced by the commit 724376a04d. The crash is
reported here: https://gitlab.com/cryptsetup/cryptsetup/issues/468

When reading from the integrity device, the function
dm_integrity_map_continue calls find_journal_node to find out if the
location to read is present in the journal. Then, it calculates how many
sectors are consecutively stored in the journal. Then, it locks the range
with add_new_range and wait_and_add_new_range.

The problem is that during wait_and_add_new_range, we hold no locks (we
don't hold ic->endio_wait.lock and we don't hold a range lock), so the
journal may change arbitrarily while wait_and_add_new_range sleeps.

The code then goes to __journal_read_write and hits
BUG_ON(journal_entry_get_sector(je) != logical_sector); because the
journal has changed.

In order to fix this bug, we need to re-check the journal location after
wait_and_add_new_range. We restrict the length to one block in order to
not complicate the code too much.

Fixes: 724376a04d ("dm integrity: implement fair range locks")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 16:01:57 -04:00
Dmitry Fomichev
ad1bd578bd dm zoned: fix a few typos
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:43 -04:00
Dmitry Fomichev
bae9a0aa33 dm zoned: add SPDX license identifiers
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:42 -04:00
Dmitry Fomichev
75d66ffb48 dm zoned: properly handle backing device failure
dm-zoned is observed to lock up or livelock in case of hardware
failure or some misconfiguration of the backing zoned device.

This patch adds a new dm-zoned target function that checks the status of
the backing device. If the request queue of the backing device is found
to be in dying state or the SCSI backing device enters offline state,
the health check code sets a dm-zoned target flag prompting all further
incoming I/O to be rejected. In order to detect backing device failures
timely, this new function is called in the request mapping path, at the
beginning of every reclaim run and before performing any metadata I/O.

The proper way out of this situation is to do

dmsetup remove <dm-zoned target>

and recreate the target when the problem with the backing device
is resolved.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:42 -04:00
Dmitry Fomichev
d7428c5011 dm zoned: improve error handling in i/o map code
Some errors are ignored in the I/O path during queueing chunks
for processing by chunk works. Since at least these errors are
transient in nature, it should be possible to retry the failed
incoming commands.

The fix -

Errors that can happen while queueing chunks are carried upwards
to the main mapping function and it now returns DM_MAPIO_REQUEUE
for any incoming requests that can not be properly queued.

Error logging/debug messages are added where needed.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:41 -04:00
Dmitry Fomichev
b234c6d7a7 dm zoned: improve error handling in reclaim
There are several places in reclaim code where errors are not
propagated to the main function, dmz_reclaim(). This function
is responsible for unlocking zones that might be still locked
at the end of any failed reclaim iterations. As the result,
some device zones may be left permanently locked for reclaim,
degrading target's capability to reclaim zones.

This patch fixes these issues as follows -

Make sure that dmz_reclaim_buf(), dmz_reclaim_seq_data() and
dmz_reclaim_rnd_data() return error codes to the caller.

dmz_reclaim() function is renamed to dmz_do_reclaim() to avoid
clashing with "struct dmz_reclaim" and is modified to return the
error to the caller.

dmz_get_zone_for_reclaim() now returns an error instead of NULL
pointer and reclaim code checks for that error.

Error logging/debug messages are added where necessary.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:40 -04:00
Dmitry Fomichev
d1fef41465 dm kcopyd: always complete failed jobs
This patch fixes a problem in dm-kcopyd that may leave jobs in
complete queue indefinitely in the event of backing storage failure.

This behavior has been observed while running 100% write file fio
workload against an XFS volume created on top of a dm-zoned target
device. If the underlying storage of dm-zoned goes to offline state
under I/O, kcopyd sometimes never issues the end copy callback and
dm-zoned reclaim work hangs indefinitely waiting for that completion.

This behavior was traced down to the error handling code in
process_jobs() function that places the failed job to complete_jobs
queue, but doesn't wake up the job handler. In case of backing device
failure, all outstanding jobs may end up going to complete_jobs queue
via this code path and then stay there forever because there are no
more successful I/O jobs to wake up the job handler.

This patch adds a wake() call to always wake up kcopyd job wait queue
for all I/O jobs that fail before dm_io() gets called for that job.

The patch also sets the write error status in all sub jobs that are
failed because their master job has failed.

Fixes: b73c67c2cb ("dm kcopyd: add sequential write feature")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:39 -04:00
Mikulas Patocka
cf3591ef83 Revert "dm bufio: fix deadlock with loop device"
Revert the commit bd293d071f. The proper
fix has been made available with commit d0a255e795 ("loop: set
PF_MEMALLOC_NOIO for the worker thread").

Note that the fix offered by commit bd293d071f doesn't really prevent
the deadlock from occuring - if we look at the stacktrace reported by
Junxiao Bi, we see that it hangs in bit_wait_io and not on the mutex -
i.e. it has already successfully taken the mutex. Changing the mutex
from mutex_lock to mutex_trylock won't help with deadlocks that happen
afterwards.

PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
   #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
   #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
   #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
   #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
   #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
   #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
   #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
   #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
   #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
   #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
  #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
  #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
  #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
  #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
  #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Fixes: bd293d071f ("dm bufio: fix deadlock with loop device")
Depends-on: d0a255e795 ("loop: set PF_MEMALLOC_NOIO for the worker thread")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:39 -04:00
Linus Torvalds
50e73a4a41 for-linus-20190809
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1NmNIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpouEEADA1e2HH9AO8QGWBVSeFi5ZdRnEXsJUfNGD
 d576M9empGI/hkbui0yY2tuodnMwmhUj641GTRC2dzl7RRyGcTMmqtGYPcyczlpU
 +6Yil3+XVNYTXRpUsKWs32H9aubNZl/L3rCcJImGvMgLW5YtEjAZJFIFIzXWWwDJ
 aZpTtnOH1+D3HH6HT35xk+aytSYZ7LsZ7X3LI9ZumKOUd2HJZGUkfWLXxgSuuUh/
 /WaBEQ9xzDNARmfx9Qb/2wSAE7XOInupPr86fI9dnmXHZ8rwhsvHRIZEvNIIqF5Z
 KzbF+rGJZ2bizKFpVFdlrIIyfBleFlQGFzYnGrs/+47zAl/CGicqmuAhGcOdGQXd
 wyj6G66iBcBIQC9hsYhPtglyQSk95tzsPZLZa7/1TlhEKl3mpor4w30NrLz/P5oy
 gdIivDhKP7aRFzBWw/2O4TOn3HhGxnZWVni6icOHE/pBQLBW12Ulc1nT0SiJ8RAt
 PYhMCFwz0Xc7pYuEfGZKwNSCIrx4i7f+spWEAt0Fget/KaB993zm+K4OzGbfBv6r
 FrZ5g/2MTVBJITmJVN2LDysF4EVEOzTULX+bCQOOWC7wmE/poCyNWj8gCpunIZsY
 V5DAB+cbEw/OLfpdiogXvCcxrbukKHV8AVdMdLYB5yEAuPN0JMqcunMjECxpNAha
 1wsv86ljvw==
 =Au30
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190809' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Revert of a bcache patch that caused an oops for some (Coly)

 - ata rb532 unused warning fix (Gustavo)

 - AoE kernel crash fix (He)

 - Error handling fixup for blkdev_get() (Jan)

 - libata read/write translation and SFF PIO fix (me)

 - Use after free and error handling fix for O_DIRECT fragments. There's
   still a nowait + sync oddity in there, we'll nail that start next
   week. If all else fails, I'll queue a revert of the NOWAIT change.
   (me)

 - Loop GFP_KERNEL -> GFP_NOIO deadlock fix (Mikulas)

 - Two BFQ regression fixes that caused crashes (Paolo)

* tag 'for-linus-20190809' of git://git.kernel.dk/linux-block:
  bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()"
  loop: set PF_MEMALLOC_NOIO for the worker thread
  bdev: Fixup error handling in blkdev_get()
  block, bfq: handle NULL return value by bfq_init_rq()
  block, bfq: move update of waker and woken list to queue freeing
  block, bfq: reset last_completed_rq_bfqq if the pointed queue is freed
  block: aoe: Fix kernel crash due to atomic sleep when exiting
  libata: add SG safety checks in SFF pio transfers
  libata: have ata_scsi_rw_xlat() fail invalid passthrough requests
  block: fix O_DIRECT error handling for bio fragments
  ata: rb532_cf: Fix unused variable warning in rb532_pata_driver_probe
2019-08-09 09:28:18 -07:00
Coly Li
20621fedb2 bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()"
This reverts commit 89e0341af0.

In drivers/md/bcache/sysfs.c:bch_snprint_string_list(), NULL pointer at
the end of list is necessary. Remove the NULL from last element of each
lists will cause the following panic,

[ 4340.455652] bcache: register_cache() registered cache device nvme0n1
[ 4340.464603] bcache: register_bdev() registered backing device sdk
[ 4421.587335] bcache: bch_cached_dev_run() cached dev sdk is running already
[ 4421.587348] bcache: bch_cached_dev_attach() Caching sdk as bcache0 on set 354e1d46-d99f-4d8b-870b-078b80dc88a6
[ 5139.247950] general protection fault: 0000 [#1] SMP NOPTI
[ 5139.247970] CPU: 9 PID: 5896 Comm: cat Not tainted 4.12.14-95.29-default #1 SLE12-SP4
[ 5139.247988] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 04/18/2019
[ 5139.248006] task: ffff888fb25c0b00 task.stack: ffff9bbacc704000
[ 5139.248021] RIP: 0010:string+0x21/0x70
[ 5139.248030] RSP: 0018:ffff9bbacc707bf0 EFLAGS: 00010286
[ 5139.248043] RAX: ffffffffa7e432e3 RBX: ffff8881c20da02a RCX: ffff0a00ffffff04
[ 5139.248058] RDX: 3f00656863616362 RSI: ffff8881c20db000 RDI: ffffffffffffffff
[ 5139.248075] RBP: ffff8881c20db000 R08: 0000000000000000 R09: ffff8881c20da02a
[ 5139.248090] R10: 0000000000000004 R11: 0000000000000000 R12: ffff9bbacc707c48
[ 5139.248104] R13: 0000000000000fd6 R14: ffffffffc0665855 R15: ffffffffc0665855
[ 5139.248119] FS:  00007faf253b8700(0000) GS:ffff88903f840000(0000) knlGS:0000000000000000
[ 5139.248137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5139.248149] CR2: 00007faf25395008 CR3: 0000000f72150006 CR4: 00000000007606e0
[ 5139.248164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5139.248179] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5139.248193] PKRU: 55555554
[ 5139.248200] Call Trace:
[ 5139.248210]  vsnprintf+0x1fb/0x510
[ 5139.248221]  snprintf+0x39/0x40
[ 5139.248238]  bch_snprint_string_list.constprop.15+0x5b/0x90 [bcache]
[ 5139.248256]  __bch_cached_dev_show+0x44d/0x5f0 [bcache]
[ 5139.248270]  ? __alloc_pages_nodemask+0xb2/0x210
[ 5139.248284]  bch_cached_dev_show+0x2c/0x50 [bcache]
[ 5139.248297]  sysfs_kf_seq_show+0xbb/0x190
[ 5139.248308]  seq_read+0xfc/0x3c0
[ 5139.248317]  __vfs_read+0x26/0x140
[ 5139.248327]  vfs_read+0x87/0x130
[ 5139.248336]  SyS_read+0x42/0x90
[ 5139.248346]  do_syscall_64+0x74/0x160
[ 5139.248358]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 5139.248370] RIP: 0033:0x7faf24eea370
[ 5139.248379] RSP: 002b:00007fff82d03f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 5139.248395] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007faf24eea370
[ 5139.248411] RDX: 0000000000020000 RSI: 00007faf25396000 RDI: 0000000000000003
[ 5139.248426] RBP: 00007faf25396000 R08: 00000000ffffffff R09: 0000000000000000
[ 5139.248441] R10: 000000007c9d4d41 R11: 0000000000000246 R12: 00007faf25396000
[ 5139.248456] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[ 5139.248892] Code: ff ff ff 0f 1f 80 00 00 00 00 49 89 f9 48 89 cf 48 c7 c0 e3 32 e4 a7 48 c1 ff 30 48 81 fa ff 0f 00 00 48 0f 46 d0 48 85 ff 74 45 <44> 0f b6 02 48 8d 42 01 45 84 c0 74 38 48 01 fa 4c 89 cf eb 0e

The simplest way to fix is to revert commit 89e0341af0 ("bcache: use
sysfs_match_string() instead of __sysfs_match_string()").

This bug was introduced in Linux v5.2, so this fix only applies to
Linux v5.2 is enough for stable tree maintainer.

Fixes: 89e0341af0 ("bcache: use sysfs_match_string() instead of __sysfs_match_string()")
Cc: stable@vger.kernel.org
Cc: Alexandru Ardelean <alexandru.ardelean@analog.com>
Reported-by: Peifeng Lin <pflin@suse.com>
Acked-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-09 07:37:33 -06:00
Hou Tao
449808a254 raid1: factor out a common routine to handle the completion of sync write
It's just code clean-up.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Guoqing Jiang
0d8ed0e9bf md: don't call spare_active in md_reap_sync_thread if all member devices can't work
When add one disk to array, the md_reap_sync_thread is responsible
to activate the spare and set In_sync flag for the new member in
spare_active().

But if raid1 has one member disk A, and disk B is added to the array.
Then we offline A before all the datas are synchronized from A to B,
obviously B doesn't have the latest data as A, but B is still marked
with In_sync flag.

So let's not call spare_active under the condition, otherwise B is
still showed with 'U' state which is not correct.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Guoqing Jiang
062f5b2ae1 md: don't set In_sync if array is frozen
When a disk is added to array, the following path is called in mdadm.

Manage_subdevs -> sysfs_freeze_array
               -> Manage_add
               -> sysfs_set_str(&info, NULL, "sync_action","idle")

Then from kernel side, Manage_add invokes the path (add_new_disk ->
validate_super = super_1_validate) to set In_sync flag.

Since In_sync means "device is in_sync with rest of array", and the new
added disk need to resync thread to help the synchronization of data.
And md_reap_sync_thread would call spare_active to set In_sync for the
new added disk finally. So don't set In_sync if array is in frozen.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Guoqing Jiang
9a567843f7 md: allow last device to be forcibly removed from RAID1/RAID10.
When the 'last' device in a RAID1 or RAID10 reports an error,
we do not mark it as failed.  This would serve little purpose
as there is no risk of losing data beyond that which is obviously
lost (as there is with RAID5), and there could be other sectors
on the device which are readable, and only readable from this device.
This in general this maximises access to data.

However the current implementation also stops an admin from removing
the last device by direct action.  This is rarely useful, but in many
case is not harmful and can make automation easier by removing special
cases.

Also, if an attempt to write metadata fails the device must be marked
as faulty, else an infinite loop will result, attempting to update
the metadata on all non-faulty devices.

So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
the 'last disk' checks for RAID1 and RAID10, and control the behavior
per array by change sysfs node.

Signed-off-by: NeilBrown <neilb@suse.de>
[add sysfs node for fail_last_dev by Guoqing]
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Andy Shevchenko
cf89160793 md: Convert to use int_pow()
Instead of linear approach to calculate power of 10, use generic int_pow()
which does it better.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Yufen Yu
7cee6d4e60 md/raid10: end bio when the device faulty
Just like raid1, we do not queue write error bio to retry write
and acknowlege badblocks, when the device is faulty.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Yufen Yu
eeba6809d8 md/raid1: end bio when the device faulty
When write bio return error, it would be added to conf->retry_list
and wait for raid1d thread to retry write and acknowledge badblocks.

In narrow_write_error(), the error bio will be split in the unit of
badblock shift (such as one sector) and raid1d thread issues them
one by one. Until all of the splited bio has finished, raid1d thread
can go on processing other things, which is time consuming.

But, there is a scene for error handling that is not necessary.
When the device has been set faulty, flush_bio_list() may end
bios in pending_bio_list with error status. Since these bios
has not been issued to the device actually, error handlding to
retry write and acknowledge badblocks make no sense.

Even without that scene, when the device is faulty, badblocks info
can not be written out to the device. Thus, we also no need to
handle the error IO.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Xiao Ni
143f6e733b md/raid6: Set R5_ReadError when there is read failure on parity disk
7471fb77ce ("md/raid6: Fix anomily when recovering a single device in
RAID6.") avoids rereading P when it can be computed from other members.
However, this misses the chance to re-write the right data to P. This
patch sets R5_ReadError if the re-read fails.

Also, when re-read is skipped, we also missed the chance to reset
rdev->read_errors to 0. It can fail the disk when there are many read
errors on P member disk (other disks don't have read error)

V2: upper layer read request don't read parity/Q data. So there is no
need to consider such situation.

This is Reported-by: kbuild test robot <lkp@intel.com>

Fixes: 7471fb77ce ("md/raid6: Fix anomily when recovering a single device in RAID6.")
Cc: <stable@vger.kernel.org> #4.4+
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Hou Tao
4675719d0f raid1: use an int as the return value of raise_barrier()
Using a sector_t as the return value is misleading, because
raise_barrier() only return 0 or -EINTR.

Also add comments for the return values of raise_barrier().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Ming Lei
226b4fc75c blk-mq: add callback of .cleanup_rq
SCSI maintains its own driver private data hooked off of each SCSI
request, and the pridate data won't be freed after scsi_queue_rq()
returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
(e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
fully dispatched them, due to a lower level SCSI driver's resource
limitation identified in scsi_queue_rq(). Currently SCSI's per-request
private data is leaked when the upper layer driver (dm-rq) frees and
then retries these requests in response to BLK_STS_RESOURCE or
BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().

This usecase is so specialized that it doesn't warrant training an
existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
account for freeing its driver private data -- doing so would add an
extra branch for handling a special case that all other consumers of
SCSI (and blk-mq) won't ever need to worry about.

So the most pragmatic way forward is to delegate freeing SCSI driver
private data to the upper layer driver (dm-rq).  Do so by adding
new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
from dm-rq.  A following commit will implement the .cleanup_rq() hook
in scsi_mq_ops.

Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: <stable@vger.kernel.org>
Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-04 21:42:06 -06:00
Mike Snitzer
9c50a98f55 dm table: fix various whitespace issues with recent DAX code
Also, rename device_synchronous to device_dax_synchronous.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-30 18:59:24 -04:00
Pankaj Gupta
5348deb138 dm table: fix dax_dev NULL dereference in device_synchronous()
If a device doesn't support DAX its 'dax_dev' is NULL.  Fix
device_synchronous() to first check if dax_dev is NULL before
dereferencing it.

Fixes: 2e9ee0955d ("dm: enable synchronous dax")
Reported-by: jencce.kernel@gmail.com
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-30 18:58:54 -04:00
Linus Torvalds
0441281965 for-linus-20190726
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl07DGAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplf5EADOOvOdsz9N/Iw8ZHHHJXCqKR26zZv75G1z
 0h1PGC7p0JZQbYFo0Zo7mjiRBGlg6tlXc2d4Gyl94XJKDwjeYTcFDvbvERdYa+MH
 d2RiFkAfR967Ri4fb+FP5L3mYOQdMJ/zk0xCDHLv/DcxeFLa5a9EJS1+vBSR+AcB
 0JpJWuHypGqGmbTaL0z9q2pmx0mgA1ERlWQtkMLrsEr2Vqg/rrjGwe2bGFY00lXc
 vKtFkpfugKc4zVAPSzC1YZgojfDDpGNEA4QMtxMsEH4hqyMpHhrtUedNY5QrjC0B
 p9h6aPXXYr2KhGP0grrEytzaYUOzK2crK5h+q+1vu6nOgx2EgmnLM9tBu/LuRH1j
 uUzKJOa3/AE+bU7uZEsaUerTBsHrgEBa1x8G92obYRnjgW3aCD2CaSbjjBhNxTZ4
 1dXyr0DTHFXZmfcfWja5tO26JTPzjwVOrwiRyU0S727UsdVJupoHiYLr5fwaDfgn
 /Du2I/XWvFtflm5i0ND0sdcX1yRlFiGZ9e45z1QFaFmcteKKWzRBDlC6mQzI/lw3
 oc583mhDR3tRtJxow+wn6AuMUehFRh8wj0UhL/MEMjLW8GiqXU5aRtanT+22Xz4L
 saNDQieeEnV7raMYXMP0qIhkJtrNASmJQos+MOJAEGOWcS2ePIUUio2kSXie+071
 BphJd2RamQ==
 =HIzH
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Several io_uring fixes/improvements:
     - Blocking fix for O_DIRECT (me)
     - Latter page slowness for registered buffers (me)
     - Fix poll hang under certain conditions (me)
     - Defer sequence check fix for wrapped rings (Zhengyuan)
     - Mismatch in async inc/dec accounting (Zhengyuan)
     - Memory ordering issue that could cause stall (Zhengyuan)
      - Track sequential defer in bytes, not pages (Zhengyuan)

 - NVMe pull request from Christoph

 - Set of hang fixes for wbt (Josef)

 - Redundant error message kill for libahci (Ding)

 - Remove unused blk_mq_sched_started_request() and related ops (Marcos)

 - drbd dynamic alloc shash descriptor to reduce stack use (Arnd)

 - blkcg ->pd_stat() non-debug print (Tejun)

 - bcache memory leak fix (Wei)

 - Comment fix (Akinobu)

 - BFQ perf regression fix (Paolo)

* tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
  io_uring: ensure ->list is initialized for poll commands
  Revert "nvme-pci: don't create a read hctx mapping without read queues"
  nvme: fix multipath crash when ANA is deactivated
  nvme: fix memory leak caused by incorrect subsystem free
  nvme: ignore subnqn for ADATA SX6000LNP
  drbd: dynamically allocate shash descriptor
  block: blk-mq: Remove blk_mq_sched_started_request and started_request
  bcache: fix possible memory leak in bch_cached_dev_run()
  io_uring: track io length in async_list based on bytes
  io_uring: don't use iov_iter_advance() for fixed buffers
  block: properly handle IOCB_NOWAIT for async O_DIRECT IO
  blk-mq: allow REQ_NOWAIT to return an error inline
  io_uring: add a memory barrier before atomic_read
  rq-qos: use a mb for got_token
  rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
  rq-qos: don't reset has_sleepers on spurious wakeups
  rq-qos: fix missed wake-ups in rq_qos_throttle
  wait: add wq_has_single_sleeper helper
  block, bfq: check also in-flight I/O in dispatch plugging
  block: fix sysfs module parameters directory path in comment
  ...
2019-07-26 10:32:12 -07:00
Wei Yongjun
5d9e06d60e bcache: fix possible memory leak in bch_cached_dev_run()
memory malloced in bch_cached_dev_run() and should be freed before
leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: 0b13efecf5 ("bcache: add return value check to bch_cached_dev_run()")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-22 08:15:17 -06:00
Linus Torvalds
3bfe1fc467 - Fix zone state management race in DM zoned target by eliminating
the unnecessary DMZ_ACTIVE state.
 
 - A couple fixes for issues the DM snapshot target's optional discard
   support added during first week of the 5.3 merge.
 
 - Increase default size of outstanding IO that is allowed for a each
   dm-kcopyd client and introduce tunable to allow user adjust.
 
 - Update DM core to use printk ratelimiting functions rather than
   duplicate them and in doing so fix an issue where DMDEBUG_LIMIT() rate
   limited KERN_DEBUG messages had excessive "callbacks suppressed"
   messages.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl0w2IsTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWgPeCACMtDQrHqMTOT7OPDRxSJjZixefzL32
 lFi31mjjEb7GoxiS3dBepdJmQiUwROINdGLIGTfBAlH05b/8fgFgE6iCGZ9uzad4
 0PNe9q7pbtfQDLXx+mVMjEdK6P/ilmVFXCW8VQpAAeUFL+dwXYHHIbmQZ/rahOz5
 8nn6wGBQ/LRRcbV0hBHNXQymIXPxMweMxO3usSuKbfhe7JjRwslThGbZ4KVwjCwl
 sLl5mEWXwTKUemGXsXFbCbtH/rnZpbaiAkBedT0oV8g8atRBeQyj0vk48htincj7
 Uv6xGjJGXuqUkcvQnTx3C1fk3lH5xJb5MTL3WEN0g6fOmyJd1sMd0gd/
 =4Rpg
 -----END PGP SIGNATURE-----

Merge tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull more device mapper updates from Mike Snitzer:

 - Fix zone state management race in DM zoned target by eliminating the
   unnecessary DMZ_ACTIVE state.

 - A couple fixes for issues the DM snapshot target's optional discard
   support added during first week of the 5.3 merge.

 - Increase default size of outstanding IO that is allowed for a each
   dm-kcopyd client and introduce tunable to allow user adjust.

 - Update DM core to use printk ratelimiting functions rather than
   duplicate them and in doing so fix an issue where DMDEBUG_LIMIT()
   rate limited KERN_DEBUG messages had excessive "callbacks suppressed"
   messages.

* tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: use printk ratelimiting functions
  dm kcopyd: Increase default sub-job size to 512KB
  dm snapshot: fix oversights in optional discard support
  dm zoned: fix zone state management race
2019-07-18 14:49:33 -07:00
Linus Torvalds
f8c3500cd1 - virtio_pmem: The new virtio_pmem facility introduces a paravirtualized
persistent memory device that allows a guest VM to use DAX mechanisms to
   access a host-file with host-page-cache. It arranges for MAP_SYNC to
   be disabled and instead triggers a host fsync() when a 'write-cache
   flush' command is sent to the virtual disk device.
 
 - Miscellaneous small fixups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdMHwpAAoJEB7SkWpmfYgCUYoP/3vcgYBAaXNksyALF0iowPoP
 z4J0KoaOA1CzRFEQtCWUQa84CWj+XoSewwSeyrIkqKQvx/gghXblK+GVjVzBn0BD
 hmmiKr8af4DdxfzYdEXJp65cCpIiVMaJiGr20Aj9ObwvWJb4QZbz9q7hnPt6KgiI
 jVND3BpP3OERb4ZFcibdmJT5foKooMcXVG6+luVe+hc1+ZZQxJBsBaqie4brQIFq
 j59NX3HfHH2fr1vVwnVH0CO4tgbgYg9wZ2EivGu6wBWvORjrr7KiSSbOYP68EBtd
 lUoNps+vQtGnfXGwNzAjp1wuknrQYYh4/KMKjep7hiZD39rgyvBpbHbyynKzQCWV
 REe8cXr/nwphsENvBAUBiqY999EWVIxdT2iaVaSA6K/31JQAC5AFyxVK/P2Ke1SK
 rvePZ++iLQ1o4phTxQPNlVUqF9jOrFVVICGwMDqaqSkOsD9YKQdFClfOF/1ntlDz
 V0bs+Y0Pe8AJCd9ESep4X+vHAWRRIb4EQIuwLaX8RJoY+r1fGye9RPthpYYzvXKp
 DI2iJztFO3anzj2i9htNPUFIaiUmIhzEvG32O2If2yc5FL02hMpHPoFx6vHhe6s3
 f8OJ+olsJK+/IIrV8+DHqYvhzylOYIhmRTvIxIxaNDPHkhR1i2RDQ6KKK1YZmsr8
 MjAZ+Ym0GadDivs+wcM6
 =uAMG
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "Primarily just the virtio_pmem driver:

   - virtio_pmem

     The new virtio_pmem facility introduces a paravirtualized
     persistent memory device that allows a guest VM to use DAX
     mechanisms to access a host-file with host-page-cache. It arranges
     for MAP_SYNC to be disabled and instead triggers a host fsync()
     when a 'write-cache flush' command is sent to the virtual disk
     device.

   - Miscellaneous small fixups"

* tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  virtio_pmem: fix sparse warning
  xfs: disable map_sync for async flush
  ext4: disable map_sync for async flush
  dax: check synchronous mapping is supported
  dm: enable synchronous dax
  libnvdimm: add dax_dev sync flag
  virtio-pmem: Add virtio pmem driver
  libnvdimm: nd_region flush callback support
  libnvdimm, namespace: Drop uuid_t implementation detail
2019-07-18 10:52:08 -07:00
Nikos Tsironis
c663e04097 dm kcopyd: Increase default sub-job size to 512KB
Currently, kcopyd has a sub-job size of 64KB and a maximum number of 8
sub-jobs. As a result, for any kcopyd job, we have a maximum of 512KB of
I/O in flight.

This upper limit to the amount of in-flight I/O under-utilizes fast
devices and results in decreased throughput, e.g., when writing to a
snapshotted thin LV with I/O size less than the pool's block size (so
COW is performed using kcopyd).

Increase kcopyd's default sub-job size to 512KB, so we have a maximum of
4MB of I/O in flight for each kcopyd job. This results in an up to 96%
improvement of bandwidth when writing to a snapshotted thin LV, with I/O
sizes less than the pool's block size.

Also, add dm_mod.kcopyd_subjob_size_kb module parameter to allow users
to fine tune the sub-job size of kcopyd. The default value of this
parameter is 512KB and the maximum allowed value is 1024KB.

We evaluate the performance impact of the change by running the
snap_breaking_throughput benchmark, from the device mapper test suite
[1].

The benchmark:

  1. Creates a 1G thin LV
  2. Provisions the thin LV
  3. Takes a snapshot of the thin LV
  4. Writes to the thin LV with:

      dd if=/dev/zero of=/dev/vg/thin_lv oflag=direct bs=<I/O size>

Running this benchmark with various thin pool block sizes and dd I/O
sizes (all combinations triggering the use of kcopyd) we get the
following results:

+-----------------+-------------+------------------+-----------------+
| Pool block size | dd I/O size | BW before (MB/s) | BW after (MB/s) |
+-----------------+-------------+------------------+-----------------+
|       1 MB      |      256 KB |       242        |       280       |
|       1 MB      |      512 KB |       238        |       295       |
|                 |             |                  |                 |
|       2 MB      |      256 KB |       238        |       354       |
|       2 MB      |      512 KB |       241        |       380       |
|       2 MB      |        1 MB |       245        |       394       |
|                 |             |                  |                 |
|       4 MB      |      256 KB |       248        |       412       |
|       4 MB      |      512 KB |       234        |       432       |
|       4 MB      |        1 MB |       251        |       474       |
|       4 MB      |        2 MB |       257        |       504       |
|                 |             |                  |                 |
|       8 MB      |      256 KB |       239        |       420       |
|       8 MB      |      512 KB |       256        |       431       |
|       8 MB      |        1 MB |       264        |       467       |
|       8 MB      |        2 MB |       264        |       502       |
|       8 MB      |        4 MB |       281        |       537       |
+-----------------+-------------+------------------+-----------------+

[1] https://github.com/jthornber/device-mapper-test-suite

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-17 11:24:16 -04:00
Mike Snitzer
3ee25485ba dm snapshot: fix oversights in optional discard support
__find_snapshots_sharing_cow() should always be used with _origins_lock
held so fix snapshot_io_hints() accordingly.  Also, once a snapshot is
being merged discards must not be allowed -- otherwise incorrect or
duplicate work will be performed.

Fixes: 2e6023850e ("dm snapshot: add optional discard support features")
Reported-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-17 11:12:30 -04:00
Damien Le Moal
3b8cafdd54 dm zoned: fix zone state management race
dm-zoned uses the zone flag DMZ_ACTIVE to indicate that a zone of the
backend device is being actively read or written and so cannot be
reclaimed. This flag is set as long as the zone atomic reference
counter is not 0. When this atomic is decremented and reaches 0 (e.g.
on BIO completion), the active flag is cleared and set again whenever
the zone is reused and BIO issued with the atomic counter incremented.
These 2 operations (atomic inc/dec and flag set/clear) are however not
always executed atomically under the target metadata mutex lock and
this causes the warning:

WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags));

in dmz_deactivate_zone() to be displayed. This problem is regularly
triggered with xfstests generic/209, generic/300, generic/451 and
xfs/077 with XFS being used as the file system on the dm-zoned target
device. Similarly, xfstests ext4/303, ext4/304, generic/209 and
generic/300 trigger the warning with ext4 use.

This problem can be easily fixed by simply removing the DMZ_ACTIVE flag
and managing the "ACTIVE" state by directly looking at the reference
counter value. To do so, the functions dmz_activate_zone() and
dmz_deactivate_zone() are changed to inline functions respectively
calling atomic_inc() and atomic_dec(), while the dmz_is_active() macro
is changed to an inline function calling atomic_read().

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-17 11:03:48 -04:00
Linus Torvalds
c309b6f242 docs conversion for v5.3-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAl0tpocACgkQCF8+vY7k
 4RWoxA//b/fmDXP3WPzrjjSmpyB9ml0/epKzPbT5S2j0lftqKBmet29k+PCjVrTx
 Nq2QauehY9ug5h8UMVUCmzPr95F0tSIGRoqk1vrn7z0K3q6k1SHrtvqbY1Bgb2Uk
 Qvh2YFU4fQLJg8WAbExCjxCdbdmBKQVGKTwCtM+tP5OMxwAFOmQrjGaUaKCKIIA2
 7Wzrx8CpSji+bJ3uK/d36c+4M9oDly5eaxBhoboL3BI0y+GqwiSASGwTO7BxrPOg
 0wq5IZHnqS8+bprT9xQdDOqf+UOY9U1cxE/+sqsHxblfUEx9gfLy/R+FLmJn+SS9
 Z3yLy4SqVHQMpWBjEAGodohikF60PAuTdymSC11jqFaKCUxWrIZg5xO+0blMrxPF
 7vYIexutCkaBMHBlNaNsHIqB7B/2FGGKoN7QW64hwvwJCGvF7OmJcV+R4bROGvh4
 nFuis9/Nm66Fq7I3aw37ThyZ0aWZdaQ0QJTH9ksxU/ZCz2hhMNYu/rXggrDvkS4U
 nr77ZT5Gd7nj4b110zf8+99uiGiinY6hTfzPAuTCLBhaxwrv4/xDHAhpwdEB5T4j
 8gOkxV8c0XWtL7sKqhGJvs/RRe2za0Y9XH6fyxsYfWcfuLjEvug8ouXMad9gxFWH
 DL3WnKJEMGLScei2wux4kGOwEbkR1bUf2cHJfh3GpCB/y8vgLOc=
 =smxY
 -----END PGP SIGNATURE-----

Merge tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull rst conversion of docs from Mauro Carvalho Chehab:
 "As agreed with Jon, I'm sending this big series directly to you, c/c
  him, as this series required a special care, in order to avoid
  conflicts with other trees"

* tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits)
  docs: kbuild: fix build with pdf and fix some minor issues
  docs: block: fix pdf output
  docs: arm: fix a breakage with pdf output
  docs: don't use nested tables
  docs: gpio: add sysfs interface to the admin-guide
  docs: locking: add it to the main index
  docs: add some directories to the main documentation index
  docs: add SPDX tags to new index files
  docs: add a memory-devices subdir to driver-api
  docs: phy: place documentation under driver-api
  docs: serial: move it to the driver-api
  docs: driver-api: add remaining converted dirs to it
  docs: driver-api: add xilinx driver API documentation
  docs: driver-api: add a series of orphaned documents
  docs: admin-guide: add a series of orphaned documents
  docs: cgroup-v1: add it to the admin-guide book
  docs: aoe: add it to the driver-api book
  docs: add some documentation dirs to the driver-api book
  docs: driver-model: move it to the driver-api book
  docs: lp855x-driver.rst: add it to the driver-api book
  ...
2019-07-16 12:21:41 -07:00
Linus Torvalds
9637d51734 for-linus-20190715
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0s1ZEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiCEEACE9H/pXoegTTWIVPVajMlsa19UHIeilk4N
 GI7oKSiirQEMZnAOmrEzgB4/0zyYQsVypys0gZlYUD3GJVsXDT3zzjNXL5NpVg/O
 nqwSGWMHBSjWkLbaM40Pb2QLXsYgveptNL+9PtxrgtoYPoT5/+TyrJMFrRfi72EK
 WFeNDKOu6aJxpJ26JSsckJ0gluKeeEpRoEqsgHGIwaMIGHQf+b+ikk7tel5FAIgA
 uDwwD+Oxsdgh/ChsXL0d90GkcbcSp6GQ7GybxVmw/tPijx6mpeIY72xY3Zx+t8zF
 b71UNk6NmCKjOPO/6fiuYKKTYw+KhzlyEKO0j675HKfx2AhchEwKw0irp4yUlydA
 zxWYmz4U7iRgktJtymv3J4FEQQ3S6d1EnuQkQNX1LwiOsEsfzhkWi+7jy7KFhZoJ
 AqtYzqnOXvLx92q0vloj06HtK6zo+I/MINldy0+qn9lq0N0VF+dctyztAHLsF7P6
 pUtS6i7l1JSFKAmMhC31sIj5TImaehM2e/TWMUPEDZaO96oKCmQwOF1oiloc6vlW
 h4xWsxP/9zOFcWNyPzy6Vo3JUXWRvFA7K+jV3Hsukw6rVHiNCGVYGSlTv8Roi5b7
 I4ggu9R2JOGyku7UIlL50IRxEyjAp11LaO8yHhcCnRB65rmyBuNMQNcfOsfxpZ5Y
 1mtSNhm5TQ==
 =g8xI
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190715' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
 "A later pull request with some followup items. I had some vacation
  coming up to the merge window, so certain things items were delayed a
  bit. This pull request also contains fixes that came in within the
  last few days of the merge window, which I didn't want to push right
  before sending you a pull request.

  This contains:

   - NVMe pull request, mostly fixes, but also a few minor items on the
     feature side that were timing constrained (Christoph et al)

   - Report zones fixes (Damien)

   - Removal of dead code (Damien)

   - Turn on cgroup psi memstall (Josef)

   - block cgroup MAINTAINERS entry (Konstantin)

   - Flush init fix (Josef)

   - blk-throttle low iops timing fix (Konstantin)

   - nbd resize fixes (Mike)

   - nbd 0 blocksize crash fix (Xiubo)

   - block integrity error leak fix (Wenwen)

   - blk-cgroup writeback and priority inheritance fixes (Tejun)"

* tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits)
  MAINTAINERS: add entry for block io cgroup
  null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED
  block: Limit zone array allocation size
  sd_zbc: Fix report zones buffer allocation
  block: Kill gfp_t argument of blkdev_report_zones()
  block: Allow mapping of vmalloc-ed buffers
  block/bio-integrity: fix a memory leak bug
  nvme: fix NULL deref for fabrics options
  nbd: add netlink reconfigure resize support
  nbd: fix crash when the blksize is zero
  block: Disable write plugging for zoned block devices
  block: Fix elevator name declaration
  block: Remove unused definitions
  nvme: fix regression upon hot device removal and insertion
  blk-throttle: fix zero wait time for iops throttled group
  block: Fix potential overflow in blk_report_zones()
  blkcg: implement REQ_CGROUP_PUNT
  blkcg, writeback: Implement wbc_blkcg_css()
  blkcg, writeback: Add wbc->no_cgroup_owner
  blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
  ...
2019-07-15 21:20:52 -07:00
Mauro Carvalho Chehab
6cf2a73cb2 docs: device-mapper: move it to the admin-guide
The DM support describes lots of aspects related to mapped
disk partitions from the userspace PoV.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-07-15 11:03:01 -03:00
Linus Torvalds
a1240cf74e Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
 "This includes changes to let percpu_ref release the backing percpu
  memory earlier after it has been switched to atomic in cases where the
  percpu ref is not revived.

  This will help recycle percpu memory earlier in cases where the
  refcounts are pinned for prolonged periods of time"

* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  percpu_ref: release percpu memory early without PERCPU_REF_ALLOW_REINIT
  md: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
  io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
  percpu_ref: introduce PERCPU_REF_ALLOW_REINIT flag
2019-07-14 16:17:18 -07:00
Linus Torvalds
2260840592 - Add encrypted byte-offset initialization vector (eboiv) to DM crypt.
- Add optional discard features to DM snapshot which allow freeing space
   from a DM device whose free space was exhausted.
 
 - Various small improvements to use struct_size() and kzalloc().
 
 - Fix to check if DM thin metadata is in fail_io mode before attempting
   to update the superblock to set the needs_check flag.  Otherwise the
   DM thin-pool can hang.
 
 - Fix DM bufio shrinker's potential for ABBA recursion deadlock with DM
   thin provisioning on loop usecase.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl0o0/YTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWiG3B/wI9//FMbHHd9KboFdDQpBNGKaYEIa+
 ZQCPRghzvODBW416yujC1xlIA4bdYyVcQ1wPqCqCDJhXndaDUpMzyRxnPTI0zm4U
 PTZNmWuXO3SmSv7QuHgaCuMIWXIvyOcGLHEb5wqWZJMZ+t4Hf14RrwWQ19d98/hO
 ff7MO70h8sAlFb8lMv6Mxa/TU8O7FoE3EBssfNOF8kHfdFNZnvrOSTvBRhmFTXPQ
 P5RsgTC2KSo8bt5lqqpcMa3XTolx+CE3Dww1SaOFNU+jM4P6n6HUTHeNDcLuyYSc
 JlaV19nFMrarTwzVbyJJqiJwlZzlH/J5arplytg5TldE37EPcl8lHuaU
 =2oWT
 -----END PGP SIGNATURE-----

Merge tag 'for-5.3/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Add encrypted byte-offset initialization vector (eboiv) to DM crypt.

 - Add optional discard features to DM snapshot which allow freeing
   space from a DM device whose free space was exhausted.

 - Various small improvements to use struct_size() and kzalloc().

 - Fix to check if DM thin metadata is in fail_io mode before attempting
   to update the superblock to set the needs_check flag. Otherwise the
   DM thin-pool can hang.

 - Fix DM bufio shrinker's potential for ABBA recursion deadlock with DM
   thin provisioning on loop usecase.

* tag 'for-5.3/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm bufio: fix deadlock with loop device
  dm snapshot: add optional discard support features
  dm crypt: implement eboiv - encrypted byte-offset initialization vector
  dm crypt: remove obsolete comment about plumb IV
  dm crypt: wipe private IV struct after key invalid flag is set
  dm integrity: use kzalloc() instead of kmalloc() + memset()
  dm: update stale comment in end_clone_bio()
  dm log writes: fix incorrect comment about the logged sequence example
  dm log writes: use struct_size() to calculate size of pending_block
  dm crypt: use struct_size() when allocating encryption context
  dm integrity: always set version on superblock update
  dm thin metadata: check if in fail_io mode when setting needs_check
2019-07-13 15:24:31 -07:00
Junxiao Bi
bd293d071f dm bufio: fix deadlock with loop device
When thin-volume is built on loop device, if available memory is low,
the following deadlock can be triggered:

One process P1 allocates memory with GFP_FS flag, direct alloc fails,
memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan()
runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer
IO to complete in __try_evict_buffer().

But this IO may never complete if issued to an underlying loop device
that forwards it using direct-IO, which allocates memory using
GFP_KERNEL (see: do_blockdev_direct_IO()).  If allocation fails, memory
reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan()
will be invoked, and since the mutex is already held by P1 the loop
thread will hang, and IO will never complete.  Resulting in ABBA
deadlock.

Cc: stable@vger.kernel.org
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-12 09:59:37 -04:00
Mike Snitzer
2e6023850e dm snapshot: add optional discard support features
discard_zeroes_cow - a discard issued to the snapshot device that maps
to entire chunks to will zero the corresponding exception(s) in the
snapshot's exception store.

discard_passdown_origin - a discard to the snapshot device is passed down
to the snapshot-origin's underlying device.  This doesn't cause copy-out
to the snapshot exception store because the snapshot-origin target is
bypassed.

The discard_passdown_origin feature depends on the discard_zeroes_cow
feature being enabled.

When these 2 features are enabled they allow a temporarily read-only
device that has completely exhausted its free space to recover space.
To do so dm-snapshot provides temporary buffer to accommodate writes
that the temporarily read-only device cannot handle yet.  Once the upper
layer frees space (e.g. fstrim to XFS) the discards issued to the
dm-snapshot target will be issued to underlying read-only device whose
free space was exhausted.  In addition those discards will also cause
zeroes to be written to the snapshot exception store if corresponding
exceptions exist.  If the underlying origin device provides
deduplication for zero blocks then if/when the snapshot is merged backed
to the origin those blocks will become unused.  Once the origin has
gained adequate space, merging the snapshot back to the thinly
provisioned device will permit continued use of that device without the
temporary space provided by the snapshot.

Requested-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-12 09:59:36 -04:00
Damien Le Moal
bd976e5272 block: Kill gfp_t argument of blkdev_report_zones()
Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In
preparation of using vmalloc() for large report buffer and zone array
allocations used by this function, remove its "gfp_t gfp_mask" argument
and rely on the caller context to use memalloc_noio_save/restore() where
necessary (block layer zone revalidation and dm-zoned I/O error path).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-11 20:04:37 -06:00
Linus Torvalds
028db3e290 Revert "Merge tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs"
This reverts merge 0f75ef6a9c (and thus
effectively commits

   7a1ade8475 ("keys: Provide KEYCTL_GRANT_PERMISSION")
   2e12256b9a ("keys: Replace uid/gid/perm permissions checking with an ACL")

that the merge brought in).

It turns out that it breaks booting with an encrypted volume, and Eric
biggers reports that it also breaks the fscrypt tests [1] and loading of
in-kernel X.509 certificates [2].

The root cause of all the breakage is likely the same, but David Howells
is off email so rather than try to work it out it's getting reverted in
order to not impact the rest of the merge window.

 [1] https://lore.kernel.org/lkml/20190710011559.GA7973@sol.localdomain/
 [2] https://lore.kernel.org/lkml/20190710013225.GB7973@sol.localdomain/

Link: https://lore.kernel.org/lkml/CAHk-=wjxoeMJfeBahnWH=9zShKp2bsVy527vo3_y8HfOdhwAAw@mail.gmail.com/
Reported-by: Eric Biggers <ebiggers@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-10 18:43:43 -07:00
Linus Torvalds
e9a83bd232 It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro.  These create more
    than the usual number of simple but annoying merge conflicts with other
    trees, unfortunately.  He has a lot more of these waiting on the wings
    that, I think, will go to you directly later on.
 
  - A new document on how to use merges and rebases in kernel repos, and one
    on Spectre vulnerabilities.
 
  - Various improvements to the build system, including automatic markup of
    function() references because some people, for reasons I will never
    understand, were of the opinion that :c:func:``function()`` is
    unattractive and not fun to type.
 
  - We now recommend using sphinx 1.7, but still support back to 1.4.
 
  - Lots of smaller improvements, warning fixes, typo fixes, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl0krAEPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Yg98H/AuLqO9LpOgUjF4LhyjxGPdzJkY9RExSJ7km
 gznyreLCZgFaJR+AY6YDsd4Jw6OJlPbu1YM/Qo3C3WrZVFVhgL/s2ebvBgCo50A8
 raAFd8jTf4/mGCHnAqRotAPQ3mETJUk315B66lBJ6Oc+YdpRhwXWq8ZW2bJxInFF
 3HDvoFgMf0KhLuMHUkkL0u3fxH1iA+KvDu8diPbJYFjOdOWENz/CV8wqdVkXRSEW
 DJxIq89h/7d+hIG3d1I7Nw+gibGsAdjSjKv4eRKauZs4Aoxd1Gpl62z0JNk6aT3m
 dtq4joLdwScydonXROD/Twn2jsu4xYTrPwVzChomElMowW/ZBBY=
 =D0eO
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.3' of git://git.lwn.net/linux

Pull Documentation updates from Jonathan Corbet:
 "It's been a relatively busy cycle for docs:

   - A fair pile of RST conversions, many from Mauro. These create more
     than the usual number of simple but annoying merge conflicts with
     other trees, unfortunately. He has a lot more of these waiting on
     the wings that, I think, will go to you directly later on.

   - A new document on how to use merges and rebases in kernel repos,
     and one on Spectre vulnerabilities.

   - Various improvements to the build system, including automatic
     markup of function() references because some people, for reasons I
     will never understand, were of the opinion that
     :c:func:``function()`` is unattractive and not fun to type.

   - We now recommend using sphinx 1.7, but still support back to 1.4.

   - Lots of smaller improvements, warning fixes, typo fixes, etc"

* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
  docs: automarkup.py: ignore exceptions when seeking for xrefs
  docs: Move binderfs to admin-guide
  Disable Sphinx SmartyPants in HTML output
  doc: RCU callback locks need only _bh, not necessarily _irq
  docs: format kernel-parameters -- as code
  Doc : doc-guide : Fix a typo
  platform: x86: get rid of a non-existent document
  Add the RCU docs to the core-api manual
  Documentation: RCU: Add TOC tree hooks
  Documentation: RCU: Rename txt files to rst
  Documentation: RCU: Convert RCU UP systems to reST
  Documentation: RCU: Convert RCU linked list to reST
  Documentation: RCU: Convert RCU basic concepts to reST
  docs: filesystems: Remove uneeded .rst extension on toctables
  scripts/sphinx-pre-install: fix out-of-tree build
  docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
  Documentation: PGP: update for newer HW devices
  Documentation: Add section about CPU vulnerabilities for Spectre
  Documentation: platform: Delete x86-laptop-drivers.txt
  docs: Note that :c:func: should no longer be used
  ...
2019-07-09 12:34:26 -07:00
Milan Broz
b9411d73bd dm crypt: implement eboiv - encrypted byte-offset initialization vector
This IV is used in some BitLocker devices with CBC encryption mode.

IV is encrypted little-endian byte-offset (with the same key and cipher
as the volume).

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:14:23 -04:00
Milan Broz
6028a7a5a3 dm crypt: remove obsolete comment about plumb IV
The URL is no longer valid and the comment is obsolete anyway
(the plumb IV was never used).

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:14:22 -04:00
Milan Broz
4a52ffc7ca dm crypt: wipe private IV struct after key invalid flag is set
If a private IV wipe function fails, the code does not set the key
invalid flag.  To fix this, move code to after the flag is set to
prevent the device from resuming in an inconsistent state.

Also, this allows using of a randomized key in private wipe function
(to be used in a following commit).

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:14:21 -04:00
Fuqian Huang
131670c262 dm integrity: use kzalloc() instead of kmalloc() + memset()
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:14:20 -04:00
Pavel Begunkov
d370ad23a5 dm: update stale comment in end_clone_bio()
Since commit a1ce35fa49 ("block: remove dead elevator
code") blk_end_request() has been replaced with blk_mq_end_request().
So update comment to reference blk_mq_end_request() accordingly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:14:20 -04:00
Qu Wenruo
7537dad791 dm log writes: fix incorrect comment about the logged sequence example
dm-log-writes records the sequence of completion, not submission, thus
for the following sequence (W=write, C=complete):

  Wa,Wb,Wc,Cc,Ca,FLUSH,FUAd,Cb,CFLUSH,CFUAd

The logged results in log device should be:
  c,a,b,flush,fua

Fix the comment to give a better example.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:13:33 -04:00
Zhengyuan Liu
d4e6e83651 dm log writes: use struct_size() to calculate size of pending_block
Use struct_size() to avoid open-coded equivalent that is prone to a type
mistake.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:08:41 -04:00
Zhengyuan Liu
9c81c99b24 dm crypt: use struct_size() when allocating encryption context
Use struct_size() to avoid open-coded equivalent that is prone to a type
mistake.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:08:40 -04:00
Milan Broz
5f1c56b34e dm integrity: always set version on superblock update
The new integrity bitmap mode uses the dirty flag.  The dirty flag
should not be set in older superblock versions.

The current code sets it unconditionally, even if the superblock
was already formatted without bitmap in older system.

Fix this by moving the version check to one common place and check
version on every superblock write.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 13:46:01 -04:00
Linus Torvalds
3b99107f0e for-5.3/block-20190708
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0jrIMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptlFD/9CNsBX+Aap2lO6wKNr6QISwNAK76GMzEay
 s4LSY2kGkXvzv8i89mCuY+8UVNI8WH2/22WnU+8CBAJOjWyFQMsIwH/mrq0oZWRD
 J6STJE8rTr6Fc2MvJUWryp/xdBh3+eDIsAdIZVHVAkIzqYPBnpIAwEIeIw8t0xsm
 v9ngpQ3WD6ep8tOj9pnG1DGKFg1CmukZCC/Y4CQV1vZtmm2I935zUwNV/TB+Egfx
 G8JSC0cSV02LMK88HCnA6MnC/XSUC0qgfXbnmP+TpKlgjVX+P/fuB3oIYcZEu2Rk
 3YBpIkhsQytKYbF42KRLsmBH72u6oB9G+tNZTgB1STUDrZqdtD9xwX1rjDlY0ZzP
 EUDnk48jl/cxbs+VZrHoE2TcNonLiymV7Kb92juHXdIYmKFQStprGcQUbMaTkMfB
 6BYrYLifWx0leu1JJ1i7qhNmug94BYCSCxcRmH0p6kPazPcY9LXNmDWMfMuBPZT7
 z79VLZnHF2wNXJyT1cBluwRYYJRT4osWZ3XUaBWFKDgf1qyvXJfrN/4zmgkEIyW7
 ivXC+KLlGkhntDlWo2pLKbbyOIKY1HmU6aROaI11k5Zyh0ixKB7tHKavK39l+NOo
 YB41+4l6VEpQEyxyRk8tO0sbHpKaKB+evVIK3tTwbY+Q0qTExErxjfWUtOgRWhjx
 iXJssPRo4w==
 =VSYT
 -----END PGP SIGNATURE-----

Merge tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "This is the main block updates for 5.3. Nothing earth shattering or
  major in here, just fixes, additions, and improvements all over the
  map. This contains:

   - Series of documentation fixes (Bart)

   - Optimization of the blk-mq ctx get/put (Bart)

   - null_blk removal race condition fix (Bob)

   - req/bio_op() cleanups (Chaitanya)

   - Series cleaning up the segment accounting, and request/bio mapping
     (Christoph)

   - Series cleaning up the page getting/putting for bios (Christoph)

   - block cgroup cleanups and moving it to where it is used (Christoph)

   - block cgroup fixes (Tejun)

   - Series of fixes and improvements to bcache, most notably a write
     deadlock fix (Coly)

   - blk-iolatency STS_AGAIN and accounting fixes (Dennis)

   - Series of improvements and fixes to BFQ (Douglas, Paolo)

   - debugfs_create() return value check removal for drbd (Greg)

   - Use struct_size(), where appropriate (Gustavo)

   - Two lighnvm fixes (Heiner, Geert)

   - MD fixes, including a read balance and corruption fix (Guoqing,
     Marcos, Xiao, Yufen)

   - block opal shadow mbr additions (Jonas, Revanth)

   - sbitmap compare-and-exhange improvemnts (Pavel)

   - Fix for potential bio->bi_size overflow (Ming)

   - NVMe pull requests:
       - improved PCIe suspent support (Keith Busch)
       - error injection support for the admin queue (Akinobu Mita)
       - Fibre Channel discovery improvements (James Smart)
       - tracing improvements including nvmetc tracing support (Minwoo Im)
       - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
         Kulkarni)"

   - Various little fixes and improvements to drivers and core"

* tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
  blk-iolatency: fix STS_AGAIN handling
  block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
  blk-mq: simplify blk_mq_make_request()
  blk-mq: remove blk_mq_put_ctx()
  sbitmap: Replace cmpxchg with xchg
  block: fix .bi_size overflow
  block: sed-opal: check size of shadow mbr
  block: sed-opal: ioctl for writing to shadow mbr
  block: sed-opal: add ioctl for done-mark of shadow mbr
  block: never take page references for ITER_BVEC
  direct-io: use bio_release_pages in dio_bio_complete
  block_dev: use bio_release_pages in bio_unmap_user
  block_dev: use bio_release_pages in blkdev_bio_end_io
  iomap: use bio_release_pages in iomap_dio_bio_end_io
  block: use bio_release_pages in bio_map_user_iov
  block: use bio_release_pages in bio_unmap_user
  block: optionally mark pages dirty in bio_release_pages
  block: move the BIO_NO_PAGE_REF check into bio_release_pages
  block: skd_main.c: Remove call to memset after dma_alloc_coherent
  block: mtip32xx: Remove call to memset after dma_alloc_coherent
  ...
2019-07-09 10:45:06 -07:00
Linus Torvalds
0f75ef6a9c Keyrings ACL
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXRyyVvu3V2unywtrAQL3xQ//eifjlELkRAPm2EReWwwahdM+9QL/0bAy
 e8eAzP9EaphQGUhpIzM9Y7Cx+a8XW2xACljY8hEFGyxXhDMoLa35oSoJOeay6vQt
 QcgWnDYsET8Z7HOsFCP3ZQqlbbqfsB6CbIKtZoEkZ8ib7eXpYcy1qTydu7wqrl4A
 AaJalAhlUKKUx9hkGGJTh2xvgmxgSJkxx3cNEWJQ2uGgY/ustBpqqT4iwFDsgA/q
 fcYTQFfNQBsC8/SmvQgxJSc+reUdQdp0z1vd8qjpSdFFcTq1qOtK0qDdz1Bbyl24
 hAxvNM1KKav83C8aF7oHhEwLrkD+XiYKixdEiCJJp+A2i+vy2v8JnfgtFTpTgLNK
 5xu2VmaiWmee9SLCiDIBKE4Ghtkr8DQ/5cKFCwthT8GXgQUtdsdwAaT3bWdCNfRm
 DqgU/AyyXhoHXrUM25tPeF3hZuDn2yy6b1TbKA9GCpu5TtznZIHju40Px/XMIpQH
 8d6s/pg+u/SnkhjYWaTvTcvsQ2FB/vZY/UzAVyosnoMBkVfL4UtAHGbb8FBVj1nf
 Dv5VjSjl4vFjgOr3jygEAeD2cJ7L6jyKbtC/jo4dnOmPrSRShIjvfSU04L3z7FZS
 XFjMmGb2Jj8a7vAGFmsJdwmIXZ1uoTwX56DbpNL88eCgZWFPGKU7TisdIWAmJj8U
 N9wholjHJgw=
 =E3bF
 -----END PGP SIGNATURE-----

Merge tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull keyring ACL support from David Howells:
 "This changes the permissions model used by keys and keyrings to be
  based on an internal ACL by the following means:

   - Replace the permissions mask internally with an ACL that contains a
     list of ACEs, each with a specific subject with a permissions mask.
     Potted default ACLs are available for new keys and keyrings.

     ACE subjects can be macroised to indicate the UID and GID specified
     on the key (which remain). Future commits will be able to add
     additional subject types, such as specific UIDs or domain
     tags/namespaces.

     Also split a number of permissions to give finer control. Examples
     include splitting the revocation permit from the change-attributes
     permit, thereby allowing someone to be granted permission to revoke
     a key without allowing them to change the owner; also the ability
     to join a keyring is split from the ability to link to it, thereby
     stopping a process accessing a keyring by joining it and thus
     acquiring use of possessor permits.

   - Provide a keyctl to allow the granting or denial of one or more
     permits to a specific subject. Direct access to the ACL is not
     granted, and the ACL cannot be viewed"

* tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Provide KEYCTL_GRANT_PERMISSION
  keys: Replace uid/gid/perm permissions checking with an ACL
2019-07-08 19:56:57 -07:00
Pankaj Gupta
2e9ee0955d dm: enable synchronous dax
This patch sets dax device 'DAXDEV_SYNC' flag if all the target
devices of device mapper support synchrononous DAX. If device
mapper consists of both synchronous and asynchronous dax devices,
we don't set 'DAXDEV_SYNC' flag.

'dm_table_supports_dax' is refactored to pass 'iterate_devices_fn'
as argument so that the callers can pass the appropriate functions.

Suggested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-05 15:19:10 -07:00
Pankaj Gupta
fefc1d97fa libnvdimm: add dax_dev sync flag
This patch adds 'DAXDEV_SYNC' flag which is set
for nd_region doing synchronous flush. This later
is used to disable MAP_SYNC functionality for
ext4 & xfs filesystem for devices don't support
synchronous flush.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-05 15:19:10 -07:00
Mike Snitzer
54fa16ee53 dm thin metadata: check if in fail_io mode when setting needs_check
Check if in fail_io mode at start of dm_pool_metadata_set_needs_check().
Otherwise dm_pool_metadata_set_needs_check()'s superblock_lock() can
crash in dm_bm_write_lock() while accessing the block manager object
that was previously destroyed as part of a failed
dm_pool_abort_metadata() that ultimately set fail_io to begin with.

Also, update DMERR() message to more accurately describe
superblock_lock() failure.

Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-02 15:50:08 -04:00
Jens Axboe
5be1f9d82f Linux 5.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
 A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
 v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
 Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
 aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
 s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
 VTS5zps=
 =huBY
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc6' into for-5.3/block

Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
fix. Otherwise we end up having conflicts with future patches between
for-5.3/block and master that touch this area. In particular, it makes
the bio_full() fix hard to backport to stable.

* tag 'v5.2-rc6': (482 commits)
  Linux 5.2-rc6
  Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
  Bluetooth: Fix regression with minimum encryption key size alignment
  tcp: refine memory limit test in tcp_fragment()
  x86/vdso: Prevent segfaults due to hoisted vclock reads
  SUNRPC: Fix a credential refcount leak
  Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
  net :sunrpc :clnt :Fix xps refcount imbalance on the error path
  NFS4: Only set creation opendata if O_CREAT
  ARM: 8867/1: vdso: pass --be8 to linker if necessary
  KVM: nVMX: reorganize initial steps of vmx_set_nested_state
  KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
  habanalabs: use u64_to_user_ptr() for reading user pointers
  nfsd: replace Jeff by Chuck as nfsd co-maintainer
  inet: clear num_timeout reqsk_alloc()
  PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
  net: mvpp2: debugfs: Add pmap to fs dump
  ipv6: Default fib6_type to RTN_UNICAST when not set
  net: hns3: Fix inconsistent indenting
  net/af_iucv: always register net_device notifier
  ...
2019-07-01 08:16:08 -06:00
Coly Li
dff90d58a1 bcache: add reclaimed_journal_buckets to struct cache_set
Now we have counters for how many times jouranl is reclaimed, how many
times cached dirty btree nodes are flushed, but we don't know how many
jouranl buckets are really reclaimed.

This patch adds reclaimed_journal_buckets into struct cache_set, this
is an increasing only counter, to tell how many journal buckets are
reclaimed since cache set runs. From all these three counters (reclaim,
reclaimed_journal_buckets, flush_write), we can have idea how well
current journal space reclaim code works.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:18 -06:00
Coly Li
91be66e131 bcache: performance improvement for btree_flush_write()
This patch improves performance for btree_flush_write() in following
ways,
- Use another spinlock journal.flush_write_lock to replace the very
  hot journal.lock. We don't have to use journal.lock here, selecting
  candidate btree nodes takes a lot of time, hold journal.lock here will
  block other jouranling threads and drop the overall I/O performance.
- Only select flushing btree node from c->btree_cache list. When the
  machine has a large system memory, mca cache may have a huge number of
  cached btree nodes. Iterating all the cached nodes will take a lot
  of CPU time, and most of the nodes on c->btree_cache_freeable and
  c->btree_cache_freed lists are cleared and have need to flush. So only
  travel mca list c->btree_cache to select flushing btree node should be
  enough for most of the cases.
- Don't iterate whole c->btree_cache list, only reversely select first
  BTREE_FLUSH_NR btree nodes to flush. Iterate all btree nodes from
  c->btree_cache and select the oldest journal pin btree nodes consumes
  huge number of CPU cycles if the list is huge (push and pop a node
  into/out of a heap is expensive). The last several dirty btree nodes
  on the tail of c->btree_cache list are earlest allocated and cached
  btree nodes, they are relative to the oldest journal pin btree nodes.
  Therefore only flushing BTREE_FLUSH_NR btree nodes from tail of
  c->btree_cache probably includes the oldest journal pin btree nodes.

In my testing, the above change decreases 50%+ CPU consumption when
journal space is full. Some times IOPS drops to 0 for 5-8 seconds,
comparing blocking I/O for 120+ seconds in previous code, this is much
better. Maybe there is room to improve in future, but at this momment
the fix looks fine and performs well in my testing.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:18 -06:00
Coly Li
50a260e859 bcache: fix race in btree_flush_write()
There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
  other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
  shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe563591 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d5 ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
        2149 err_free2:
        2150         bkey_put(b->c, &n2->key);
        2151         btree_node_free(n2);
        2152         rw_unlock(true, n2);
        2153 err_free1:
        2154         bkey_put(b->c, &n1->key);
        2155         btree_node_free(n1);
        2156         rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Fixes: cafe563591 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:18 -06:00
Coly Li
d91ce7574d bcache: remove retry_flush_write from struct cache_set
In struct cache_set, retry_flush_write is added for commit c4dc2497d5
("bcache: fix high CPU occupancy during journal") which is reverted in
previous patch.

Now it is useless anymore, and this patch removes it from bcache code.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:18 -06:00
Coly Li
41508bb7d4 bcache: add comments for mutex_lock(&b->write_lock)
When accessing or modifying BTREE_NODE_dirty bit, it is not always
necessary to acquire b->write_lock. In bch_btree_cache_free() and
mca_reap() acquiring b->write_lock is necessary, and this patch adds
comments to explain why mutex_lock(&b->write_lock) is necessary for
checking or clearing BTREE_NODE_dirty bit there.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
e5ec5f4765 bcache: only clear BTREE_NODE_dirty bit when it is set
In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
always set no matter btree node is dirty or not. The code looks like
this,
	if (btree_node_dirty(b))
		btree_complete_write(b, btree_current_write(b));
	clear_bit(BTREE_NODE_dirty, &b->flags);

Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
bit is cleared, then it is unnecessary to clear the bit again.

This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
true (the bit is set), to save a few CPU cycles.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
249a5f6da5 bcache: Revert "bcache: fix high CPU occupancy during journal"
This reverts commit c4dc2497d5.

This patch enlarges a race between normal btree flush code path and
flush_btree_write(), which causes deadlock when journal space is
exhausted. Reverts this patch makes the race window from 128 btree
nodes to only 1 btree nodes.

Fixes: c4dc2497d5 ("bcache: fix high CPU occupancy during journal")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Tang Junhui <tang.junhui.linux@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
ba82c1ac16 bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free"
This reverts commit 6268dc2c47.

This patch depends on commit c4dc2497d5 ("bcache: fix high CPU
occupancy during journal") which is reverted in previous patch. So
revert this one too.

Fixes: 6268dc2c47 ("bcache: free heap cache_set->flush_btree in bch_journal_free")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
1df3877ff6 bcache: shrink btree node cache after bch_btree_check()
When cache set starts, bch_btree_check() will check all bkeys on cache
device by calculating the checksum. This operation will consume a huge
number of system memory if there are a lot of data cached. Since bcache
uses its own mca cache to maintain all its read-in btree nodes, and only
releases the cache space when system memory manage code starts to shrink
caches. Then before memory manager code to call the mca cache shrinker
callback, bcache mca cache will compete memory resource with user space
application, which may have nagive effect to performance of user space
workloads (e.g. data base, or I/O service of distributed storage node).

This patch tries to call bcache mca shrinker routine to proactively
release mca cache memory, to decrease the memory pressure of system and
avoid negative effort of the overall system I/O performance.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
a231f07a5f bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket()
In journal_read_bucket() when setting ja->seq[bucket_index], there might
be potential case that a later non-maximum overwrites a better sequence
number to ja->seq[bucket_index]. This patch adds a check to make sure
that ja->seq[bucket_index] will be only set a new value if it is bigger
then current value.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
2464b69314 bcache: add code comments for journal_read_bucket()
This patch adds more code comments in journal_read_bucket(), this is an
effort to make the code to be more understandable.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
7e865eba00 bcache: fix potential deadlock in cached_def_free()
When enable lockdep and reboot system with a writeback mode bcache
device, the following potential deadlock warning is reported by lockdep
engine.

[  101.536569][  T401] kworker/2:2/401 is trying to acquire lock:
[  101.538575][  T401] 00000000bbf6e6c7 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
[  101.542054][  T401]
[  101.542054][  T401] but task is already holding lock:
[  101.544587][  T401] 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[  101.548386][  T401]
[  101.548386][  T401] which lock already depends on the new lock.
[  101.548386][  T401]
[  101.551874][  T401]
[  101.551874][  T401] the existing dependency chain (in reverse order) is:
[  101.555000][  T401]
[  101.555000][  T401] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[  101.557860][  T401]        process_one_work+0x277/0x640
[  101.559661][  T401]        worker_thread+0x39/0x3f0
[  101.561340][  T401]        kthread+0x125/0x140
[  101.562963][  T401]        ret_from_fork+0x3a/0x50
[  101.564718][  T401]
[  101.564718][  T401] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[  101.567701][  T401]        lock_acquire+0xb4/0x1c0
[  101.569651][  T401]        flush_workqueue+0xae/0x4c0
[  101.571494][  T401]        drain_workqueue+0xa9/0x180
[  101.573234][  T401]        destroy_workqueue+0x17/0x250
[  101.575109][  T401]        cached_dev_free+0x44/0x120 [bcache]
[  101.577304][  T401]        process_one_work+0x2a4/0x640
[  101.579357][  T401]        worker_thread+0x39/0x3f0
[  101.581055][  T401]        kthread+0x125/0x140
[  101.582709][  T401]        ret_from_fork+0x3a/0x50
[  101.584592][  T401]
[  101.584592][  T401] other info that might help us debug this:
[  101.584592][  T401]
[  101.588355][  T401]  Possible unsafe locking scenario:
[  101.588355][  T401]
[  101.590974][  T401]        CPU0                    CPU1
[  101.592889][  T401]        ----                    ----
[  101.594743][  T401]   lock((work_completion)(&cl->work)#2);
[  101.596785][  T401]                                lock((wq_completion)bcache_writeback_wq);
[  101.600072][  T401]                                lock((work_completion)(&cl->work)#2);
[  101.602971][  T401]   lock((wq_completion)bcache_writeback_wq);
[  101.605255][  T401]
[  101.605255][  T401]  *** DEADLOCK ***
[  101.605255][  T401]
[  101.608310][  T401] 2 locks held by kworker/2:2/401:
[  101.610208][  T401]  #0: 00000000cf2c7d17 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
[  101.613709][  T401]  #1: 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[  101.617480][  T401]
[  101.617480][  T401] stack backtrace:
[  101.619539][  T401] CPU: 2 PID: 401 Comm: kworker/2:2 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
[  101.623225][  T401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  101.627210][  T401] Workqueue: events cached_dev_free [bcache]
[  101.629239][  T401] Call Trace:
[  101.630360][  T401]  dump_stack+0x85/0xcb
[  101.631777][  T401]  print_circular_bug+0x19a/0x1f0
[  101.633485][  T401]  __lock_acquire+0x16cd/0x1850
[  101.635184][  T401]  ? __lock_acquire+0x6a8/0x1850
[  101.636863][  T401]  ? lock_acquire+0xb4/0x1c0
[  101.638421][  T401]  ? find_held_lock+0x34/0xa0
[  101.640015][  T401]  lock_acquire+0xb4/0x1c0
[  101.641513][  T401]  ? flush_workqueue+0x87/0x4c0
[  101.643248][  T401]  flush_workqueue+0xae/0x4c0
[  101.644832][  T401]  ? flush_workqueue+0x87/0x4c0
[  101.646476][  T401]  ? drain_workqueue+0xa9/0x180
[  101.648303][  T401]  drain_workqueue+0xa9/0x180
[  101.649867][  T401]  destroy_workqueue+0x17/0x250
[  101.651503][  T401]  cached_dev_free+0x44/0x120 [bcache]
[  101.653328][  T401]  process_one_work+0x2a4/0x640
[  101.655029][  T401]  worker_thread+0x39/0x3f0
[  101.656693][  T401]  ? process_one_work+0x640/0x640
[  101.658501][  T401]  kthread+0x125/0x140
[  101.660012][  T401]  ? kthread_create_worker_on_cpu+0x70/0x70
[  101.661985][  T401]  ret_from_fork+0x3a/0x50
[  101.691318][  T401] bcache: bcache_device_free() bcache0 stopped

Here is how the above potential deadlock may happen in reboot/shutdown
code path,
1) bcache_reboot() is called firstly in the reboot/shutdown code path,
   then in bcache_reboot(), bcache_device_stop() is called.
2) bcache_device_stop() sets BCACHE_DEV_CLOSING on d->falgs, then call
   closure_queue(&d->cl) to invoke cached_dev_flush(). And in turn
   cached_dev_flush() calls cached_dev_free() via closure_at()
3) In cached_dev_free(), after stopped writebach kthread
   dc->writeback_thread, the kwork dc->writeback_write_wq is stopping by
   destroy_workqueue().
4) Inside destroy_workqueue(), drain_workqueue() is called. Inside
   drain_workqueue(), flush_workqueue() is called. Then wq->lockdep_map
   is acquired by lock_map_acquire() in flush_workqueue(). After the
   lock acquired the rest part of flush_workqueue() just wait for the
   workqueue to complete.
5) Now we look back at writeback thread routine bch_writeback_thread(),
   in the main while-loop, write_dirty() is called via continue_at() in
   read_dirty_submit(), which is called via continue_at() in while-loop
   level called function read_dirty(). Inside write_dirty() it may be
   re-called on workqueeu dc->writeback_write_wq via continue_at().
   It means when the writeback kthread is stopped in cached_dev_free()
   there might be still one kworker queued on dc->writeback_write_wq
   to execute write_dirty() again.
6) Now this kworker is scheduled on dc->writeback_write_wq to run by
   process_one_work() (which is called by worker_thread()). Before
   calling the kwork routine, wq->lockdep_map is acquired.
7) But wq->lockdep_map is acquired already in step 4), so a A-A lock
   (lockdep terminology) scenario happens.

Indeed on multiple cores syatem, the above deadlock is very rare to
happen, just as the code comments in process_one_work() says,
2263     * AFAICT there is no possible deadlock scenario between the
2264     * flush_work() and complete() primitives (except for
	   single-threaded
2265     * workqueues), so hiding them isn't a problem.

But it is still good to fix such lockdep warning, even no one running
bcache on single core system.

The fix is simple. This patch solves the above potential deadlock by,
- Do not destroy workqueue dc->writeback_write_wq in cached_dev_free().
- Flush and destroy dc->writeback_write_wq in writebach kthread routine
  bch_writeback_thread(), where after quit the thread main while-loop
  and before cached_dev_put() is called.

By this fix, dc->writeback_write_wq will be stopped and destroy before
the writeback kthread stopped, so the chance for a A-A locking on
wq->lockdep_map is disappeared, such A-A deadlock won't happen
any more.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
80265d8dfd bcache: acquire bch_register_lock later in cached_dev_free()
When enable lockdep engine, a lockdep warning can be observed when
reboot or shutdown system,

[ 3142.764557][    T1] bcache: bcache_reboot() Stopping all devices:
[ 3142.776265][ T2649]
[ 3142.777159][ T2649] ======================================================
[ 3142.780039][ T2649] WARNING: possible circular locking dependency detected
[ 3142.782869][ T2649] 5.2.0-rc4-lp151.20-default+ #1 Tainted: G        W
[ 3142.785684][ T2649] ------------------------------------------------------
[ 3142.788479][ T2649] kworker/3:67/2649 is trying to acquire lock:
[ 3142.790738][ T2649] 00000000aaf02291 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
[ 3142.794678][ T2649]
[ 3142.794678][ T2649] but task is already holding lock:
[ 3142.797402][ T2649] 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
[ 3142.801462][ T2649]
[ 3142.801462][ T2649] which lock already depends on the new lock.
[ 3142.801462][ T2649]
[ 3142.805277][ T2649]
[ 3142.805277][ T2649] the existing dependency chain (in reverse order) is:
[ 3142.808902][ T2649]
[ 3142.808902][ T2649] -> #2 (&bch_register_lock){+.+.}:
[ 3142.812396][ T2649]        __mutex_lock+0x7a/0x9d0
[ 3142.814184][ T2649]        cached_dev_free+0x17/0x120 [bcache]
[ 3142.816415][ T2649]        process_one_work+0x2a4/0x640
[ 3142.818413][ T2649]        worker_thread+0x39/0x3f0
[ 3142.820276][ T2649]        kthread+0x125/0x140
[ 3142.822061][ T2649]        ret_from_fork+0x3a/0x50
[ 3142.823965][ T2649]
[ 3142.823965][ T2649] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[ 3142.827244][ T2649]        process_one_work+0x277/0x640
[ 3142.829160][ T2649]        worker_thread+0x39/0x3f0
[ 3142.830958][ T2649]        kthread+0x125/0x140
[ 3142.832674][ T2649]        ret_from_fork+0x3a/0x50
[ 3142.834915][ T2649]
[ 3142.834915][ T2649] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[ 3142.838121][ T2649]        lock_acquire+0xb4/0x1c0
[ 3142.840025][ T2649]        flush_workqueue+0xae/0x4c0
[ 3142.842035][ T2649]        drain_workqueue+0xa9/0x180
[ 3142.844042][ T2649]        destroy_workqueue+0x17/0x250
[ 3142.846142][ T2649]        cached_dev_free+0x52/0x120 [bcache]
[ 3142.848530][ T2649]        process_one_work+0x2a4/0x640
[ 3142.850663][ T2649]        worker_thread+0x39/0x3f0
[ 3142.852464][ T2649]        kthread+0x125/0x140
[ 3142.854106][ T2649]        ret_from_fork+0x3a/0x50
[ 3142.855880][ T2649]
[ 3142.855880][ T2649] other info that might help us debug this:
[ 3142.855880][ T2649]
[ 3142.859663][ T2649] Chain exists of:
[ 3142.859663][ T2649]   (wq_completion)bcache_writeback_wq --> (work_completion)(&cl->work)#2 --> &bch_register_lock
[ 3142.859663][ T2649]
[ 3142.865424][ T2649]  Possible unsafe locking scenario:
[ 3142.865424][ T2649]
[ 3142.868022][ T2649]        CPU0                    CPU1
[ 3142.869885][ T2649]        ----                    ----
[ 3142.871751][ T2649]   lock(&bch_register_lock);
[ 3142.873379][ T2649]                                lock((work_completion)(&cl->work)#2);
[ 3142.876399][ T2649]                                lock(&bch_register_lock);
[ 3142.879727][ T2649]   lock((wq_completion)bcache_writeback_wq);
[ 3142.882064][ T2649]
[ 3142.882064][ T2649]  *** DEADLOCK ***
[ 3142.882064][ T2649]
[ 3142.885060][ T2649] 3 locks held by kworker/3:67/2649:
[ 3142.887245][ T2649]  #0: 00000000e774cdd0 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
[ 3142.890815][ T2649]  #1: 00000000f7df89da ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[ 3142.894884][ T2649]  #2: 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
[ 3142.898797][ T2649]
[ 3142.898797][ T2649] stack backtrace:
[ 3142.900961][ T2649] CPU: 3 PID: 2649 Comm: kworker/3:67 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
[ 3142.904789][ T2649] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 3142.909168][ T2649] Workqueue: events cached_dev_free [bcache]
[ 3142.911422][ T2649] Call Trace:
[ 3142.912656][ T2649]  dump_stack+0x85/0xcb
[ 3142.914181][ T2649]  print_circular_bug+0x19a/0x1f0
[ 3142.916193][ T2649]  __lock_acquire+0x16cd/0x1850
[ 3142.917936][ T2649]  ? __lock_acquire+0x6a8/0x1850
[ 3142.919704][ T2649]  ? lock_acquire+0xb4/0x1c0
[ 3142.921335][ T2649]  ? find_held_lock+0x34/0xa0
[ 3142.923052][ T2649]  lock_acquire+0xb4/0x1c0
[ 3142.924635][ T2649]  ? flush_workqueue+0x87/0x4c0
[ 3142.926375][ T2649]  flush_workqueue+0xae/0x4c0
[ 3142.928047][ T2649]  ? flush_workqueue+0x87/0x4c0
[ 3142.929824][ T2649]  ? drain_workqueue+0xa9/0x180
[ 3142.931686][ T2649]  drain_workqueue+0xa9/0x180
[ 3142.933534][ T2649]  destroy_workqueue+0x17/0x250
[ 3142.935787][ T2649]  cached_dev_free+0x52/0x120 [bcache]
[ 3142.937795][ T2649]  process_one_work+0x2a4/0x640
[ 3142.939803][ T2649]  worker_thread+0x39/0x3f0
[ 3142.941487][ T2649]  ? process_one_work+0x640/0x640
[ 3142.943389][ T2649]  kthread+0x125/0x140
[ 3142.944894][ T2649]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 3142.947744][ T2649]  ret_from_fork+0x3a/0x50
[ 3142.970358][ T2649] bcache: bcache_device_free() bcache0 stopped

Here is how the deadlock happens.
1) bcache_reboot() calls bcache_device_stop(), then inside
   bcache_device_stop() BCACHE_DEV_CLOSING bit is set on d->flags.
   Then closure_queue(&d->cl) is called to invoke cached_dev_flush().
2) In cached_dev_flush(), cached_dev_free() is called by continu_at().
3) In cached_dev_free(), when stopping the writeback kthread of the
   cached device by kthread_stop(), dc->writeback_thread will be waken
   up to quite the kthread while-loop, then cached_dev_put() is called
   in bch_writeback_thread().
4) Calling cached_dev_put() in writeback kthread may drop dc->count to
   0, then dc->detach kworker is scheduled, which is initialized as
   cached_dev_detach_finish().
5) Inside cached_dev_detach_finish(), the last line of code is to call
   closure_put(&dc->disk.cl), which drops the last reference counter of
   closrure dc->disk.cl, then the callback cached_dev_flush() gets
   called.
Now cached_dev_flush() is called for second time in the code path, the
first time is in step 2). And again bch_register_lock will be acquired
again, and a A-A lock (lockdep terminology) is happening.

The root cause of the above A-A lock is in cached_dev_free(), mutex
bch_register_lock is held before stopping writeback kthread and other
kworkers. Fortunately now we have variable 'bcache_is_reboot', which may
prevent device registration or unregistration during reboot/shutdown
time, so it is unncessary to hold bch_register_lock such early now.

This is how this patch fixes the reboot/shutdown time A-A lock issue:
After moving mutex_lock(&bch_register_lock) to a later location where
before atomic_read(&dc->running) in cached_dev_free(), such A-A lock
problem can be solved without any reboot time registration race.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
97ba3b816e bcache: acquire bch_register_lock later in cached_dev_detach_finish()
Now there is variable bcache_is_reboot to prevent device register or
unregister during reboot, it is unncessary to still hold mutex lock
bch_register_lock before stopping writeback_rate_update kworker and
writeback kthread. And if the stopping kworker or kthread holding
bch_register_lock inside their routine (we used to have such problem
in writeback thread, thanks to Junhui Wang fixed it), it is very easy
to introduce deadlock during reboot/shutdown procedure.

Therefore in this patch, the location to acquire bch_register_lock is
moved to the location before calling calc_cached_dev_sectors(). Which
is later then original location in cached_dev_detach_finish().

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
a59ff6ccc2 bcache: avoid a deadlock in bcache_reboot()
It is quite frequently to observe deadlock in bcache_reboot() happens
and hang the system reboot process. The reason is, in bcache_reboot()
when calling bch_cache_set_stop() and bcache_device_stop() the mutex
bch_register_lock is held. But in the process to stop cache set and
bcache device, bch_register_lock will be acquired again. If this mutex
is held here, deadlock will happen inside the stopping process. The
aftermath of the deadlock is, whole system reboot gets hung.

The fix is to avoid holding bch_register_lock for the following loops
in bcache_reboot(),
       list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
                bch_cache_set_stop(c);

        list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
                bcache_device_stop(&dc->disk);

A module range variable 'bcache_is_reboot' is added, it sets to true
in bcache_reboot(). In register_bcache(), if bcache_is_reboot is checked
to be true, reject the registration by returning -EBUSY immediately.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
5c2a634cbf bcache: stop writeback kthread and kworker when bch_cached_dev_run() failed
In bch_cached_dev_attach() after bch_cached_dev_writeback_start()
called, the wrireback kthread and writeback rate update kworker of the
cached device are created, if the following bch_cached_dev_run()
failed, bch_cached_dev_attach() will return with -ENOMEM without
stopping the writeback related kthread and kworker.

This patch stops writeback kthread and writeback rate update kworker
before returning -ENOMEM if bch_cached_dev_run() returns error.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
f54d801dda bcache: destroy dc->writeback_write_wq if failed to create dc->writeback_thread
Commit 9baf30972b ("bcache: fix for gc and write-back race") added a
new work queue dc->writeback_write_wq, but forgot to destroy it in the
error condition when creating dc->writeback_thread failed.

This patch destroys dc->writeback_write_wq if kthread_create() returns
error pointer to dc->writeback_thread, then a memory leak is avoided.

Fixes: 9baf30972b ("bcache: fix for gc and write-back race")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
5461999848 bcache: fix mistaken sysfs entry for io_error counter
In bch_cached_dev_files[] from driver/md/bcache/sysfs.c, sysfs_errors is
incorrectly inserted in. The correct entry should be sysfs_io_errors.

This patch fixes the problem and now I/O errors of cached device can be
read from /sys/block/bcache<N>/bcache/io_errors.

Fixes: c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
0c277e211a bcache: add pendings_cleanup to stop pending bcache device
If a bcache device is in dirty state and its cache set is not
registered, this bcache device will not appear in /dev/bcache<N>,
and there is no way to stop it or remove the bcache kernel module.

This is an as-designed behavior, but sometimes people has to reboot
whole system to release or stop the pending backing device.

This sysfs interface may remove such pending bcache devices when
write anything into the sysfs file manually.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
944a4f340a bcache: make bset_search_tree() be more understandable
The purpose of following code in bset_search_tree() is to avoid a branch
instruction,
 994         if (likely(f->exponent != 127))
 995                 n = j * 2 + (((unsigned int)
 996                               (f->mantissa -
 997                                bfloat_mantissa(search, f))) >> 31);
 998         else
 999                 n = (bkey_cmp(tree_to_bkey(t, j), search) > 0)
1000                         ? j * 2
1001                         : j * 2 + 1;

This piece of code is not very clear to understand, even when I tried to
add code comment for it, I made mistake. This patch removes the implict
bit operation and uses explicit branch to calculate next location in
binary tree search.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
68a53c95a0 bcache: remove "XXX:" comment line from run_cache_set()
In previous bcache patches for Linux v5.2, the failure code path of
run_cache_set() is tested and fixed. So now the following comment
line can be removed from run_cache_set(),
	/* XXX: test this, it's broken */

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
e0faa3d7f7 bcache: improve error message in bch_cached_dev_run()
This patch adds more error message in bch_cached_dev_run() to indicate
the exact reason why an error value is returned. Please notice when
printing out the "is running already" message, pr_info() is used here,
because in this case also -EBUSY is returned, the bcache device can
continue to attach to the cache devince and run, so it won't be an
error level message in kernel message.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
633bb2ce60 bcache: add more error message in bch_cached_dev_attach()
This patch adds more error message for attaching cached device, this is
helpful to debug code failure during bache device start up.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
4b6efb4bdb bcache: more detailed error message to bcache_device_link()
This patch adds more accurate error message for specific
ssyfs_create_link() call, to help debugging failure during
bcache device start tup.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
383ff2183a bcache: check CACHE_SET_IO_DISABLE bit in bch_journal()
When too many I/O errors happen on cache set and CACHE_SET_IO_DISABLE
bit is set, bch_journal() may continue to work because the journaling
bkey might be still in write set yet. The caller of bch_journal() may
believe the journal still work but the truth is in-memory journal write
set won't be written into cache device any more. This behavior may
introduce potential inconsistent metadata status.

This patch checks CACHE_SET_IO_DISABLE bit at the head of bch_journal(),
if the bit is set, bch_journal() returns NULL immediately to notice
caller to know journal does not work.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
e775339e1a bcache: check CACHE_SET_IO_DISABLE in allocator code
If CACHE_SET_IO_DISABLE of a cache set flag is set by too many I/O
errors, currently allocator routines can still continue allocate
space which may introduce inconsistent metadata state.

This patch checkes CACHE_SET_IO_DISABLE bit in following allocator
routines,
- bch_bucket_alloc()
- __bch_bucket_alloc_set()
Once CACHE_SET_IO_DISABLE is set on cache set, the allocator routines
may reject allocation request earlier to avoid potential inconsistent
metadata.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
bd9026c8a7 bcache: remove unncessary code in bch_btree_keys_init()
Function bch_btree_keys_init() initializes b->set[].size and
b->set[].data to zero. As the code comments indicates, these code indeed
is unncessary, because both struct btree_keys and struct bset_tree are
nested embedded into struct btree, when struct btree is filled with 0
bits by kzalloc() in mca_bucket_alloc(), b->set[].size and
b->set[].data are initialized to 0 (a.k.a NULL) already.

This patch removes the redundant code, and add comments in
bch_btree_keys_init() and mca_bucket_alloc() to explain why it's safe.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
0b13efecf5 bcache: add return value check to bch_cached_dev_run()
This patch adds return value check to bch_cached_dev_run(), now if there
is error happens inside bch_cached_dev_run(), it can be catched.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Alexandru Ardelean
89e0341af0 bcache: use sysfs_match_string() instead of __sysfs_match_string()
The arrays (of strings) that are passed to __sysfs_match_string() are
static, so use sysfs_match_string() which does an implicit ARRAY_SIZE()
over these arrays.

Functionally, this doesn't change anything.
The change is more cosmetic.

It only shrinks the static arrays by 1 byte each.

Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
f960facb39 bcache: remove unnecessary prefetch() in bset_search_tree()
In function bset_search_tree(), when p >= t->size, t->tree[0] will be
prefetched by the following code piece,
 974                 unsigned int p = n << 4;
 975
 976                 p &= ((int) (p - t->size)) >> 31;
 977
 978                 prefetch(&t->tree[p]);

The purpose of the above code is to avoid a branch instruction, but
when p >= t->size, prefetch(&t->tree[0]) has no positive performance
contribution at all. This patch avoids the unncessary prefetch by only
calling prefetch() when p < t->size.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
08ec1e6282 bcache: add io error counting in write_bdev_super_endio()
When backing device super block is written by bch_write_bdev_super(),
the bio complete callback write_bdev_super_endio() simply ignores I/O
status. Indeed such write request also contribute to backing device
health status if the request failed.

This patch checkes bio->bi_status in write_bdev_super_endio(), if there
is error, bch_count_backing_io_errors() will be called to count an I/O
error to dc->io_errors.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
578df99b1b bcache: ignore read-ahead request failure on backing device
When md raid device (e.g. raid456) is used as backing device, read-ahead
requests on a degrading and recovering md raid device might be failured
immediately by md raid code, but indeed this md raid array can still be
read or write for normal I/O requests. Therefore such failed read-ahead
request are not real hardware failure. Further more, after degrading and
recovering accomplished, read-ahead requests will be handled by md raid
array again.

For such condition, I/O failures of read-ahead requests don't indicate
real health status (because normal I/O still be served), they should not
be counted into I/O error counter dc->io_errors.

Since there is no simple way to detect whether the backing divice is a
md raid device, this patch simply ignores I/O failures for read-ahead
bios on backing device, to avoid bogus backing device failure on a
degrading md raid array.

Suggested-and-tested-by: Thorsten Knabe <linux@thorsten-knabe.de>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
e6dcbd3e6c bcache: avoid flushing btree node in cache_set_flush() if io disabled
When cache_set_flush() is called for too many I/O errors detected on
cache device and the cache set is retiring, inside the function it
doesn't make sense to flushing cached btree nodes from c->btree_cache
because CACHE_SET_IO_DISABLE is set on c->flags already and all I/Os
onto cache device will be rejected.

This patch checks in cache_set_flush() that whether CACHE_SET_IO_DISABLE
is set. If yes, then avoids to flush the cached btree nodes to reduce
more time and make cache set retiring more faster.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
695277f16b Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()"
This reverts commit 6147305c73.

Although this patch helps the failed bcache device to stop faster when
too many I/O errors detected on corresponding cached device, setting
CACHE_SET_IO_DISABLE bit to cache set c->flags was not a good idea. This
operation will disable all I/Os on cache set, which means other attached
bcache devices won't work neither.

Without this patch, the failed bcache device can also be stopped
eventually if internal I/O accomplished (e.g. writeback). Therefore here
I revert it.

Fixes: 6147305c73 ("bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()")
Reported-by: Yong Li <mr.liyong@qq.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
0ae49cb7aa bcache: fix return value error in bch_journal_read()
When everything is OK in bch_journal_read(), finally the return value
is returned by,
	return ret;
which assumes ret will be 0 here. This assumption is wrong when all
journal buckets as are full and filled with valid journal entries. In
such cache the last location referencess read_bucket() sets 'ret' to
1, which means new jset added into jset list. The jset list is list
'journal' in caller run_cache_set().

Return 1 to run_cache_set() means something wrong and the cache set
won't start, but indeed everything is OK.

This patch changes the line at end of bch_journal_read() to directly
return 0 since everything if verything is good. Then a bogus error
is fixed.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
b387e9b586 bcache: check c->gc_thread by IS_ERR_OR_NULL in cache_set_flush()
When system memory is in heavy pressure, bch_gc_thread_start() from
run_cache_set() may fail due to out of memory. In such condition,
c->gc_thread is assigned to -ENOMEM, not NULL pointer. Then in following
failure code path bch_cache_set_error(), when cache_set_flush() gets
called, the code piece to stop c->gc_thread is broken,
         if (!IS_ERR_OR_NULL(c->gc_thread))
                 kthread_stop(c->gc_thread);

And KASAN catches such NULL pointer deference problem, with the warning
information:

[  561.207881] ==================================================================
[  561.207900] BUG: KASAN: null-ptr-deref in kthread_stop+0x3b/0x440
[  561.207904] Write of size 4 at addr 000000000000001c by task kworker/15:1/313

[  561.207913] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G        W         5.0.0-vanilla+ #3
[  561.207916] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
[  561.207935] Workqueue: events cache_set_flush [bcache]
[  561.207940] Call Trace:
[  561.207948]  dump_stack+0x9a/0xeb
[  561.207955]  ? kthread_stop+0x3b/0x440
[  561.207960]  ? kthread_stop+0x3b/0x440
[  561.207965]  kasan_report+0x176/0x192
[  561.207973]  ? kthread_stop+0x3b/0x440
[  561.207981]  kthread_stop+0x3b/0x440
[  561.207995]  cache_set_flush+0xd4/0x6d0 [bcache]
[  561.208008]  process_one_work+0x856/0x1620
[  561.208015]  ? find_held_lock+0x39/0x1d0
[  561.208028]  ? drain_workqueue+0x380/0x380
[  561.208048]  worker_thread+0x87/0xb80
[  561.208058]  ? __kthread_parkme+0xb6/0x180
[  561.208067]  ? process_one_work+0x1620/0x1620
[  561.208072]  kthread+0x326/0x3e0
[  561.208079]  ? kthread_create_worker_on_cpu+0xc0/0xc0
[  561.208090]  ret_from_fork+0x3a/0x50
[  561.208110] ==================================================================
[  561.208113] Disabling lock debugging due to kernel taint
[  561.208115] irq event stamp: 11800231
[  561.208126] hardirqs last  enabled at (11800231): [<ffffffff83008538>] do_syscall_64+0x18/0x410
[  561.208127] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[  561.208129] #PF error: [WRITE]
[  561.312253] hardirqs last disabled at (11800230): [<ffffffff830052ff>] trace_hardirqs_off_thunk+0x1a/0x1c
[  561.312259] softirqs last  enabled at (11799832): [<ffffffff850005c7>] __do_softirq+0x5c7/0x8c3
[  561.405975] PGD 0 P4D 0
[  561.442494] softirqs last disabled at (11799821): [<ffffffff831add2c>] irq_exit+0x1ac/0x1e0
[  561.791359] Oops: 0002 [#1] SMP KASAN NOPTI
[  561.791362] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G    B   W         5.0.0-vanilla+ #3
[  561.791363] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
[  561.791371] Workqueue: events cache_set_flush [bcache]
[  561.791374] RIP: 0010:kthread_stop+0x3b/0x440
[  561.791376] Code: 00 00 65 8b 05 26 d5 e0 7c 89 c0 48 0f a3 05 ec aa df 02 0f 82 dc 02 00 00 4c 8d 63 20 be 04 00 00 00 4c 89 e7 e8 65 c5 53 00 <f0> ff 43 20 48 8d 7b 24 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
[  561.791377] RSP: 0018:ffff88872fc8fd10 EFLAGS: 00010286
[  561.838895] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  561.838916] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  561.838934] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  561.838948] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  561.838966] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  561.838979] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  561.838996] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  563.067028] RAX: 0000000000000000 RBX: fffffffffffffffc RCX: ffffffff832dd314
[  563.067030] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000297
[  563.067032] RBP: ffff88872fc8fe88 R08: fffffbfff0b8213d R09: fffffbfff0b8213d
[  563.067034] R10: 0000000000000001 R11: fffffbfff0b8213c R12: 000000000000001c
[  563.408618] R13: ffff88dc61cc0f68 R14: ffff888102b94900 R15: ffff88dc61cc0f68
[  563.408620] FS:  0000000000000000(0000) GS:ffff888f7dc00000(0000) knlGS:0000000000000000
[  563.408622] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  563.408623] CR2: 000000000000001c CR3: 0000000f48a1a004 CR4: 00000000007606e0
[  563.408625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  563.408627] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  563.904795] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  563.915796] PKRU: 55555554
[  563.915797] Call Trace:
[  563.915807]  cache_set_flush+0xd4/0x6d0 [bcache]
[  563.915812]  process_one_work+0x856/0x1620
[  564.001226] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  564.033563]  ? find_held_lock+0x39/0x1d0
[  564.033567]  ? drain_workqueue+0x380/0x380
[  564.033574]  worker_thread+0x87/0xb80
[  564.062823] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  564.118042]  ? __kthread_parkme+0xb6/0x180
[  564.118046]  ? process_one_work+0x1620/0x1620
[  564.118048]  kthread+0x326/0x3e0
[  564.118050]  ? kthread_create_worker_on_cpu+0xc0/0xc0
[  564.167066] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  564.252441]  ret_from_fork+0x3a/0x50
[  564.252447] Modules linked in: msr rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib i40iw configfs iw_cm ib_cm libiscsi scsi_transport_iscsi mlx4_ib ib_uverbs mlx4_en ib_core nls_iso8859_1 nls_cp437 vfat fat intel_rapl skx_edac x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ses raid0 aesni_intel cdc_ether enclosure usbnet ipmi_ssif joydev aes_x86_64 i40e scsi_transport_sas mii bcache md_mod crypto_simd mei_me ioatdma crc64 ptp cryptd pcspkr i2c_i801 mlx4_core glue_helper pps_core mei lpc_ich dca wmi ipmi_si ipmi_devintf nd_pmem dax_pmem nd_btt ipmi_msghandler device_dax pcc_cpufreq button hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect xhci_pci sysimgblt fb_sys_fops xhci_hcd ttm megaraid_sas drm usbcore nfit libnvdimm sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[  564.299390] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  564.348360] CR2: 000000000000001c
[  564.348362] ---[ end trace b7f0e5cc7b2103b0 ]---

Therefore, it is not enough to only check whether c->gc_thread is NULL,
we should use IS_ERR_OR_NULL() to check both NULL pointer and error
value.

This patch changes the above buggy code piece in this way,
         if (!IS_ERR_OR_NULL(c->gc_thread))
                 kthread_stop(c->gc_thread);

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:13 -06:00
Coly Li
141df8bb5d bcache: don't set max writeback rate if gc is running
When gc is running, user space I/O processes may wait inside
bcache code, so no new I/O coming. Indeed this is not a real idle
time, maximum writeback rate should not be set in such situation.
Otherwise a faster writeback thread may compete locks with gc thread
and makes garbage collection slower, which results a longer I/O
freeze period.

This patch checks c->gc_mark_valid in set_at_max_writeback_rate(). If
c->gc_mark_valid is 0 (gc running), set_at_max_writeback_rate() returns
false, then update_writeback_rate() will not set writeback rate to
maximum value even c->idle_counter reaches an idle threshold.

Now writeback thread won't interfere gc thread performance.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:13 -06:00
Linus Torvalds
65ee21eb63 - Fix incorrect uses of kstrndup and DM logging macros in DM's early
init code.
 
 - Fix DM log-writes target's handling of super block sectors so updates
   are made in order through use of completion.
 
 - Fix DM core's argument splitting code to avoid undefined behaviour
   reported as a side-effect of UBSAN analysis on ppc64le.
 
 - Fix DM verity target to limit the amount of error messages that can
   result from a corrupt block being found.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl0VBp4THHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWlLKB/9wnB1FWhVIwt5j0e3YpqToThptT22N
 sWO/vBKtIR8nbbohXeam7mgh15+D61SpY+8jdkgAlIQonPzCwFrgSWdtpbbveYw6
 gFWAdd013R+piGX40l0EztwAQ9IawwFl7JP+fBevxFQdGAVT6SpaJfBWH0XcIBbb
 udEtoeW0dVp06SFsokGq6TPSRCdxwMh9JpRplgtLmz0yhfsWeFivjw2AMIen5QqB
 Z0zcJnL8UMRdJq0EEqjQ3z9CH/Th8Nryxibo8YLnADPBfv7RyR2+wCjrbode8LAj
 LZHjHikxzt39F9PpJIPvVuvNMzrBdQPDF61O5MK7dOMl49GfIoz2ybvH
 =oEWd
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix incorrect uses of kstrndup and DM logging macros in DM's early
   init code.

 - Fix DM log-writes target's handling of super block sectors so updates
   are made in order through use of completion.

 - Fix DM core's argument splitting code to avoid undefined behaviour
   reported as a side-effect of UBSAN analysis on ppc64le.

 - Fix DM verity target to limit the amount of error messages that can
   result from a corrupt block being found.

* tag 'for-5.2/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm verity: use message limit for data block corruption message
  dm table: don't copy from a NULL pointer in realloc_argv()
  dm log writes: make sure super sector log updates are written in order
  dm init: remove trailing newline from calls to DMERR() and DMINFO()
  dm init: fix incorrect uses of kstrndup()
2019-06-28 08:48:21 +08:00
David Howells
2e12256b9a keys: Replace uid/gid/perm permissions checking with an ACL
Replace the uid/gid/perm permissions checking on a key with an ACL to allow
the SETATTR and SEARCH permissions to be split.  This will also allow a
greater range of subjects to represented.

============
WHY DO THIS?
============

The problem is that SETATTR and SEARCH cover a slew of actions, not all of
which should be grouped together.

For SETATTR, this includes actions that are about controlling access to a
key:

 (1) Changing a key's ownership.

 (2) Changing a key's security information.

 (3) Setting a keyring's restriction.

And actions that are about managing a key's lifetime:

 (4) Setting an expiry time.

 (5) Revoking a key.

and (proposed) managing a key as part of a cache:

 (6) Invalidating a key.

Managing a key's lifetime doesn't really have anything to do with
controlling access to that key.

Expiry time is awkward since it's more about the lifetime of the content
and so, in some ways goes better with WRITE permission.  It can, however,
be set unconditionally by a process with an appropriate authorisation token
for instantiating a key, and can also be set by the key type driver when a
key is instantiated, so lumping it with the access-controlling actions is
probably okay.

As for SEARCH permission, that currently covers:

 (1) Finding keys in a keyring tree during a search.

 (2) Permitting keyrings to be joined.

 (3) Invalidation.

But these don't really belong together either, since these actions really
need to be controlled separately.

Finally, there are number of special cases to do with granting the
administrator special rights to invalidate or clear keys that I would like
to handle with the ACL rather than key flags and special checks.


===============
WHAT IS CHANGED
===============

The SETATTR permission is split to create two new permissions:

 (1) SET_SECURITY - which allows the key's owner, group and ACL to be
     changed and a restriction to be placed on a keyring.

 (2) REVOKE - which allows a key to be revoked.

The SEARCH permission is split to create:

 (1) SEARCH - which allows a keyring to be search and a key to be found.

 (2) JOIN - which allows a keyring to be joined as a session keyring.

 (3) INVAL - which allows a key to be invalidated.

The WRITE permission is also split to create:

 (1) WRITE - which allows a key's content to be altered and links to be
     added, removed and replaced in a keyring.

 (2) CLEAR - which allows a keyring to be cleared completely.  This is
     split out to make it possible to give just this to an administrator.

 (3) REVOKE - see above.


Keys acquire ACLs which consist of a series of ACEs, and all that apply are
unioned together.  An ACE specifies a subject, such as:

 (*) Possessor - permitted to anyone who 'possesses' a key
 (*) Owner - permitted to the key owner
 (*) Group - permitted to the key group
 (*) Everyone - permitted to everyone

Note that 'Other' has been replaced with 'Everyone' on the assumption that
you wouldn't grant a permit to 'Other' that you wouldn't also grant to
everyone else.

Further subjects may be made available by later patches.

The ACE also specifies a permissions mask.  The set of permissions is now:

	VIEW		Can view the key metadata
	READ		Can read the key content
	WRITE		Can update/modify the key content
	SEARCH		Can find the key by searching/requesting
	LINK		Can make a link to the key
	SET_SECURITY	Can change owner, ACL, expiry
	INVAL		Can invalidate
	REVOKE		Can revoke
	JOIN		Can join this keyring
	CLEAR		Can clear this keyring


The KEYCTL_SETPERM function is then deprecated.

The KEYCTL_SET_TIMEOUT function then is permitted if SET_SECURITY is set,
or if the caller has a valid instantiation auth token.

The KEYCTL_INVALIDATE function then requires INVAL.

The KEYCTL_REVOKE function then requires REVOKE.

The KEYCTL_JOIN_SESSION_KEYRING function then requires JOIN to join an
existing keyring.

The JOIN permission is enabled by default for session keyrings and manually
created keyrings only.


======================
BACKWARD COMPATIBILITY
======================

To maintain backward compatibility, KEYCTL_SETPERM will translate the
permissions mask it is given into a new ACL for a key - unless
KEYCTL_SET_ACL has been called on that key, in which case an error will be
returned.

It will convert possessor, owner, group and other permissions into separate
ACEs, if each portion of the mask is non-zero.

SETATTR permission turns on all of INVAL, REVOKE and SET_SECURITY.  WRITE
permission turns on WRITE, REVOKE and, if a keyring, CLEAR.  JOIN is turned
on if a keyring is being altered.

The KEYCTL_DESCRIBE function translates the ACL back into a permissions
mask to return depending on possessor, owner, group and everyone ACEs.

It will make the following mappings:

 (1) INVAL, JOIN -> SEARCH

 (2) SET_SECURITY -> SETATTR

 (3) REVOKE -> WRITE if SETATTR isn't already set

 (4) CLEAR -> WRITE

Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
the value set with KEYCTL_SETATTR.


=======
TESTING
=======

This passes the keyutils testsuite for all but a couple of tests:

 (1) tests/keyctl/dh_compute/badargs: The first wrong-key-type test now
     returns EOPNOTSUPP rather than ENOKEY as READ permission isn't removed
     if the type doesn't have ->read().  You still can't actually read the
     key.

 (2) tests/keyctl/permitting/valid: The view-other-permissions test doesn't
     work as Other has been replaced with Everyone in the ACL.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-06-27 23:03:07 +01:00
Dan Carpenter
16d4b74654 md/raid1: Fix a warning message in remove_wb()
The WARN_ON() macro doesn't take an error message, it just takes a
condition.  I've changed this to use WARN(1, "...") instead.

Fixes: 3e148a3209 ("md/raid1: fix potential data inconsistency issue with write behind device")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-26 12:18:48 -07:00
Milan Broz
2eba4e640b dm verity: use message limit for data block corruption message
DM verity should also use DMERR_LIMIT to limit repeat data block
corruption messages.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-06-25 14:09:14 -04:00
Jerome Marchand
a065192655 dm table: don't copy from a NULL pointer in realloc_argv()
For the first call to realloc_argv() in dm_split_args(), old_argv is
NULL and size is zero. Then memcpy is called, with the NULL old_argv
as the source argument and a zero size argument. AFAIK, this is
undefined behavior and generates the following warning when compiled
with UBSAN on ppc64le:

In file included from ./arch/powerpc/include/asm/paca.h:19,
                 from ./arch/powerpc/include/asm/current.h:16,
                 from ./include/linux/sched.h:12,
                 from ./include/linux/kthread.h:6,
                 from drivers/md/dm-core.h:12,
                 from drivers/md/dm-table.c:8:
In function 'memcpy',
    inlined from 'realloc_argv' at drivers/md/dm-table.c:565:3,
    inlined from 'dm_split_args' at drivers/md/dm-table.c:588:9:
./include/linux/string.h:345:9: error: argument 2 null where non-null expected [-Werror=nonnull]
  return __builtin_memcpy(p, q, size);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/dm-table.c: In function 'dm_split_args':
./include/linux/string.h:345:9: note: in a call to built-in function '__builtin_memcpy'

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-06-25 14:09:13 -04:00
zhangyi (F)
211ad4b733 dm log writes: make sure super sector log updates are written in order
Currently, although we submit super bios in order (and super.nr_entries
is incremented by each logged entry), submit_bio() is async so each
super sector may not be written to log device in order and then the
final nr_entries may be smaller than it should be.

This problem can be reproduced by the xfstests generic/455 with ext4:

  QA output created by 455
 -Silence is golden
 +mark 'end' does not exist

Fix this by serializing submission of super sectors to make sure each
is written to the log disk in order.

Fixes: 0e9cebe724 ("dm: add log writes target")
Cc: stable@vger.kernel.org
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-06-25 14:09:13 -04:00
Stephen Boyd
10c9c8e7c0 dm init: remove trailing newline from calls to DMERR() and DMINFO()
These printing macros already add a trailing newline, so having another
one here just makes for blank lines when these prints are enabled.
Remove these needless newlines.

Fixes: 6bbc923dfc ("dm: add support to directly boot to a mapped device")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-06-25 13:43:09 -04:00
Gen Zhang
dec7e6494e dm init: fix incorrect uses of kstrndup()
Fix 2 kstrndup() calls with incorrect argument order.

Fixes: 6bbc923dfc ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org # v5.1
Signed-off-by: Gen Zhang <blackgod016574@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-06-25 13:34:52 -04:00
Guoqing Jiang
d494549ac8 md: add bitmap_abort label in md_run
Now, there are two places need to consider about
the failure of destroy bitmap, so move the common
part between bitmap_abort and abort label.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:36:00 -07:00
Guoqing Jiang
617b194a13 md-bitmap: create and destroy wb_info_pool with the change of bitmap
The write-behind attribute is part of bitmap, since bitmap
can be added/removed dynamically with the following.

1. mdadm --grow /dev/md0 --bitmap=none
2. mdadm --grow /dev/md0 --bitmap=internal --write-behind

So we need to destroy wb_info_pool in md_bitmap_destroy,
and create the pool before load bitmap.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:36:00 -07:00
Guoqing Jiang
10c92fca63 md-bitmap: create and destroy wb_info_pool with the change of backlog
Since we can enable write-behind mode by write backlog node,
so create wb_info_pool if the mode is just enabled, also call
call md_bitmap_update_sb to make user aware the write-behind
mode is enabled. Conversely, wb_info_pool should be destroyed
when write-behind mode is disabled.

Beside above, it is better to update bitmap sb if we change
the number of max_write_behind.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:36:00 -07:00
Guoqing Jiang
963c555e75 md: introduce mddev_create/destroy_wb_pool for the change of member device
Previously, we called rdev_init_wb to avoid potential data
inconsistency when array is created.

Now, we need to call the function and create mempool if a
device is added or just be flaged as "writemostly". So
mddev_create_wb_pool is introduced and called accordingly.
And for safety reason, we mark implicit GFP_NOIO allocation
scope for create mempool during mddev_suspend/mddev_resume.

And mempool should be removed conversely after remove a
member device or its's "writemostly" flag, which is done
by call mddev_destroy_wb_pool.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:36:00 -07:00
Guoqing Jiang
3e148a3209 md/raid1: fix potential data inconsistency issue with write behind device
For write-behind mode, we think write IO is complete once it has
reached all the non-writemostly devices. It works fine for single
queue devices.

But for multiqueue device, if there are lots of IOs come from upper
layer, then the write-behind device could issue those IOs to different
queues, depends on the each queue's delay, so there is no guarantee
that those IOs can arrive in order.

To address the issue, we need to check the collision among write
behind IOs, we can only continue without collision, otherwise wait
for the completion of previous collisioned IO.

And WBCollision is introduced for multiqueue device which is worked
under write-behind mode.

But this patch doesn't handle below cases which could have the data
inconsistency issue as well, these cases will be handled in later
patches.

1. modify max_write_behind by write backlog node.
2. add or remove array's bitmap dynamically.
3. the change of member disk.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:35:59 -07:00
Christoph Hellwig
14ccb66b3f block: remove the bi_phys_segments field in struct bio
We only need the number of segments in the blk-mq submission path.
Remove the field from struct bio, and return it from a variant of
blk_queue_split instead of that it can passed as an argument to
those functions that need the value.

This also means we stop recounting segments except for cloning
and partial segments.

To keep the number of arguments in this how path down remove
pointless struct request_queue arguments from any of the functions
that had it and grew a nr_segs argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:29:22 -06:00
Mariusz Tkaczyk
9642fa73d0 md: fix for divide error in status_resync
Stopping external metadata arrays during resync/recovery causes
retries, loop of interrupting and starting reconstruction, until it
hit at good moment to stop completely. While these retries
curr_mark_cnt can be small- especially on HDD drives, so subtraction
result can be smaller than 0. However it is casted to uint without
checking. As a result of it the status bar in /proc/mdstat while stopping
is strange (it jumps between 0% and 99%).

The real problem occurs here after commit 72deb455b5 ("block: remove
CONFIG_LBDAF"). Sector_div() macro has been changed, now the
divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0.

Check if db value can be really counted and replace these macro by
div64_u64() inline.

Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-18 08:02:25 -07:00
Guoqing Jiang
e9eeba28a1 md/raid10: read balance chooses idlest disk for SSD
Andy reported that raid10 array with SSD disks has poor
read performance. Compared with raid1, RAID-1 can be 3x
faster than RAID-10 sometimes [1].

The thing is that raid10 chooses the low distance disk
for read request, however, the approach doesn't work
well for SSD device since it doesn't have spindle like
HDD, we should just read from the SSD which has less
pending IO like commit 9dedf60313 ("md/raid1: read
balance chooses idlest disk for SSD").

So this commit selects the idlest SSD disk for read if
array has none rotational disk, otherwise, read_balance
uses the previous distance priority algorithm. With the
change, the performance of raid10 gets increased largely
per Andy's test [2].

[1]. https://marc.info/?l=linux-raid&m=155915890004761&w=2
[2]. https://marc.info/?l=linux-raid&m=155990654223786&w=2

Tested-by: Andy Smith <andy@strugglers.net>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:35 -06:00
Marcos Paulo de Souza
c7afa8034b md: raid1-10: Unify r{1,10}bio_pool_free
Avoiding duplicated code, since they just execute a kfree.

Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:35 -06:00
Gustavo A. R. Silva
8cf05a7841 md: raid10: Use struct_size() in kmalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
   int stuff;
   struct boo entry[];
};

instance = kmalloc(size, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:34 -06:00
Yufen Yu
ebfeb444fa md/raid1: get rid of extra blank line and space
This patch get rid of extra blank line and space, and
add necessary space for code.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:34 -06:00
Yufen Yu
e5b521ee9b md: fix spelling typo and add necessary space
This patch fix a spelling typo and add necessary space for code.
In addition, the patch get rid of the unnecessary 'if'.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:34 -06:00
Marcos Paulo de Souza
168b305b0c md: md.c: Return -ENODEV when mddev is NULL in rdev_attr_show
Commit c42d324099
("md: return -ENODEV if rdev has no mddev assigned") changed
rdev_attr_store to return -ENODEV when rdev->mddev is NULL, now do the
same to rdev_attr_show.

Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:34 -06:00
Xiao Ni
d9771f5ec4 raid5-cache: Need to do start() part job after adding journal device
commit d5d885fd51 ("md: introduce new personality funciton start()")
splits the init job to two parts. The first part run() does the jobs that
do not require the md threads. The second part start() does the jobs that
require the md threads.

Now it just does run() in adding new journal device. It needs to do the
second part start() too.

Fixes: d5d885fd51 ("md: introduce new personality funciton start()")
Cc: stable@vger.kernel.org #v4.9+
Reported-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:34 -06:00
Marcos Paulo de Souza
3f677f9c99 drivers: md: Unify common definitions of raid1 and raid10
These definitions are being moved to raid1-10.c.

Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:34 -06:00
Mauro Carvalho Chehab
f0ba43774c docs: convert docs to ReST and rename to *.rst
The conversion is actually:
  - add blank lines and indentation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-06-14 14:21:04 -06:00
Coly Li
1f0ffa6734 bcache: only set BCACHE_DEV_WB_RUNNING when cached device attached
When people set a writeback percent via sysfs file,
  /sys/block/bcache<N>/bcache/writeback_percent
current code directly sets BCACHE_DEV_WB_RUNNING to dc->disk.flags
and schedules kworker dc->writeback_rate_update.

If there is no cache set attached to, the writeback kernel thread is
not running indeed, running dc->writeback_rate_update does not make
sense and may cause NULL pointer deference when reference cache set
pointer inside update_writeback_rate().

This patch checks whether the cache set point (dc->disk.c) is NULL in
sysfs interface handler, and only set BCACHE_DEV_WB_RUNNING and
schedule dc->writeback_rate_update when dc->disk.c is not NULL (it
means the cache device is attached to a cache set).

This problem might be introduced from initial bcache commit, but
commit 3fd47bfe55 ("bcache: stop dc->writeback_rate_update properly")
changes part of the original code piece, so I add 'Fixes: 3fd47bfe55b0'
to indicate from which commit this patch can be applied.

Fixes: 3fd47bfe55 ("bcache: stop dc->writeback_rate_update properly")
Reported-by: Bjørn Forsman <bjorn.forsman@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Bjørn Forsman <bjorn.forsman@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-13 03:09:15 -06:00
Coly Li
31b90956b1 bcache: fix stack corruption by PRECEDING_KEY()
Recently people report bcache code compiled with gcc9 is broken, one of
the buggy behavior I observe is that two adjacent 4KB I/Os should merge
into one but they don't. Finally it turns out to be a stack corruption
caused by macro PRECEDING_KEY().

See how PRECEDING_KEY() is defined in bset.h,
437 #define PRECEDING_KEY(_k)                                       \
438 ({                                                              \
439         struct bkey *_ret = NULL;                               \
440                                                                 \
441         if (KEY_INODE(_k) || KEY_OFFSET(_k)) {                  \
442                 _ret = &KEY(KEY_INODE(_k), KEY_OFFSET(_k), 0);  \
443                                                                 \
444                 if (!_ret->low)                                 \
445                         _ret->high--;                           \
446                 _ret->low--;                                    \
447         }                                                       \
448                                                                 \
449         _ret;                                                   \
450 })

At line 442, _ret points to address of a on-stack variable combined by
KEY(), the life range of this on-stack variable is in line 442-446,
once _ret is returned to bch_btree_insert_key(), the returned address
points to an invalid stack address and this address is overwritten in
the following called bch_btree_iter_init(). Then argument 'search' of
bch_btree_iter_init() points to some address inside stackframe of
bch_btree_iter_init(), exact address depends on how the compiler
allocates stack space. Now the stack is corrupted.

Fixes: 0eacac2203 ("bcache: PRECEDING_KEY()")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Rolf Fokkens <rolf@rolffokkens.nl>
Reviewed-by: Pierre JUHEN <pierre.juhen@orange.fr>
Tested-by: Shenghui Wang <shhuiw@foxmail.com>
Tested-by: Pierre JUHEN <pierre.juhen@orange.fr>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-13 03:09:14 -06:00
Thomas Gleixner
55716d2643 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428
Based on 1 normalized pattern(s):

  this file is released under the gplv2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 68 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:16 +02:00
Thomas Gleixner
2025cf9e19 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 288
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms and conditions of the gnu general public license
  version 2 as published by the free software foundation this program
  is distributed in the hope it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 263 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.208660670@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:36:37 +02:00
Thomas Gleixner
1a59d1b8e0 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not write to the free software foundation inc
  59 temple place suite 330 boston ma 02111 1307 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1334 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.113240726@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:35 -07:00
Thomas Gleixner
2874c5fd28 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Linus Torvalds
b2ad81363f libnvdimm fixes v5.2-rc2
- Fix a regression that disabled device-mapper dax support
 
 - Remove unnecessary hardened-user-copy overhead (>30%) for dax
   read(2)/write(2).
 
 - Fix some compilation warnings.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJc6WWQAAoJEB7SkWpmfYgCpVwP/0Vfq/3ChbH5T7s4x2MkpLX+
 metYwCyzPJK32mVMbAmizGWEBn8Np+eZcU7jvKYpDXJLWdbUUz4oZD04RYmgkYp7
 SHmjn9VdpfMSziWUx6zrrbyAtBq04x7GT7IIkCzlGIuNVCYqXBnRSVGz06tDFEEd
 pU9HtZr32C425pdFK5D4sorJED2JKG7CwLPdSVHayuyHmg7jp78T7U5Y31WgOhSw
 +JF6UwQIJ+UPg30PYBPG32Zmh8E7Fv/AaYF3JGbp4xRS+B/xbakZhJtYuBzWRjlp
 BlwUg9nUaVgEnjE9KpTcJk8VlXDz6ZjpYXXdY4Hv5g+PPWm5kdZBhPYjaymrtI3o
 7DjtKmNd4F5qhU06oTXtFoBbgoiOBM7fOqsyVZ6tsNguVojlt8lnUvkTKqvznw4n
 K4TGzi0Zgu511umMumF1Q/d0BlNXz+gptcC4qwuEUyQa7sEPSWSfcC66SvY/Y5ym
 VGG4roO3Jz6p3JniuFEXakifzU57vPPv7OxGD3d0PKUSDHVU5yPjWRpJju8wJeVW
 DmTZ+SBo2Q/YP9vDlULPqxGJNkP31SaRg/9PnB8W1z2yqyuA+Pjv+Qjt1X618PFq
 1c2+ufeJoOb1Zc3k6Jw1bovilpb2GDW+4QucC3J0/zFtK00PYcGyyqo3jWlUgINf
 QWPgwBIW/yFcb7xOazFS
 =nko1
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm fixes from Dan Williams:

 - Fix a regression that disabled device-mapper dax support

 - Remove unnecessary hardened-user-copy overhead (>30%) for dax
   read(2)/write(2).

 - Fix some compilation warnings.

* tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
  dax: Arrange for dax_supported check to span multiple devices
  libnvdimm: Fix compilation warnings with W=1
2019-05-25 10:11:23 -07:00
Linus Torvalds
86c2f5d653 SPDX update for 5.2-rc2, round 2
Here is another set of reviewed patches that adds SPDX tags to different
 kernel files, based on a set of rules that are being used to parse the
 comments to try to determine that the license of the file is
 "GPL-2.0-or-later".  Only the "obvious" versions of these matches are
 included here, a number of "non-obvious" variants of text have been
 found but those have been postponed for later review and analysis.
 
 These patches have been out for review on the linux-spdx@vger mailing
 list, and while they were created by automatic tools, they were
 hand-verified by a bunch of different people, all whom names are on the
 patches are reviewers.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXOgmlw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yk4rACfRqxGOGVLR/t6E9dDzOZRAdEz/mYAoJLZmziY
 0YlSSSPtP5HI6JDh65Ng
 =HXQb
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pule more SPDX updates from Greg KH:
 "Here is another set of reviewed patches that adds SPDX tags to
  different kernel files, based on a set of rules that are being used to
  parse the comments to try to determine that the license of the file is
  "GPL-2.0-or-later".

  Only the "obvious" versions of these matches are included here, a
  number of "non-obvious" variants of text have been found but those
  have been postponed for later review and analysis.

  These patches have been out for review on the linux-spdx@vger mailing
  list, and while they were created by automatic tools, they were
  hand-verified by a bunch of different people, all whom names are on
  the patches are reviewers"

* tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (85 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 125
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 123
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 122
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 121
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 120
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 119
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 118
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 116
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 114
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 113
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 112
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 111
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 110
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 106
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 105
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 104
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 103
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 102
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 101
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98
  ...
2019-05-24 14:31:58 -07:00
Thomas Gleixner
af1a8899d2 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 or at your option any
  later version you should have received a copy of the gnu general
  public license for example usr src linux copying if not write to the
  free software foundation inc 675 mass ave cambridge ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 20 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:27:13 +02:00
Thomas Gleixner
8d7c56d08f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 45
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 or at your option any
  later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 11 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.370933192@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:27:12 +02:00
Linus Torvalds
86f9e56d08 - Fix a particularly glaring oversight in a DM core commit from 5.1 that
doesn't properly trim special IOs (e.g. discards) relative to
   corresponding target's max_io_len_target_boundary().
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAlzkicQTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWuRcCADKc4zew1EAhcHjwanfWoQ2d+5ONdxv
 Ir1yiQX3hqK96G1Vu4IROsxWPDgNBI+60HwWI0z3/nMdv4F9a9Tfowl9fcKgZNey
 jGrzP7x1+9GPtJY0BIpFUQ895qC75wXdd0c6HjmM5IaN0DRrv783ZMMBNhS6yx84
 2vDThiEpGiDKUXHTP9b1khUbZzvkpTmlk293UlkFDftgejfW5mq1FqbjLfACMNVI
 Nh7D6A7MjFrEKHjvNbiGgeFn93iW1+XqGbsFobuCV8Z4A0ImD85H78Lesri2qrRG
 nxGfTZMyq++SGSOV/JTzZ4k/qkZfkrDyljYiPaTgZpCi1mZYQ6wH8+7u
 =oKZ5
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/dm-fix-1' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fix from Mike Snitzer:
 "Fix a particularly glaring oversight in a DM core commit from 5.1 that
  doesn't properly trim special IOs (e.g. discards) relative to
  corresponding target's max_io_len_target_boundary()"

* tag 'for-5.2/dm-fix-1' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: make sure to obey max_io_len_target_boundary
2019-05-22 08:10:35 -07:00
Michael Lass
51b86f9a8d dm: make sure to obey max_io_len_target_boundary
Commit 61697a6abd ("dm: eliminate 'split_discard_bios' flag from DM
target interface") incorrectly removed code from
__send_changing_extent_only() that is required to impose a per-target IO
boundary on IO that exceeds max_io_len_target_boundary().  Otherwise
"special" IO (e.g. DISCARD, WRITE SAME, WRITE ZEROES) can write beyond
where allowed.

Fix this by restoring the max_io_len_target_boundary() limit in
__send_changing_extent_only()

Fixes: 61697a6abd ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Michael Lass <bevan@bi-co.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-21 19:15:20 -04:00
Thomas Gleixner
ec8f24b7fa treewide: Add SPDX license identifier - Makefile/Kconfig
Add SPDX license identifiers to all Make/Kconfig files which:

 - Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:46 +02:00
Thomas Gleixner
09c434b8a0 treewide: Add SPDX license identifier for more missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have MODULE_LICENCE("GPL*") inside which was used in the initial
   scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Thomas Gleixner
457c899653 treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Dan Williams
7bf7eac8d6 dax: Arrange for dax_supported check to span multiple devices
Pankaj reports that starting with commit ad428cdb52 "dax: Check the
end of the block-device capacity with dax_direct_access()" device-mapper
no longer allows dax operation. This results from the stricter checks in
__bdev_dax_supported() that validate that the start and end of a
block-device map to the same 'pagemap' instance.

Teach the dax-core and device-mapper to validate the 'pagemap' on a
per-target basis. This is accomplished by refactoring the
bdev_dax_supported() internals into generic_fsdax_supported() which
takes a sector range to validate. Consequently generic_fsdax_supported()
is suitable to be used in a device-mapper ->iterate_devices() callback.
A new ->dax_supported() operation is added to allow composite devices to
split and route upper-level bdev_dax_supported() requests.

Fixes: ad428cdb52 ("dax: Check the end of the block-device...")
Cc: <stable@vger.kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Tested-by: Pankaj Gupta <pagupta@redhat.com>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-05-20 15:02:08 -07:00
Linus Torvalds
311f71281f - Improve DM snapshot target's scalability by using finer grained
locking.  Requires some list_bl interface improvements.
 
 - Add ability for DM integrity to use a bitmap mode, that tracks regions
   where data and metadata are out of sync, instead of using a journal.
 
 - Improve DM thin provisioning target to not write metadata changes to
   disk if the thin-pool and associated thin devices are merely
   activated but not used.  This avoids metadata corruption due to
   concurrent activation of thin devices across different OS instances
   (e.g. split brain scenarios, which ultimately would be avoided if
   proper device filters were used -- but not having proper filtering has
   proven a very common configuration mistake)
 
 - Fix missing call to path selector type->end_io in DM multipath.  This
   fixes reported performance problems due to inaccurate path selector IO
   accounting causing an imbalance of IO (e.g. avoiding issuing IO to
   particular path due to it seemingly being heavily used).
 
 - Fix bug in DM cache metadata's loading of its discard bitset that
   could lead to all cache blocks being discarded if the very first cache
   block was discarded (thankfully in practice the first cache block is
   generally in use; be it FS superblock, partition table, disk label,
   etc).
 
 - Add testing-only DM dust target which simulates a device that has
   failing sectors and/or read failures.
 
 - Fix a DM init error path reference count hang that caused boot hangs
   if user supplied malformed input on kernel commandline.
 
 - Fix a couple issues with DM crypt target's logging being overly
   verbose or lacking context.
 
 - Various other small fixes to DM init, DM multipath, DM zoned, and DM
   crypt.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAlzdcCgTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWsxZB/9idHl8LmwwL1JzBfi/XX7bWxwqDQLo
 j1b3ycQ14AKVau4VCkmgDuRIfMDuU6PIAVvsMeVbF3aCE0fZ7zbEV1qHefbtJuCL
 MMm//KbrhIT8oMKYUWtlOj7XI9MT6ErFzfActBZ6UF6r21m1N3bohhVGN7kvCnJm
 wgmSlnz/m2GLKK8gQx+OisnAh0nlje3PIdIYPu7uWN6t0FF2XRz3UwWTuyw7lYhC
 Rx2J+sOIL02CtadhHKLMCG8OutRXWP01cBSohUVJIMGihWfbe6aqvhG5afbqb4bG
 UQrXl477ry5zyQ4fAU2JKZ+8qFvc1FoLLknKrZQu+uYPRokUPw/AwiL7
 =mOH3
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/dm-changes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Improve DM snapshot target's scalability by using finer grained
   locking. Requires some list_bl interface improvements.

 - Add ability for DM integrity to use a bitmap mode, that tracks
   regions where data and metadata are out of sync, instead of using a
   journal.

 - Improve DM thin provisioning target to not write metadata changes to
   disk if the thin-pool and associated thin devices are merely
   activated but not used. This avoids metadata corruption due to
   concurrent activation of thin devices across different OS instances
   (e.g. split brain scenarios, which ultimately would be avoided if
   proper device filters were used -- but not having proper filtering
   has proven a very common configuration mistake)

 - Fix missing call to path selector type->end_io in DM multipath. This
   fixes reported performance problems due to inaccurate path selector
   IO accounting causing an imbalance of IO (e.g. avoiding issuing IO to
   particular path due to it seemingly being heavily used).

 - Fix bug in DM cache metadata's loading of its discard bitset that
   could lead to all cache blocks being discarded if the very first
   cache block was discarded (thankfully in practice the first cache
   block is generally in use; be it FS superblock, partition table, disk
   label, etc).

 - Add testing-only DM dust target which simulates a device that has
   failing sectors and/or read failures.

 - Fix a DM init error path reference count hang that caused boot hangs
   if user supplied malformed input on kernel commandline.

 - Fix a couple issues with DM crypt target's logging being overly
   verbose or lacking context.

 - Various other small fixes to DM init, DM multipath, DM zoned, and DM
   crypt.

* tag 'for-5.2/dm-changes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (42 commits)
  dm: fix a couple brace coding style issues
  dm crypt: print device name in integrity error message
  dm crypt: move detailed message into debug level
  dm ioctl: fix hang in early create error condition
  dm integrity: whitespace, coding style and dead code cleanup
  dm integrity: implement synchronous mode for reboot handling
  dm integrity: handle machine reboot in bitmap mode
  dm integrity: add a bitmap mode
  dm integrity: introduce a function add_new_range_and_wait()
  dm integrity: allow large ranges to be described
  dm ingerity: pass size to dm_integrity_alloc_page_list()
  dm integrity: introduce rw_journal_sectors()
  dm integrity: update documentation
  dm integrity: don't report unused options
  dm integrity: don't check null pointer before kvfree and vfree
  dm integrity: correctly calculate the size of metadata area
  dm dust: Make dm_dust_init and dm_dust_exit static
  dm dust: remove redundant unsigned comparison to less than zero
  dm mpath: always free attached_handler_name in parse_path()
  dm init: fix max devices/targets checks
  ...
2019-05-16 15:55:48 -07:00
Sheetal Singala
8454fca4f5 dm: fix a couple brace coding style issues
Signed-off-by: Sheetal Singala <2396sheetal@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-16 10:09:21 -04:00
Milan Broz
f710126cfc dm crypt: print device name in integrity error message
This message should better identify the DM device with the integrity
failure.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-16 10:09:20 -04:00
Milan Broz
7a1cd7238f dm crypt: move detailed message into debug level
The information about tag size should not be printed without debug info
set. Also print device major:minor in the error message to identify the
device instance.

Also use rate limiting and debug level for info about used crypto API
implementaton.  This is important because during online reencryption
the existing message saturates syslog (because we are moving hotzone
across the whole device).

Cc: stable@vger.kernel.org
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-16 10:09:20 -04:00
Helen Koike
0f41fcf788 dm ioctl: fix hang in early create error condition
The dm_early_create() function (which deals with "dm-mod.create=" kernel
command line option) calls dm_hash_insert() who gets an extra reference
to the md object.

In case of failure, this reference wasn't being released, causing
dm_destroy() to hang, thus hanging the whole boot process.

Fix this by calling __hash_remove() in the error path.

Fixes: 6bbc923dfc ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-16 09:52:06 -04:00
Mike Snitzer
05d6909ea9 dm integrity: whitespace, coding style and dead code cleanup
Just some things that stood out like a sore thumb.
Also, converted some printk(KERN_CRIT, ...) to DMCRIT(...)

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-09 16:00:31 -04:00
Roman Gushchin
ddde2af747 md: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
Percpu reference counters should now be initialized with the
PERCPU_REF_ALLOW_REINIT in order to allow switching them to the
percpu mode from the atomic mode.
To make percpu_ref_switch_to_percpu() call in set_in_sync()
succeed,let's initialize percpu refcounters with the
PERCU_REF_ALLOW_REINIT flag.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-05-09 10:50:59 -07:00
Mikulas Patocka
482714932e dm integrity: implement synchronous mode for reboot handling
Unfortunatelly, there may be bios coming even after the reboot notifier
was called.  We don't want these bios to make the bitmap dirty again.

To address this, implement a synchronous mode - when a bio is about to
be terminated, we clean the bitmap and terminate the bio after the clean
operation succeeds.  This obviously slows down bio processing, but it
makes sure that when all bios are finished, the bitmap will be clean.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-08 13:41:59 -04:00
Mikulas Patocka
1f5a77591b dm integrity: handle machine reboot in bitmap mode
When in bitmap mode the bitmap must be cleared when rebooting.  This
commit adds the reboot hook.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-08 13:41:58 -04:00
Mikulas Patocka
468dfca38b dm integrity: add a bitmap mode
Introduce an alternate mode of operation where dm-integrity uses a
bitmap instead of a journal. If a bit in the bitmap is 1, the
corresponding region's data and integrity tags are not synchronized - if
the machine crashes, the unsynchronized regions will be recalculated.
The bitmap mode is faster than the journal mode, because we don't have
to write the data twice, but it is also less reliable, because if data
corruption happens when the machine crashes, it may not be detected.

Benchmark results for an SSD connected to a SATA300 port, when doing
large linear writes with dd:

buffered I/O:
        raw device throughput - 245MB/s
        dm-integrity with journaling - 120MB/s
        dm-integrity with bitmap - 238MB/s

direct I/O with 1MB block size:
        raw device throughput - 248MB/s
        dm-integrity with journaling - 123MB/s
        dm-integrity with bitmap - 223MB/s

For more info see dm-integrity in Documentation/device-mapper/

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-08 13:41:58 -04:00
Mikulas Patocka
8b3bbd490d dm integrity: introduce a function add_new_range_and_wait()
Introduce a function add_new_range_and_wait() in order to avoid
repetitive code.  It will be used in the following commit.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-08 13:40:28 -04:00
Linus Torvalds
67a2422239 for-5.2/block-20190507
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzR0AAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpo0MD/47D1kBK9rGzkAwIz1Jkh1Qy/ITVaDJzmHJ
 UP5uncQsgKFLKMR1LbRcrWtmk2MwFDNULGbteHFeCYE1ypCrTgpWSp5+SJluKd1Q
 hma9krLSAXO9QiSaZ4jafshXFIZxz6IjakOW8c9LrT80Ze47yh7AxiLwDafcp/Jj
 x6NW790qB7ENDtfarDkZk14NCS8HGLRHO5B21LB+hT0Kfbh0XZaLzJdj7Mck1wPA
 VT8hL9mPuA++AjF7Ra4kUjwSakgmajTa3nS2fpkwTYdztQfas7x5Jiv7FWxrrelb
 qbabkNkWKepcHAPEiZR7o53TyfCucGeSK/jG+dsJ9KhNp26kl1ci3frl5T6PfVMP
 SPPDjsKIHs+dqFrU9y5rSGhLJqewTs96hHthnLGxyF67+5sRb5+YIy+dcqgiyc/b
 TUVyjCD6r0cO2q4v9VhwnhOyeBUA9Rwbu8nl7JV5Q45uG7qI4BC39l1jfubMNDPO
 GLNGUUzb6ER7z6lYINjRSF2Jhejsx8SR9P7jhpb1Q7k/VvDDxO1T4FpwvqWFz9+s
 Gn+s6//+cA6LL+42eZkQjvwF2CUNE7TaVT8zdb+s5HP1RQkZToqUnsQCGeRTrFni
 RqWXfW9o9+awYRp431417oMdX/LvLGq9+ZtifRk9DqDcowXevTaf0W2RpplWSuiX
 RcCuPeLAVg==
 =Ot0g
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Nothing major in this series, just fixes and improvements all over the
  map. This contains:

   - Series of fixes for sed-opal (David, Jonas)

   - Fixes and performance tweaks for BFQ (via Paolo)

   - Set of fixes for bcache (via Coly)

   - Set of fixes for md (via Song)

   - Enabling multi-page for passthrough requests (Ming)

   - Queue release fix series (Ming)

   - Device notification improvements (Martin)

   - Propagate underlying device rotational status in loop (Holger)

   - Removal of mtip32xx trim support, which has been disabled for years
     (Christoph)

   - Improvement and cleanup of nvme command handling (Christoph)

   - Add block SPDX tags (Christoph)

   - Cleanup/hardening of bio/bvec iteration (Christoph)

   - A few NVMe pull requests (Christoph)

   - Removal of CONFIG_LBDAF (Christoph)

   - Various little fixes here and there"

* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
  block: fix mismerge in bvec_advance
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  blk-mq: always free hctx after request queue is freed
  blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  block: fix function name in comment
  nvmet: protect discovery change log event list iteration
  nvme: mark nvme_core_init and nvme_core_exit static
  nvme: move command size checks to the core
  nvme-fabrics: check more command sizes
  nvme-pci: check more command sizes
  nvme-pci: remove an unneeded variable initialization
  nvme-pci: unquiesce admin queue on shutdown
  nvme-pci: shutdown on timeout during deletion
  nvme-pci: fix psdt field for single segment sgls
  nvme-multipath: don't print ANA group state by default
  nvme-multipath: split bios with the ns_head bio_set before submitting
  ...
2019-05-07 18:14:36 -07:00
Mikulas Patocka
4f43446ddf dm integrity: allow large ranges to be described
Change n_sectors data type from unsigned to sector_t.  Following commits
will need to lock large ranges.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:12 -04:00
Mikulas Patocka
d5027e0345 dm ingerity: pass size to dm_integrity_alloc_page_list()
Pass size to dm_integrity_alloc_page_list().  This is needed so
following commits can pass a size that is different from
ic->journal_pages.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:12 -04:00
Mikulas Patocka
981e8a980d dm integrity: introduce rw_journal_sectors()
Introduce a function rw_journal_sectors() that takes sector and length
as its arguments instead of a section and the number of sections.

This functions will be used in further patches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:11 -04:00
Mikulas Patocka
88ad5d1eb1 dm integrity: update documentation
Update documentation with the "meta_device" parameter and flags.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:10 -04:00
Mikulas Patocka
893e3c395b dm integrity: don't report unused options
If we are not journaling, don't report journaling options in the table
status.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:09 -04:00
Mikulas Patocka
97abfde17a dm integrity: don't check null pointer before kvfree and vfree
The functions kfree, vfree and kvfree do nothing if we pass a NULL
pointer to them.  So we don't need to test the pointer for NULL.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:08 -04:00
Mikulas Patocka
30bba430dd dm integrity: correctly calculate the size of metadata area
When we use separate devices for data and metadata, dm-integrity would
incorrectly calculate the size of the metadata device as if it had
512-byte block size - and it would refuse activation with larger block
size and smaller metadata device.

Fix this so that it takes actual block size into account, which fixes
the following reported issue:
https://gitlab.com/cryptsetup/cryptsetup/issues/450

Fixes: 356d9d52e1 ("dm integrity: allow separate metadata device")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:08 -04:00
YueHaibing
9ccce5a0fb dm dust: Make dm_dust_init and dm_dust_exit static
Fix sparse warnings:

drivers/md/dm-dust.c:495:12: warning: symbol 'dm_dust_init' was not declared. Should it be static?
drivers/md/dm-dust.c:505:13: warning: symbol 'dm_dust_exit' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:07 -04:00
Colin Ian King
cacddeab56 dm dust: remove redundant unsigned comparison to less than zero
Variable block is an unsigned long long hence the less than zero
comparison is always false, hence it is redundant and can be removed.

Addresses-Coverity: ("Unsigned compared against 0")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:06 -04:00
Linus Torvalds
81ff5d2cba Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "API:
   - Add support for AEAD in simd
   - Add fuzz testing to testmgr
   - Add panic_on_fail module parameter to testmgr
   - Use per-CPU struct instead multiple variables in scompress
   - Change verify API for akcipher

  Algorithms:
   - Convert x86 AEAD algorithms over to simd
   - Forbid 2-key 3DES in FIPS mode
   - Add EC-RDSA (GOST 34.10) algorithm

  Drivers:
   - Set output IV with ctr-aes in crypto4xx
   - Set output IV in rockchip
   - Fix potential length overflow with hashing in sun4i-ss
   - Fix computation error with ctr in vmx
   - Add SM4 protected keys support in ccree
   - Remove long-broken mxc-scc driver
   - Add rfc4106(gcm(aes)) cipher support in cavium/nitrox"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (179 commits)
  crypto: ccree - use a proper le32 type for le32 val
  crypto: ccree - remove set but not used variable 'du_size'
  crypto: ccree - Make cc_sec_disable static
  crypto: ccree - fix spelling mistake "protedcted" -> "protected"
  crypto: caam/qi2 - generate hash keys in-place
  crypto: caam/qi2 - fix DMA mapping of stack memory
  crypto: caam/qi2 - fix zero-length buffer DMA mapping
  crypto: stm32/cryp - update to return iv_out
  crypto: stm32/cryp - remove request mutex protection
  crypto: stm32/cryp - add weak key check for DES
  crypto: atmel - remove set but not used variable 'alg_name'
  crypto: picoxcell - Use dev_get_drvdata()
  crypto: crypto4xx - get rid of redundant using_sd variable
  crypto: crypto4xx - use sync skcipher for fallback
  crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues
  crypto: crypto4xx - fix ctr-aes missing output IV
  crypto: ecrdsa - select ASN1 and OID_REGISTRY for EC-RDSA
  crypto: ux500 - use ccflags-y instead of CFLAGS_<basename>.o
  crypto: ccree - handle tee fips error during power management resume
  crypto: ccree - add function to handle cryptocell tee fips error
  ...
2019-05-06 20:15:06 -07:00
Jens Axboe
2d5abb9a1e bcache: make is_discard_enabled() static
It's not used outside this file.

Fixes: 631207314d ("bcache: fix failure in journal relplay")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-01 06:34:09 -06:00
Martin Wilck
940bc47178 dm mpath: always free attached_handler_name in parse_path()
Commit b592211c33 ("dm mpath: fix attached_handler_name leak and
dangling hw_handler_name pointer") fixed a memory leak for the case
where setup_scsi_dh() returns failure. But setup_scsi_dh may return
success and not "use" attached_handler_name if the
retain_attached_hwhandler flag is not set on the map. As setup_scsi_sh
properly "steals" the pointer by nullifying it, freeing it
unconditionally in parse_path() is safe.

Fixes: b592211c33 ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer")
Cc: stable@vger.kernel.org
Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-30 16:51:30 -04:00
Helen Koike
8e890c1ab1 dm init: fix max devices/targets checks
dm-init should allow up to DM_MAX_{DEVICES,TARGETS} for devices/targets,
and not DM_MAX_{DEVICES,TARGETS} - 1.

Fix the checks and also fix the error message when the number of devices
is surpassed.

Fixes: 6bbc923dfc ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-30 16:51:23 -04:00
Bryan Gurney
e4f3fabd67 dm: add dust target
Add the dm-dust target, which simulates the behavior of bad sectors
at arbitrary locations, and the ability to enable the emulation of
the read failures at an arbitrary time.

This target behaves similarly to a linear target.  At a given time,
the user can send a message to the target to start failing read
requests on specific blocks.  When the failure behavior is enabled,
reads of blocks configured "bad" will fail with EIO.

Writes of blocks configured "bad" will result in the following:

1. Remove the block from the "bad block list".
2. Successfully complete the write.

After this point, the block will successfully contain the written
data, and will service reads and writes normally.  This emulates the
behavior of a "remapped sector" on a hard disk drive.

dm-dust provides logging of which blocks have been added or removed
to the "bad block list", as well as logging when a block has been
removed from the bad block list.  These messages can be used
alongside the messages from the driver using a dm-dust device to
analyze the driver's behavior when a read fails at a given time.

(This logging can be reduced via a "quiet" mode, if desired.)

NOTE: If the block size is larger than 512 bytes, only the first sector
of each "dust block" is detected.  Placing a limiting layer above a dust
target, to limit the minimum I/O size to the dust block size, will
ensure proper emulation of the given large block size.

Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Co-developed-by: Joe Shimkus <jshimkus@redhat.com>
Co-developed-by: John Dorminy <jdorminy@redhat.com>
Co-developed-by: John Pittman <jpittman@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tjaskiew@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-30 16:37:19 -04:00
Christoph Hellwig
2b070cfe58 block: remove the i argument to bio_for_each_segment_all
We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:13 -06:00
Christoph Hellwig
f936b06ae5 bcache: clean up do_btree_node_write a bit
Use a variable containing the buffer address instead of the to be
removed integer iterator from bio_for_each_segment_all.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:11 -06:00
Coly Li
cdca22bcbc bcache: remove redundant LIST_HEAD(journal) from run_cache_set()
Commit 95f18c9d13 ("bcache: avoid potential memleak of list of
journal_replay(s) in the CACHE_SYNC branch of run_cache_set") forgets
to remove the original define of LIST_HEAD(journal), which makes
the change no take effect. This patch removes redundant variable
LIST_HEAD(journal) from run_cache_set(), to make Shenghui's fix
working.

Fixes: 95f18c9d13 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set")
Reported-by: Juha Aatrokoski <juha.aatrokoski@aalto.fi>
Cc: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 08:20:46 -06:00
Thomas Gleixner
be9c52ed84 dm persistent data: Simplify stack trace handling
Replace the indirection through struct stack_trace with an invocation of
the storage array based interface. This results in less storage space and
indirection.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094802.533968922@linutronix.de
2019-04-29 12:37:52 +02:00
Thomas Gleixner
741b58f3e2 dm bufio: Simplify stack trace retrieval
Replace the indirection through struct stack_trace with an invocation of
the storage array based interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094802.446326191@linutronix.de
2019-04-29 12:37:52 +02:00
Mikulas Patocka
f8011d3344 dm writecache: avoid unnecessary lookups in writecache_find_entry()
This is a small optimization in writecache_find_entry().

If we go past the condition "if (unlikely(!node))", we can be certain that
there is no entry in the tree that has the block equal to the "block"
variable.

Consequently, we can return the next entry directly, we don't need to go
to the second part of the function that finds the entry with lowest or
highest seq number that matches the "block" variable.

Also, add some whitespace and cleanup needless braces.

Suggested-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26 11:48:03 -04:00
Huaisheng Ye
08a8e80462 dm writecache: remove unused member page_offset in writeback_struct
The stucture member page_offset in writeback_struct never has been
used actually. Remove it.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26 11:32:50 -04:00
Mikulas Patocka
81bc6d150a dm delay: fix a crash when invalid device is specified
When the target line contains an invalid device, delay_ctr() will call
delay_dtr() with NULL workqueue.  Attempting to destroy the NULL
workqueue causes a crash.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26 11:29:32 -04:00
Peng Wang
514cf4f881 dm: only initialize md->dax_dev if CONFIG_DAX_DRIVER is enabled
md->dax_dev defaults to NULL and there is no need to initialize it
if CONFIG_DAX_DRIVER is disabled.

Signed-off-by: Peng Wang <rocking@whu.edu.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26 11:28:17 -04:00
Yufen Yu
5de719e3d0 dm mpath: fix missing call of path selector type->end_io
After commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via
blk_insert_cloned_request feedback"), map_request() will requeue the tio
when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.

Thus, if device driver status is error, a tio may be requeued multiple
times until the return value is not DM_MAPIO_REQUEUE.  That means
type->start_io may be called multiple times, while type->end_io is only
called when IO complete.

In fact, even without commit 396eaf21ee, setup_clone() failure can
also cause tio requeue and associated missed call to type->end_io.

The service-time path selector selects path based on in_flight_size,
which is increased by st_start_io() and decreased by st_end_io().
Missed calls to st_end_io() can lead to in_flight_size count error and
will cause the selector to make the wrong choice.  In addition,
queue-length path selector will also be affected.

To fix the problem, call type->end_io in ->release_clone_rq before tio
requeue.  map_info is passed to ->release_clone_rq() for map_request()
error path that result in requeue.

Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Cc: stable@vger.kernl.org
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-25 15:38:52 -04:00
Eric Biggers
877b5691f2 crypto: shash - remove shash_desc::flags
The flags field in 'struct shash_desc' never actually does anything.
The only ostensibly supported flag is CRYPTO_TFM_REQ_MAY_SLEEP.
However, no shash algorithm ever sleeps, making this flag a no-op.

With this being the case, inevitably some users who can't sleep wrongly
pass MAY_SLEEP.  These would all need to be fixed if any shash algorithm
actually started sleeping.  For example, the shash_ahash_*() functions,
which wrap a shash algorithm with the ahash API, pass through MAY_SLEEP
from the ahash API to the shash API.  However, the shash functions are
called under kmap_atomic(), so actually they're assumed to never sleep.

Even if it turns out that some users do need preemption points while
hashing large buffers, we could easily provide a helper function
crypto_shash_update_large() which divides the data into smaller chunks
and calls crypto_shash_update() and cond_resched() for each chunk.  It's
not necessary to have a flag in 'struct shash_desc', nor is it necessary
to make individual shash algorithms aware of this at all.

Therefore, remove shash_desc::flags, and document that the
crypto_shash_*() functions can be called from any context.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-04-25 15:38:12 +08:00
Shenghui Wang
95f18c9d13 bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set
In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used
to collect journal_replay(s) and filled by bch_journal_read().

If all goes well, bch_journal_replay() will release the list of
jounal_replay(s) at the end of the branch.

If something goes wrong, code flow will jump to the label "err:" and leave
the list unreleased.

This patch will release the list of journal_replay(s) in the case of
error detected.

v1 -> v2:
* Move the release code to the location after label 'err:' to
  simply the change.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:29 -06:00
Shenghui Wang
f16277ca20 bcache: fix wrong usage use-after-freed on keylist in out_nocoalesce branch of btree_gc_coalesce
Elements of keylist should be accessed before the list is freed.
Move bch_keylist_free() calling after the while loop to avoid wrong
content accessed.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:29 -06:00
Tang Junhui
631207314d bcache: fix failure in journal relplay
journal replay failed with messages:
Sep 10 19:10:43 ceph kernel: bcache: error on
bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries
2057493-2057567 missing! (replaying 2057493-2076601), disabling
caching

The reason is in journal_reclaim(), when discard is enabled, we send
discard command and reclaim those journal buckets whose seq is old
than the last_seq_now, but before we write a journal with last_seq_now,
the machine is restarted, so the journal with the last_seq_now is not
written to the journal bucket, and the last_seq_wrote in the newest
journal is old than last_seq_now which we expect to be, so when we doing
replay, journals from last_seq_wrote to last_seq_now are missing.

It's hard to write a journal immediately after journal_reclaim(),
and it harmless if those missed journal are caused by discarding
since those journals are already wrote to btree node. So, if miss
seqs are started from the beginning journal, we treat it as normal,
and only print a message to show the miss journal, and point out
it maybe caused by discarding.

Patch v2 add a judgement condition to ignore the missed journal
only when discard enabled as Coly suggested.

(Coly Li: rebase the patch with other changes in bch_journal_replay())

Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Tested-by: Dennis Schridde <devurandom@gmx.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
eb8cbb6df3 bcache: improve bcache_reboot()
This patch tries to release mutex bch_register_lock early, to give
chance to stop cache set and bcache device early.

This patch also expends time out of stopping all bcache device from
2 seconds to 10 seconds, because stopping writeback rate update worker
may delay for 5 seconds, 2 seconds is not enough.

After this patch applied, stopping bcache devices during system reboot
or shutdown is very hard to be observed any more.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
63d63b51d7 bcache: add comments for closure_fn to be called in closure_queue()
Add code comments to explain which call back function might be called
for the closure_queue(). This is an effort to make code to be more
understandable for readers.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
bb6d355c2a bcache: Add comments for blkdev_put() in registration code path
Add comments to explain why in register_bcache() blkdev_put() won't
be called in two location. Add comments to explain why blkdev_put()
must be called in register_cache() when cache_alloc() failed.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
88c12d42d2 bcache: add error check for calling register_bdev()
This patch adds return value to register_bdev(). Then if failure happens
inside register_bdev(), its caller register_bcache() may detect and
handle the failure more properly.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
68d10e6979 bcache: return error immediately in bch_journal_replay()
When failure happens inside bch_journal_replay(), calling
cache_set_err_on() and handling the failure in async way is not a good
idea. Because after bch_journal_replay() returns, registering code will
continue to execute following steps, and unregistering code triggered
by cache_set_err_on() is running in same time. First it is unnecessary
to handle failure and unregister cache set in an async way, second there
might be potential race condition to run register and unregister code
for same cache set.

So in this patch, if failure happens in bch_journal_replay(), we don't
call cache_set_err_on(), and just print out the same error message to
kernel message buffer, then return -EIO immediately caller. Then caller
can detect such failure and handle it in synchrnozied way.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
2d17456eb1 bcache: add comments for kobj release callback routine
Bcache has several routines to release resources in implicit way, they
are called when the associated kobj released. This patch adds code
comments to notice when and which release callback will be called,
- When dc->disk.kobj released:
  void bch_cached_dev_release(struct kobject *kobj)
- When d->kobj released:
  void bch_flash_dev_release(struct kobject *kobj)
- When c->kobj released:
  void bch_cache_set_release(struct kobject *kobj)
- When ca->kobj released
  void bch_cache_release(struct kobject *kobj)

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
ce3e4cfb59 bcache: add failure check to run_cache_set() for journal replay
Currently run_cache_set() has no return value, if there is failure in
bch_journal_replay(), the caller of run_cache_set() has no idea about
such failure and just continue to execute following code after
run_cache_set().  The internal failure is triggered inside
bch_journal_replay() and being handled in async way. This behavior is
inefficient, while failure handling inside bch_journal_replay(), cache
register code is still running to start the cache set. Registering and
unregistering code running as same time may introduce some rare race
condition, and make the code to be more hard to be understood.

This patch adds return value to run_cache_set(), and returns -EIO if
bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
such failure and stop registering code flow immedidately inside
register_cache_set().

If journal replay fails, run_cache_set() can report error immediately
to register_cache_set(). This patch makes the failure handling for
bch_journal_replay() be in synchronized way, easier to understand and
debug, and avoid poetential race condition for register-and-unregister
in same time.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
1bee2addc0 bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim()
In journal_reclaim() ja->cur_idx of each cache will be update to
reclaim available journal buckets. Variable 'int n' is used to count how
many cache is successfully reclaimed, then n is set to c->journal.key
by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
loop will write the jset data onto each cache.

The problem is, if all jouranl buckets on each cache is full, the
following code in journal_reclaim(),

529 for_each_cache(ca, c, iter) {
530       struct journal_device *ja = &ca->journal;
531       unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
532
533       /* No space available on this device */
534       if (next == ja->discard_idx)
535               continue;
536
537       ja->cur_idx = next;
538       k->ptr[n++] = MAKE_PTR(0,
539                         bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
540                         ca->sb.nr_this_dev);
541 }
542
543 bkey_init(k);
544 SET_KEY_PTRS(k, n);

If there is no available bucket to reclaim, the if() condition at line
534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
will set KEY_PTRS field of c->journal.key to 0.

Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
journal_write_unlocked() the journal data is written in following loop,

649	for (i = 0; i < KEY_PTRS(k); i++) {
650-671		submit journal data to cache device
672	}

If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
won't be written to cache device here. If system crahed or rebooted
before bkeys of the lost journal entries written into btree nodes, data
corruption will be reported during bcache reload after rebooting the
system.

Indeed there is only one cache in a cache set, there is no need to set
KEY_PTRS field in journal_reclaim() at all. But in order to keep the
for_each_cache() logic consistent for now, this patch fixes the above
problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
available to reclaim.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Coly Li
14215ee01f bcache: move definition of 'int ret' out of macro read_bucket()
'int ret' is defined as a local variable inside macro read_bucket().
Since this macro is called multiple times, and following patches will
use a 'int ret' variable in bch_journal_read(), this patch moves
definition of 'int ret' from macro read_bucket() to range of function
bch_journal_read().

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Liang Chen
a4b732a248 bcache: fix a race between cache register and cacheset unregister
There is a race between cache device register and cache set unregister.
For an already registered cache device, register_bcache will call
bch_is_open to iterate through all cachesets and check every cache
there. The race occurs if cache_set_free executes at the same time and
clears the caches right before ca is dereferenced in bch_is_open_cache.
To close the race, let's make sure the clean up work is protected by
the bch_register_lock as well.

This issue can be reproduced as follows,
while true; do echo /dev/XXX> /sys/fs/bcache/register ; done&
while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done &

and results in the following oops,

[  +0.000053] BUG: unable to handle kernel NULL pointer dereference at 0000000000000998
[  +0.000457] #PF error: [normal kernel read fault]
[  +0.000464] PGD 800000003ca9d067 P4D 800000003ca9d067 PUD 3ca9c067 PMD 0
[  +0.000388] Oops: 0000 [#1] SMP PTI
[  +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ #6
[  +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
[  +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache]
[  +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d
[  +0.000839] RSP: 0018:ffff92ee804cbd88 EFLAGS: 00010202
[  +0.000328] RAX: ffffffffc010e190 RBX: ffff918b5c6b5000 RCX: ffff918b7d8e0000
[  +0.000399] RDX: ffff918b7d8e0000 RSI: ffffffffc010e190 RDI: 0000000000000001
[  +0.000398] RBP: ffff918b7d318340 R08: 0000000000000000 R09: ffffffffb9bd2d7a
[  +0.000385] R10: ffff918b7eb253c0 R11: ffffb95980f51200 R12: ffffffffc010e1a0
[  +0.000411] R13: fffffffffffffff2 R14: 000000000000000b R15: ffff918b7e232620
[  +0.000384] FS:  00007f955bec2740(0000) GS:ffff918b7eb00000(0000) knlGS:0000000000000000
[  +0.000420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000801] CR2: 0000000000000998 CR3: 000000003cad6000 CR4: 00000000001406e0
[  +0.000837] Call Trace:
[  +0.000682]  ? _cond_resched+0x10/0x20
[  +0.000691]  ? __kmalloc+0x131/0x1b0
[  +0.000710]  kernfs_fop_write+0xfa/0x170
[  +0.000733]  __vfs_write+0x2e/0x190
[  +0.000688]  ? inode_security+0x10/0x30
[  +0.000698]  ? selinux_file_permission+0xd2/0x120
[  +0.000752]  ? security_file_permission+0x2b/0x100
[  +0.000753]  vfs_write+0xa8/0x1a0
[  +0.000676]  ksys_write+0x4d/0xb0
[  +0.000699]  do_syscall_64+0x3a/0xf0
[  +0.000692]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
George Spelvin
3a3947271c bcache: Clean up bch_get_congested()
There are a few nits in this function.  They could in theory all
be separate patches, but that's probably taking small commits
too far.

1) I added a brief comment saying what it does.

2) I like to declare pointer parameters "const" where possible
   for documentation reasons.

3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming
weight of a 32-bit random number (giving a random integer with
mean 16 and variance 8).  Passing by reference in a 64-bit variable
is silly; just use hweight32().

4) Its helper function fract_exp_two is unnecessarily tangled.
Gcc can optimize the multiply by (1 << x) to a shift, but it can
be written in a much more straightforward way at the cost of one
more bit of internal precision.  Some analysis reveals that this
bit is always available.

This shrinks the object code for fract_exp_two(x, 6) from 23 bytes:

0000000000000000 <foo1>:
   0:   89 f9                   mov    %edi,%ecx
   2:   c1 e9 06                shr    $0x6,%ecx
   5:   b8 01 00 00 00          mov    $0x1,%eax
   a:   d3 e0                   shl    %cl,%eax
   c:   83 e7 3f                and    $0x3f,%edi
   f:   d3 e7                   shl    %cl,%edi
  11:   c1 ef 06                shr    $0x6,%edi
  14:   01 f8                   add    %edi,%eax
  16:   c3                      retq

To 19:

0000000000000017 <foo2>:
  17:   89 f8                   mov    %edi,%eax
  19:   83 e0 3f                and    $0x3f,%eax
  1c:   83 c0 40                add    $0x40,%eax
  1f:   89 f9                   mov    %edi,%ecx
  21:   c1 e9 06                shr    $0x6,%ecx
  24:   d3 e0                   shl    %cl,%eax
  26:   c1 e8 06                shr    $0x6,%eax
  29:   c3                      retq

(Verified with 0 <= frac_bits <= 8, 0 <= x < 16<<frac_bits;
both versions produce the same output.)

5) And finally, the call to bch_get_congested() in check_should_bypass()
is separated from the use of the value by multiple tests which
could moot the need to compute it.  Move the computation down to
where it's needed.  This also saves a local register to hold the
computed value.

Signed-off-by: George Spelvin <lkml@sdf.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Geliang Tang
792732d985 bcache: use kmemdup_nul for CACHED_LABEL buffer
This patch uses kmemdup_nul to create a NUL-terminated string from
dc->sb.label. This is better than open coding it.

With this, we can move env[2] initialization into env[] array to make
code more elegant.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Arnd Bergmann
78d4eb8ad9 bcache: avoid clang -Wunintialized warning
clang has identified a code path in which it thinks a
variable may be unused:

drivers/md/bcache/alloc.c:333:4: error: variable 'bucket' is used uninitialized whenever 'if' condition is false
      [-Werror,-Wsometimes-uninitialized]
                        fifo_pop(&ca->free_inc, bucket);
                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
 #define fifo_pop(fifo, i)       fifo_pop_front(fifo, (i))
                                ^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/bcache/util.h:189:6: note: expanded from macro 'fifo_pop_front'
        if (_r) {                                                       \
            ^~
drivers/md/bcache/alloc.c:343:46: note: uninitialized use occurs here
                        allocator_wait(ca, bch_allocator_push(ca, bucket));
                                                                  ^~~~~~
drivers/md/bcache/alloc.c:287:7: note: expanded from macro 'allocator_wait'
                if (cond)                                               \
                    ^~~~
drivers/md/bcache/alloc.c:333:4: note: remove the 'if' if its condition is always true
                        fifo_pop(&ca->free_inc, bucket);
                        ^
drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
 #define fifo_pop(fifo, i)       fifo_pop_front(fifo, (i))
                                ^
drivers/md/bcache/util.h:189:2: note: expanded from macro 'fifo_pop_front'
        if (_r) {                                                       \
        ^
drivers/md/bcache/alloc.c:331:15: note: initialize the variable 'bucket' to silence this warning
                        long bucket;
                                   ^

This cannot happen in practice because we only enter the loop
if there is at least one element in the list.

Slightly rearranging the code makes this clearer to both the
reader and the compiler, which avoids the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Guoju Fang
4e0c04ec3a bcache: fix inaccurate result of unused buckets
To get the amount of unused buckets in sysfs_priority_stats, the code
count the buckets which GC_SECTORS_USED is zero. It's correct and should
not be overwritten by the count of buckets which prio is zero.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Guoju Fang
1568ee7e3c bcache: fix crashes stopping bcache device before read miss done
The bio from upper layer is considered completed when bio_complete()
returns. In most scenarios bio_complete() is called in search_free(),
but when read miss happens, the bio_compete() is called when backing
device reading completed, while the struct search is still in use until
cache inserting finished.

If someone stops the bcache device just then, the device may be closed
and released, but after cache inserting finished the struct search will
access a freed struct cached_dev.

This patch add the reference of bcache device before bio_complete() when
read miss happens, and put it after the search is not used.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Mike Snitzer
873f258bec dm thin metadata: do not write metadata if no changes occurred
Otherwise, just activating a thin-pool and thin device and then
deactivating them will cause the thin-pool metadata to be changed
(e.g. superblock written) -- even without any metadata being changed.

Add 'in_service' flag to struct dm_pool_metadata and set it in
pmd_write_lock() because all on-disk metadata changes must take a write
lock of pmd->root_lock.  Once 'in_service' is set it is never cleared.
__commit_transaction() will return 0 if 'in_service' is not set.
dm_pool_commit_metadata() is updated to use __pmd_write_lock() so that
it isn't the sole reason for putting a thin-pool in service.

Also fix dm_pool_commit_metadata() to open the next transaction if the
return from __commit_transaction() is 0.  Not seeing why the early
return ever made since for a return of 0 given that dm-io's async_io(),
as used by bufio, always returns 0.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:34 -04:00
Mike Snitzer
6a1b1ddc6a dm thin metadata: add wrappers for managing write locking of metadata
No functional change, but this prepares to hook off of pmd_write_lock()
with additional functionality (as provided in next commit).

Suggested-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:34 -04:00
Mike Snitzer
a1ed4d9e93 dm thin metadata: check __commit_transaction()'s return
Fix __reserve_metadata_snap() to return early if __commit_transaction()
fails.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:33 -04:00
Mike Snitzer
c6e086e0c9 dm space map common: zero entire ll_disk
Otherwise, memory that is allocated (and potentially not previously
zeroed) will get written to disk as part of the space maps.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:32 -04:00
Huaisheng Ye
84420b1e5d dm writecache: add unlikely for returned value of rb_next/prev
In functions writecache_discard() and writecache_find_entry() there is a
high probablity that the pointer of structure rb_node won't equal NULL.
Add unlikely for the pointer node NULL.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:31 -04:00
Huaisheng Ye
09f2d65630 dm writecache: remove needless dereferences in __writecache_writeback_pmem()
bio is already available so there is no need to access it in terms of
the wb pointer.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:31 -04:00
Nikos Tsironis
3f1637f210 dm snapshot: Use fine-grained locking scheme
Substitute the global locking scheme with a fine grained one, employing
the read-write semaphore and the scalable exception tables with
per-bucket locks introduced by the previous two commits.

Summarizing, we now use a read-write semaphore to protect the mostly
read fields of the snapshot structure, e.g., valid, active, etc., and
per-bucket bit spinlocks to protect accesses to the complete and pending
exception tables.

Finally, we use an extra spinlock (pe_allocation_lock) to serialize the
allocation of new exceptions by the exception store. This allocation is
really fast, so the extra spinlock doesn't hurt the performance.

This scheme allows dm-snapshot to scale better, resulting in increased
IOPS and reduced latency.

Following are some benchmark results using the null_blk device:

  modprobe null_blk gb=1024 bs=512 submit_queues=8 hw_queue_depth=4096 \
   queue_mode=2 irqmode=1 completion_nsec=1 nr_devices=1

* Benchmark fio_origin_randwrite_throughput_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to origin device, IO
  engine libaio):

  +--------------+-------------+------------+
  | # of workers | IOPS Before | IOPS After |
  +--------------+-------------+------------+
  |      1       |    57708    |   66421    |
  |      2       |    63415    |   77589    |
  |      4       |    67276    |   98839    |
  |      8       |    60564    |   109258   |
  +--------------+-------------+------------+

* Benchmark fio_origin_randwrite_latency_N, from the device mapper test
  suite [1] (direct IO, random 4K writes to origin device, IO engine
  psync):

  +--------------+-----------------------+----------------------+
  | # of workers | Latency (usec) Before | Latency (usec) After |
  +--------------+-----------------------+----------------------+
  |      1       |         16.25         |        13.27         |
  |      2       |         31.65         |        25.08         |
  |      4       |         55.28         |        41.08         |
  |      8       |         121.47        |        74.44         |
  +--------------+-----------------------+----------------------+

* Benchmark fio_snapshot_randwrite_throughput_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to snapshot device, IO
  engine libaio):

  +--------------+-------------+------------+
  | # of workers | IOPS Before | IOPS After |
  +--------------+-------------+------------+
  |      1       |    72593    |   84938    |
  |      2       |    97379    |   134973   |
  |      4       |    90610    |   143077   |
  |      8       |    90537    |   180085   |
  +--------------+-------------+------------+

* Benchmark fio_snapshot_randwrite_latency_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to snapshot device, IO
  engine psync):

  +--------------+-----------------------+----------------------+
  | # of workers | Latency (usec) Before | Latency (usec) After |
  +--------------+-----------------------+----------------------+
  |      1       |         12.53         |         10.6         |
  |      2       |         19.78         |        14.89         |
  |      4       |         40.37         |        23.47         |
  |      8       |         89.32         |        48.48         |
  +--------------+-----------------------+----------------------+

[1] https://github.com/jthornber/device-mapper-test-suite

Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:30 -04:00
Nikos Tsironis
f79ae415b6 dm snapshot: Make exception tables scalable
Use list_bl to implement the exception hash tables' buckets. This change
permits concurrent access, to distinct buckets, by multiple threads.

Also, implement helper functions to lock and unlock the exception tables
based on the chunk number of the exception at hand.

We retain the global locking, by means of down_write(), which is
replaced by the next commit.

Still, we must acquire the per-bucket spinlocks when accessing the hash
tables, since list_bl does not allow modification on unlocked lists.

Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:29 -04:00
Nikos Tsironis
4ad8d880b6 dm snapshot: Replace mutex with rw semaphore
dm-snapshot uses a single mutex to serialize every access to the
snapshot state. This includes all accesses to the complete and pending
exception tables, which occur at every origin write, every snapshot
read/write and every exception completion.

The lock statistics indicate that this mutex is a bottleneck (average
wait time ~480 usecs for 8 processes doing random 4K writes to the
origin device) preventing dm-snapshot to scale as the number of threads
doing IO increases.

The major contention points are __origin_write()/snapshot_map() and
pending_complete(), i.e., the submission and completion of pending
exceptions.

Replace this mutex with a rw semaphore.

We essentially revert commit ae1093be5a ("dm snapshot: use mutex
instead of rw_semaphore") and together with the next two patches we
substitute the single mutex with a fine-grained locking scheme, where we
use a read-write semaphore to protect the mostly read fields of the
snapshot structure, e.g., valid, active, etc., and per-bucket bit
spinlocks to protect accesses to the complete and pending exception
tables.

Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:28 -04:00
Nikos Tsironis
65fc7c3704 dm snapshot: Don't sleep holding the snapshot lock
When completing a pending exception, pending_complete() waits for all
conflicting reads to drain, before inserting the final, completed
exception. Conflicting reads are snapshot reads redirected to the
origin, because the relevant chunk is not remapped to the COW device the
moment we receive the read.

The completed exception must be inserted into the exception table after
all conflicting reads drain to ensure snapshot reads don't return
corrupted data. This is required because inserting the completed
exception into the exception table signals that the relevant chunk is
remapped and both origin writes and snapshot merging will now overwrite
the chunk in origin.

This wait is done holding the snapshot lock to ensure that
pending_complete() doesn't starve if new snapshot reads keep coming for
this chunk.

In preparation for the next commit, where we use a spinlock instead of a
mutex to protect the exception tables, we remove the need for holding
the lock while waiting for conflicting reads to drain.

We achieve this in two steps:

1. pending_complete() inserts the completed exception before waiting for
   conflicting reads to drain and removes the pending exception after
   all conflicting reads drain.

   This ensures that new snapshot reads will be redirected to the COW
   device, instead of the origin, and thus pending_complete() will not
   starve. Moreover, we use the existence of both a completed and
   a pending exception to signify that the COW is done but there are
   conflicting reads in flight.

2. In __origin_write() we check first if there is a pending exception
   and then if there is a completed exception. If there is a pending
   exception any submitted BIO is delayed on the pe->origin_bios list and
   DM_MAPIO_SUBMITTED is returned. This ensures that neither writes to the
   origin nor snapshot merging can overwrite the origin chunk, until all
   conflicting reads drain, and thus snapshot reads will not return
   corrupted data.

Summarizing, we now have the following possible combinations of pending
and completed exceptions for a chunk, along with their meaning:

A. No exceptions exist: The chunk has not been remapped yet.
B. Only a pending exception exists: The chunk is currently being copied
   to the COW device.
C. Both a pending and a completed exception exist: COW for this chunk
   has completed but there are snapshot reads in flight which had been
   redirected to the origin before the chunk was remapped.
D. Only the completed exception exists: COW has been completed and there
   are no conflicting reads in flight.

Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:27 -04:00
Nikos Tsironis
e28adc3bf3 dm cache metadata: Fix loading discard bitset
Add missing dm_bitset_cursor_next() to properly advance the bitset
cursor.

Otherwise, the discarded state of all blocks is set according to the
discarded state of the first block.

Fixes: ae4a46a1f6 ("dm cache metadata: use bitset cursor api to load discard bitset")
Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:25 -04:00
Damien Le Moal
7aedf75ff7 dm zoned: Fix zone report handling
The function blkdev_report_zones() returns success even if no zone
information is reported (empty report). Empty zone reports can only
happen if the report start sector passed exceeds the device capacity.
The conditions for this to happen are either a bug in the caller code,
or, a change in the device that forced the low level driver to change
the device capacity to a value that is lower than the report start
sector. This situation includes a failed disk revalidation resulting in
the disk capacity being changed to 0.

If this change happens while dm-zoned is in its initialization phase
executing dmz_init_zones(), this function may enter an infinite loop
and hang the system. To avoid this, add a check to disallow empty zone
reports and bail out early. Also fix the function dmz_update_zone() to
make sure that the report for the requested zone was correctly obtained.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Shaun Tancheff <shaun@tancheff.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:17:58 -04:00
Dan Carpenter
a3839bc635 dm zoned: Silence a static checker warning
My static checker complains about this line from dmz_get_zoned_device()

	aligned_capacity = dev->capacity & ~(blk_queue_zone_sectors(q) - 1);

The problem is that "aligned_capacity" and "dev->capacity" are sector_t
type (which is a u64 under most configs) but blk_queue_zone_sectors(q)
returns a u32 so the higher 32 bits in aligned_capacity are cleared to
zero.  This patch adds a cast to address the issue.

Fixes: 114e025968 ("dm zoned: ignore last smaller runt zone")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:16:01 -04:00
Christoph Hellwig
c13b5487d9 dm crypt: fix endianness annotations around org_sector_of_dmreq
The sector used here is a little endian value, so use the right
type for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:16:01 -04:00
Nigel Croxon
b2176a1dfb md/raid: raid5 preserve the writeback action after the parity check
The problem is that any 'uptodate' vs 'disks' check is not precise
in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the
device that might try to kick off writes and then skip the action.
Better to prevent the raid driver from taking unexpected action *and* keep
the system alive vs killing the machine with BUG_ON.

Note: fixed warning reported by kbuild test robot <lkp@intel.com>

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-16 14:33:18 -07:00
Song Liu
a25d8c327b Revert "Don't jump to compute_result state from check_result state"
This reverts commit 4f4fd7c579.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Nigel Croxon <ncroxon@redhat.com>
Cc: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-16 09:35:23 -07:00
Pawel Baldysiak
c42d324099 md: return -ENODEV if rdev has no mddev assigned
Mdadm expects that setting drive as faulty will fail with -EBUSY only if
this operation will cause RAID to be failed. If this happens, it will
try to stop the array. Currently -EBUSY might also be returned if rdev
is in the middle of the removal process - for example there is a race
with mdmon that already requested the drive to be failed/removed.

If rdev does not contain mddev, return -ENODEV instead, so the caller
can distinguish between those two cases and behave accordingly.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-16 09:31:21 -07:00
Jens Axboe
d7ba866759 Linux 5.1-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlyzsYgeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGMw0H/ir42KJiABBKSETD
 0d38qXVclAI/123zl8EkSfDrBKOsuIpXUDxzKeoDMhMkiurMpK6bbEOTPJAQMZJe
 nEYpq/bZQi+vO8Q/pMMpaC3ExlIRosd0JAR7TyDUh5ZAeeMuDNzmvMk/DPxXPbNt
 0P1FWePDa7908ajCOW1T8ZrB9Ak8boo7TKkF3LBb00ks1mEkyp/l74MKOHdu+HYn
 XIwncX/Jotl4BrKdNC2f/NXYLYk6MrJDGug8TxuHgIqiMWhhrcSqbxU1ri7iqFXB
 cBYdFo6ZJ8CWHux8/5LY5CMjSqEtzKha2Ohuhy3MMu1RsICyFLQtHnxHJ1ytLSBt
 DOPcDQ0=
 =CEUD
 -----END PGP SIGNATURE-----

Merge tag 'v5.1-rc5' into for-5.2/block

Pull in v5.1-rc5 to resolve two conflicts. One is in BFQ, in just
a comment, and is trivial. The other one is a conflict due to a
later fix in the bio multi-page work, and needs a bit more care.

* tag 'v5.1-rc5': (476 commits)
  Linux 5.1-rc5
  fs: prevent page refcount overflow in pipe_buf_get
  mm: prevent get_user_pages() from overflowing page refcount
  mm: add 'try_get_page()' helper function
  mm: make page ref count overflow check tighter and more explicit
  clk: imx: Fix PLL_1416X not rounding rates
  clk: mediatek: fix clk-gate flag setting
  arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value
  iommu/amd: Set exclusion range correctly
  clang-format: Update with the latest for_each macro list
  perf/core: Fix perf_event_disable_inatomic() race
  block: fix the return errno for direct IO
  Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
  NFSv4.1 fix incorrect return value in copy_file_range
  xprtrdma: Fix helper that drains the transport
  NFS: Fix handling of reply page vector
  NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
  dma-debug: only skip one stackframe entry
  platform/x86: pmc_atom: Drop __initconst on dmi table
  nvmet: fix discover log page when offsets are used
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-15 08:14:19 -06:00
Christoph Hellwig
efcd487c69 md: add __acquires/__releases annotations to handle_active_stripes
This tells sparse that we release and reacquire the device_lock and
avoids a warning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:09 -07:00
Christoph Hellwig
368ecade05 md: add __acquires/__releases annotations to (un)lock_two_stripes
This tells sparse that we acquire/release the two stripe locks and
avoids a warning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:09 -07:00
Christoph Hellwig
2b598ee54a md: mark md_cluster_mod static
Sparse complains that it has no external declaration, and it turns out
that it is never even used outside of md.c.  So just mark it static
and drop the export.

Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:09 -07:00
Christoph Hellwig
ae50640beb md: use correct type in super_1_sync
If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:09 -07:00
Christoph Hellwig
00485d0942 md: use correct type in super_1_load
If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:08 -07:00
Christoph Hellwig
c35403f82c md: use correct types in md_bitmap_print_sb
If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:08 -07:00
Christoph Hellwig
ed4d0a4ea1 md: add a missing endianness conversion in check_sb_changes
The on-disk value is little endian and we need to convert it to
native endian before storing the value in the in-core structure.

Fixes: 7564beda19 ("md-cluster/raid10: support add disk under grow mode")
Cc: <stable@vger.kernel.org> # 4.20+
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:08 -07:00
Yufen Yu
ee37e62191 md: add mddev->pers to avoid potential NULL pointer dereference
When doing re-add, we need to ensure rdev->mddev->pers is not NULL,
which can avoid potential NULL pointer derefence in fallowing
add_bound_rdev().

Fixes: a6da4ef85c ("md: re-add a failed disk")
Cc: Xiao Ni <xni@redhat.com>
Cc: NeilBrown <neilb@suse.com>
Cc: <stable@vger.kernel.org> # 4.4+
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:08 -07:00
Christoph Hellwig
72deb455b5 block: remove CONFIG_LBDAF
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures.  These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time.  Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.

Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06 10:48:35 -06:00
Mikulas Patocka
4ed319c6ac dm integrity: fix deadlock with overlapping I/O
dm-integrity will deadlock if overlapping I/O is issued to it, the bug
was introduced by commit 724376a04d ("dm integrity: implement fair
range locks").  Users rarely use overlapping I/O so this bug went
undetected until now.

Fix this bug by correcting, likely cut-n-paste, typos in
ranges_overlap() and also remove a flawed ranges_overlap() check in
remove_range_unlocked().  This condition could leave unprocessed bios
hanging on wait_list forever.

Cc: stable@vger.kernel.org # v4.19+
Fixes: 724376a04d ("dm integrity: implement fair range locks")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-05 18:49:08 -04:00
Mike Snitzer
bcb44433bb dm: disable DISCARD if the underlying storage no longer supports it
Storage devices which report supporting discard commands like
WRITE_SAME_16 with unmap, but reject discard commands sent to the
storage device.  This is a clear storage firmware bug but it doesn't
change the fact that should a program cause discards to be sent to a
multipath device layered on this buggy storage, all paths can end up
failed at the same time from the discards, causing possible I/O loss.

The first discard to a path will fail with Illegal Request, Invalid
field in cdb, e.g.:
 kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
 kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current]
 kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb
 kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00
 kernel: blk_update_request: critical target error, dev sdfn, sector 10487808

The SCSI layer converts this to the BLK_STS_TARGET error number, the sd
device disables its support for discard on this path, and because of the
BLK_STS_TARGET error multipath fails the discard without failing any
path or retrying down a different path.  But subsequent discards can
cause path failures.  Any discards sent to the path which already failed
a discard ends up failing with EIO from blk_cloned_rq_check_limits with
an "over max size limit" error since the discard limit was set to 0 by
the sd driver for the path.  As the error is EIO, this now fails the
path and multipath tries to send the discard down the next path.  This
cycle continues as discards are sent until all paths fail.

Fix this by training DM core to disable DISCARD if the underlying
storage already did so.

Also, fix branching in dm_done() and clone_endio() to reflect the
mutually exclussive nature of the IO operations in question.

Cc: stable@vger.kernel.org
Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-04 15:33:59 -04:00
Ilya Dryomov
eb40c0acdc dm table: propagate BDI_CAP_STABLE_WRITES to fix sporadic checksum errors
Some devices don't use blk_integrity but still want stable pages
because they do their own checksumming.  Examples include rbd and iSCSI
when data digests are negotiated.  Stacking DM (and thus LVM) on top of
these devices results in sporadic checksum errors.

Set BDI_CAP_STABLE_WRITES if any underlying device has it set.

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01 16:26:02 -04:00
Mikulas Patocka
75ae193626 dm: revert 8f50e35815 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
The limit was already incorporated to dm-crypt with commit 4e870e948f
("dm crypt: fix error with too large bios"), so we don't need to apply
it globally to all targets. The quantity BIO_MAX_PAGES * PAGE_SIZE is
wrong anyway because the variable ti->max_io_len it is supposed to be in
the units of 512-byte sectors not in bytes.

Reduction of the limit to 1048576 sectors could even cause data
corruption in rare cases - suppose that we have a dm-striped device with
stripe size 768MiB. The target will call dm_set_target_max_io_len with
the value 1572864. The buggy code would reduce it to 1048576. Now, the
dm-core will errorneously split the bios on 1048576-sector boundary
insetad of 1572864-sector boundary and pass these stripe-crossing bios
to the striped target.

Cc: stable@vger.kernel.org # v4.16+
Fixes: 8f50e35815 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01 16:20:36 -04:00
Andi Kleen
93fc91675a dm init: fix const confusion for dm_allowed_targets array
A non const pointer to const cannot be marked initconst.
Mark the array actually const.

Fixes: 6bbc923dfc dm: add support to directly boot to a mapped device
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01 16:16:37 -04:00
YueHaibing
5efedc9b62 dm integrity: make dm_integrity_init and dm_integrity_exit static
Fix sparse warnings:

drivers/md/dm-integrity.c:3619:12: warning:
 symbol 'dm_integrity_init' was not declared. Should it be static?
drivers/md/dm-integrity.c:3638:6: warning:
 symbol 'dm_integrity_exit' was not declared. Should it be static?

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01 16:16:36 -04:00
Mikulas Patocka
0d74e6a3b6 dm integrity: change memcmp to strncmp in dm_integrity_ctr
If the string opt_string is small, the function memcmp can access bytes
that are beyond the terminating nul character. In theory, it could cause
segfault, if opt_string were located just below some unmapped memory.

Change from memcmp to strncmp so that we don't read bytes beyond the end
of the string.

Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01 16:11:11 -04:00
NeilBrown
2bc13b83e6 md: batch flush requests.
Currently if many flush requests are submitted to an md device is quick
succession, they are serialized and can take a long to process them all.
We don't really need to call flush all those times - a single flush call
can satisfy all requests submitted before it started.
So keep track of when the current flush started and when it finished,
allow any pending flush that was requested before the flush started
to complete without waiting any more.

Test results from Xiao:

Test is done on a raid10 device which is created by 4 SSDs. The tool is
dbench.

1. The latest linux stable kernel
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    768    10.509    78.305
  Flush                  2078376     0.013    10.094
  Close                  21787697     0.019    18.821
  LockX                    96580     0.007     3.184
  Mkdir                      384     0.008     0.062
  Rename                 1255883     0.191    23.534
  ReadX                  46495589     0.020    14.230
  WriteX                 14790591     7.123    60.706
  Unlink                 5989118     0.440    54.551
  UnlockX                  96580     0.005     2.736
  FIND_FIRST             10393845     0.042    12.079
  SET_FILE_INFORMATION   2415558     0.129    10.088
  QUERY_FILE_INFORMATION 4711725     0.005     8.462
  QUERY_PATH_INFORMATION 26883327     0.032    21.715
  QUERY_FS_INFORMATION   4929409     0.010     8.238
  NTCreateX              29660080     0.100    53.268

Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
max_latency=60.712 ms

2. With patch1 "Revert "MD: fix lock contention for flush bios""
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    256     8.326    36.761
  Flush                   693291     3.974   180.269
  Close                  7266404     0.009    36.929
  LockX                    32160     0.006     0.840
  Mkdir                      128     0.008     0.021
  Rename                  418755     0.063    29.945
  ReadX                  15498708     0.007     7.216
  WriteX                 4932310    22.482   267.928
  Unlink                 1997557     0.109    47.553
  UnlockX                  32160     0.004     1.110
  FIND_FIRST             3465791     0.036     7.320
  SET_FILE_INFORMATION    805825     0.015     1.561
  QUERY_FILE_INFORMATION 1570950     0.005     2.403
  QUERY_PATH_INFORMATION 8965483     0.013    14.277
  QUERY_FS_INFORMATION   1643626     0.009     3.314
  NTCreateX              9892174     0.061    41.278

Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
max_latency=267.939 m

3. With patch1 and patch2
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    768     9.570    54.588
  Flush                  2061354     0.666    15.102
  Close                  21604811     0.012    25.697
  LockX                    95770     0.007     1.424
  Mkdir                      384     0.008     0.053
  Rename                 1245411     0.096    12.263
  ReadX                  46103198     0.011    12.116
  WriteX                 14667988     7.375    60.069
  Unlink                 5938936     0.173    30.905
  UnlockX                  95770     0.005     4.147
  FIND_FIRST             10306407     0.041    11.715
  SET_FILE_INFORMATION   2395987     0.048     7.640
  QUERY_FILE_INFORMATION 4672371     0.005     9.291
  QUERY_PATH_INFORMATION 26656735     0.018    19.719
  QUERY_FS_INFORMATION   4887940     0.010     7.654
  NTCreateX              29410811     0.059    28.551

Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
max_latency=60.075 ms

Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01 12:11:48 -06:00
NeilBrown
4bc034d353 Revert "MD: fix lock contention for flush bios"
This reverts commit 5a409b4f56.

This patch has two problems.

1/ it make multiple calls to submit_bio() from inside a make_request_fn.
 The bios thus submitted will be queued on current->bio_list and not
 submitted immediately.  As the bios are allocated from a mempool,
 this can theoretically result in a deadlock - all the pool of requests
 could be in various ->bio_list queues and a subsequent mempool_alloc
 could block waiting for one of them to be released.

2/ It aims to handle a case when there are many concurrent flush requests.
  It handles this by submitting many requests in parallel - all of which
  are identical and so most of which do nothing useful.
  It would be more efficient to just send one lower-level request, but
  allow that to satisfy multiple upper-level requests.

Fixes: 5a409b4f56 ("MD: fix lock contention for flush bios")
Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01 12:11:48 -06:00
Nigel Croxon
4f4fd7c579 Don't jump to compute_result state from check_result state
Changing state from check_state_check_result to
check_state_compute_result not only is unsafe but also doesn't
appear to serve a valid purpose.  A raid6 check should only be
pushing out extra writes if doing repair and a mis-match occurs.
The stripe dev management will already try and do repair writes
for failing sectors.

This patch makes the raid6 check_state_check_result handling
work more like raid5's.  If somehow too many failures for a
check, just quit the check operation for the stripe.  When any
checks pass, don't try and use check_state_compute_result for
a purpose it isn't needed for and is unsafe for.  Just mark the
stripe as in sync for passing its parity checks and let the
stripe dev read/write code and the bad blocks list do their
job handling I/O errors.

Repro steps from Xiao:

These are the steps to reproduce this problem:
1. redefined OPT_MEDIUM_ERR_ADDR to 12000 in scsi_debug.c
2. insmod scsi_debug.ko dev_size_mb=11000  max_luns=1 num_tgts=1
3. mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6
sde is the disk created by scsi_debug
4. echo "2" >/sys/module/scsi_debug/parameters/opts
5. raid-check

It panic:
[ 4854.730899] md: data-check of RAID array md127
[ 4854.857455] sd 5:0:0:0: [sdr] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.859246] sd 5:0:0:0: [sdr] tag#80 Sense Key : Medium Error [current]
[ 4854.860694] sd 5:0:0:0: [sdr] tag#80 Add. Sense: Unrecovered read error
[ 4854.862207] sd 5:0:0:0: [sdr] tag#80 CDB: Read(10) 28 00 00 00 2d 88 00 04 00 00
[ 4854.864196] print_req_error: critical medium error, dev sdr, sector 11656 flags 0
[ 4854.867409] sd 5:0:0:0: [sdr] tag#100 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.869469] sd 5:0:0:0: [sdr] tag#100 Sense Key : Medium Error [current]
[ 4854.871206] sd 5:0:0:0: [sdr] tag#100 Add. Sense: Unrecovered read error
[ 4854.872858] sd 5:0:0:0: [sdr] tag#100 CDB: Read(10) 28 00 00 00 2e e0 00 00 08 00
[ 4854.874587] print_req_error: critical medium error, dev sdr, sector 12000 flags 4000
[ 4854.876456] sd 5:0:0:0: [sdr] tag#101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.878552] sd 5:0:0:0: [sdr] tag#101 Sense Key : Medium Error [current]
[ 4854.880278] sd 5:0:0:0: [sdr] tag#101 Add. Sense: Unrecovered read error
[ 4854.881846] sd 5:0:0:0: [sdr] tag#101 CDB: Read(10) 28 00 00 00 2e e8 00 00 08 00
[ 4854.883691] print_req_error: critical medium error, dev sdr, sector 12008 flags 4000
[ 4854.893927] sd 5:0:0:0: [sdr] tag#166 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.896002] sd 5:0:0:0: [sdr] tag#166 Sense Key : Medium Error [current]
[ 4854.897561] sd 5:0:0:0: [sdr] tag#166 Add. Sense: Unrecovered read error
[ 4854.899110] sd 5:0:0:0: [sdr] tag#166 CDB: Read(10) 28 00 00 00 2e e0 00 00 10 00
[ 4854.900989] print_req_error: critical medium error, dev sdr, sector 12000 flags 0
[ 4854.902757] md/raid:md127: read error NOT corrected!! (sector 9952 on sdr1).
[ 4854.904375] md/raid:md127: read error NOT corrected!! (sector 9960 on sdr1).
[ 4854.906201] ------------[ cut here ]------------
[ 4854.907341] kernel BUG at drivers/md/raid5.c:4190!

raid5.c:4190 above is this BUG_ON:

    handle_parity_checks6()
        ...
        BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */

Cc: <stable@vger.kernel.org> # v3.16+
OriginalAuthor: David Jeffery <djeffery@redhat.com>
Cc: Xiao Ni <xni@redhat.com>
Tested-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: David Jeffy <djeffery@redhat.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01 12:11:48 -06:00
Linus Torvalds
11efae3506 for-5.1/block-post-20190315
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyL124QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptsxD/42slmoE5TC3vwXcgMBEilrjIHCns6O4Leo
 0r8Awdwil8QkVDphfAWsgkTBjRPUNKv4cCg2kG4VEzAy62YSutUWPeqJZwLOpGDI
 kji9XI6WLqwQ/VhDFwEln9G+xWDUQxds5PZDomlzLpjiNqkFArwwsPFnJbshH4fB
 U6kZrhVSLfvJHIJmC9H4RIWuTEwUH1yFSvzzMqDOOyvRon2g/A2YlHb2KhSCaJPq
 1b0jbhyR0GVP0EH1FdeKvNYFZfvXXSPAbxDN1CEtW/Lq8WxXeoaCj390tC+gL7yQ
 WWHntvUoVU/weWudbT3tVsYgpI91KfPM5OuWTDGod6lFwHrI5X91Pao3KYUGPb9d
 cwvNBOlkNqR1ENZOGTgxLeKwiwV7G1DIjvsaijRQJhGy4Uw4RkM/YEct9JHxWBIF
 x4ZuSVUVZ5Y3zNPC945iJ6Z5feOz/UO9bQL00oimu0c0JhAp++3pHWAFJEMQ8q1a
 0IRifkeUyhf0p9CIVPDnUzmNgSBglFkAVTPVAWySBVDU+v0/GoNcYwTzPq4cgPrF
 UJEIlx+RdDpKKmCqBvKjtx4w7BC1lCebL/1ZJrbARNO42djt8xeuyvKw0t+MYVTZ
 UsvLX72tXwUIbj0IZZGuz+8uSGD4ddDs8+x486FN4oaCPf36FUnnkOZZkhjV/KQA
 vsZNrNNZpw==
 =qBae
 -----END PGP SIGNATURE-----

Merge tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block

Pull more block layer changes from Jens Axboe:
 "This is a collection of both stragglers, and fixes that came in after
  I finalized the initial pull. This contains:

   - An MD pull request from Song, with a few minor fixes

   - Set of NVMe patches via Christoph

   - Pull request from Konrad, with a few fixes for xen/blkback

   - pblk fix IO calculation fix (Javier)

   - Segment calculation fix for pass-through (Ming)

   - Fallthrough annotation for blkcg (Mathieu)"

* tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block: (25 commits)
  blkcg: annotate implicit fall through
  nvme-tcp: support C2HData with SUCCESS flag
  nvmet: ignore EOPNOTSUPP for discard
  nvme: add proper write zeroes setup for the multipath device
  nvme: add proper discard setup for the multipath device
  nvme: remove nvme_ns_config_oncs
  nvme: disable Write Zeroes for qemu controllers
  nvmet-fc: bring Disconnect into compliance with FC-NVME spec
  nvmet-fc: fix issues with targetport assoc_list list walking
  nvme-fc: reject reconnect if io queue count is reduced to zero
  nvme-fc: fix numa_node when dev is null
  nvme-fc: use nr_phys_segments to determine existence of sgl
  nvme-loop: init nvmet_ctrl fatal_err_work when allocate
  nvme: update comment to make the code easier to read
  nvme: put ns_head ref if namespace fails allocation
  nvme-trace: fix cdw10 buffer overrun
  nvme: don't warn on block content change effects
  nvme: add get-feature to admin cmds tracer
  md: Fix failed allocation of md_register_thread
  It's wrong to add len to sector_nr in raid10 reshape twice
  ...
2019-03-16 12:36:39 -07:00
Aditya Pakki
e406f12dde md: Fix failed allocation of md_register_thread
mddev->sync_thread can be set to NULL on kzalloc failure downstream.
The patch checks for such a scenario and frees allocated resources.

Committer node:

Added similar fix to raid5.c, as suggested by Guoqing.

Cc: stable@vger.kernel.org # v3.16+
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-03-12 10:15:18 -07:00
Xiao Ni
b761dcf121 It's wrong to add len to sector_nr in raid10 reshape twice
In reshape_request it already adds len to sector_nr already. It's wrong to add len to
sector_nr again after adding pages to bio. If there is bad block it can't copy one chunk
at a time, it needs to goto read_more. Now the sector_nr is wrong. It can cause data
corruption.

Cc: stable@vger.kernel.org # v3.16+
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-03-12 10:15:18 -07:00
Mariusz Dabrowski
a596d08677 raid5: set write hint for PPL
When the Partial Parity Log is enabled, circular buffer is used to store
PPL data. Each write to RAID device causes overwrite of data in this buffer
so some write_hint can be set to those request to help drives handle
garbage collection. This patch adds new sysfs attribute which can be used
to specify which write_hint should be assigned to PPL.

Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-03-12 10:15:18 -07:00
Kent Overstreet
b330e6a49d md: convert to kvmalloc
The code really just wants a big flat buffer, so just do that.

Link: http://lkml.kernel.org/r/20181217131929.11727-3-kent.overstreet@gmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:02 -07:00
Linus Torvalds
6cdc577a18 - Update bio-based DM core to always call blk_queue_split() and update
DM targets to properly advertise discard limits that blk_queue_split()
   looks at when dtermining to split discard.  Whereby allowing DM core's
   own 'split_discard_bios' to be removed.
 
 - Improve DM cache target to provide support for discard passdown to the
   origin device.
 
 - Introduce support to directly boot to a DM mapped device from init by
   using dm-mod.create= module param.  This eliminates the need for an
   elaborate initramfs that is otherwise needed to create DM devices.
   This feature's implementation has been worked on for quite some time
   (got up to v12) and is of particular interest to Android and other
   more embedded platforms (e.g. ARM).
 
 - Rate limit errors from the DM integrity target that were identified as
   the cause for recent NMI hangs due to console limitations.
 
 - Add sanity checks for user input to thin-pool and external snapshot
   creation.
 
 - Remove some unused leftover kmem caches from when old .request_fn
   request-based support was removed.
 
 - Various small cleanups and fixes to targets (e.g. typos, needless
   unlikely() annotations, use struct_size(), remove needless
   .direct_access method from dm-snapshot)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJcgT+7AAoJEMUj8QotnQNaAsUIAIxsO5y6+7UruZzZxpyYBA34
 yBLnZ9SICxESteu4R9lWT4LnFbrdwDDSSCeQ1dFt5/vx54T4qISN/O3lv9e//BeJ
 BxFXtu7wB485l28uojBZeb+9APTaoihfEokcfDqZnaf26XtY0t/M+yRP7U86eGcC
 zsX9fOEmJ3cpWtpai07tbHNDjIrr1kIWcFuU2+xGO/wn+Up8uLd85exi7e3cqDs6
 VC+YJ/10/2keqFQvse3w3TBMjduwpb7SlDa2z/SorYaStVHzgwRSSjWYkSM/eDRA
 OkSeRQ3Rnwc+Vad2R8J7unnZlMd4kALjGuzbyafWnitE+C+n0aJFDKqjIwNbKcw=
 =GKp5
 -----END PGP SIGNATURE-----

Merge tag 'for-5.1/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Update bio-based DM core to always call blk_queue_split() and update
   DM targets to properly advertise discard limits that
   blk_queue_split() looks at when dtermining to split discard. Whereby
   allowing DM core's own 'split_discard_bios' to be removed.

 - Improve DM cache target to provide support for discard passdown to
   the origin device.

 - Introduce support to directly boot to a DM mapped device from init by
   using dm-mod.create= module param. This eliminates the need for an
   elaborate initramfs that is otherwise needed to create DM devices.

   This feature's implementation has been worked on for quite some time
   (got up to v12) and is of particular interest to Android and other
   more embedded platforms (e.g. ARM).

 - Rate limit errors from the DM integrity target that were identified
   as the cause for recent NMI hangs due to console limitations.

 - Add sanity checks for user input to thin-pool and external snapshot
   creation.

 - Remove some unused leftover kmem caches from when old .request_fn
   request-based support was removed.

 - Various small cleanups and fixes to targets (e.g. typos, needless
   unlikely() annotations, use struct_size(), remove needless
   .direct_access method from dm-snapshot)

* tag 'for-5.1/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm integrity: limit the rate of error messages
  dm snapshot: don't define direct_access if we don't support it
  dm cache: add support for discard passdown to the origin device
  dm writecache: fix typo in name for writeback_wq
  dm: add support to directly boot to a mapped device
  dm thin: add sanity checks to thin-pool and external snapshot creation
  dm block manager: remove redundant unlikely annotation
  dm verity fec: remove redundant unlikely annotation
  dm integrity: remove redundant unlikely annotation
  dm: always call blk_queue_split() in dm_process_bio()
  dm: fix to_sector() for 32bit
  dm switch: use struct_size() in kzalloc()
  dm: remove unused _rq_tio_cache and _rq_cache
  dm: eliminate 'split_discard_bios' flag from DM target interface
  dm: update dm_process_bio() to split bio if in ->make_request_fn()
2019-03-09 17:40:27 -08:00
Linus Torvalds
80201fe175 for-5.1/block-20190302
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlx63XIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpp2vEACfrrQsap7R+Av28mmXpmXi2FPa3g5Tev1t
 yYjK2qHvhlMZjPTYw3hCmbYdDDczlF7PEgSE2x2DjdcsYapb8Fy1lZ2X16c7ztBR
 HD/t9b5AVSQsczZzKgv3RqsNtTnjzS5V0A8XH8FAP2QRgiwDMwSN6G0FP0JBLbE/
 ZgxQrH1Iy1F33Wz4hI3Z7dEghKPZrH1IlegkZCEu47q9SlWS76qUetSy2GEtchOl
 3Lgu54mQZyVdI5/QZf9DyMDLF6dIz3tYU2qhuo01AHjGRCC72v86p8sIiXcUr94Q
 8pbegJhJ/g8KBol9Qhv3+pWG/QUAZwi/ZwasTkK+MJ4klRXfOrznxPubW1z6t9Vn
 QRo39Po5SqqP0QWAscDxCFjESIQlWlKa+LZurJL7DJDCUGrSgzTpnVwFqKwc5zTP
 HJa5MT2tEeL2TfUYRYCfh0ZV0elINdHA1y1klDBh38drh4EWr2gW8xdseGYXqRjh
 fLgEpoF7VQ8kTvxKN+E4jZXkcZmoLmefp0ZyAbblS6IawpPVC7kXM9Fdn2OU8f2c
 fjVjvSiqxfeN6dnpfeLDRbbN9894HwgP/LPropJOQ7KmjCorQq5zMDkAvoh3tElq
 qwluRqdBJpWT/F05KweY+XVW8OawIycmUWqt6JrVNoIDAK31auHQv47kR0VA4OvE
 DRVVhYpocw==
 =VBaU
 -----END PGP SIGNATURE-----

Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "Not a huge amount of changes in this round, the biggest one is that we
  finally have Mings multi-page bvec support merged. Apart from that,
  this pull request contains:

   - Small series that avoids quiescing the queue for sysfs changes that
     match what we currently have (Aleksei)

   - Series of bcache fixes (via Coly)

   - Series of lightnvm fixes (via Mathias)

   - NVMe pull request from Christoph. Nothing major, just SPDX/license
     cleanups, RR mp policy (Hannes), and little fixes (Bart,
     Chaitanya).

   - BFQ series (Paolo)

   - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection
     for the fast path (Jianchao)

   - fops->iopoll() added for async IO polling, this is a feature that
     the upcoming io_uring interface will use (Christoph, me)

   - Partition scan loop fixes (Dongli)

   - mtip32xx conversion from managed resource API (Christoph)

   - cdrom registration race fix (Guenter)

   - MD pull from Song, two minor fixes.

   - Various documentation fixes (Marcos)

   - Multi-page bvec feature. This brings a lot of nice improvements
     with it, like more efficient splitting, larger IOs can be supported
     without growing the bvec table size, and so on. (Ming)

   - Various little fixes to core and drivers"

* tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits)
  block: fix updating bio's front segment size
  block: Replace function name in string with __func__
  nbd: propagate genlmsg_reply return code
  floppy: remove set but not used variable 'q'
  null_blk: fix checking for REQ_FUA
  block: fix NULL pointer dereference in register_disk
  fs: fix guard_bio_eod to check for real EOD errors
  blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map
  block: optimize bvec iteration in bvec_iter_advance
  block: introduce mp_bvec_for_each_page() for iterating over page
  block: optimize blk_bio_segment_split for single-page bvec
  block: optimize __blk_segment_map_sg() for single-page bvec
  block: introduce bvec_nth_page()
  iomap: wire up the iopoll method
  block: add bio_set_polled() helper
  block: wire up block device iopoll method
  fs: add an iopoll method to struct file_operations
  loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
  loop: do not print warn message if partition scan is successful
  block: bounce: make sure that bvec table is updated
  ...
2019-03-08 14:12:17 -08:00
Mikulas Patocka
2255574468 dm integrity: limit the rate of error messages
When using dm-integrity underneath md-raid, some tests with raid
auto-correction trigger large amounts of integrity failures - and all
these failures print an error message. These messages can bring the
system to a halt if the system is using serial console.

Fix this by limiting the rate of error messages - it improves the speed
of raid recovery and avoids the hang.

Fixes: 7eada909bf ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-06 09:03:00 -05:00
Mikulas Patocka
c439ca69d5 dm snapshot: don't define direct_access if we don't support it
Don't define a direct_access function that fails, dm_dax_direct_access
already fails with -EIO if the pointer is zero;

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:52 -05:00
Mike Snitzer
de7180ff90 dm cache: add support for discard passdown to the origin device
DM cache now defaults to passing discards down to the origin device.
User may disable this using the "no_discard_passdown" feature when
creating the cache device.

If the cache's underlying origin device doesn't support discards then
passdown is disabled (with warning).  Similarly, if the underlying
origin device's max_discard_sectors is less than a cache block discard
passdown will be disabled (this is required because sizing of the cache
internal discard bitset depends on it).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:52 -05:00
Huaisheng Ye
f87e033b3b dm writecache: fix typo in name for writeback_wq
The workqueue's name should be "writecache-writeback" instead of
"writecache-writeabck".

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:51 -05:00
Helen Koike
6bbc923dfc dm: add support to directly boot to a mapped device
Add a "create" module parameter, which allows device-mapper targets to
be configured at boot time. This enables early use of DM targets in the
boot process (as the root device or otherwise) without the need of an
initramfs.

The syntax used in the boot param is based on the concise format from
the dmsetup tool to follow the rule of least surprise:

	dmsetup table --concise /dev/mapper/lroot

Which is:
	dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]

Where,
	<name>		::= The device name.
	<uuid>		::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
	<minor>		::= The device minor number | ""
	<flags>		::= "ro" | "rw"
	<table>		::= <start_sector> <num_sectors> <target_type> <target_args>
	<target_type>	::= "verity" | "linear" | ...

For example, the following could be added in the boot parameters:
dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0

Only the targets that were tested are allowed and the ones that don't
change any block device when the device is create as read-only. For
example, mirror and cache targets are not allowed. The rationale behind
this is that if the user makes a mistake, choosing the wrong device to
be the mirror or the cache can corrupt data.

The only targets initially allowed are:
* crypt
* delay
* linear
* snapshot-origin
* striped
* verity

Co-developed-by: Will Drewry <wad@chromium.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:50 -05:00
Jason Cai (Xiang Feng)
70de2cbda8 dm thin: add sanity checks to thin-pool and external snapshot creation
Invoking dm_get_device() twice on the same device path with different
modes is dangerous.  Because in that case, upgrade_mode() will alloc a
new 'dm_dev' and free the old one, which may be referenced by a previous
caller.  Dereferencing the dangling pointer will trigger kernel NULL
pointer dereference.

The following two cases can reproduce this issue.  Actually, they are
invalid setups that must be disallowed, e.g.:

1. Creating a thin-pool with read_only mode, and the same device as
both metadata and data.

dmsetup create thinp --table \
    "0 41943040 thin-pool /dev/vdb /dev/vdb 128 0 1 read_only"

BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
...
Call Trace:
 new_read+0xfb/0x110 [dm_bufio]
 dm_bm_read_lock+0x43/0x190 [dm_persistent_data]
 ? kmem_cache_alloc_trace+0x15c/0x1e0
 __create_persistent_data_objects+0x65/0x3e0 [dm_thin_pool]
 dm_pool_metadata_open+0x8c/0xf0 [dm_thin_pool]
 pool_ctr.cold.79+0x213/0x913 [dm_thin_pool]
 ? realloc_argv+0x50/0x70 [dm_mod]
 dm_table_add_target+0x14e/0x330 [dm_mod]
 table_load+0x122/0x2e0 [dm_mod]
 ? dev_status+0x40/0x40 [dm_mod]
 ctl_ioctl+0x1aa/0x3e0 [dm_mod]
 dm_ctl_ioctl+0xa/0x10 [dm_mod]
 do_vfs_ioctl+0xa2/0x600
 ? handle_mm_fault+0xda/0x200
 ? __do_page_fault+0x26c/0x4f0
 ksys_ioctl+0x60/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x55/0x150
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

2. Creating a external snapshot using the same thin-pool device.

dmsetup create thinp --table \
    "0 41943040 thin-pool /dev/vdc /dev/vdb 128 0 2 ignore_discard"
dmsetup message /dev/mapper/thinp 0 "create_thin 0"
dmsetup create snap --table \
            "0 204800 thin /dev/mapper/thinp 0 /dev/mapper/thinp"

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
...
Call Trace:
? __alloc_pages_nodemask+0x13c/0x2e0
retrieve_status+0xa5/0x1f0 [dm_mod]
? dm_get_live_or_inactive_table.isra.7+0x20/0x20 [dm_mod]
 table_status+0x61/0xa0 [dm_mod]
 ctl_ioctl+0x1aa/0x3e0 [dm_mod]
 dm_ctl_ioctl+0xa/0x10 [dm_mod]
 do_vfs_ioctl+0xa2/0x600
 ksys_ioctl+0x60/0x90
 ? ksys_write+0x4f/0xb0
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x55/0x150
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Jason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:49 -05:00
Chengguang Xu
5941c621dc dm block manager: remove redundant unlikely annotation
unlikely has already included in IS_ERR(),
so just remove redundant unlikely annotation.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:48 -05:00
Chengguang Xu
821b40da4d dm verity fec: remove redundant unlikely annotation
unlikely has already included in IS_ERR(),
so just remove redundant unlikely annotation.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:48 -05:00
Chengguang Xu
5e3d0e3706 dm integrity: remove redundant unlikely annotation
unlikely has already included in IS_ERR(),
so just remove redundant unlikely annotation.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:47 -05:00
Mike Snitzer
effd58c95f dm: always call blk_queue_split() in dm_process_bio()
Do not just call blk_queue_split() if the bio is_abnormal_io().

Fixes: 568c73a355 ("dm: update dm_process_bio() to split bio if in ->make_request_fn()")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:32 -05:00
Gustavo A. R. Silva
d2832376b6 dm switch: use struct_size() in kzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:48:51 -05:00
Mike Snitzer
e689fbab3d dm: remove unused _rq_tio_cache and _rq_cache
Also move dm_rq_target_io structure definition from dm-rq.h to dm-rq.c

Fixes: 6a23e05c2f ("dm: remove legacy request-based IO path")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:48:50 -05:00
Mike Snitzer
61697a6abd dm: eliminate 'split_discard_bios' flag from DM target interface
There is no need to have DM core split discards on behalf of a DM target
now that blk_queue_split() handles splitting discards based on the
queue_limits.  A DM target just needs to set max_discard_sectors,
discard_granularity, etc, in queue_limits.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-02-20 23:24:55 -05:00
Mike Snitzer
568c73a355 dm: update dm_process_bio() to split bio if in ->make_request_fn()
Must call blk_queue_split() otherwise queue_limits for abnormal requests
(e.g. discard, writesame, etc) won't be imposed.

In addition, add dm_queue_split() to simplify DM specific splitting that
is needed for targets that impose ti->max_io_len.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-02-19 12:25:00 -05:00
Linus Torvalds
24f0a48743 for-linus-20190215
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxm7pAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpl6JEACM5qHp7HEf7muuLKDUoX16G2eDOjacVxbL
 q1kqyHNvrYD/aGo+8vcshCef6xno9fL1akIxTyaTcMwYJUk9JSMicsVimxC1OvI6
 a5ZiWItX2L8Nh/heJe+FtutWbrT+Nd+3Q8DqI+U0YkRnjnXaRVgLFtBmjLOxBrqJ
 Ps/VepB4GaxA0oWdPbhos/N3wa42uFy3ixdv3Kv6WmHdqraB9uagt8PwwUti3WzQ
 uxWL6J+JOBSDha8l3fp68Okib1bm/6Nmmc9l8Yz1eFwf+Y+gVgw7wPQxkUD/XaFW
 bDJGwp3NawK07EanIAIzfXUEGfLvgeRJBEP3OGwV/TAiHX5q9zQo/tbM6x8j4aT9
 zGlwU/EnwFixgbRW/hOT5Ox4usBlfB1j0ZiNmgUm8QphHrELFnc35Kd+PR/KONNX
 sI6ZiifEAMR+4S99kTZ5YjHUqcUVm9ndd8iQGW9mvM6vt3o1L6QKeOeEKBMlhMcx
 V+JtViC50ojidYc82kEtQFY9OKRkc5x3k1wBsH49LGMT+fvEwETallOXHTarQKrv
 QAZNN1NINkMmrL5bgBXFqf0qpOy4xHnhis5AilUHNZwa4G8iAe8oqz/2eUCydiV1
 Ogx20a8T1ifeSkI2NXrwnBjVzqnfiO9wOb9py98BiLR6k59x3GYtbCdGtpIXfSFv
 hG79KKoz3Q==
 =8mjO
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190215' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Ensure we insert into the hctx dispatch list, if a request is marked
   as DONTPREP (Jianchao)

 - NVMe pull request, single missing unlock on error fix (Keith)

 - MD pull request, single fix for a potentially data corrupting issue
   (Nate)

 - Floppy check_events regression fix (Yufen)

* tag 'for-linus-20190215' of git://git.kernel.dk/linux-block:
  md/raid1: don't clear bitmap bits on interrupted recovery.
  floppy: check_events callback should not return a negative number
  nvme-pci: add missing unlock for reset error
  blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
2019-02-15 09:12:28 -08:00
Jens Axboe
6fb845f0e7 Linux 5.0-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru
 8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1
 PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV
 yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38
 JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr
 8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR
 A6Ktjw==
 =scY4
 -----END PGP SIGNATURE-----

Merge tag 'v5.0-rc6' into for-5.1/block

Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
This is needed since io_uring is now based on the block branch,
to avoid a conflict between the multi-page bvecs and the bits
of io_uring that touch the core block parts.

* tag 'v5.0-rc6': (525 commits)
  Linux 5.0-rc6
  x86/mm: Make set_pmd_at() paravirt aware
  MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
  blk-mq: remove duplicated definition of blk_mq_freeze_queue
  Blk-iolatency: warn on negative inflight IO counter
  blk-iolatency: fix IO hang due to negative inflight counter
  MAINTAINERS: unify reference to xen-devel list
  x86/mm/cpa: Fix set_mce_nospec()
  futex: Handle early deadlock return correctly
  futex: Fix barrier comment
  net: dsa: b53: Fix for failure when irq is not defined in dt
  blktrace: Show requests without sector
  mips: cm: reprime error cause
  mips: loongson64: remove unreachable(), fix loongson_poweroff().
  sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
  geneve: should not call rt6_lookup() when ipv6 was disabled
  KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
  KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
  kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
  signal: Better detection of synchronous signals
  ...
2019-02-15 08:43:59 -07:00
Ming Lei
56d18f62f5 block: kill BLK_MQ_F_SG_MERGE
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:12 -07:00
Ming Lei
2705c93742 block: kill QUEUE_FLAG_NO_SG_MERGE
Since bdced438ac ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.

Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.

Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().

For another user of blk_recalc_rq_segments():

- run in partial completion branch of blk_update_request, which is an unusual case

- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
since dm-rq is the only user.

Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
current setup of the I/O path, as it isn't going to save you a significant amount
of cycles.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:12 -07:00
Ming Lei
6dc4f100c1 block: allow bio_for_each_segment_all() to iterate over multi-page bvec
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:11 -07:00
Ming Lei
2e1f4f4d24 bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.

Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:11 -07:00
Nikos Tsironis
4ae280b4ee dm thin: fix bug where bio that overwrites thin block ignores FUA
When provisioning a new data block for a virtual block, either because
the block was previously unallocated or because we are breaking sharing,
if the whole block of data is being overwritten the bio that triggered
the provisioning is issued immediately, skipping copying or zeroing of
the data block.

When this bio completes the new mapping is inserted in to the pool's
metadata by process_prepared_mapping(), where the bio completion is
signaled to the upper layers.

This completion is signaled without first committing the metadata.  If
the bio in question has the REQ_FUA flag set and the system crashes
right after its completion and before the next metadata commit, then the
write is lost despite the REQ_FUA flag requiring that I/O completion for
this request must only be signaled after the data has been committed to
non-volatile storage.

Fix this by deferring the completion of overwrite bios, with the REQ_FUA
flag set, until after the metadata has been committed.

Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-02-14 19:02:29 -05:00
Nate Dailey
dfcc34c99f md/raid1: don't clear bitmap bits on interrupted recovery.
sync_request_write no longer submits writes to a Faulty device. This has
the unfortunate side effect that bitmap bits can be incorrectly cleared
if a recovery is interrupted (previously, end_sync_write would have
prevented this). This means the next recovery may not copy everything
it should, potentially corrupting data.

Add a function for doing the proper md_bitmap_end_sync, called from
end_sync_write and the Faulty case in sync_request_write.

backport note to 4.14: s/md_bitmap_end_sync/bitmap_end_sync
Cc: stable@vger.kernel.org 4.14+
Fixes: 0c9d5b127f ("md/raid1: avoid reusing a resync bio after error handling.")
Reviewed-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Tested-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-02-12 14:06:58 -08:00
Mikulas Patocka
ff0c129d3b dm crypt: don't overallocate the integrity tag space
bio_sectors() returns the value in the units of 512-byte sectors (no
matter what the real sector size of the device).  dm-crypt multiplies
bio_sectors() by on_disk_tag_size to calculate the space allocated for
integrity tags.  If dm-crypt is running with sector size larger than
512b, it allocates more data than is needed.

Device Mapper trims the extra space when passing the bio to
dm-integrity, so this bug didn't result in any visible misbehavior.
But it must be fixed to avoid wasteful memory allocation for the block
integrity payload.

Fixes: ef43aa3806 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
Cc: stable@vger.kernel.org # 4.12+
Reported-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-02-11 12:06:48 -05:00
Coly Li
dc7292a5bc bcache: use (REQ_META|REQ_PRIO) to indicate bio for metadata
In 'commit 752f66a75a ("bcache: use REQ_PRIO to indicate bio for
metadata")' REQ_META is replaced by REQ_PRIO to indicate metadata bio.
This assumption is not always correct, e.g. XFS uses REQ_META to mark
metadata bio other than REQ_PRIO. This is why Nix noticed that bcache
does not cache metadata for XFS after the above commit.

Thanks to Dave Chinner, he explains the difference between REQ_META and
REQ_PRIO from view of file system developer. Here I quote part of his
explanation from mailing list,
   REQ_META is used for metadata. REQ_PRIO is used to communicate to
   the lower layers that the submitter considers this IO to be more
   important that non REQ_PRIO IO and so dispatch should be expedited.

   IOWs, if the filesystem considers metadata IO to be more important
   that user data IO, then it will use REQ_PRIO | REQ_META rather than
   just REQ_META.

Then it seems bios with REQ_META or REQ_PRIO should both be cached for
performance optimation, because they are all probably low I/O latency
demand by upper layer (e.g. file system).

So in this patch, when we want to decide whether to bypass the cache,
REQ_META and REQ_PRIO are both checked. Then both metadata and
high priority I/O requests will be handled properly.

Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Andre Noll <maan@tuebingen.mpg.de>
Tested-by: Nix <nix@esperi.org.uk>
Cc: stable@vger.kernel.org
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:33 -07:00
Coly Li
a91fbda49f bcache: fix input overflow to cache set sysfs file io_error_halflife
Cache set sysfs entry io_error_halflife is used to set c->error_decay.
c->error_decay is in type unsigned int, and it is converted by
strtoul_or_return(), therefore overflow to c->error_decay is possible
for a large input value.

This patch fixes the overflow by using strtoul_safe_clamp() to convert
input string to an unsigned long value in range [0, UINT_MAX], then
divides by 88 and set it to c->error_decay.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:33 -07:00
Coly Li
b15008403b bcache: fix input overflow to cache set io_error_limit
c->error_limit is in type unsigned int, it is set via cache set sysfs
file io_error_limit. Inside the bcache code, input string is converted
by strtoul_or_return() and set the converted value to c->error_limit.

Because the converted value is unsigned long, and c->error_limit is
unsigned int, if the input is large enought, overflow will happen to
c->error_limit.

This patch uses sysfs_strtoul_clamp() to convert input string, and set
the range in [0, UINT_MAX] to avoid the potential overflow.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
453745fbbe bcache: fix input overflow to journal_delay_ms
c->journal_delay_ms is in type unsigned short, it is set via sysfs
interface and converted by sysfs_strtoul() from input string to
unsigned short value. Therefore overflow to unsigned short might be
happen when the converted value exceed USHRT_MAX. e.g. writing
65536 into sysfs file journal_delay_ms, c->journal_delay_ms is set to
0.

This patch uses sysfs_strtoul_clamp() to convert the input string and
limit value range in [0, USHRT_MAX], to avoid the input overflow.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
dab71b2db9 bcache: fix input overflow to writeback_rate_minimum
dc->writeback_rate_minimum is type unsigned integer variable, it is set
via sysfs interface, and converte from input string to unsigned integer
by d_strtoul_nonzero(). When the converted input value is larger than
UINT_MAX, overflow to unsigned integer happens.

This patch fixes the overflow by using sysfs_strotoul_clamp() to
convert input string and limit the value in range [1, UINT_MAX], then
the overflow can be avoided.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
5b5fd3c94e bcache: fix potential div-zero error of writeback_rate_p_term_inverse
Current code already uses d_strtoul_nonzero() to convert input string
to an unsigned integer, to make sure writeback_rate_p_term_inverse
won't be zero value. But overflow may happen when converting input
string to an unsigned integer value by d_strtoul_nonzero(), then
dc->writeback_rate_p_term_inverse can still be set to 0 even if the
sysfs file input value is not zero, e.g. 4294967296 (a.k.a UINT_MAX+1).

If dc->writeback_rate_p_term_inverse is set to 0, it might cause a
dev-zero error in following code from __update_writeback_rate(),
	int64_t proportional_scaled =
		div_s64(error, dc->writeback_rate_p_term_inverse);

This patch replaces d_strtoul_nonzero() by sysfs_strtoul_clamp() and
limit the value range in [1, UINT_MAX]. Then the unsigned integer
overflow and dev-zero error can be avoided.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
c3b75a2199 bcache: fix potential div-zero error of writeback_rate_i_term_inverse
dc->writeback_rate_i_term_inverse can be set via sysfs interface. It is
in type unsigned int, and convert from input string by d_strtoul(). The
problem is d_strtoul() does not check valid range of the input, if
4294967296 is written into sysfs file writeback_rate_i_term_inverse,
an overflow of unsigned integer will happen and value 0 is set to
dc->writeback_rate_i_term_inverse.

In writeback.c:__update_writeback_rate(), there are following lines of
code,
      integral_scaled = div_s64(dc->writeback_rate_integral,
                      dc->writeback_rate_i_term_inverse);
If dc->writeback_rate_i_term_inverse is set to 0 via sysfs interface,
a div-zero error might be triggered in the above code.

Therefore we need to add a range limitation in the sysfs interface,
this is what this patch does, use sysfs_stroul_clamp() to replace
d_strtoul() and restrict the input range in [1, UINT_MAX].

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
369d21a73a bcache: fix input overflow to writeback_delay
Sysfs file writeback_delay is used to configure dc->writeback_delay
which is type unsigned int. But bcache code uses sysfs_strtoul() to
convert the input string, therefore it might be overflowed if the input
value is too large. E.g. input value is 4294967296 but indeed 0 is
set to dc->writeback_delay.

This patch uses sysfs_strtoul_clamp() to convert the input string and
set the result value range in [0, UINT_MAX] to avoid such unsigned
integer overflow.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
f5c0b95d2e bcache: use sysfs_strtoul_bool() to set bit-field variables
When setting bcache parameters via sysfs, there are some variables are
defined as bit-field value. Current bcache code in sysfs.c uses either
d_strtoul() or sysfs_strtoul() to convert the input string to unsigned
integer value and set it to the corresponded bit-field value.

The problem is, the bit-field value only takes the lowest bit of the
converted value. If input is 2, the expected value (like bool value)
of the bit-field value should be 1, but indeed it is 0.

The following sysfs files for bit-field variables have such problem,
	bypass_torture_test,	for dc->bypass_torture_test
	writeback_metadata,	for dc->writeback_metadata
	writeback_running,	for dc->writeback_running
	verify,			for c->verify
	key_merging_disabled,	for c->key_merging_disabled
	gc_always_rewrite,	for c->gc_always_rewrite
	btree_shrinker_disabled,for c->shrinker_disabled
	copy_gc_enabled,	for c->copy_gc_enabled

This patch uses sysfs_strtoul_bool() to set such bit-field variables,
then if the converted value is non-zero, the bit-field variables will
be set to 1, like setting a bool value like expensive_debug_checks.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
e4db37fb69 bcache: add sysfs_strtoul_bool() for setting bit-field variables
When setting bool values via sysfs interface, e.g. writeback_metadata,
if writing 1 into writeback_metadata file, dc->writeback_metadata is
set to 1, but if writing 2 into the file, dc->writeback_metadata is
0. This is misleading, a better result should be 1 for all non-zero
input value.

It is because dc->writeback_metadata is a bit-field variable, and
current code simply use d_strtoul() to convert a string into integer
and takes the lowest bit value. To fix such error, we need a routine
to convert the input string into unsigned integer, and set target
variable to 1 if the converted integer is non-zero.

This patch introduces a new macro called sysfs_strtoul_bool(), it can
be used to convert input string into bool value, we can use it to set
bool value for bit-field vairables.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
8c27a3953e bcache: fix input overflow to sequential_cutoff
People may set sequential_cutoff of a cached device via sysfs file,
but current code does not check input value overflow. E.g. if value
4294967295 (UINT_MAX) is written to file sequential_cutoff, its value
is 4GB, but if 4294967296 (UINT_MAX + 1) is written into, its value
will be 0. This is an unexpected behavior.

This patch replaces d_strtoi_h() by sysfs_strtoul_clamp() to convert
input string to unsigned integer value, and limit its range in
[0, UINT_MAX]. Then the input overflow can be fixed.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
f54478c6e2 bcache: fix input integer overflow of congested threshold
Cache set congested threshold values congested_read_threshold_us and
congested_write_threshold_us can be set via sysfs interface. These
two values are 'unsigned int' type, but sysfs interface uses strtoul
to convert input string. So if people input a large number like
9999999999, the value indeed set is 1410065407, which is not expected
behavior.

This patch replaces sysfs_strtoul() by sysfs_strtoul_clamp() when
convert input string to unsigned int value, and set value range in
[0, UINT_MAX], to avoid the above integer overflow errors.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Coly Li
596b5a5dd1 bcache: improve sysfs_strtoul_clamp()
Currently sysfs_strtoul_clamp() is defined as,
 82 #define sysfs_strtoul_clamp(file, var, min, max)                   \
 83 do {                                                               \
 84         if (attr == &sysfs_ ## file)                               \
 85                 return strtoul_safe_clamp(buf, var, min, max)      \
 86                         ?: (ssize_t) size;                         \
 87 } while (0)

The problem is, if bit width of var is less then unsigned long, min and
max may not protect var from integer overflow, because overflow happens
in strtoul_safe_clamp() before checking min and max.

To fix such overflow in sysfs_strtoul_clamp(), to make min and max take
effect, this patch adds an unsigned long variable, and uses it to macro
strtoul_safe_clamp() to convert an unsigned long value in range defined
by [min, max]. Then assign this value to var. By this method, if bit
width of var is less than unsigned long, integer overflow won't happen
before min and max are checking.

Now sysfs_strtoul_clamp() can properly handle smaller data type like
unsigned int, of cause min and max should be defined in range of
unsigned int too.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Tang Junhui
58ac323084 bcache: treat stale && dirty keys as bad keys
Stale && dirty keys can be produced in the follow way:
After writeback in write_dirty_finish(), dirty keys k1 will
replace by clean keys k2
==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
==>btree_insert_fn(struct btree_op *b_op, struct btree *b)
==>static int bch_btree_insert_node(struct btree *b,
       struct btree_op *op,
       struct keylist *insert_keys,
       atomic_t *journal_ref,
Then two steps:
A) update k1 to k2 in btree node memory;
   bch_btree_insert_keys(b, op, insert_keys, replace_key)
B) Write the bset(contains k2) to cache disk by a 30s delay work
   bch_btree_leaf_dirty(b, journal_ref).
But before the 30s delay work write the bset to cache device,
these things happened:
A) GC works, and reclaim the bucket k2 point to;
B) Allocator works, and invalidate the bucket k2 point to,
   and increase the gen of the bucket, and place it into free_inc
   fifo;
C) Until now, the 30s delay work still does not finish work,
   so in the disk, the key still is k1, it is dirty and stale
   (its gen is smaller than the gen of the bucket). and then the
   machine power off suddenly happens;
D) When the machine power on again, after the btree reconstruction,
   the stale dirty key appear.

In bch_extent_bad(), when expensive_debug_checks is off, it would
treat the dirty key as good even it is stale keys, and it would
cause bellow probelms:
A) In read_dirty() it would cause machine crash:
   BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
B) It could be worse when reads hits stale dirty keys, it would
   read old incorrect data.

This patch tolerate the existence of these stale && dirty keys,
and treat them as bad key in bch_extent_bad().

(Coly Li: fix indent which was modified by sender's email client)

Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Colin Ian King
e8cf978dff bcache: fix indentation issue, remove tabs on a hunk of code
There is a hunk of code that is indented one level too deep, fix this
by removing the extra tabs.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Coly Li
d4610456cf bcache: export backing_dev_uuid via sysfs
When there are multiple bcache devices, after a reboot the name of
bcache devices may change (e.g. current /dev/bcache1 was /dev/bcache0
before reboot). Therefore we need the backing device UUID (sb.uuid) to
identify each bcache device.

Backing device uuid can be found by program bcache-super-show, but
directly exporting backing_dev_uuid by sysfs file
/sys/block/bcache<?>/bcache/backing_dev_uuid is a much simpler method.

With backing_dev_uuid, and partition uuids from /dev/disk/by-partuuid/,
now we can identify each bcache device and its partitions conveniently.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Coly Li
926d19465b bcache: export backing_dev_name via sysfs
This patch export dc->backing_dev_name to sysfs file
/sys/block/bcache<?>/bcache/backing_dev_name, then people or user space
tools may know the backing device name of this bcache device.

Of cause it can be done by parsing sysfs links, but this method can be
much simpler to find the link between bcache device and backing device.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Coly Li
83ff9318c4 bcache: not use hard coded memset size in bch_cache_accounting_clear()
In stats.c:bch_cache_accounting_clear(), a hard coded number '7' is
used in memset(). It is because in struct cache_stats, there are 7
atomic_t type members. This is not good when new members added into
struct stats, the hard coded number will only clear part of memory.

This patch replaces 'sizeof(unsigned long) * 7' by more generic
'sizeof(struct cache_stats))', to avoid potential error if new
member added into struct cache_stats.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Daniel Axtens
9951379b0c bcache: never writeback a discard operation
Some users see panics like the following when performing fstrim on a
bcached volume:

[  529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  530.183928] #PF error: [normal kernel read fault]
[  530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
[  530.750887] Oops: 0000 [#1] SMP PTI
[  530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
[  531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
[  531.693137] RIP: 0010:blk_queue_split+0x148/0x620
[  531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
[  532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
[  533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
[  533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
[  533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
[  534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
[  534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
[  534.833319] FS:  00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
[  535.224098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
[  535.851759] Call Trace:
[  535.970308]  ? mempool_alloc_slab+0x15/0x20
[  536.174152]  ? bch_data_insert+0x42/0xd0 [bcache]
[  536.403399]  blk_mq_make_request+0x97/0x4f0
[  536.607036]  generic_make_request+0x1e2/0x410
[  536.819164]  submit_bio+0x73/0x150
[  536.980168]  ? submit_bio+0x73/0x150
[  537.149731]  ? bio_associate_blkg_from_css+0x3b/0x60
[  537.391595]  ? _cond_resched+0x1a/0x50
[  537.573774]  submit_bio_wait+0x59/0x90
[  537.756105]  blkdev_issue_discard+0x80/0xd0
[  537.959590]  ext4_trim_fs+0x4a9/0x9e0
[  538.137636]  ? ext4_trim_fs+0x4a9/0x9e0
[  538.324087]  ext4_ioctl+0xea4/0x1530
[  538.497712]  ? _copy_to_user+0x2a/0x40
[  538.679632]  do_vfs_ioctl+0xa6/0x600
[  538.853127]  ? __do_sys_newfstat+0x44/0x70
[  539.051951]  ksys_ioctl+0x6d/0x80
[  539.212785]  __x64_sys_ioctl+0x1a/0x20
[  539.394918]  do_syscall_64+0x5a/0x110
[  539.568674]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

We have observed it where both:
1) LVM/devmapper is involved (bcache backing device is LVM volume) and
2) writeback cache is involved (bcache cache_mode is writeback)

On one machine, we can reliably reproduce it with:

 # echo writeback > /sys/block/bcache0/bcache/cache_mode
   (not sure whether above line is required)
 # mount /dev/bcache0 /test
 # for i in {0..10}; do
	file="$(mktemp /test/zero.XXX)"
	dd if=/dev/zero of="$file" bs=1M count=256
	sync
	rm $file
    done
  # fstrim -v /test

Observing this with tracepoints on, we see the following writes:

fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4260112 + 196352 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4456464 + 262144 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4718608 + 81920 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5324816 + 180224 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5505040 + 262144 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5767184 + 81920 hit 0 bypass 1
fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 6373392 + 180224 hit 1 bypass 0
<crash>

Note the final one has different hit/bypass flags.

This is because in should_writeback(), we were hitting a case where
the partial stripe condition was returning true and so
should_writeback() was returning true early.

If that hadn't been the case, it would have hit the would_skip test, and
as would_skip == s->iop.bypass == true, should_writeback() would have
returned false.

Looking at the git history from 'commit 72c270612b ("bcache: Write out
full stripes")', it looks like the idea was to optimise for raid5/6:

       * If a stripe is already dirty, force writes to that stripe to
	 writeback mode - to help build up full stripes of dirty data

To fix this issue, make sure that should_writeback() on a discard op
never returns true.

More details of debugging:
https://www.spinics.net/lists/linux-bcache/msg06996.html

Previous reports:
 - https://bugzilla.kernel.org/show_bug.cgi?id=201051
 - https://bugzilla.kernel.org/show_bug.cgi?id=196103
 - https://www.spinics.net/lists/linux-bcache/msg06885.html

(Coly Li: minor modification to follow maximum 75 chars per line rule)

Cc: Kent Overstreet <koverstreet@google.com>
Cc: stable@vger.kernel.org
Fixes: 72c270612b ("bcache: Write out full stripes")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Mike Snitzer
fa8db4948f dm: don't use bio_trim() afterall
bio_trim() has an early return, which makes it _not_ idempotent, if the
offset is 0 and the bio's bi_size already matches the requested size.
Prior to DM, all users of bio_trim() were fine with this.  But DM has
exposed the fact that bio_trim()'s early return is incompatible with a
cloned bio whose integrity payload must be trimmed via
bio_integrity_trim().

Fix this by reverting DM back to doing the equivalent of bio_trim() but
in an idempotent manner (so bio_integrity_trim is always performed).

Follow-on work is needed to assess what benefit bio_trim()'s early
return is providing to its existing callers.

Reported-by: Milan Broz <gmazyland@gmail.com>
Fixes: 57c36519e4 ("dm: fix clone_bio() to trigger blk_recount_segments()")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-02-06 17:24:37 -05:00
Mikulas Patocka
645efa84f6 dm: add memory barrier before waitqueue_active
Block core changes to switch bio-based IO accounting to be percpu had a
side-effect of altering DM core to now rely on calling waitqueue_active
(in both bio-based and request-based) to check if another task is in
dm_wait_for_completion().

A memory barrier is needed before calling waitqueue_active().  DM core
doesn't piggyback on a preceding memory barrier so it must explicitly
use its own.

For more details on why using waitqueue_active() without a preceding
barrier is unsafe, please see the comment before the waitqueue_active()
definition in include/linux/wait.h.

Add the missing memory barrier by switching to using wq_has_sleeper().

Fixes: 6f75723190 ("dm: remove the pending IO accounting")
Fixes: c4576aed8d ("dm: fix request-based dm's use of dm_wait_for_completion")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-02-06 17:24:37 -05:00
Yufen Yu
ebda52fa1b raid1: simplify raid1_error function
Remove redundance set_bit and let code simplify.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-02-04 10:37:11 -08:00
Gustavo A. R. Silva
f1e5b6239b md-linear: use struct_size() in kzalloc()
One of the more common cases of allocation size calculations is finding the
size of a structure that has a zero-sized array at the end, along with memory
for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can now
use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-02-04 10:37:11 -08:00
Alexei Naberezhnov
483cbbeddd md/raid5: fix 'out of memory' during raid cache recovery
This fixes the case when md array assembly fails because of raid cache recovery
unable to allocate a stripe, despite attempts to replay stripes and increase
cache size. This happens because stripes released by r5c_recovery_replay_stripes
and raid5_set_cache_size don't become available for allocation immediately.
Released stripes first are placed on conf->released_stripes list and require
md thread to merge them on conf->inactive_list before they can be allocated.

Patch allows final allocation attempt during cache recovery to wait for
new stripes to become availabe for allocation.

Cc: linux-raid@vger.kernel.org
Cc: Shaohua Li <shli@kernel.org>
Cc: linux-stable <stable@vger.kernel.org> # 4.10+
Fixes: b4c625c673 ("md/r5cache: r5cache recovery: part 1")
Signed-off-by: Alexei Naberezhnov <anaberezhnov@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-01-28 11:44:40 -08:00
Linus Torvalds
cffd425b90 - Fix DM crypt's parsing of extended IV arguments.
- Fix DM thinp's discard passdown to properly account for extra
   reference that is taken to guard against reallocating a block before a
   discard has been issued.
 
 - Fix bio-based DM's redundant IO accounting that was occurring for bios
   that must be split due to the nature of the DM target (e.g. dm-stripe,
   dm-thinp, etc).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJcShLOAAoJEMUj8QotnQNahKwIAKtIHVmexvytppeapY0JkW+r
 eAVvfAmy/wnAir1P4jAM5/vIFydwdiOy9XmJxoB5yZHr7iG7Yhf+N58sN0Bk8SIH
 whUmt7MI8bMYkXhJNrEhiWbDDnEc55C7xdAqAYGDH7j2lJEkdnR6hdL5fDnMnCyx
 AxbQhAIXOWUCaDZKHQ8VQquq0EDg/LElCaGPWMJ71YTjgoNWV4MvQWqLqk/KJEyb
 JfYR+5yaAVlqoysIb+gQtn4+UNA5oU8jSYe1rqrdfZ1fYx5ZYhlhp4gHM3+tD4gL
 4zEOPD0iP1hrGc+4ZgWmFT2fZVHcUHcUb8+UnkH52t5QFhnghnTQSnRsaWviqAg=
 =ew3d
 -----END PGP SIGNATURE-----

Merge tag 'for-5.0/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix DM crypt's parsing of extended IV arguments.

 - Fix DM thinp's discard passdown to properly account for extra
   reference that is taken to guard against reallocating a block before
   a discard has been issued.

 - Fix bio-based DM's redundant IO accounting that was occurring for
   bios that must be split due to the nature of the DM target (e.g.
   dm-stripe, dm-thinp, etc).

* tag 'for-5.0/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: add missing trace_block_split() to __split_and_process_bio()
  dm: fix dm_wq_work() to only use __split_and_process_bio() if appropriate
  dm: fix redundant IO accounting for bios that need splitting
  dm: fix clone_bio() to trigger blk_recount_segments()
  dm thin: fix passdown_double_checking_shared_status()
  dm crypt: fix parsing of extended IV arguments
2019-01-25 09:07:18 +13:00
Mike Snitzer
075c18c3e1 dm: add missing trace_block_split() to __split_and_process_bio()
Provides useful context about bio splits in blktrace.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-01-22 13:41:48 -05:00
Mike Snitzer
6548c7c538 dm: fix dm_wq_work() to only use __split_and_process_bio() if appropriate
Otherwise targets that don't support/expect IO splitting could resubmit
bios using code paths with unnecessary IO splitting complexity.

Depends-on: 24113d4878 ("dm: avoid indirect call in __dm_make_request")
Fixes: 978e51ba38 ("dm: optimize bio-based NVMe IO submission")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-01-22 13:41:40 -05:00
Mike Snitzer
a1e1cb72d9 dm: fix redundant IO accounting for bios that need splitting
The risk of redundant IO accounting was not taken into consideration
when commit 18a25da843 ("dm: ensure bio submission follows a
depth-first tree walk") introduced IO splitting in terms of recursion
via generic_make_request().

Fix this by subtracting the split bio's payload from the IO stats that
were already accounted for by start_io_acct() upon dm_make_request()
entry.  This repeat oscillation of the IO accounting, up then down,
isn't ideal but refactoring DM core's IO splitting to pre-split bios
_before_ they are accounted turned out to be an excessive amount of
change that will need a full development cycle to refine and verify.

Before this fix:

  /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so
  bios are split on 32k boundaries.

  # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \
    	--iodepth=1 --ioengine=libaio --direct=1 --refill_buffers

  with debugging added:
  [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128
  [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio:
  [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64
  ...

  16M written yet 136M (278528 * 512b) accounted:
  # cat /sys/block/dm-2/stat | awk '{ print $7 }'
  278528

After this fix:

  16M written and 16M (32768 * 512b) accounted:
  # cat /sys/block/dm-2/stat | awk '{ print $7 }'
  32768

Fixes: 18a25da843 ("dm: ensure bio submission follows a depth-first tree walk")
Cc: stable@vger.kernel.org # 4.16+
Reported-by: Bryan Gurney <bgurney@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-01-21 11:29:27 -05:00
Mike Snitzer
57c36519e4 dm: fix clone_bio() to trigger blk_recount_segments()
DM's clone_bio() now benefits from using bio_trim() by fixing the fact
that clone_bio() wasn't clearing BIO_SEG_VALID like bio_trim() does;
which triggers blk_recount_segments() via bio_phys_segments().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-01-21 11:29:18 -05:00
Joe Thornber
d445bd9cec dm thin: fix passdown_double_checking_shared_status()
Commit 00a0ea33b4 ("dm thin: do not queue freed thin mapping for next
stage processing") changed process_prepared_discard_passdown_pt1() to
increment all the blocks being discarded until after the passdown had
completed to avoid them being prematurely reused.

IO issued to a thin device that breaks sharing with a snapshot, followed
by a discard issued to snapshot(s) that previously shared the block(s),
results in passdown_double_checking_shared_status() being called to
iterate through the blocks double checking their reference count is zero
and issuing the passdown if so.  So a side effect of commit 00a0ea33b4
is passdown_double_checking_shared_status() was broken.

Fix this by checking if the block reference count is greater than 1.
Also, rename dm_pool_block_is_used() to dm_pool_block_is_shared().

Fixes: 00a0ea33b4 ("dm thin: do not queue freed thin mapping for next stage processing")
Cc: stable@vger.kernel.org # 4.9+
Reported-by: ryan.p.norwood@gmail.com
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-01-15 16:10:41 -05:00
Marcos Paulo de Souza
6251691a92 md: Make bio_alloc_mddev use bio_alloc_bioset
bio_alloc_bioset returns a bio pointer or NULL, so we can avoid storing
the returned data into a new variable.

Acked-by: Guoqing Jiang <gqjiang@suse.com>
Acked-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-14 06:31:56 -07:00
Milan Broz
1856b9f7bc dm crypt: fix parsing of extended IV arguments
The dm-crypt cipher specification in a mapping table is defined as:
  cipher[:keycount]-chainmode-ivmode[:ivopts]
or (new crypt API format):
  capi:cipher_api_spec-ivmode[:ivopts]

For ESSIV, the parameter includes hash specification, for example:
aes-cbc-essiv:sha256

The implementation expected that additional IV option to never include
another dash '-' character.

But, with SHA3, there are names like sha3-256; so the mapping table
parser fails:

dmsetup create test --table "0 8 crypt aes-cbc-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0"
  or (new crypt API format)
dmsetup create test --table "0 8 crypt capi:cbc(aes)-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0"

  device-mapper: crypt: Ignoring unexpected additional cipher options
  device-mapper: table: 253:0: crypt: Error creating IV
  device-mapper: ioctl: error adding target to table

Fix the dm-crypt constructor to ignore additional dash in IV options and
also remove a bogus warning (that is ignored anyway).

Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-01-10 15:17:33 -05:00
Jens Axboe
dc629c211c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md into for-linus
Pull the pending 4.21 changes for md from Shaohua.

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md: fix raid10 hang issue caused by barrier
  raid10: refactor common wait code from regular read/write request
  md: remvoe redundant condition check
  lib/raid6: add option to skip algo benchmarking
  lib/raid6: sort algos in rough performance order
  lib/raid6: check for assembler SSSE3 support
  lib/raid6: avoid __attribute_const__ redefinition
  lib/raid6: add missing include for raid6test
  md: remove set but not used variable 'bi_rdev'
2019-01-03 08:21:02 -07:00
Linus Torvalds
f346b0becb Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - large KASAN update to use arm's "software tag-based mode"

 - a few misc things

 - sh updates

 - ocfs2 updates

 - just about all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (167 commits)
  kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
  memcg, oom: notify on oom killer invocation from the charge path
  mm, swap: fix swapoff with KSM pages
  include/linux/gfp.h: fix typo
  mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
  hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  memory_hotplug: add missing newlines to debugging output
  mm: remove __hugepage_set_anon_rmap()
  include/linux/vmstat.h: remove unused page state adjustment macro
  mm/page_alloc.c: allow error injection
  mm: migrate: drop unused argument of migrate_page_move_mapping()
  blkdev: avoid migration stalls for blkdev pages
  mm: migrate: provide buffer_migrate_page_norefs()
  mm: migrate: move migrate_page_lock_buffers()
  mm: migrate: lock buffers before migrate_page_move_mapping()
  mm: migration: factor out code to compute expected number of page references
  mm, page_alloc: enable pcpu_drain with zone capability
  kmemleak: add config to select auto scan
  mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
  ...
2018-12-28 16:55:46 -08:00
Linus Torvalds
4ed7bdc1eb - Eliminate a couple indirect calls from bio-based DM core.
- Fix DM to allow reads that exceed readahead limits by setting io_pages
   in the backing_dev_info.
 
 - A couple code cleanups in request-based DM.
 
 - Fix various DM targets to check for device sector overflow if
   CONFIG_LBDAF is not set.
 
 - Use u64 instead of sector_t to store iv_offset in DM crypt; sector_t
   isn't large enough on 32bit when CONFIG_LBDAF is not set.
 
 - Performance fixes to DM's kcopyd and the snapshot target focused on
   limiting memory use and workqueue stalls.
 
 - Fix typos in the integrity and writecache targets.
 
 - Log which algorithm is used for dm-crypt's encryption and
   dm-integrity's hashing.
 
 - Fix false -EBUSY errors in DM raid target's handling of check/repair
   messages.
 
 - Fix DM flakey target's corrupt_bio_byte feature to reliably corrupt
   the Nth byte in a bio's payload.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJcHu6MAAoJEMUj8QotnQNar9gIAKyJ5hH0ik3mlLhAyPcMkRU+
 MPvjTAC8V5V33VY8YqKD/TcDHY8yaGTPKSK0SAnB8Jv2tLz8mp0H18syJbMvNcFc
 MS/xWYLAZiXf5xJ4eJpo8ayDvU6DAkKtSbG7VHJ8gnx/WV8bW1b90OCIq8GN+jHo
 DPp6RDzy3LvonoZ9OPognQKUDXK1crdPDiv1Wzw9Zjd4JlruIaGgJOc1RLB7YdVK
 u5c7UD7cT+MCROEzJrctAZa5oRVhuemd83HPvwE2akRFvCSNtc7aIVtNfJ79nqL7
 19oUDfsuzf2zkiJLzO0oU4OrJ+lUGDJf/sPXmU1qlPyKdUqYKFxT+KHvBnhAQXI=
 =pCs8
 -----END PGP SIGNATURE-----

Merge tag 'for-4.21/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Eliminate a couple indirect calls from bio-based DM core.

 - Fix DM to allow reads that exceed readahead limits by setting
   io_pages in the backing_dev_info.

 - A couple code cleanups in request-based DM.

 - Fix various DM targets to check for device sector overflow if
   CONFIG_LBDAF is not set.

 - Use u64 instead of sector_t to store iv_offset in DM crypt; sector_t
   isn't large enough on 32bit when CONFIG_LBDAF is not set.

 - Performance fixes to DM's kcopyd and the snapshot target focused on
   limiting memory use and workqueue stalls.

 - Fix typos in the integrity and writecache targets.

 - Log which algorithm is used for dm-crypt's encryption and
   dm-integrity's hashing.

 - Fix false -EBUSY errors in DM raid target's handling of check/repair
   messages.

 - Fix DM flakey target's corrupt_bio_byte feature to reliably corrupt
   the Nth byte in a bio's payload.

* tag 'for-4.21/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: do not allow readahead to limit IO size
  dm raid: fix false -EBUSY when handling check/repair message
  dm rq: cleanup leftover code from recently removed q->mq_ops branching
  dm verity: log the hash algorithm implementation
  dm crypt: log the encryption algorithm implementation
  dm integrity: fix spelling mistake in workqueue name
  dm flakey: Properly corrupt multi-page bios.
  dm: Check for device sector overflow if CONFIG_LBDAF is not set
  dm crypt: use u64 instead of sector_t to store iv_offset
  dm kcopyd: Fix bug causing workqueue stalls
  dm snapshot: Fix excessive memory usage and workqueue stalls
  dm bufio: update comment in dm-bufio.c
  dm writecache: fix typo in error msg for creating writecache_flush_thread
  dm: remove indirect calls from __send_changing_extent_only()
  dm mpath: only flush workqueue when needed
  dm rq: remove unused arguments from rq_completed()
  dm: avoid indirect call in __dm_make_request
2018-12-28 15:02:26 -08:00
Linus Torvalds
0e9da3fbf7 for-4.21/block-20181221
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwb7R8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjiID/97oDjMhNT7rwpuMbHw855h62j1hEN/m+N3
 FI0uxivYoYZLD+eJRnMcBwHlKjrCX8iJQAcv9ffI3ThtFW7dnZT3atUacaZVR/Dt
 IrxdymdBP3qsmuaId5NYBug7rJ+AiqFJKjEvCcSPu5X397J4I3SEbzhfvYLJ/aZX
 16o0HJlVVIrcbmq1IP4HwiIIOaKXvPaw04L4z4fpeynRSWG7EAi8NLSnhlR4Rxbb
 BTiMkCTsjRCFdyO6da4fvNQKWmPGPa3bJkYy3qR99cvJCeIbQjRyCloQlWNJRRgi
 3eJpCHVxqFmN0/+DNTJVQEEr4H8o0AVucrLVct1Jc4pessenkpoUniP8vELqwlng
 Z2VHLkhTfCEmvFlk82grrYdNvGATRsrbswt/PlP4T7rBfr1IpDk8kXDWF59EL2dy
 ly35Sk3wJGHBl8qa+vEPXOAnaWdqJXuVGpwB4ifOIatOls8mOxwfZjiRc7x05/fC
 1O4rR2IfLwRqwoYHs0AJ+h6ohOSn1mkGezl2Tch1VSFcJUOHmuYvraTaUi6hblpA
 SslaAoEhO39hRBL0HsvsMeqVWM9uzqvFkLDCfNPdiA81H1258CIbo4vF8z6czCIS
 eeXnTJxVhPVbZgb3a1a93SPwM6KIDZFoIijyd+NqjpU94thlnhYD0QEcKJIKH7os
 2p4aHs6ktw==
 =TRdW
 -----END PGP SIGNATURE-----

Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "This is the main pull request for block/storage for 4.21.

  Larger than usual, it was a busy round with lots of goodies queued up.
  Most notable is the removal of the old IO stack, which has been a long
  time coming. No new features for a while, everything coming in this
  week has all been fixes for things that were previously merged.

  This contains:

   - Use atomic counters instead of semaphores for mtip32xx (Arnd)

   - Cleanup of the mtip32xx request setup (Christoph)

   - Fix for circular locking dependency in loop (Jan, Tetsuo)

   - bcache (Coly, Guoju, Shenghui)
      * Optimizations for writeback caching
      * Various fixes and improvements

   - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
      * host and target support for NVMe over TCP
      * Error log page support
      * Support for separate read/write/poll queues
      * Much improved polling
      * discard OOM fallback
      * Tracepoint improvements

   - lightnvm (Hans, Hua, Igor, Matias, Javier)
      * Igor added packed metadata to pblk. Now drives without metadata
        per LBA can be used as well.
      * Fix from Geert on uninitialized value on chunk metadata reads.
      * Fixes from Hans and Javier to pblk recovery and write path.
      * Fix from Hua Su to fix a race condition in the pblk recovery
        code.
      * Scan optimization added to pblk recovery from Zhoujie.
      * Small geometry cleanup from me.

   - Conversion of the last few drivers that used the legacy path to
     blk-mq (me)

   - Removal of legacy IO path in SCSI (me, Christoph)

   - Removal of legacy IO stack and schedulers (me)

   - Support for much better polling, now without interrupts at all.
     blk-mq adds support for multiple queue maps, which enables us to
     have a map per type. This in turn enables nvme to have separate
     completion queues for polling, which can then be interrupt-less.
     Also means we're ready for async polled IO, which is hopefully
     coming in the next release.

   - Killing of (now) unused block exports (Christoph)

   - Unification of the blk-rq-qos and blk-wbt wait handling (Josef)

   - Support for zoned testing with null_blk (Masato)

   - sx8 conversion to per-host tag sets (Christoph)

   - IO priority improvements (Damien)

   - mq-deadline zoned fix (Damien)

   - Ref count blkcg series (Dennis)

   - Lots of blk-mq improvements and speedups (me)

   - sbitmap scalability improvements (me)

   - Make core inflight IO accounting per-cpu (Mikulas)

   - Export timeout setting in sysfs (Weiping)

   - Cleanup the direct issue path (Jianchao)

   - Export blk-wbt internals in block debugfs for easier debugging
     (Ming)

   - Lots of other fixes and improvements"

* tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
  kyber: use sbitmap add_wait_queue/list_del wait helpers
  sbitmap: add helpers for add/del wait queue handling
  block: save irq state in blkg_lookup_create()
  dm: don't reuse bio for flushes
  nvme-pci: trace SQ status on completions
  nvme-rdma: implement polling queue map
  nvme-fabrics: allow user to pass in nr_poll_queues
  nvme-fabrics: allow nvmf_connect_io_queue to poll
  nvme-core: optionally poll sync commands
  block: make request_to_qc_t public
  nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
  nvme-tcp: fix endianess annotations
  nvmet-tcp: fix endianess annotations
  nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
  nvme-pci: only set nr_maps to 2 if poll queues are supported
  nvmet: use a macro for default error location
  nvmet: fix comparison of a u16 with -1
  blk-mq: enable IO poll if .nr_queues of type poll > 0
  blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
  blk-mq: skip zero-queue maps in blk_mq_map_swqueue
  ...
2018-12-28 13:19:59 -08:00
Arun KS
ca79b0c211 mm: convert totalram_pages and totalhigh_pages variables to atomic
totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things.  It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Linus Torvalds
b71acb0e37 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Add 1472-byte test to tcrypt for IPsec
   - Reintroduced crypto stats interface with numerous changes
   - Support incremental algorithm dumps

  Algorithms:
   - Add xchacha12/20
   - Add nhpoly1305
   - Add adiantum
   - Add streebog hash
   - Mark cts(cbc(aes)) as FIPS allowed

  Drivers:
   - Improve performance of arm64/chacha20
   - Improve performance of x86/chacha20
   - Add NEON-accelerated nhpoly1305
   - Add SSE2 accelerated nhpoly1305
   - Add AVX2 accelerated nhpoly1305
   - Add support for 192/256-bit keys in gcmaes AVX
   - Add SG support in gcmaes AVX
   - ESN for inline IPsec tx in chcr
   - Add support for CryptoCell 703 in ccree
   - Add support for CryptoCell 713 in ccree
   - Add SM4 support in ccree
   - Add SM3 support in ccree
   - Add support for chacha20 in caam/qi2
   - Add support for chacha20 + poly1305 in caam/jr
   - Add support for chacha20 + poly1305 in caam/qi2
   - Add AEAD cipher support in cavium/nitrox"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (130 commits)
  crypto: skcipher - remove remnants of internal IV generators
  crypto: cavium/nitrox - Fix build with !CONFIG_DEBUG_FS
  crypto: salsa20-generic - don't unnecessarily use atomic walk
  crypto: skcipher - add might_sleep() to skcipher_walk_virt()
  crypto: x86/chacha - avoid sleeping under kernel_fpu_begin()
  crypto: cavium/nitrox - Added AEAD cipher support
  crypto: mxc-scc - fix build warnings on ARM64
  crypto: api - document missing stats member
  crypto: user - remove unused dump functions
  crypto: chelsio - Fix wrong error counter increments
  crypto: chelsio - Reset counters on cxgb4 Detach
  crypto: chelsio - Handle PCI shutdown event
  crypto: chelsio - cleanup:send addr as value in function argument
  crypto: chelsio - Use same value for both channel in single WR
  crypto: chelsio - Swap location of AAD and IV sent in WR
  crypto: chelsio - remove set but not used variable 'kctx_len'
  crypto: ux500 - Use proper enum in hash_set_dma_transfer
  crypto: ux500 - Use proper enum in cryp_set_dma_transfer
  crypto: aesni - Add scatter/gather avx stubs, and use them in C
  crypto: aesni - Introduce partial block macro
  ..
2018-12-27 13:53:32 -08:00
Guoqing Jiang
e820d55cb9 md: fix raid10 hang issue caused by barrier
When both regular IO and resync IO happen at the same time,
and if we also need to split regular. Then we can see tasks
hang due to barrier.

1. resync thread
[ 1463.757205] INFO: task md1_resync:5215 blocked for more than 480 seconds.
[ 1463.757207]       Not tainted 4.19.5-1-default #1
[ 1463.757209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1463.757212] md1_resync      D    0  5215      2 0x80000000
[ 1463.757216] Call Trace:
[ 1463.757223]  ? __schedule+0x29a/0x880
[ 1463.757231]  ? raise_barrier+0x8d/0x140 [raid10]
[ 1463.757236]  schedule+0x78/0x110
[ 1463.757243]  raise_barrier+0x8d/0x140 [raid10]
[ 1463.757248]  ? wait_woken+0x80/0x80
[ 1463.757257]  raid10_sync_request+0x1f6/0x1e30 [raid10]
[ 1463.757265]  ? _raw_spin_unlock_irq+0x22/0x40
[ 1463.757284]  ? is_mddev_idle+0x125/0x137 [md_mod]
[ 1463.757302]  md_do_sync.cold.78+0x404/0x969 [md_mod]
[ 1463.757311]  ? wait_woken+0x80/0x80
[ 1463.757336]  ? md_rdev_init+0xb0/0xb0 [md_mod]
[ 1463.757351]  md_thread+0xe9/0x140 [md_mod]
[ 1463.757358]  ? _raw_spin_unlock_irqrestore+0x2e/0x60
[ 1463.757364]  ? __kthread_parkme+0x4c/0x70
[ 1463.757369]  kthread+0x112/0x130
[ 1463.757374]  ? kthread_create_worker_on_cpu+0x40/0x40
[ 1463.757380]  ret_from_fork+0x3a/0x50

2. regular IO
[ 1463.760679] INFO: task kworker/0:8:5367 blocked for more than 480 seconds.
[ 1463.760683]       Not tainted 4.19.5-1-default #1
[ 1463.760684] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1463.760687] kworker/0:8     D    0  5367      2 0x80000000
[ 1463.760718] Workqueue: md submit_flushes [md_mod]
[ 1463.760721] Call Trace:
[ 1463.760731]  ? __schedule+0x29a/0x880
[ 1463.760741]  ? wait_barrier+0xdd/0x170 [raid10]
[ 1463.760746]  schedule+0x78/0x110
[ 1463.760753]  wait_barrier+0xdd/0x170 [raid10]
[ 1463.760761]  ? wait_woken+0x80/0x80
[ 1463.760768]  raid10_write_request+0xf2/0x900 [raid10]
[ 1463.760774]  ? wait_woken+0x80/0x80
[ 1463.760778]  ? mempool_alloc+0x55/0x160
[ 1463.760795]  ? md_write_start+0xa9/0x270 [md_mod]
[ 1463.760801]  ? try_to_wake_up+0x44/0x470
[ 1463.760810]  raid10_make_request+0xc1/0x120 [raid10]
[ 1463.760816]  ? wait_woken+0x80/0x80
[ 1463.760831]  md_handle_request+0x121/0x190 [md_mod]
[ 1463.760851]  md_make_request+0x78/0x190 [md_mod]
[ 1463.760860]  generic_make_request+0x1c6/0x470
[ 1463.760870]  raid10_write_request+0x77a/0x900 [raid10]
[ 1463.760875]  ? wait_woken+0x80/0x80
[ 1463.760879]  ? mempool_alloc+0x55/0x160
[ 1463.760895]  ? md_write_start+0xa9/0x270 [md_mod]
[ 1463.760904]  raid10_make_request+0xc1/0x120 [raid10]
[ 1463.760910]  ? wait_woken+0x80/0x80
[ 1463.760926]  md_handle_request+0x121/0x190 [md_mod]
[ 1463.760931]  ? _raw_spin_unlock_irq+0x22/0x40
[ 1463.760936]  ? finish_task_switch+0x74/0x260
[ 1463.760954]  submit_flushes+0x21/0x40 [md_mod]

So resync io is waiting for regular write io to complete to
decrease nr_pending (conf->barrier++ is called before waiting).
The regular write io splits another bio after call wait_barrier
which call nr_pending++, then the splitted bio would continue
with raid10_write_request -> wait_barrier, so the splitted bio
has to wait for barrier to be zero, then deadlock happens as
follows.

	resync io		regular io

	raise_barrier
				wait_barrier
				generic_make_request
				wait_barrier

To resolve the issue, we need to call allow_barrier to decrease
nr_pending before generic_make_request since regular IO is not
issued to underlying devices, and wait_barrier is called again
to ensure no internal IO happening.

Fixes: fc9977dd06 ("md/raid10: simplify the splitting of requests.")
Reported-and-tested-by: Siniša Bandin <sinisa@4net.rs>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-12-20 08:53:24 -08:00
Guoqing Jiang
caea3c47ad raid10: refactor common wait code from regular read/write request
Both raid10_read_request and raid10_write_request share
the same code at the beginning of them, so introduce
regular_request_wait to clean up code, and call it in
both request functions.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-12-20 08:53:24 -08:00
Chengguang Xu
37b22c2894 md: remvoe redundant condition check
mempool_destroy() can handle NULL pointer correctly,
so there is no need to check NULL pointer before calling
mempool_destroy().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-12-20 08:53:24 -08:00
Yue Haibing
f91389c8d2 md: remove set but not used variable 'bi_rdev'
Fixes gcc '-Wunused-but-set-variable' warning:

drivers/md/md.c: In function 'md_integrity_add_rdev':
drivers/md/md.c:2149:24: warning:
 variable 'bi_rdev' set but not used [-Wunused-but-set-variable]

It not used any more after commit
  1501efadc5 ("md/raid: only permit hot-add of compatible integrity profiles")

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-12-20 08:53:23 -08:00
Jens Axboe
dbe3ece128 dm: don't reuse bio for flushes
DM currently has a statically allocated bio that it uses to issue empty
flushes. It doesn't submit this bio, it just uses it for maintaining
state while setting up clones. Multiple users can access this bio at the
same time. This wasn't previously an issue, even if it was a bit iffy,
but with the blkg associations it can become one.

We setup the blkg association, then clone bio's and submit, then remove
the blkg assocation again. But since we can have multiple tasks doing
this at the same time, against multiple blkg's, then we can either lose
references to a blkg, or put it twice. The latter causes complaints on
the percpu ref being <= 0 when released, and can cause use-after-free as
well. Ming reports that xfstest generic/475 triggers this:

------------[ cut here ]------------
percpu ref (blkg_release) <= 0 (0) after switching to atomic
WARNING: CPU: 13 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x2c9/0x4a0

Switch to just using an on-stack bio for this, and get rid of the
embedded bio.

Fixes: 5cdf2e3fea ("blkcg: associate blkg when associating a device")
Reported-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-19 09:13:34 -07:00
Jaegeuk Kim
c6d6e9b0f6 dm: do not allow readahead to limit IO size
Update DM to set the bdi's io_pages.  This fixes reads to be capped at
the device's max request size (even if user's read IO exceeds the
established readahead setting).

Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
Cc: stable@vger.kernel.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 14:23:41 -05:00
Heinz Mauelshagen
74694bcbdf dm raid: fix false -EBUSY when handling check/repair message
Sending a check/repair message infrequently leads to -EBUSY instead of
properly identifying an active resync.  This occurs because
raid_message() is testing recovery bits in a racy way.

Fix by calling decipher_sync_action() from raid_message() to properly
identify the idle state of the RAID device.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 13:48:35 -05:00
Mike Snitzer
34743bfdde dm rq: cleanup leftover code from recently removed q->mq_ops branching
When commit 6a23e05c2f ("dm: remove legacy request-based IO path")
removed some q->mq_ops branching from map_request() it left in place a
goto that was only needed if that branching (and conditional 'r'
assignment) existed.  Now that the branching is gone map_request()'s
goto can be removed too.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:27 -05:00
Eric Biggers
bbf6a56692 dm verity: log the hash algorithm implementation
Log the hash algorithm's driver name when a dm-verity target is created.
This will help people determine whether the expected implementation is
being used.  It can make an enormous difference; e.g., SHA-256 on ARM
can be 8x faster with the crypto extensions than without.  It can also
be useful to know if an implementation using an external crypto
accelerator is being used instead of a software implementation.

Example message:

[   35.281945] device-mapper: verity: sha256 using implementation "sha256-ce"

We've already found the similar message in fs/crypto/keyinfo.c to be
very useful.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:27 -05:00
Eric Biggers
af331ebae7 dm crypt: log the encryption algorithm implementation
Log the encryption algorithm's driver name when a dm-crypt target is
created.  This will help people determine whether the expected
implementation is being used.  In some cases we've seen people do
benchmarks and reject using encryption for performance reasons, when in
fact they used a much slower implementation than was possible on the
hardware.  It can make an enormous difference; e.g., AES-XTS on ARM can
be over 10x faster with the crypto extensions than without.  It can also
be useful to know if an implementation using an external crypto
accelerator is being used instead of a software implementation.

Example message:

[   29.307629] device-mapper: crypt: xts(aes) using implementation "xts-aes-ce"

We've already found the similar message in fs/crypto/keyinfo.c to be
very useful.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:27 -05:00
Colin Ian King
e8c2566f83 dm integrity: fix spelling mistake in workqueue name
Rename the workqueue from dm-intergrity-recalc to dm-integrity-recalc.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:27 -05:00
Sweet Tea
a00f5276e2 dm flakey: Properly corrupt multi-page bios.
The flakey target is documented to be able to corrupt the Nth byte in
a bio, but does not corrupt byte indices after the first biovec in the
bio. Change the corrupting function to actually corrupt the Nth byte
no matter in which biovec that index falls.

A test device generating two-page bios, atop a flakey device configured
to corrupt a byte index on the second page, verified both the failure
to corrupt before this patch and the expected corruption after this
change.

Signed-off-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:27 -05:00
Milan Broz
ef87bfc24f dm: Check for device sector overflow if CONFIG_LBDAF is not set
Reference to a device in device-mapper table contains offset in sectors.

If the sector_t is 32bit integer (CONFIG_LBDAF is not set), then
several device-mapper targets can overflow this offset and validity
check is then performed on a wrong offset and a wrong table is activated.

See for example (on 32bit without CONFIG_LBDAF) this overflow:

  # dmsetup create test --table "0 2048 linear /dev/sdg 4294967297"
  # dmsetup table test
  0 2048 linear 8:96 1

This patch adds explicit check for overflow if the offset is sector_t type.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
AliOS system security
8d683dcd65 dm crypt: use u64 instead of sector_t to store iv_offset
The iv_offset in the mapping table of crypt target is a 64bit number when
IV algorithm is plain64, plain64be, essiv or benbi. It will be assigned to
iv_offset of struct crypt_config, cc_sector of struct convert_context and
iv_sector of struct dm_crypt_request. These structures members are defined
as a sector_t. But sector_t is 32bit when CONFIG_LBDAF is not set in 32bit
kernel. In this situation sector_t is not big enough to store the 64bit
iv_offset.

Here is a reproducer.
Prepare test image and device (loop is automatically allocated by cryptsetup):

  # dd if=/dev/zero of=tst.img bs=1M count=1
  # echo "tst"|cryptsetup open --type plain -c aes-xts-plain64 \
  --skip 500000000000000000 tst.img test

On 32bit system (use IV offset value that overflows to 64bit; CONFIG_LBDAF if off)
and device checksum is wrong:

  # dmsetup table test --showkeys
  0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 3551657984 7:0 0

  # sha256sum /dev/mapper/test
  533e25c09176632b3794f35303488c4a8f3f965dffffa6ec2df347c168cb6c19 /dev/mapper/test

On 64bit system (and on 32bit system with the patch), table and checksum is now correct:

  # dmsetup table test --showkeys
  0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 500000000000000000 7:0 0

  # sha256sum /dev/mapper/test
  5d16160f9d5f8c33d8051e65fdb4f003cc31cd652b5abb08f03aa6fce0df75fc /dev/mapper/test

Signed-off-by: AliOS system security <alios_sys_security@linux.alibaba.com>
Tested-and-Reviewed-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
Nikos Tsironis
d7e6b8dfc7 dm kcopyd: Fix bug causing workqueue stalls
When using kcopyd to run callbacks through dm_kcopyd_do_callback() or
submitting copy jobs with a source size of 0, the jobs are pushed
directly to the complete_jobs list, which could be under processing by
the kcopyd thread. As a result, the kcopyd thread can continue running
completed jobs indefinitely, without releasing the CPU, as long as
someone keeps submitting new completed jobs through the aforementioned
paths. Processing of work items, queued for execution on the same CPU as
the currently running kcopyd thread, is thus stalled for excessive
amounts of time, hurting performance.

Running the following test, from the device mapper test suite [1],

  dmtest run --suite snapshot -n parallel_io_to_many_snaps_N

, with 8 active snapshots, we get, in dmesg, messages like the
following:

[68899.948523] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 95s!
[68899.949282] Showing busy workqueues and worker pools:
[68899.949288] workqueue events: flags=0x0
[68899.949295]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[68899.949306]     pending: vmstat_shepherd, cache_reap
[68899.949331] workqueue mm_percpu_wq: flags=0x8
[68899.949337]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949345]     pending: vmstat_update
[68899.949387] workqueue dm_bufio_cache: flags=0x8
[68899.949392]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[68899.949400]     pending: work_fn [dm_bufio]
[68899.949423] workqueue kcopyd: flags=0x8
[68899.949429]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949437]     pending: do_work [dm_mod]
[68899.949452] workqueue kcopyd: flags=0x8
[68899.949458]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[68899.949466]     in-flight: 13:do_work [dm_mod]
[68899.949474]     pending: do_work [dm_mod]
[68899.949487] workqueue kcopyd: flags=0x8
[68899.949493]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949501]     pending: do_work [dm_mod]
[68899.949515] workqueue kcopyd: flags=0x8
[68899.949521]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949529]     pending: do_work [dm_mod]
[68899.949541] workqueue kcopyd: flags=0x8
[68899.949547]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949555]     pending: do_work [dm_mod]
[68899.949568] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=95s workers=4 idle: 27130 27223 1084

Fix this by splitting the complete_jobs list into two parts: A user
facing part, named callback_jobs, and one used internally by kcopyd,
retaining the name complete_jobs. dm_kcopyd_do_callback() and
dispatch_job() now push their jobs to the callback_jobs list, which is
spliced to the complete_jobs list once, every time the kcopyd thread
wakes up. This prevents kcopyd from hogging the CPU indefinitely and
causing workqueue stalls.

Re-running the aforementioned test:

  * Workqueue stalls are eliminated
  * The maximum writing time among all targets is reduced from 09m37.10s
    to 06m04.85s and the total run time of the test is reduced from
    10m43.591s to 7m19.199s

[1] https://github.com/jthornber/device-mapper-test-suite

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
Nikos Tsironis
721b1d98fb dm snapshot: Fix excessive memory usage and workqueue stalls
kcopyd has no upper limit to the number of jobs one can allocate and
issue. Under certain workloads this can lead to excessive memory usage
and workqueue stalls. For example, when creating multiple dm-snapshot
targets with a 4K chunk size and then writing to the origin through the
page cache. Syncing the page cache causes a large number of BIOs to be
issued to the dm-snapshot origin target, which itself issues an even
larger (because of the BIO splitting taking place) number of kcopyd
jobs.

Running the following test, from the device mapper test suite [1],

  dmtest run --suite snapshot -n many_snapshots_of_same_volume_N

, with 8 active snapshots, results in the kcopyd job slab cache growing
to 10G. Depending on the available system RAM this can lead to the OOM
killer killing user processes:

[463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
              nodemask=(null), order=1, oom_score_adj=0
[463.492894] kthreadd cpuset=/ mems_allowed=0
[463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
[463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[463.492952] Call Trace:
[463.492964]  dump_stack+0x7d/0xbb
[463.492973]  dump_header+0x6b/0x2fc
[463.492987]  ? lockdep_hardirqs_on+0xee/0x190
[463.493012]  oom_kill_process+0x302/0x370
[463.493021]  out_of_memory+0x113/0x560
[463.493030]  __alloc_pages_slowpath+0xf40/0x1020
[463.493055]  __alloc_pages_nodemask+0x348/0x3c0
[463.493067]  cache_grow_begin+0x81/0x8b0
[463.493072]  ? cache_grow_begin+0x874/0x8b0
[463.493078]  fallback_alloc+0x1e4/0x280
[463.493092]  kmem_cache_alloc_node+0xd6/0x370
[463.493098]  ? copy_process.part.31+0x1c5/0x20d0
[463.493105]  copy_process.part.31+0x1c5/0x20d0
[463.493115]  ? __lock_acquire+0x3cc/0x1550
[463.493121]  ? __switch_to_asm+0x34/0x70
[463.493129]  ? kthread_create_worker_on_cpu+0x70/0x70
[463.493135]  ? finish_task_switch+0x90/0x280
[463.493165]  _do_fork+0xe0/0x6d0
[463.493191]  ? kthreadd+0x19f/0x220
[463.493233]  kernel_thread+0x25/0x30
[463.493235]  kthreadd+0x1bf/0x220
[463.493242]  ? kthread_create_on_cpu+0x90/0x90
[463.493248]  ret_from_fork+0x3a/0x50
[463.493279] Mem-Info:
[463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
[463.493285]  active_file:80216 inactive_file:80107 isolated_file:435
[463.493285]  unevictable:0 dirty:51266 writeback:109372 unstable:0
[463.493285]  slab_reclaimable:31191 slab_unreclaimable:3483521
[463.493285]  mapped:526 shmem:4903 pagetables:1759 bounce:0
[463.493285]  free:33623 free_pcp:2392 free_cma:0
...
[463.493489] Unreclaimable slab info:
[463.493513] Name                      Used          Total
[463.493522] bio-6                   1028KB       1028KB
[463.493525] bio-5                   1028KB       1028KB
[463.493528] dm_snap_pending_exception     236783KB     243789KB
[463.493531] dm_exception              41KB         42KB
[463.493534] bio-4                   1216KB       1216KB
[463.493537] bio-3                 439396KB     439396KB
[463.493539] kcopyd_job           6973427KB    6973427KB
...
[463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
[463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
[463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Moreover, issuing a large number of kcopyd jobs results in kcopyd
hogging the CPU, while processing them. As a result, processing of work
items, queued for execution on the same CPU as the currently running
kcopyd thread, is stalled for long periods of time, hurting performance.
Running the aforementioned test we get, in dmesg, messages like the
following:

[67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
[67501.195586] Showing busy workqueues and worker pools:
[67501.195591] workqueue events: flags=0x0
[67501.195597]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195611]     pending: cache_reap
[67501.195641] workqueue mm_percpu_wq: flags=0x8
[67501.195645]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195656]     pending: vmstat_update
[67501.195682] workqueue kblockd: flags=0x18
[67501.195687]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[67501.195698]     pending: blk_timeout_work
[67501.195753] workqueue kcopyd: flags=0x8
[67501.195757]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195768]     pending: do_work [dm_mod]
[67501.195802] workqueue kcopyd: flags=0x8
[67501.195806]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195817]     pending: do_work [dm_mod]
[67501.195834] workqueue kcopyd: flags=0x8
[67501.195838]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195848]     pending: do_work [dm_mod]
[67501.195881] workqueue kcopyd: flags=0x8
[67501.195885]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195896]     pending: do_work [dm_mod]
[67501.195920] workqueue kcopyd: flags=0x8
[67501.195924]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
[67501.195935]     in-flight: 67:do_work [dm_mod]
[67501.195945]     pending: do_work [dm_mod]
[67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765

The root cause for these issues is the way dm-snapshot uses kcopyd. In
particular, the lack of an explicit or implicit limit to the maximum
number of in-flight COW jobs. The merging path is not affected because
it implicitly limits the in-flight kcopyd jobs to one.

Fix these issues by using a semaphore to limit the maximum number of
in-flight kcopyd jobs. We grab the semaphore before allocating a new
kcopyd job in start_copy() and start_full_bio() and release it after the
job finishes in copy_callback().

The initial semaphore value is configurable through a module parameter,
to allow fine tuning the maximum number of in-flight COW jobs. Setting
this parameter to zero initializes the semaphore to INT_MAX.

A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
value was decided experimentally as a trade-off between memory
consumption, stalling the kernel's workqueues and maintaining a high
enough throughput.

Re-running the aforementioned test:

  * Workqueue stalls are eliminated
  * kcopyd's job slab cache uses a maximum of 130MB
  * The time taken by the test to write to the snapshot-origin target is
    reduced from 05m20.48s to 03m26.38s

[1] https://github.com/jthornber/device-mapper-test-suite

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
Shenghui Wang
ef9923739e dm bufio: update comment in dm-bufio.c
* Hashtable has been replaced by rbtree to manage buffers.
  Update the comment.
* Fix typo in the comment for dm_bufio_issue_flush

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
Shenghui Wang
e8ea141a0f dm writecache: fix typo in error msg for creating writecache_flush_thread
The error msg should be "flush thread" instead of "endio thread"
for writecache_flush_thread.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
Mike Snitzer
53b4716870 dm: remove indirect calls from __send_changing_extent_only()
No need to be so fancy.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
wuzhouhui
935fcc56ab dm mpath: only flush workqueue when needed
The workqueues are shared by many multipath devices, only flush whole
workqueue when necessary.  Otherwise, we just flush works as needed.

Signed-off-by: wuzhouhui <wuzhouhui14@mails.ucas.ac.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:25 -05:00
Mike Snitzer
2adc5c559a dm rq: remove unused arguments from rq_completed()
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:25 -05:00
Mikulas Patocka
24113d4878 dm: avoid indirect call in __dm_make_request
Indirect calls are inefficient because of retpolines that are used for
spectre workaround. This patch replaces an indirect call with a condition
(that can be predicted by the branch predictor).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:25 -05:00
Jens Axboe
3c94d83cb3 blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
There's a single user of this function, dm, and dm just wants
to check if IO is inflight, not that it's just allocated.

This fixes a hang with srp/002 in blktests with dm, where it tries
to suspend but waits for inflight IO to finish first. As it checks
for just allocated requests, this fails.

Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-17 21:31:42 -07:00
Guoju Fang
e78bd0d26f bcache: print number of keys in trace_bcache_journal_write
Sometimes flush journal may be very frequent, so it's useful to dump
number of keys every time write journal.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Coly Li
cc38ca7ed5 bcache: set writeback_percent in a flexible range
Because CUTOFF_WRITEBACK is defined as 40, so before the changes of
dynamic cutoff writeback values, writeback_percent is limited to [0,
CUTOFF_WRITEBACK]. Any value larger than CUTOFF_WRITEBACK will be fixed
up to 40.

Now cutof writeback limit is a dynamic value bch_cutoff_writeback, so
the range of writeback_percent can be a more flexible range as [0,
bch_cutoff_writeback]. The flexibility is, it can be expended to a
larger or smaller range than [0, 40], depends on how value
bch_cutoff_writeback is specified.

The default value is still strongly recommended to most of users for
most of workloads. But for people who want to do research on bcache
writeback perforamnce tuning, they may have chance to specify more
flexible writeback_percent in range [0, 70].

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Coly Li
9aaf516546 bcache: make cutoff_writeback and cutoff_writeback_sync tunable
Currently the cutoff writeback and cutoff writeback sync thresholds are
defined by CUTOFF_WRITEBACK (40) and CUTOFF_WRITEBACK_SYNC (70) as
static values. Most of time these they work fine, but when people want
to do research on bcache writeback mode performance tuning, there is no
chance to modify the soft and hard cutoff writeback values.

This patch introduces two module parameters bch_cutoff_writeback_sync
and bch_cutoff_writeback which permit people to tune the values when
loading bcache.ko. If they are not specified by module loading, current
values CUTOFF_WRITEBACK_SYNC and CUTOFF_WRITEBACK will be used as
default and nothing changes.

When people want to tune this two values,
- cutoff_writeback can be set in range [1, 70]
- cutoff_writeback_sync can be set in range [1, 90]
- cutoff_writeback always <= cutoff_writeback_sync

The default values are strongly recommended to most of users for most of
workloads. Anyway, if people wants to take their own risk to do research
on new writeback cutoff tuning for their own workload, now they can make
it.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Coly Li
009673d02f bcache: add MODULE_DESCRIPTION information
This patch moves MODULE_AUTHOR and MODULE_LICENSE to end of super.c, and
add MODULE_DESCRIPTION("Bcache: a Linux block layer cache").

This is preparation for adding module parameters.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Coly Li
7a671d8ef8 bcache: option to automatically run gc thread after writeback
The option gc_after_writeback is disabled by default, because garbage
collection will discard SSD data which drops cached data.

Echo 1 into /sys/fs/bcache/<UUID>/internal/gc_after_writeback will
enable this option, which wakes up gc thread when writeback accomplished
and all cached data is clean.

This option is helpful for people who cares writing performance more. In
heavy writing workload, all cached data can be clean only happens when
writeback thread cleans all cached data in I/O idle time. In such
situation a following gc running may help to shrink bcache B+ tree and
discard more clean data, which may be helpful for future writing
requests.

If you are not sure whether this is helpful for your own workload,
please leave it as disabled by default.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Coly Li
cb07ad6368 bcache: introduce force_wake_up_gc()
Garbage collection thread starts to work when c->sectors_to_gc is
negative value, otherwise nothing will happen even the gc thread is
woken up by wake_up_gc().

force_wake_up_gc() sets c->sectors_to_gc to -1 before calling
wake_up_gc(), then gc thread may have chance to run if no one else sets
c->sectors_to_gc to a positive value before gc_should_run().

This routine can be called where the gc thread is woken up and required
to run in force.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
f383ae300c bcache: cannot set writeback_running via sysfs if no writeback kthread created
"echo 1 > writeback_running" marks writeback_running even if no
writeback kthread created as "d_strtoul(writeback_running)" will simply
set dc-> writeback_running without checking the existence of
dc->writeback_thread.

Add check for setting writeback_running via sysfs: if no writeback
kthread available, reject setting to 1.

v2 -> v3:
  * Make message on wrong assignment more clear.
  * Print name of bcache device instead of name of backing device.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
79b791466e bcache: do not mark writeback_running too early
A fresh backing device is not attached to any cache_set, and
has no writeback kthread created until first attached to some
cache_set.

But bch_cached_dev_writeback_init run
"
	dc->writeback_running		= true;
	WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING,
			&dc->disk.flags));
"
for any newly formatted backing devices.

For a fresh standalone backing device, we can get something like
following even if no writeback kthread created:
------------------------
/sys/block/bcache0/bcache# cat writeback_running
1
/sys/block/bcache0/bcache# cat writeback_rate_debug
rate:		512.0k/sec
dirty:		0.0k
target:		0.0k
proportional:	0.0k
integral:	0.0k
change:		0.0k/sec
next io:	-15427384ms

The none ZERO fields are misleading as no alive writeback kthread yet.

Set dc->writeback_running false as no writeback thread created in
bch_cached_dev_writeback_init().

We have writeback thread created and woken up in bch_cached_dev_writeback
_start(). Set dc->writeback_running true before bch_writeback_queue()
called, as a writeback thread will check if dc->writeback_running is true
before writing back dirty data, and hung if false detected.

After the change, we can get the following output for a fresh standalone
backing device:
-----------------------
/sys/block/bcache0/bcache$ cat writeback_running
0
/sys/block/bcache0/bcache# cat writeback_rate_debug
rate:		0.0k/sec
dirty:		0.0k
target:		0.0k
proportional:	0.0k
integral:	0.0k
change:		0.0k/sec
next io:	0ms

v1 -> v2:
  Set dc->writeback_running before bch_writeback_queue() called,

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
4e361e020e bcache: update comment in sysfs.c
We have struct cached_dev allocated by kzalloc in register_bcache(),
which initializes all the fields of cached_dev with 0s. And commit
ce4c3e19e5 ("bcache: Replace bch_read_string_list() by
__sysfs_match_string()") has remove the string "default".

Update the comment.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
3db4d0783e bcache: update comment for bch_data_insert
commit 220bb38c21 ("bcache: Break up struct search") introduced
changes to struct search and s->iop. bypass/bio are fields of struct
data_insert_op now. Update the comment.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
ae17102316 bcache: do not check if debug dentry is ERR or NULL explicitly on remove
debugfs_remove and debugfs_remove_recursive will check if the dentry
pointer is NULL or ERR, and will do nothing in that case.

Remove the check in cache_set_free and bch_debug_init.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
d2f96f487f bcache: add comment for cache_set->fill_iter
We have the following define for btree iterator:
	struct btree_iter {
		size_t size, used;
	#ifdef CONFIG_BCACHE_DEBUG
		struct btree_keys *b;
	#endif
		struct btree_iter_set {
			struct bkey *k, *end;
		} data[MAX_BSETS];
	};

We can see that the length of data[] field is static MAX_BSETS, which is
defined as 4 currently.

But a btree node on disk could have too many bsets for an iterator to fit
on the stack - maybe far more that MAX_BSETS. Have to dynamically allocate
space to host more btree_iter_sets.

bch_cache_set_alloc() will make sure the pool cache_set->fill_iter can
allocate an iterator equipped with enough room that can host
	(sb.bucket_size / sb.block_size)
btree_iter_sets, which is more than static MAX_BSETS.

bch_btree_node_read_done() will use that pool to allocate one iterator, to
host many bsets in one btree node.

Add more comment around cache_set->fill_iter to make code less confusing.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Mike Snitzer
2af6c0703d dm thin: bump target version
Decoupled version bump from commit f6c367585d ("dm thin: send event
about thin-pool state change _after_ making it") because version bumps
just create conflicts when backporting to the stable trees.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-12 09:39:54 -05:00
Mike Snitzer
f6c367585d dm thin: send event about thin-pool state change _after_ making it
Sending a DM event before a thin-pool state change is about to happen is
a bug.  It wasn't realized until it became clear that userspace response
to the event raced with the actual state change that the event was
meant to notify about.

Fix this by first updating internal thin-pool state to reflect what the
DM event is being issued about.  This fixes a long-standing racey/buggy
userspace device-mapper-test-suite 'resize_io' test that would get an
event but not find the state it was looking for -- so it would just go
on to hang because no other events caused the test to reevaluate the
thin-pool's state.

Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-11 15:19:26 -05:00
Mike Snitzer
c4576aed8d dm: fix request-based dm's use of dm_wait_for_completion
The md->wait waitqueue is used by both bio-based and request-based DM.
Commit dbd3bbd291 ("dm rq: leverage blk_mq_queue_busy() to check for
outstanding IO") lost sight of the requirement that
dm_wait_for_completion() must work with all types of DM devices.

Fix md_in_flight() to call the blk-mq or bio-based method accordingly.

Fixes: dbd3bbd291 ("dm rq: leverage blk_mq_queue_busy() to check for outstanding IO")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 07:40:02 -07:00
Jens Axboe
b7934ba414 dm: fix inflight IO check
After switching to percpu inflight counters, the inflight check
is totally buggy. It's perfectly valid for some counters to be
non-zero while having a total inflight IO count of 0, that's how
these kinds of counters work (inc on one CPU, dec on another).
Fix the md_in_flight() check to sum all counters before returning
a false positive, potentially.

While at it, remove the inflight read for IO completion. We don't
need it, just wake anyone that's waiting for the IO count to drop
to zero. The caller needs to re-check that value anyway when woken,
which it does.

Fixes: 6f75723190 ("dm: remove the pending IO accounting")
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 18:10:34 -07:00
Mikulas Patocka
6f75723190 dm: remove the pending IO accounting
Remove the "pending" atomic counters, that duplicate block-core's
in_flight counters, and update md_in_flight() to look at percpu
in_flight counters.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:38 -07:00
Mike Snitzer
112f158f66 block: stop passing 'cpu' to all percpu stats methods
All of part_stat_* and related methods are used with preempt disabled,
so there is no need to pass cpu around to allow of them.  Just call
smp_processor_id() as needed.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Mike Snitzer
dbd3bbd291 dm rq: leverage blk_mq_queue_busy() to check for outstanding IO
Now that request-based dm-multipath only supports blk-mq, make use of
the newly introduced blk_mq_queue_busy() to check for outstanding IO --
rather than (ab)using the block core's in_flight counters.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Mikulas Patocka
80a787ba38 dm: dont rewrite dm_disk(md)->part0.in_flight
generic_start_io_acct and generic_end_io_acct already update the variable
in_flight using atomic operations, so we don't have to overwrite them
again.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Dennis Zhou
db6638d7d1 blkcg: remove bio->bi_css and instead use bio->bi_blkg
Prior patches ensured that any bio that interacts with a request_queue
is properly associated with a blkg. This makes bio->bi_css unnecessary
as blkg maintains a reference to blkcg already.

This removes the bio field bi_css and transfers corresponding uses to
access via bi_blkg.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:37 -07:00
Dennis Zhou
892ad71f62 dm: set the static flush bio device on demand
The next patch changes the macro bio_set_dev() to associate a bio with a
blkg based on the device set. However, dm creates a static bio to be
used as the basis for cloning empty flush bios on creation. The
bio_set_dev() call in alloc_dev() will cause problems with the next
patch adding association to bio_set_dev() because the call is before the
bdev is associated with a gendisk (bd_disk is %NULL). To get around
this, set the device on the static bio every time and use that to clone
to the other bios.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:37 -07:00
Damien Le Moal
d57f9da890 dm zoned: Fix target BIO completion handling
struct bioctx includes the ref refcount_t to track the number of I/O
fragments used to process a target BIO as well as ensure that the zone
of the BIO is kept in the active state throughout the lifetime of the
BIO. However, since decrementing of this reference count is done in the
target .end_io method, the function bio_endio() must be called multiple
times for read and write target BIOs, which causes problems with the
value of the __bi_remaining struct bio field for chained BIOs (e.g. the
clone BIO passed by dm core is large and splits into fragments by the
block layer), resulting in incorrect values and inconsistencies with the
BIO_CHAIN flag setting. This is turn triggers the BUG_ON() call:

BUG_ON(atomic_read(&bio->__bi_remaining) <= 0);

in bio_remaining_done() called from bio_endio().

Fix this ensuring that bio_endio() is called only once for any target
BIO by always using internal clone BIOs for processing any read or
write target BIO. This allows reference counting using the target BIO
context counter to trigger the target BIO completion bio_endio() call
once all data, metadata and other zone work triggered by the BIO
complete.

Overall, this simplifies the code too as the target .end_io becomes
unnecessary and differences between read and write BIO issuing and
completion processing disappear.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-07 16:04:31 -05:00
Mike Snitzer
89f5fa4747 dm: call blk_queue_split() to impose device limits on bios
Otherwise the incoming bios, of various types, won't be shaped based on
the DM device's advertised limits.

Depends-on: af67c31fba ("blk: remove bio_set arg from blk_queue_split()")
Fixes: 744889b7cb ("block: don't deal with discard limit in blkdev_issue_discard()")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-07 16:04:22 -05:00
Mike Snitzer
687cf4412a dm cache metadata: verify cache has blocks in blocks_are_clean_separate_dirty()
Otherwise dm_bitset_cursor_begin() return -ENODATA.  Other calls to
dm_bitset_cursor_begin() have similar negative checks.

Fixes inability to create a cache in passthrough mode (even though doing
so makes no sense).

Fixes: 0d963b6e65 ("dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty")
Cc: stable@vger.kernel.org
Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-07 16:04:09 -05:00
Eric Biggers
3d234b3313 crypto: drop mask=CRYPTO_ALG_ASYNC from 'shash' tfm allocations
'shash' algorithms are always synchronous, so passing CRYPTO_ALG_ASYNC
in the mask to crypto_alloc_shash() has no effect.  Many users therefore
already don't pass it, but some still do.  This inconsistency can cause
confusion, especially since the way the 'mask' argument works is
somewhat counterintuitive.

Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-11-20 14:26:55 +08:00
Eric Biggers
1ad0f1603a crypto: drop mask=CRYPTO_ALG_ASYNC from 'cipher' tfm allocations
'cipher' algorithms (single block ciphers) are always synchronous, so
passing CRYPTO_ALG_ASYNC in the mask to crypto_alloc_cipher() has no
effect.  Many users therefore already don't pass it, but some still do.
This inconsistency can cause confusion, especially since the way the
'mask' argument works is somewhat counterintuitive.

Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-11-20 14:26:55 +08:00
Jens Axboe
344e9ffcbd block: add queue_is_mq() helper
Various spots check for q->mq_ops being non-NULL, but provide
a helper to do this instead.

Where the ->mq_ops != NULL check is redundant, remove it.

Since mq == rq-based now that legacy is gone, get rid of the
queue_is_rq_based() and just use queue_is_mq() everywhere.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-16 08:34:06 -07:00
Christoph Hellwig
6d46964230 block: remove the lock argument to blk_alloc_queue_node
With the legacy request path gone there is no real need to override the
queue_lock.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-15 12:13:35 -07:00
Linus Torvalds
5f21585384 for-linus-20181102
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvchGgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj/1D/4kEQx4ncnFoZk8QshHV1L++rH3BbcLjQDd
 Wbh9ZSIQdI/gHTzS6bE7x3YfcbpMWPMO3+jFawdfRiFTEjlF8vQ+mnJ+Btb3z4D6
 mGEeFGVhHExlp2a0x/Ma8YWVNlMB7BE8Tq73bZEVMY+9lbpmDW/vp7Sfa87LBDKQ
 ZmY+My+VdHN7qLtQ7t3W/HtpbU+kcXMMd3ICjK4i+ofXy6mynk4+oQ2jwyXc5L86
 UCJCsTsSRr3CgbnkW/uprHo0XHk8i7O/4C3oR+x4pAIxCCa9g+vmw0EO9fvi/2iQ
 qe8jKdm7Y09xu/TiPBa7iz45tdh0cNMJKo3OezmSF9Np+r69KL5C/U4GRPKN3Iwm
 keoqn14ScABkYMSe4ys1AdEgKD6bNUaW3r/lJxTH2oUR23mjnCLp7c4WD/G+MlbB
 CzoakQyCHTZmDFLr2Kc8bkjmpil2T2UFfmLIDAu30LWIYeSGpiIO/V+g1foJMF2f
 06ERltNvgX1BJjoh4NSWySLEf1ZtkUU60NeATRol6gwhnIyLrHsgfm6OEhqlW/7x
 Xc1BWyzX7K6c3Dskk/u5aSRyXOyRC9KkMt3/2XexeDNHkte9yMH0IgSvopPBuER8
 +iPvPjNp7ychTKZB3zpSnlqGgePTjbufIEBtO3OyUmDZKjUqxahtxkQfmPhoclu+
 XdR4ArcqNg==
 =0zM4
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20181102' of git://git.kernel.dk/linux-block

Pull block layer fixes from Jens Axboe:
 "The biggest part of this pull request is the revert of the blkcg
  cleanup series. It had one fix earlier for a stacked device issue, but
  another one was reported. Rather than play whack-a-mole with this,
  revert the entire series and try again for the next kernel release.

  Apart from that, only small fixes/changes.

  Summary:

   - Indentation fixup for mtip32xx (Colin Ian King)

   - The blkcg cleanup series revert (Dennis Zhou)

   - Two NVMe fixes. One fixing a regression in the nvme request
     initialization in this merge window, causing nvme-fc to not work.
     The other is a suspend/resume p2p resource issue (James, Keith)

   - Fix sg discard merge, allowing us to merge in cases where we didn't
     before (Jianchao Wang)

   - Call rq_qos_exit() after the queue is frozen, preventing a hang
     (Ming)

   - Fix brd queue setup, fixing an oops if we fail setting up all
     devices (Ming)"

* tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
  nvme-pci: fix conflicting p2p resource adds
  nvme-fc: fix request private initialization
  blkcg: revert blkcg cleanups series
  block: brd: associate with queue until adding disk
  block: call rq_qos_exit() after queue is frozen
  mtip32xx: clean an indentation issue, remove extraneous tabs
  block: fix the DISCARD request merge
2018-11-02 11:25:48 -07:00
Dennis Zhou
b5f2954d30 blkcg: revert blkcg cleanups series
This reverts a series committed earlier due to null pointer exception
bug report in [1]. It seems there are edge case interactions that I did
not consider and will need some time to understand what causes the
adverse interactions.

The original series can be found in [2] with a follow up series in [3].

[1] https://www.spinics.net/lists/cgroups/msg20719.html
[2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

This reverts the following commits:
d459d853c2, b2c3fa5467, 101246ec02, b3b9f24f5f, e2b0989954,
f0fcb3ec89, c839e7a03f, bdc2491708, 74b7c02a9b, 5bf9a1f3b4,
a7b39b4e96, 07b05bcc32, 49f4c2dc2b, 27e6fa996c

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-01 19:59:53 -06:00
Linus Torvalds
7abe849315 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull md updates from Shaohua Li:
 "This mainly improves raid10 cluster and fixes some bugs:

   - raid10 cluster improvements from Guoqing

   - Memory leak fixes from Jack and Xiao

   - raid10 hang fix from Alex

   - raid5 block faulty device fix from Mariusz

   - metadata update fix from Neil

   - Invalid disk role fix from Me

   - Other clearnups"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  MD: Memory leak when flush bio size is zero
  md: fix memleak for mempool
  md-cluster: remove suspend_info
  md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted
  md-cluster/bitmap: don't call md_bitmap_sync_with_cluster during reshaping stage
  md-cluster/raid10: don't call remove_and_add_spares during reshaping stage
  md-cluster/raid10: call update_size in md_reap_sync_thread
  md-cluster: introduce resync_info_get interface for sanity check
  md-cluster/raid10: support add disk under grow mode
  md-cluster/raid10: resize all the bitmaps before start reshape
  MD: fix invalid stored role for a disk - try2
  md/bitmap: use mddev_suspend/resume instead of ->quiesce()
  md: remove redundant code that is no longer reachable
  md: allow metadata updates while suspending an array - fix
  MD: fix invalid stored role for a disk
  md/raid10: Fix raid10 replace hang when new added disk faulty
  raid5: block failing device if raid will be failed
2018-10-26 13:00:44 -07:00
Linus Torvalds
71f4d95b23 - Biggest change this cycle is to remove support for the legacy IO path
(.request_fn) from request-based DM.  Jens has already started
   preparing for complete removal of the legacy IO path in 4.21 but this
   earlier removal of support from DM has been coordinated with Jens (as
   evidenced by the commit being attributed to him).  Making
   request-based DM exclussively blk-mq only cleans up that portion of DM
   core quite nicely.
 
 - Convert the thinp and zoned targets over to using refcount_t where
   applicable.
 
 - A couple fixes to the DM zoned target for refcounting and other races
   buried in the implementation of metadata block creation and use.
 
 - Small cleanups to remove redundant unlikely() around a couple
   WARN_ON_ONCE().
 
 - Simplify how dm-ioctl copies from userspace, eliminating some
   potential for a malicious user trying to change the executed ioctl
   after its processing has begun.
 
 - Tweaked DM crypt target to use the DM device name when naming the
   various workqueues created for a particular DM crypt device (makes the
   N workqueues for a DM crypt device more easily understood and enhances
   user's accounting capabilities at a glance via "ps")
 
 - Small fixup to remove dead branch in DM writecache's memory_entry().
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJb0zCmAAoJEMUj8QotnQNaUKwIAKFC3cuTtQyh3LomSbT4uAr5
 6apBSVcxVhqU+isriW3HmBPkO4HGyqMWjX5oQGHrOj0YK0i1H65Nq3qH9ATaiSHn
 awdo8A4YmClF5Mojc51UebXIH0IfnGSOKH/FHNhQzT3jAdn+vYinMSZ28JwFPgKW
 DsVOSM1dlJZBWRXhQNpyCjVl9Xb3rRUOnkfG0endyMfOsnoxKurhwSkXoStzCdQn
 O5ubt1XT3wMKoI1k9QWjfrBU1NtZZYD+kQ6EfkYXfL9RNhhZjwzO/eNtqT3jnKsq
 qbcd8/0JIUttPf7+F0URG9mbMbebfJGqNAaJWcnlRbCHmgUBBGVWsnl8MTWOLkw=
 =MdDN
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - The biggest change this cycle is to remove support for the legacy IO
   path (.request_fn) from request-based DM.

   Jens has already started preparing for complete removal of the legacy
   IO path in 4.21 but this earlier removal of support from DM has been
   coordinated with Jens (as evidenced by the commit being attributed to
   him).

   Making request-based DM exclussively blk-mq only cleans up that
   portion of DM core quite nicely.

 - Convert the thinp and zoned targets over to using refcount_t where
   applicable.

 - A couple fixes to the DM zoned target for refcounting and other races
   buried in the implementation of metadata block creation and use.

 - Small cleanups to remove redundant unlikely() around a couple
   WARN_ON_ONCE().

 - Simplify how dm-ioctl copies from userspace, eliminating some
   potential for a malicious user trying to change the executed ioctl
   after its processing has begun.

 - Tweaked DM crypt target to use the DM device name when naming the
   various workqueues created for a particular DM crypt device (makes
   the N workqueues for a DM crypt device more easily understood and
   enhances user's accounting capabilities at a glance via "ps")

 - Small fixup to remove dead branch in DM writecache's memory_entry().

* tag 'for-4.20/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm writecache: remove disabled code in memory_entry()
  dm zoned: fix various dmz_get_mblock() issues
  dm zoned: fix metadata block ref counting
  dm raid: avoid bitmap with raid4/5/6 journal device
  dm crypt: make workqueue names device-specific
  dm: add dm_table_device_name()
  dm ioctl: harden copy_params()'s copy_from_user() from malicious users
  dm: remove unnecessary unlikely() around WARN_ON_ONCE()
  dm zoned: target: use refcount_t for dm zoned reference counters
  dm thin: use refcount_t for thin_c reference counting
  dm table: require that request-based DM be layered on blk-mq devices
  dm: rename DM_TYPE_MQ_REQUEST_BASED to DM_TYPE_REQUEST_BASED
  dm: remove legacy request-based IO path
2018-10-26 12:57:38 -07:00
Linus Torvalds
6080ad3a99 for-linus-20181026
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvTOPAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptS3D/9jUjcOiMWYwHW6UbLGlVJiW9igzl6xSKpL
 QSxVCyxKoiaoMobO5dg/BJSq9grQAeOAzv0cjGS6VUuUtfRzMlar73PqZzeaF3oE
 oXeg1goxd9Qjs5AgJ5m3ZM+PLIn7g1zh5TP9vKKGeV4USh4tTlLgYEm5Su3sJDLb
 +6RSCkzYahSCouV2ugHPscsroJ/xzbT5vpMvSflbpe5iYNwzRcn81Z/c6rz7nw9A
 IlnYaf9EdP33uyP1j5Elc2Q1q5DmmxePJlrCUc8VyQdVSMAFZBWZfnkAsdeG7i5V
 YLtNm8tfmGNBZQ0Izp/VwGf9ZMwy13Z2JLBYK/qUh2EGGaWzk7FxiPR9+Gl2MHCD
 iRL7ujDb8yJGL8IEz6xiLcZEHZtUW1VLKPA/YiVtBrg9i6533KEboSQ77CyPchxc
 JJaowruaw/TmaL0MdLYQ3MHVYolWjljyv36YqJcN4oeaGuWfOSotjKKSnIg6FAh3
 Co8cDdcrGdD3yUYGx/7NR+/ejfWDnlCMJ/MtWmC0SMiJbL0pQgjIhXdyt1ciJB5Z
 ezHHrGahbmwNg27AuVHlVsjD8jmA2Z017NnS4zNHOdwNAuhE1LWtBOGU8I5oytJH
 eH0UYu2TDjhPGZ7TjEGZ0Xo6brz6IgPOaUIxrUj73hnbSunxdxo2jODFiiiMrs7Z
 6fxyHKMPuQ==
 =NFEB
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20181026' of git://git.kernel.dk/linux-block

Pull more block layer updates from Jens Axboe:

 - Set of patches improving support for zoned devices. This was ready
   before the merge window, but I was late in picking it up and hence it
   missed the original pull request (Damien, Christoph)

 - libata no link power management quirk addition for a Samsung drive
   (Diego Viola)

 - Fix for a performance regression in BFQ that went into this merge
   window (Federico Motta)

 - Fix for a missing dma mask setting return value check (Gustavo)

 - Typo in the gdrom queue failure case (me)

 - NULL pointer deref fix for xen-blkfront (Vasilis Liaskovitis)

 - Fixing the get_rq trace point placement in blk-mq (Xiaoguang Wang)

 - Removal of a set-but-not-read variable in cdrom (zhong jiang)

* tag 'for-linus-20181026' of git://git.kernel.dk/linux-block:
  libata: Apply NOLPM quirk for SAMSUNG MZ7TD256HAFV-000L9
  block, bfq: fix asymmetric scenarios detection
  gdrom: fix mistake in assignment of error
  blk-mq: place trace_block_getrq() in correct place
  block: Introduce blk_revalidate_disk_zones()
  block: add a report_zones method
  block: Expose queue nr_zones in sysfs
  block: Improve zone reset execution
  block: Introduce BLKGETNRZONES ioctl
  block: Introduce BLKGETZONESZ ioctl
  block: Limit allocation of zone descriptors for report zones
  block: Introduce blkdev_nr_zones() helper
  scsi: sd_zbc: Fix sd_zbc_check_zones() error checks
  scsi: sd_zbc: Reduce boot device scan and revalidate time
  scsi: sd_zbc: Rearrange code
  cdrom: remove set but not used variable 'tocuse'
  skd: fix unchecked return values
  xen/blkfront: avoid NULL blkfront_info dereference on device removal
2018-10-26 12:43:13 -07:00
Linus Torvalds
62606c224d Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Remove VLA usage
   - Add cryptostat user-space interface
   - Add notifier for new crypto algorithms

  Algorithms:
   - Add OFB mode
   - Remove speck

  Drivers:
   - Remove x86/sha*-mb as they are buggy
   - Remove pcbc(aes) from x86/aesni
   - Improve performance of arm/ghash-ce by up to 85%
   - Implement CTS-CBC in arm64/aes-blk, faster by up to 50%
   - Remove PMULL based arm64/crc32 driver
   - Use PMULL in arm64/crct10dif
   - Add aes-ctr support in s5p-sss
   - Add caam/qi2 driver

  Others:
   - Pick better transform if one becomes available in crc-t10dif"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (124 commits)
  crypto: chelsio - Update ntx queue received from cxgb4
  crypto: ccree - avoid implicit enum conversion
  crypto: caam - add SPDX license identifier to all files
  crypto: caam/qi - simplify CGR allocation, freeing
  crypto: mxs-dcp - make symbols 'sha1_null_hash' and 'sha256_null_hash' static
  crypto: arm64/aes-blk - ensure XTS mask is always loaded
  crypto: testmgr - fix sizeof() on COMP_BUF_SIZE
  crypto: chtls - remove set but not used variable 'csk'
  crypto: axis - fix platform_no_drv_owner.cocci warnings
  crypto: x86/aes-ni - fix build error following fpu template removal
  crypto: arm64/aes - fix handling sub-block CTS-CBC inputs
  crypto: caam/qi2 - avoid double export
  crypto: mxs-dcp - Fix AES issues
  crypto: mxs-dcp - Fix SHA null hashes and output length
  crypto: mxs-dcp - Implement sha import/export
  crypto: aegis/generic - fix for big endian systems
  crypto: morus/generic - fix for big endian systems
  crypto: lrw - fix rebase error after out of bounds fix
  crypto: cavium/nitrox - use pci_alloc_irq_vectors() while enabling MSI-X.
  crypto: cavium/nitrox - NITROX command queue changes.
  ...
2018-10-25 16:43:35 -07:00
Damien Le Moal
bf50545696 block: Introduce blk_revalidate_disk_zones()
Drivers exposing zoned block devices have to initialize and maintain
correctness (i.e. revalidate) of the device zone bitmaps attached to
the device request queue (seq_zones_bitmap and seq_zones_wlock).

To simplify coding this, introduce a generic helper function
blk_revalidate_disk_zones() suitable for most (and likely all) cases.
This new function always update the seq_zones_bitmap and seq_zones_wlock
bitmaps as well as the queue nr_zones field when called for a disk
using a request based queue. For a disk using a BIO based queue, only
the number of zones is updated since these queues do not have
schedulers and so do not need the zone bitmaps.

With this change, the zone bitmap initialization code in sd_zbc.c can be
replaced with a call to this function in sd_zbc_read_zones(), which is
called from the disk revalidate block operation method.

A call to blk_revalidate_disk_zones() is also added to the null_blk
driver for devices created with the zoned mode enabled.

Finally, to ensure that zoned devices created with dm-linear or
dm-flakey expose the correct number of zones through sysfs, a call to
blk_revalidate_disk_zones() is added to dm_table_set_restrictions().

The zone bitmaps allocated and initialized with
blk_revalidate_disk_zones() are freed automatically from
__blk_release_queue() using the block internal function
blk_queue_free_zone_bitmaps().

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-25 11:17:40 -06:00
Christoph Hellwig
e76239a374 block: add a report_zones method
Dispatching a report zones command through the request queue is a major
pain due to the command reply payload rewriting necessary. Given that
blkdev_report_zones() is executing everything synchronously, implement
report zones as a block device file operation instead, allowing major
simplification of the code in many places.

sd, null-blk, dm-linear and dm-flakey being the only block device
drivers supporting exposing zoned block devices, these drivers are
modified to provide the device side implementation of the
report_zones() block device file operation.

For device mappers, a new report_zones() target type operation is
defined so that the upper block layer calls blkdev_report_zones() can
be propagated down to the underlying devices of the dm targets.
Implementation for this new operation is added to the dm-linear and
dm-flakey targets.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Damien]
* Changed method block_device argument to gendisk
* Various bug fixes and improvements
* Added support for null_blk, dm-linear and dm-flakey.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-25 11:17:40 -06:00
Damien Le Moal
a91e138022 block: Introduce blkdev_nr_zones() helper
Introduce the blkdev_nr_zones() helper function to get the total
number of zones of a zoned block device. This number is always 0 for a
regular block device (q->limits.zoned == BLK_ZONED_NONE case).

Replace hard-coded number of zones calculation in dmz_get_zoned_device()
with a call to this helper.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-25 11:17:40 -06:00
Linus Torvalds
6ab9e09238 for-4.20/block-20181021
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvNQKgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgps+8D/9Iy6YIeoPwN10gYsqIh0P2fS3wKzL3kiww
 3vFsWO78PzgLxUlNmB7teLtNFc/R5mi8becZmAdvs9za5YFZk56o3Ifv1x+e+z00
 VY1/gxhiJD6suLeJ6lECnERGDaiWOZVRMo2TE17vxYGW6GGaa0Ts6PUUXmpla1u5
 WKctgt0Qv9WVNyiIdLdeHqzKJwsSSwNTt8fK7eFhy3x8e0CwJr+GtXckbbW3LFkY
 lug0npsTli3EmEPMovZhd25SjZmTk5GTM+ADZQ7Tnv5KXoDWB9jn6TcCSAi3G+5d
 5WUVwfnDyYJiH8qvlg5tRJ690muIy3xMOmpr7QBQ0YnR/LQ3EW+1CVfqD+qimgLH
 TXzlREXQpBP3YlxSDS5nddz4o5z84GZmC9B/43ujPaZKIQ6eBXYdkmQH7tPtSugm
 C6VGomR5tHotjxIiAsexh/5hAus+wW8bObKGTPTyINT0ub3XNclwCKLh26CgI9ie
 WvbS9g3j/KPvu/7s6weZpgD+cks0YdWe/XdXXxiHwsGI9h3J2aJna5RQt1rKWDm5
 wGCgbc/B8eSwiWx+GXlqdB9/Dy/bGXOnSTDnKpEVl1f5zNjeLwUKXbjvkMefWs4m
 jEIcquuDETORY+ZYEfa5YbmS4Lhskr0kzMVTVkZ++81tAWpSCU9Xh3IHrR8TNpt+
 J0oh0FHBDg==
 =LRTT
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "This is the main pull request for block changes for 4.20. This
  contains:

   - Series enabling runtime PM for blk-mq (Bart).

   - Two pull requests from Christoph for NVMe, with items such as;
      - Better AEN tracking
      - Multipath improvements
      - RDMA fixes
      - Rework of FC for target removal
      - Fixes for issues identified by static checkers
      - Fabric cleanups, as prep for TCP transport
      - Various cleanups and bug fixes

   - Block merging cleanups (Christoph)

   - Conversion of drivers to generic DMA mapping API (Christoph)

   - Series fixing ref count issues with blkcg (Dennis)

   - Series improving BFQ heuristics (Paolo, et al)

   - Series improving heuristics for the Kyber IO scheduler (Omar)

   - Removal of dangerous bio_rewind_iter() API (Ming)

   - Apply single queue IPI redirection logic to blk-mq (Ming)

   - Set of fixes and improvements for bcache (Coly et al)

   - Series closing a hotplug race with sysfs group attributes (Hannes)

   - Set of patches for lightnvm:
      - pblk trace support (Hans)
      - SPDX license header update (Javier)
      - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
        specs behind a common core interface. (Javier, Matias)
      - Enable pblk to use a common interface to retrieve chunk metadata
        (Matias)
      - Bug fixes (Various)

   - Set of fixes and updates to the blk IO latency target (Josef)

   - blk-mq queue number updates fixes (Jianchao)

   - Convert a bunch of drivers from the old legacy IO interface to
     blk-mq. This will conclude with the removal of the legacy IO
     interface itself in 4.21, with the rest of the drivers (me, Omar)

   - Removal of the DAC960 driver. The SCSI tree will introduce two
     replacement drivers for this (Hannes)"

* tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
  block: setup bounce bio_sets properly
  blkcg: reassociate bios when make_request() is called recursively
  blkcg: fix edge case for blk_get_rl() under memory pressure
  nvme-fabrics: move controller options matching to fabrics
  nvme-rdma: always have a valid trsvcid
  mtip32xx: fully switch to the generic DMA API
  rsxx: switch to the generic DMA API
  umem: switch to the generic DMA API
  sx8: switch to the generic DMA API
  sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
  skd: switch to the generic DMA API
  ubd: remove use of blk_rq_map_sg
  nvme-pci: remove duplicate check
  drivers/block: Remove DAC960 driver
  nvme-pci: fix hot removal during error handling
  nvmet-fcloop: suppress a compiler warning
  nvme-core: make implicit seed truncation explicit
  nvmet-fc: fix kernel-doc headers
  nvme-fc: rework the request initialization code
  nvme-fc: introduce struct nvme_fcp_op_w_sgl
  ...
2018-10-22 17:46:08 +01:00
Xiao Ni
af9b926de9 MD: Memory leak when flush bio size is zero
flush_pool is leaked when flush bio size is zero

Fixes: 5a409b4f56 ("MD: fix lock contention for flush bios")
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-22 09:15:26 -07:00
Jack Wang
6aaa58c994 md: fix memleak for mempool
I noticed kmemleak report memory leak when run create/stop
md in a loop, backtrace:
[<000000001ca975e7>] mempool_create_node+0x86/0xd0
[<0000000095576bcd>] md_run+0x1057/0x1410 [md_mod]
[<000000007b45c5fc>] do_md_run+0x15/0x130 [md_mod]
[<000000001ede9ec0>] md_ioctl+0x1f49/0x25d0 [md_mod]
[<000000004142cacf>] blkdev_ioctl+0x680/0xd00

The root cause is we alloc mddev->flush_pool and
mddev->flush_bio_pool in md_run, but from do_md_stop
will not call into md_stop but __md_stop, move the
mempool_destroy to __md_stop fixes the problem for me.

The bug was introduced in 5a409b4f56, the fixes should go to
4.18+

Fixes: 5a409b4f56 ("MD: fix lock contention for flush bios")
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-22 09:12:38 -07:00
Mike Snitzer
da4ad3a23a dm writecache: remove disabled code in memory_entry()
This dead branch was missed during review.  It only makes memory_entry()
more inefficient due to needless call to is_power_of_2(), etc.

Reported-by: shenghui <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-22 10:59:52 -04:00
Damien Le Moal
3d4e738311 dm zoned: fix various dmz_get_mblock() issues
dmz_fetch_mblock() called from dmz_get_mblock() has a race since the
allocation of the new metadata block descriptor and its insertion in
the cache rbtree with the READING state is not atomic. Two different
contexts requesting the same block may end up each adding two different
descriptors of the same block to the cache.

Another problem for this function is that the BIO for processing the
block read is allocated after the metadata block descriptor is inserted
in the cache rbtree. If the BIO allocation fails, the metadata block
descriptor is freed without first being removed from the rbtree.

Fix the first problem by checking again if the requested block is not in
the cache right before inserting the newly allocated descriptor,
atomically under the mblk_lock spinlock. The second problem is fixed by
simply allocating the BIO before inserting the new block in the cache.

Finally, since dmz_fetch_mblock() also increments a block reference
counter, rename the function to dmz_get_mblock_slow(). To be symmetric
and clear, also rename dmz_lookup_mblock() to dmz_get_mblock_fast() and
increment the block reference counter directly in that function rather
than in dmz_get_mblock().

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18 15:17:03 -04:00
Damien Le Moal
33c2865f8d dm zoned: fix metadata block ref counting
Since the ref field of struct dmz_mblock is always used with the
spinlock of struct dmz_metadata locked, there is no need to use an
atomic_t type. Change the type of the ref field to an unsigne
integer.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18 15:15:23 -04:00
Heinz Mauelshagen
d857ad75ed dm raid: avoid bitmap with raid4/5/6 journal device
With raid4/5/6, journal device and write intent bitmap are mutually exclusive.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18 15:13:48 -04:00
Guoqing Jiang
ea89238c0a md-cluster: remove suspend_info
Previously, we allow multiple nodes can resync device, but we
had changed it to only support one node can do resync at one
time, but suspend_info is still used.

Now, let's remove the structure and use suspend_lo/hi to record
the range.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:41:25 -07:00
Guoqing Jiang
cb9ee15431 md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted
We need to continue the reshaping if it was interrupted in
original node. So original node should call resync_bitmap
in case reshaping is aborted.

Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes,
node which continues the reshaping should restart reshape from
mddev->reshape_position instead of from the first beginning.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:40:48 -07:00
Guoqing Jiang
cbce6863b6 md-cluster/bitmap: don't call md_bitmap_sync_with_cluster during reshaping stage
When reshape is happening in one node, other nodes could receive
lots of RESYNCING messages, so md_bitmap_sync_with_cluster is called.

Since the resyncing window is typically small in these RESYNCING
messages, so WARN is always triggered, so we should not call the
func when reshape is happening.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:40:01 -07:00
Guoqing Jiang
ca1e98e04a md-cluster/raid10: don't call remove_and_add_spares during reshaping stage
remove_and_add_spares is not needed if reshape is
happening in another node, because raid10_add_disk
called inside raid10_start_reshape would handle the
role changes of disk. Plus, remove_and_add_spares
can't deal with the role change due to reshape.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:39:10 -07:00
Guoqing Jiang
aefb2e5fc2 md-cluster/raid10: call update_size in md_reap_sync_thread
We need to change the capacity in all nodes after one node
finishs reshape. And as we did before, we can't change the
capacity directly in md_do_sync, instead, the capacity should
be only changed in update_size or received CHANGE_CAPACITY
msg.

So master node calls update_size after completes reshape in
md_reap_sync_thread, but we need to skip ops->update_size if
MD_CLOSING is set since reshaping could not be finish.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:38:06 -07:00
Guoqing Jiang
5ebaf80bc8 md-cluster: introduce resync_info_get interface for sanity check
Since the resync region from suspend_info means one node
is reshaping this area, so the position of reshape_progress
should be included in the area.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:36:35 -07:00
Guoqing Jiang
7564beda19 md-cluster/raid10: support add disk under grow mode
For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.

Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.

For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.

Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.

2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:34:56 -07:00
Guoqing Jiang
afd7562860 md-cluster/raid10: resize all the bitmaps before start reshape
To support add disk under grow mode, we need to resize
all the bitmaps of each node before reshape, so that we
can ensure all nodes have the same view of the bitmap of
the clustered raid.

So after the master node resized the bitmap, it broadcast
a message to other slave nodes, and it checks the size of
each bitmap are same or not by compare pages. We can only
continue the reshaping after all nodes update the bitmap
to the same size (by checking the pages), otherwise revert
bitmap size to previous value.

The resize_bitmaps interface and BITMAP_RESIZE message are
introduced in md-cluster.c for the purpose.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:30:58 -07:00
Michał Mirosław
ed0302e830 dm crypt: make workqueue names device-specific
Make cpu-usage debugging easier by naming workqueues per device.

Example ps output:

root       413  0.0  0.0      0     0 ?        I<   paź02   0:00  [kcryptd_io/253:0]
root       414  0.0  0.0      0     0 ?        I<   paź02   0:00  [kcryptd/253:0]
root       415  0.0  0.0      0     0 ?        S    paź02   1:10  [dmcrypt_write/253:0]
root       465  0.0  0.0      0     0 ?        I<   paź02   0:00  [kcryptd_io/253:2]
root       466  0.0  0.0      0     0 ?        I<   paź02   0:00  [kcryptd/253:2]
root       467  0.0  0.0      0     0 ?        S    paź02   2:06  [dmcrypt_write/253:2]
root     15359  0.2  0.0      0     0 ?        I<   19:43   0:25  [kworker/u17:8-kcryptd/253:0]
root     16563  0.2  0.0      0     0 ?        I<   20:10   0:18  [kworker/u17:0-kcryptd/253:2]
root     23205  0.1  0.0      0     0 ?        I<   21:21   0:04  [kworker/u17:4-kcryptd/253:0]
root     13383  0.1  0.0      0     0 ?        I<   21:32   0:02  [kworker/u17:2-kcryptd/253:2]
root      2610  0.1  0.0      0     0 ?        I<   21:42   0:01  [kworker/u17:12-kcryptd/253:2]
root     20124  0.1  0.0      0     0 ?        I<   21:56   0:01  [kworker/u17:1-kcryptd/253:2]

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18 12:17:17 -04:00
Michał Mirosław
f349b0a3e1 dm: add dm_table_device_name()
Add a shortcut for dm_device_name(dm_table_get_md(t)).

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18 12:10:02 -04:00
Wenwen Wang
800a7340ab dm ioctl: harden copy_params()'s copy_from_user() from malicious users
In copy_params(), the struct 'dm_ioctl' is first copied from the user
space buffer 'user' to 'param_kernel' and the field 'data_size' is
checked against 'minimum_data_size' (size of 'struct dm_ioctl' payload
up to its 'data' member).  If the check fails, an error code EINVAL will be
returned.  Otherwise, param_kernel->data_size is used to do a second copy,
which copies from the same user-space buffer to 'dmi'.  After the second
copy, only 'dmi->data_size' is checked against 'param_kernel->data_size'.
Given that the buffer 'user' resides in the user space, a malicious
user-space process can race to change the content in the buffer between
the two copies.  This way, the attacker can inject inconsistent data
into 'dmi' (versus previously validated 'param_kernel').

Fix redundant copying of 'minimum_data_size' from user-space buffer by
using the first copy stored in 'param_kernel'.  Also remove the
'data_size' check after the second copy because it is now unnecessary.

Cc: stable@vger.kernel.org
Signed-off-by: Wenwen Wang <wang6495@umn.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18 11:54:07 -04:00
Igor Stoppa
bab5d98884 dm: remove unnecessary unlikely() around WARN_ON_ONCE()
WARN_ON() already contains an unlikely(), so it's not necessary to
wrap it into another.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-16 14:34:59 -04:00
John Pittman
092b564876 dm zoned: target: use refcount_t for dm zoned reference counters
The API surrounding refcount_t should be used in place of atomic_t
when variables are being used as reference counters.  This API can
prevent issues such as counter overflows and use-after-free
conditions.  Within the dm zoned target stack, the atomic_t API is
used for bioctx->ref and cw->refcount.  Change these to use
refcount_t, avoiding the issues mentioned.

Signed-off-by: John Pittman <jpittman@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-16 14:27:38 -04:00
John Pittman
22d4c291f5 dm thin: use refcount_t for thin_c reference counting
The API surrounding refcount_t should be used in place of atomic_t
when variables are being used as reference counters.  It can
potentially prevent reference counter overflows and use-after-free
conditions.  In the dm thin layer, one such example is tc->refcount.
Change this from the atomic_t API to the refcount_t API to prevent
mentioned conditions.

Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-16 14:27:03 -04:00
Shaohua Li
9e753ba9b9 MD: fix invalid stored role for a disk - try2
Commit d595567dc4 (MD: fix invalid stored role for a disk) broke linear
hotadd. Let's only fix the role for disks in raid1/10.
Based on Guoqing's original patch.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-14 17:05:07 -07:00
Mike Snitzer
cef6f55a9f dm table: require that request-based DM be layered on blk-mq devices
Now that request-based DM (multipath) is blk-mq only: this restriction
is required while the legacy request-based IO path still exists.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-11 17:51:13 -04:00
Greg Kroah-Hartman
834d3cd294 Fix open-coded multiplication arguments to allocators
- Fixes several new open-coded multiplications added in the 4.19 merge window.
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlu/fokWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsB/EACgKV77Sad5Luyr3rCmUtGcQ7az
 yLIrqvGcxC55ZEoZwHmSjxiN+5X2kDF6SEFrebvDKFSbiRoC0a1IWRC4pWTpBhTs
 +i1qHVTlOrwBZFTwOn2uklvgkkUfjatG/6zWc7l/Ye070Hekk0SnbMozlggCOJRm
 yKglXaBx9MKmj/T60Vpfve4ubBLM0zSuRPlsBON2qUUp2YTHbEqHOoYawfSK4RuF
 y2hzZc5A0/F7TionkHjrkdEJ8jRkwii2x4iM9KSdhNRxBT0lZkk3xpD6PjRaXCzt
 N2BMU17kftI5498QyKHXdTYCuVPqTpm+Z3d/q+YTbjdpXre1xcZU06ZT9Bqa+LwB
 pRaN4eqd7nLFKvCQYnUp0GuDj5pxd3Xz2dpC0IkaliEM8xYad1+NZRq7SkRJYOpM
 /y05GRdln9ULJF/pet5IS6LtXY+FSn4z+9e+ztVIPQ/kJUqvmyKfWPpdp6TPtwjC
 vb9cbKD7LRPoBfrY0efPXe4aixCwmc4Ob4kljCZtkyrpV+iImYQn9XqTblU7sbHa
 Om8FxGxdX7Xu9HUoT7uHeb8ZNg1g0/XWAEhs7pY22fzHT14T+0fYRz8njmlrw3ed
 dRdzydOxkJMcCVKLitoiw2X1yNRRHtGbXq/UhrHMNbEkOzf73/3fYZK68849FaEK
 1oFOX/N/OI5kp7pNAQ==
 =NS8/
 -----END PGP SIGNATURE-----

Merge tag 'alloc-args-v4.19-rc8' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Kees writes:
  "Fix open-coded multiplication arguments to allocators

   - Fixes several new open-coded multiplications added in the 4.19
     merge window."

* tag 'alloc-args-v4.19-rc8' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  treewide: Replace more open-coded allocation size multiplications
2018-10-11 19:10:30 +02:00
Mike Snitzer
953923c09f dm: rename DM_TYPE_MQ_REQUEST_BASED to DM_TYPE_REQUEST_BASED
Now that request-based DM is only using blk-mq, there is no need to
differentiate between legacy "rq" and new "mq".  We're back to a single
request-based DM -- and there was much rejoicing!

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-11 11:36:09 -04:00
Jens Axboe
6a23e05c2f dm: remove legacy request-based IO path
dm supports both, and since we're killing off the legacy path in
general, get rid of it in dm.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-11 11:36:09 -04:00
Damien Le Moal
118aa47c70 dm linear: fix linear_end_io conditional definition
The dm-linear target is independent of the dm-zoned target. For code
requiring support for zoned block devices, use CONFIG_BLK_DEV_ZONED
instead of CONFIG_DM_ZONED.

While at it, similarly to dm linear, also enable the DM_TARGET_ZONED_HM
feature in dm-flakey only if CONFIG_BLK_DEV_ZONED is defined.

Fixes: beb9caac21 ("dm linear: eliminate linear_end_io call if CONFIG_DM_ZONED disabled")
Fixes: 0be12c1c7f ("dm linear: add support for zoned block devices")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-10 23:22:24 -04:00
Jack Wang
f8f83d8ffe md/bitmap: use mddev_suspend/resume instead of ->quiesce()
After 9e1cc0a545 ("md: use mddev_suspend/resume instead of ->quiesce()")
We still have similar left in bitmap functions.

Replace quiesce() with mddev_suspend/resume.

Also move md_bitmap_create out of mddev_suspend. and move mddev_resume
after md_bitmap_destroy. as we did in set_bitmap_file.

Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Reviewed-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-10 11:03:34 -07:00
Colin Ian King
116d99adf5 md: remove redundant code that is no longer reachable
And earlier commit removed the error label to two statements that
are now never reachable.  Since this code is now dead code, remove it.

Detected by CoverityScan, CID#1462409 ("Structurally dead code")

Fixes: d5d885fd51 ("md: introduce new personality funciton start()")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-10 10:45:15 -07:00
Mike Snitzer
beb9caac21 dm linear: eliminate linear_end_io call if CONFIG_DM_ZONED disabled
It is best to avoid any extra overhead associated with bio completion.
DM core will indirectly call a DM target's .end_io if it is defined.
In the case of DM linear, there is no need to do so (for every bio that
completes) if CONFIG_DM_ZONED is not enabled.

Avoiding an extra indirect call for every bio completion is very
important for ensuring DM linear doesn't incur more overhead that
further widens the performance gap between dm-linear and raw block
devices.

Fixes: 0be12c1c7f ("dm linear: add support for zoned block devices")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-10 12:11:09 -04:00
Damien Le Moal
9864cd5dc5 dm: fix report zone remapping to account for partition offset
If dm-linear or dm-flakey are layered on top of a partition of a zoned
block device, remapping of the start sector and write pointer position
of the zones reported by a report zones BIO must be modified to account
for the target table entry mapping (start offset within the device and
entry mapping with the dm device).  If the target's backing device is a
partition of a whole disk, the start sector on the physical device of
the partition must also be accounted for when modifying the zone
information.  However, dm_remap_zone_report() was not considering this
last case, resulting in incorrect zone information remapping with
targets using disk partitions.

Fix this by calculating the target backing device start sector using
the position of the completed report zones BIO and the unchanged
position and size of the original report zone BIO. With this value
calculated, the start sector and write pointer position of the target
zones can be correctly remapped.

Fixes: 10999307c1 ("dm: introduce dm_remap_zone_report()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-09 14:20:13 -04:00
Shenghui Wang
c7cd55504a dm cache: destroy migration_cache if cache target registration failed
Commit 7e6358d244 ("dm: fix various targets to dm_register_target
after module __init resources created") inadvertently introduced this
bug when it moved dm_register_target() after the call to KMEM_CACHE().

Fixes: 7e6358d244 ("dm: fix various targets to dm_register_target after module __init resources created")
Cc: stable@vger.kernel.org
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-09 13:53:03 -04:00
Dongbo Cao
3a646fd776 bcache: panic fix for making cache device
when the nbuckets of cache device is smaller than 1024, making cache
device will trigger BUG_ON in kernel, add a condition to avoid this.

Reported-by: nitroxis <n@nxs.re>
Signed-off-by: Dongbo Cao <cdbdyx@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:59 -06:00
Dongbo Cao
f6027bca9e bcache: split combined if-condition code into separate ones
Split the combined '||' statements in if() check, to make the code easier
for debug.

Signed-off-by: Dongbo Cao <cdbdyx@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:57 -06:00
Shenghui Wang
8792099f9a bcache: use MAX_CACHES_PER_SET instead of magic number 8 in __bch_bucket_alloc_set
Current cache_set has MAX_CACHES_PER_SET caches most, and the macro
is used for
"
	struct cache *cache_by_alloc[MAX_CACHES_PER_SET];
"
in the define of struct cache_set.

Use MAX_CACHES_PER_SET instead of magic number 8 in
__bch_bucket_alloc_set.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:56 -06:00
Coly Li
149d0efada bcache: replace hard coded number with BUCKET_GC_GEN_MAX
In extents.c:bch_extent_bad(), number 96 is used as parameter to call
btree_bug_on(). The purpose is to check whether stale gen value exceeds
BUCKET_GC_GEN_MAX, so it is better to use macro BUCKET_GC_GEN_MAX to
make the code more understandable.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:55 -06:00
Dongbo Cao
91bafdf081 bcache: remove useless parameter of bch_debug_init()
Parameter "struct kobject *kobj" in bch_debug_init() is useless,
remove it in this patch.

Signed-off-by: Dongbo Cao <cdbdyx@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:53 -06:00
Shenghui Wang
3fd3c5c02b bcache: remove unused bch_passthrough_cache
struct kmem_cache *bch_passthrough_cache is not used in
bcache code. Remove it.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:52 -06:00
Shenghui Wang
46010141da bcache: recal cached_dev_sectors on detach
Recal cached_dev_sectors on cached_dev detached, as recal done on
cached_dev attached.

Update the cached_dev_sectors before bcache_device_detach called
as bcache_device_detach will set bcache_device->c to NULL.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:50 -06:00
Tang Junhui
2d6cb6edd2 bcache: fix miss key refill->end in writeback
refill->end record the last key of writeback, for example, at the first
time, keys (1,128K) to (1,1024K) are flush to the backend device, but
the end key (1,1024K) is not included, since the bellow code:
	if (bkey_cmp(k, refill->end) >= 0) {
		ret = MAP_DONE;
		goto out;
	}
And in the next time when we refill writeback keybuf again, we searched
key start from (1,1024K), and got a key bigger than it, so the key
(1,1024K) missed.
This patch modify the above code, and let the end key to be included to
the writeback key buffer.

Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:48 -06:00
Ben Peddell
7567c2a2ad bcache: Populate writeback_rate_minimum attribute
Forgot to include the maintainers with my first email.

Somewhere between Michael Lyle's original
"bcache: PI controller for writeback rate V2" patch dated 07 Sep 2017
and 1d316e6 bcache: implement PI controller for writeback rate,
the mapping of the writeback_rate_minimum attribute was dropped.

Re-add the missing sysfs writeback_rate_minimum attribute mapping to
"allow the user to specify a minimum rate at which dirty blocks are
retired."

Fixes: 1d316e6 ("bcache: implement PI controller for writeback rate")
Signed-off-by: Ben Peddell <klightspeed@killerwolves.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:46 -06:00
Tang Junhui
2e17a262a2 bcache: correct dirty data statistics
When bcache device is clean, dirty keys may still exist after
journal replay, so we need to count these dirty keys even
device in clean status, otherwise after writeback, the amount
of dirty data would be incorrect.

Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:45 -06:00
Coly Li
4516da427f bcache: fix typo in code comments of closure_return_with_destructor()
The code comments of closure_return_with_destructor() in closure.h makrs
function name as closure_return(). This patch fixes this type with the
correct name - closure_return_with_destructor.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:43 -06:00
Tang Junhui
dd0c91793b bcache: fix ioctl in flash device
When doing ioctl in flash device, it will call ioctl_dev() in super.c,
then we should not to get cached device since flash only device has
no backend device. This patch just move the jugement dc->io_disable
to cached_dev_ioctl() to make ioctl in flash device correctly.

Fixes: 0f0709e6bf ("bcache: stop bcache device when backing device is offline")
Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:42 -06:00
Coly Li
752f66a75a bcache: use REQ_PRIO to indicate bio for metadata
In cached_dev_cache_miss() and check_should_bypass(), REQ_META is used
to check whether a bio is for metadata request. REQ_META is used for
blktrace, the correct REQ_ flag should be REQ_PRIO. This flag means the
bio should be prior to other bio, and frequently be used to indicate
metadata io in file system code.

This patch replaces REQ_META with correct flag REQ_PRIO.

CC Adam Manzanares because he explains to me what REQ_PRIO is for.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:40 -06:00
Tang Junhui
502b291568 bcache: trace missed reading by cache_missed
Missed reading IOs are identified by s->cache_missed, not the
s->cache_miss, so in trace_bcache_read() using trace_bcache_read
to identify whether the IO is missed or not.

Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:39 -06:00
Shenghui Wang
7a55948d38 bcache: account size of buckets used in uuid write to ca->meta_sectors_written
UUIDs are considered as metadata. __uuid_write should add the number
of buckets (in sectors) written to disk to ca->meta_sectors_written.
Currently only 1 bucket is used in uuid write.

Steps to test:
1) create a fresh backing device and a fresh cache device separately.
   The backing device didn't attach to any cache set.
2) cd /sys/block/<cache device>/bcache
   cat metadata_written      // record the output value
   cat bucket_size
3) attach the backing device to cache set
4) cat metadata_written
   The output value is almost the same as the value in step 2
   before the change.
   After the change, the value is bigger about 1 bucket size.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Reviewed-by: Tang Junhui <tang.junhui.linux@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:37 -06:00
Kees Cook
329e098939 treewide: Replace more open-coded allocation size multiplications
As done treewide earlier, this catches several more open-coded
allocation size calculations that were added to the kernel during the
merge window. This performs the following mechanical transformations
using Coccinelle:

	kvmalloc(a * b, ...) -> kvmalloc_array(a, b, ...)
	kvzalloc(a * b, ...) -> kvcalloc(a, b, ...)
	devm_kzalloc(..., a * b, ...) -> devm_kcalloc(..., a, b, ...)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-10-05 18:06:30 -07:00
Greg Kroah-Hartman
b98d6cb80b - Fix a DM thinp __udivdi3 undefined on 32-bit bug introduced during
4.19 merge window.
 
 - Fix leak and dangling pointer in DM multipath's scsi_dh related code.
 
 - A couple stable@ fixes for DM cache's resize support.
 
 - A DM raid fix to remove "const" from decipher_sync_action()'s return
   type.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJbt7RQAAoJEMUj8QotnQNac2MIAK0p2Ndy7ErItA+qIlmBodsO
 juWGdnhBXTHQ9qoeIuamqA66g7LZ8oxnaiIQX8tqzt+GyzxgcjH+9C+Nn7qf8VLF
 4SkbI8/8CxBKmsKv5G5OI2tCfdfTXop9rOTg+3tF6BcTxXzPGeY+5mImsMN4KI/S
 1+hnq2xMtScdTDLyN0qKxs0+e8YdP7IJn2QAscLVNik0wdWn+7Hfp7Hg7tXcupyF
 WQ1dkY8kaGCUid167jAkCmY4etXHIfAb81EWxFtGb/GNGlFmc57t7HGa5mVaO3qv
 QXgp9u27q7v/wqzDHO68+mzAfiBHMFmJHa4uRg7gR1LrE9/0y3CDWC3zMYDL3Es=
 =Av3x
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Mike writes:
  "device mapper fixes

   - Fix a DM thinp __udivdi3 undefined on 32-bit bug introduced during
     4.19 merge window.

   - Fix leak and dangling pointer in DM multipath's scsi_dh related code.

   - A couple stable@ fixes for DM cache's resize support.

   - A DM raid fix to remove "const" from decipher_sync_action()'s return
     type."

* tag 'for-4.19/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache: fix resize crash if user doesn't reload cache table
  dm cache metadata: ignore hints array being too small during resize
  dm raid: remove bogus const from decipher_sync_action() return type
  dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer
  dm thin metadata: fix __udivdi3 undefined on 32-bit
2018-10-05 16:09:56 -07:00
Mike Snitzer
5d07384a66 dm cache: fix resize crash if user doesn't reload cache table
A reload of the cache's DM table is needed during resize because
otherwise a crash will occur when attempting to access smq policy
entries associated with the portion of the cache that was recently
extended.

The reason is cache-size based data structures in the policy will not be
resized, the only way to safely extend the cache is to allow for a
proper cache policy initialization that occurs when the cache table is
loaded.  For example the smq policy's space_init(), init_allocator(),
calc_hotspot_params() must be sized based on the extended cache size.

The fix for this is to disallow cache resizes of this pattern:
1) suspend "cache" target's device
2) resize the fast device used for the cache
3) resume "cache" target's device

Instead, the last step must be a full reload of the cache's DM table.

Fixes: 66a636356 ("dm cache: add stochastic-multi-queue (smq) policy")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-04 15:20:52 -04:00
Joe Thornber
4561ffca88 dm cache metadata: ignore hints array being too small during resize
Commit fd2fa9541 ("dm cache metadata: save in-core policy_hint_size to
on-disk superblock") enabled previously written policy hints to be
used after a cache is reactivated.  But in doing so the cache
metadata's hint array was left exposed to out of bounds access because
on resize the metadata's on-disk hint array wasn't ever extended.

Fix this by ignoring that there are no on-disk hints associated with the
newly added cache blocks.  An expanded on-disk hint array is later
rewritten upon the next clean shutdown of the cache.

Fixes: fd2fa9541 ("dm cache metadata: save in-core policy_hint_size to on-disk superblock")
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-04 15:20:51 -04:00
NeilBrown
059421e041 md: allow metadata updates while suspending an array - fix
Commit 35bfc52187 ("md: allow metadata update while suspending.")
added support for allowing md_check_recovery() to still perform
metadata updates while the array is entering the 'suspended' state.
This is needed to allow the processes of entering the state to
complete.

Unfortunately, the patch doesn't really work.  The test for
"mddev->suspended" at the start of md_check_recovery() means that the
function doesn't try to do anything at all while entering suspend.

This patch moves the code of updating the metadata while suspending to
*before* the test on mddev->suspended.

Reported-by: Jeff Mahoney <jeffm@suse.com>
Fixes: 35bfc52187 ("md: allow metadata update while suspending.")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-03 09:52:08 -07:00
Shaohua Li
d595567dc4 MD: fix invalid stored role for a disk
If we change the number of array's device after device is removed from array,
then add the device back to array, we can see that device is added as active
role instead of spare which we expected.

Please see the below link for details:
https://marc.info/?l=linux-raid&m=153736982015076&w=2

This is caused by that we prefer to use device's previous role which is
recorded by saved_raid_disk, but we should respect the new number of
conf->raid_disks since it could be changed after device is removed.

Reported-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Tested-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-01 18:36:36 -07:00
Jens Axboe
c0aac682fa This is the 4.19-rc6 release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAluw4MIACgkQONu9yGCS
 aT7+8xAAiYnc4khUsxeInm3z44WPfRX1+UF51frTNSY5C8Nn5nvRSnTUNLuKkkrz
 8RbwCL6UYyJxF9I/oZdHPsPOD4IxXkQY55tBjz7ZbSBIFEwYM6RJMm8mAGlXY7wq
 VyWA5MhlpGHM9DjrguB4DMRipnrSc06CVAnC+ZyKLjzblzU1Wdf2dYu+AW9pUVXP
 j4r74lFED5djPY1xfqfzEwmYRCeEGYGx7zMqT3GrrF5uFPqj1H6O5klEsAhIZvdl
 IWnJTU2coC8R/Sd17g4lHWPIeQNnMUGIUbu+PhIrZ/lDwFxlocg4BvarPXEdzgYi
 gdZzKBfovpEsSu5RCQsKWG4IGQxY7I1p70IOP9eqEFHZy77qT1YcHVAWrK1Y/bJd
 UA08gUOSzRnhKkNR3+PsaMflUOl9WkpyHECZu394cyRGMutSS50aWkavJPJ/o1Qi
 D/oGqZLLcKFyuNcchG+Met1TzY3LvYEDgSburqwqeUZWtAsGs8kmiiq7qvmXx4zV
 IcgM8ERqJ8mbfhfsXQU7hwydIrPJ3JdIq19RnM5ajbv2Q4C/qJCyAKkQoacrlKR4
 aiow/qvyNrP80rpXfPJB8/8PiWeDtAnnGhM+xySZNlw3t8GR6NYpUkIzf5TdkSb3
 C8KuKg6FY9QAS62fv+5KK3LB/wbQanxaPNruQFGe5K1iDQ5Fvzw=
 =dMl4
 -----END PGP SIGNATURE-----

Merge tag 'v4.19-rc6' into for-4.20/block

Merge -rc6 in, for two reasons:

1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
   they aren't in the 4.20 branch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

* tag 'v4.19-rc6': (780 commits)
  Linux 4.19-rc6
  MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
  cpufreq: qcom-kryo: Fix section annotations
  perf/core: Add sanity check to deal with pinned event failure
  xen/blkfront: correct purging of persistent grants
  Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
  selftests/powerpc: Fix Makefiles for headers_install change
  blk-mq: I/O and timer unplugs are inverted in blktrace
  dax: Fix deadlock in dax_lock_mapping_entry()
  x86/boot: Fix kexec booting failure in the SEV bit detection code
  bcache: add separate workqueue for journal_write to avoid deadlock
  drm/amd/display: Fix Edid emulation for linux
  drm/amd/display: Fix Vega10 lightup on S3 resume
  drm/amdgpu: Fix vce work queue was not cancelled when suspend
  Revert "drm/panel: Add device_link from panel device to DRM device"
  xen/blkfront: When purging persistent grants, keep them in the buffer
  clocksource/drivers/timer-atmel-pit: Properly handle error cases
  block: fix deadline elevator drain for zoned block devices
  ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
  drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-01 08:58:57 -06:00
Greg Kroah-Hartman
291d0e5d81 for-linus-20180929
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAluv6bgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprAFD/9YEm5/YGX8ypqepUISfr7MSspZCl/r0GVv
 Oyu14jfTe18ji9mnFzA+g0y/KbNgdyOYstOp9l4V2Pxmt0Hq3KHjiqdbGbICUcDz
 uPbwnXWW3Pem6l6JOmT86n14c5irkmB/XezlBEE2n7cVReCOwkycr8VNUkXax/Mu
 Wuv1nAX2+uNBGSg0g4H2Y5Dk0fxmQcyKKkVfsz1xa9T2G5sB7gi8XU3+mqXT77hC
 BN7aaB306g+gNwGuHp4V6r9eUmSilRHq53qYTKRD8Vtbe2VeVlsmnU8LjFGRuxcN
 UZOuEO5MftIt32epi8hEQwWVxoZlaHv5qTjAHjiM77H7+kZGVK7Xv/ZrJHoRRQcI
 vIrNKZX0wUtlsC/MmdCcYdqzxgyMJYNc7+Y13W2M/GgXamrOVcjaYWGaxDWfLlIN
 jLkFrBK+9XRnvh5o0yKmoL/LXFJ4vXc3T9cvaYN/KTJUhYcfBEDuvfJTPKjbrWkc
 iv6ORaLh9hbtUmIJO2yo0ZtLo9vxhegJK1NP6bICo0fJ9iiOrIQpxLiEegWA0ITb
 85ot2Iepao5wqNnobSUBdlIKgIt1hECaNVCb3wUvIM7KYZilYP5nYsHg28x9+ZPq
 CJZRzYmMBaH+yKo9FkO/+rxGPg6R9Gri5QZRTFAVUcW02I407A8+nlMwQMEBZxqb
 80h8OO/ACw==
 =8c0k
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180929' of git://git.kernel.dk/linux-block

Jens writes:
  "Block fixes for 4.19-rc6

   A set of fixes that should go into this release. This pull request
   contains:

   - A fix (hopefully) for the persistent grants for xen-blkfront. A
     previous fix from this series wasn't complete, hence reverted, and
     this one should hopefully be it. (Boris Ostrovsky)

   - Fix for an elevator drain warning with SMR devices, which is
     triggered when you switch schedulers (Damien)

   - bcache deadlock fix (Guoju Fang)

   - Fix for the block unplug tracepoint, which has had the
     timer/explicit flag reverted since 4.11 (Ilya)

   - Fix a regression in this series where the blk-mq timeout hook is
     invoked with the RCU read lock held, hence preventing it from
     blocking (Keith)

   - NVMe pull from Christoph, with a single multipath fix (Susobhan Dey)"

* tag 'for-linus-20180929' of git://git.kernel.dk/linux-block:
  xen/blkfront: correct purging of persistent grants
  Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
  blk-mq: I/O and timer unplugs are inverted in blktrace
  bcache: add separate workqueue for journal_write to avoid deadlock
  xen/blkfront: When purging persistent grants, keep them in the buffer
  block: fix deadline elevator drain for zoned block devices
  blk-mq: Allow blocking queue tag iter callbacks
  nvme: properly propagate errors in nvme_mpath_init
2018-09-29 14:52:14 -07:00
Alex Wu
ee37d7314a md/raid10: Fix raid10 replace hang when new added disk faulty
[Symptom]

Resync thread hang when new added disk faulty during replacing.

[Root Cause]

In raid10_sync_request(), we expect to issue a bio with callback
end_sync_read(), and a bio with callback end_sync_write().

In normal situation, we will add resyncing sectors into
mddev->recovery_active when raid10_sync_request() returned, and sub
resynced sectors from mddev->recovery_active when end_sync_write()
calls end_sync_request().

If new added disk, which are replacing the old disk, is set faulty,
there is a race condition:
    1. In the first rcu protected section, resync thread did not detect
       that mreplace is set faulty and pass the condition.
    2. In the second rcu protected section, mreplace is set faulty.
    3. But, resync thread will prepare the read object first, and then
       check the write condition.
    4. It will find that mreplace is set faulty and do not have to
       prepare write object.
This cause we add resync sectors but never sub it.

[How to Reproduce]

This issue can be easily reproduced by the following steps:
    mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
    mdadm /dev/md0 -a /dev/sde
    mdadm /dev/md0 --replace /dev/sdd
    sleep 1
    mdadm /dev/md0 -f /dev/sde

[How to Fix]

This issue can be fixed by using local variables to record the result
of test conditions. Once the conditions are satisfied, we can make sure
that we need to issue a bio for read and a bio for write.

Previous 'commit 24afd80d99 ("md/raid10: handle recovery of
replacement devices.")' will also check whether bio is NULL, but leave
the comment saying that it is a pointless test. So we remove this dummy
check.

Reported-by: Alex Chen <alexchen@synology.com>
Reviewed-by: Allen Peng <allenpeng@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Alex Wu <alexwu@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-09-28 11:42:47 -07:00
Mariusz Tkaczyk
fb73b357fb raid5: block failing device if raid will be failed
Currently there is an inconsistency for failing the member drives
for arrays with different RAID levels. For RAID456 - there is a possibility
to fail all of the devices. However - for other RAID levels - kernel blocks
removing the member drive, if the operation results in array's FAIL state
(EBUSY is returned). For example - removing last drive from RAID1 is not
possible.
This kind of blocker was never implemented for raid456 and we cannot see
the reason why.

We had tested following patch and did not observe any regression, so do you
have any comments/reasons for current approach, or we can send the proper
patch for this?

Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-09-28 11:13:15 -07:00
Guoju Fang
0f843e65d9 bcache: add separate workqueue for journal_write to avoid deadlock
After write SSD completed, bcache schedules journal_write work to
system_wq, which is a public workqueue in system, without WQ_MEM_RECLAIM
flag. system_wq is also a bound wq, and there may be no idle kworker on
current processor. Creating a new kworker may unfortunately need to
reclaim memory first, by shrinking cache and slab used by vfs, which
depends on bcache device. That's a deadlock.

This patch create a new workqueue for journal_write with WQ_MEM_RECLAIM
flag. It's rescuer thread will work to avoid the deadlock.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27 09:47:01 -06:00
Dennis Zhou (Facebook)
c839e7a03f blkcg: remove bio->bi_css and instead use bio->bi_blkg
Prior patches ensured that all bios are now associated with some blkg.
This now makes bio->bi_css unnecessary as blkg maintains a reference to
the blkcg already.

This patch removes the field bi_css and transfers corresponding uses to
access via bi_blkg.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:13 -06:00
Geert Uytterhoeven
0328ba9040 dm raid: remove bogus const from decipher_sync_action() return type
With gcc-4.1.2:

    drivers/md/dm-raid.c:3357: warning: type qualifiers ignored on function return type

Remove the "const" keyword to fix this.

Fixes: 36a240a706 ("dm raid: fix RAID leg rebuild errors")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-17 22:46:50 -04:00
Mike Snitzer
b592211c33 dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer
Commit e8f74a0f00 ("dm mpath: eliminate need to use
scsi_device_from_queue") introduced 2 regressions:
1) memory leak occurs if attached_handler_name is not assigned to
   m->hw_handler_name
2) m->hw_handler_name can become a dangling pointer if the
   RETAIN_ATTACHED_HW_HANDLER flag is set and scsi_dh_attach() returns
   -EBUSY.

Fix both of these by clearing 'attached_handler_name' pointer passed to
setup_scsi_dh() after it is assigned to m->hw_handler_name.  And if
setup_scsi_dh() doesn't consume 'attached_handler_name' parse_path()
will kfree() it.

Fixes: e8f74a0f00 ("dm mpath: eliminate need to use scsi_device_from_queue")
Cc: stable@vger.kernel.org # 4.16+
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-17 22:46:49 -04:00
Mike Snitzer
013ad04390 dm thin metadata: fix __udivdi3 undefined on 32-bit
sector_div() is only viable for use with sector_t.
dm_block_t is typedef'd to uint64_t -- so use div_u64() instead.

Fixes: 3ab918281 ("dm thin metadata: try to avoid ever aborting transactions")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-17 11:49:34 -04:00
Kees Cook
6d39a1241e dm: Remove VLA usage from hashes
In the quest to remove all stack VLA usage from the kernel[1], this uses
the new HASH_MAX_DIGESTSIZE from the crypto layer to allocate the upper
bounds on stack usage.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-14 14:08:52 +08:00
Linus Torvalds
a0efc03b79 - DM verity fix for crash due to using vmalloc'd buffers with the
asynchronous crypto hadsh API.
 
 - Fix to both DM crypt and DM integrity targets to discontinue using
   CRYPTO_TFM_REQ_MAY_SLEEP because its use of GFP_KERNEL can lead to
   deadlock by recursing back into a filesystem.
 
 - Various DM raid fixes related to reshape and rebuild races.
 
 - Fix for DM thin-provisioning to avoid data corruption that was a
   side-effect of needing to abort DM thin metadata transaction due to
   running out of metadata space.  Fix is to reserve a small amount of
   metadata space so that once it is used the DM thin-pool can finish its
   active transaction before switching to read-only mode.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJbmq1sAAoJEMUj8QotnQNa9hwIANBzgefmIoBA8hCqlS7Nzogh
 moHAbaUONYxAVksJ5s4Divpa4RE8jSZ8AiVJizbLj94XxqnpA5x4pmhtUs3OdWxq
 4FW8fZvb2L1IMeshajMsO+rKrL6y8uwYnRI5zSJRKxhV2frj1VZ4ioMEXMGrwqjO
 jwNYFWglX5G3eIohUIDxgXki/WkacndQlsy2T6Utx0idz50usOt1pNON/GPhRsXd
 yWbvxQWuaZlm6+TF1ygsrXr1nMwRwDw0pdjY9sRQ3hHmuYQ24Wgkwf5Ggi5US7VZ
 iwgK8dppR8zWqwRyxRYahJvQvwcnC18mewFAmuDQfvPXqvUHtuhWaxzE4D9c3L0=
 =yIoD
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - DM verity fix for crash due to using vmalloc'd buffers with the
   asynchronous crypto hadsh API.

 - Fix to both DM crypt and DM integrity targets to discontinue using
   CRYPTO_TFM_REQ_MAY_SLEEP because its use of GFP_KERNEL can lead to
   deadlock by recursing back into a filesystem.

 - Various DM raid fixes related to reshape and rebuild races.

 - Fix for DM thin-provisioning to avoid data corruption that was a
   side-effect of needing to abort DM thin metadata transaction due to
   running out of metadata space. Fix is to reserve a small amount of
   metadata space so that once it is used the DM thin-pool can finish
   its active transaction before switching to read-only mode.

* tag 'for-4.19/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm thin metadata: try to avoid ever aborting transactions
  dm raid: bump target version, update comments and documentation
  dm raid: fix RAID leg rebuild errors
  dm raid: fix rebuild of specific devices by updating superblock
  dm raid: fix stripe adding reshape deadlock
  dm raid: fix reshape race on small devices
  dm: disable CRYPTO_TFM_REQ_MAY_SLEEP to fix a GFP_KERNEL recursion deadlock
  dm verity: fix crash on bufio buffer that was allocated with vmalloc
2018-09-13 19:12:55 -10:00
Joe Thornber
3ab9182816 dm thin metadata: try to avoid ever aborting transactions
Committing a transaction can consume some metadata of it's own, we now
reserve a small amount of metadata to cover this.  Free metadata
reported by the kernel will not include this reserve.

If any of the reserve has been used after a commit we enter a new
internal state PM_OUT_OF_METADATA_SPACE.  This is reported as
PM_READ_ONLY, so no userland changes are needed.  If the metadata
device is resized the pool will move back to PM_WRITE.

These changes mean we never need to abort and rollback a transaction due
to running out of metadata space.  This is particularly important
because there have been a handful of reports of data corruption against
DM thin-provisioning that can all be attributed to the thin-pool having
ran out of metadata space.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-10 17:03:18 -04:00
Heinz Mauelshagen
5380c05b68 dm raid: bump target version, update comments and documentation
Bump target version to reflect the documented fixes are available.
Also fix some code comments (typos and clarity).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06 17:07:58 -04:00
Heinz Mauelshagen
36a240a706 dm raid: fix RAID leg rebuild errors
On fast devices such as NVMe, a flaw in rs_get_progress() results in
false target status output when userspace lvm2 requests leg rebuilds
(symptom of the failure is device health chars 'aaaaaaaa' instead of
expected 'aAaAAAAA' causing lvm2 to fail).

The correct sync action state definitions already exist in
decipher_sync_action() so fix rs_get_progress() to use it.

Change decipher_sync_action() to return an enum rather than a string for
the sync states and call it from rs_get_progress().  Introduce
sync_str() to translate from enum to the string that is needed by
raid_status().

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06 17:07:56 -04:00
Heinz Mauelshagen
c44a5ee803 dm raid: fix rebuild of specific devices by updating superblock
Update superblock when particular devices are requested via rebuild
(e.g. lvconvert --replace ...) to avoid spurious failure with the "New
device injected into existing raid set without 'delta_disks' or
'rebuild' parameter specified" error message.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06 17:07:56 -04:00
Heinz Mauelshagen
644e2537fd dm raid: fix stripe adding reshape deadlock
When initiating a stripe adding reshape, a deadlock between
md_stop_writes() waiting for the sync thread to stop and the running
sync thread waiting for inactive stripes occurs (this frequently happens
on single-core but rarely on multi-core systems).

Fix this deadlock by setting MD_RECOVERY_WAIT to have the main MD
resynchronization thread worker (md_do_sync()) bail out when initiating
the reshape via constructor arguments.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06 17:07:50 -04:00
Heinz Mauelshagen
38b0bd0cda dm raid: fix reshape race on small devices
Loading a new mapping table, the dm-raid target's constructor
retrieves the volatile reshaping state from the raid superblocks.

When the new table is activated in a following resume, the actual
reshape position is retrieved.  The reshape driven by the previous
mapping can already have finished on small and/or fast devices thus
updating raid superblocks about the new raid layout.

This causes the actual array state (e.g. stripe size reshape finished)
to be inconsistent with the one in the new mapping, causing hangs with
left behind devices.

This race does not occur with usual raid device sizes but with small
ones (e.g. those created by the lvm2 test suite).

Fix by no longer transferring stale/inconsistent raid_set state during
preresume.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06 14:11:00 -04:00
Mikulas Patocka
432061b3da dm: disable CRYPTO_TFM_REQ_MAY_SLEEP to fix a GFP_KERNEL recursion deadlock
There's a XFS on dm-crypt deadlock, recursing back to itself due to the
crypto subsystems use of GFP_KERNEL, reported here:
https://bugzilla.kernel.org/show_bug.cgi?id=200835

* dm-crypt calls crypt_convert in xts mode
* init_crypt from xts.c calls kmalloc(GFP_KERNEL)
* kmalloc(GFP_KERNEL) recurses into the XFS filesystem, the filesystem
	tries to submit some bios and wait for them, causing a deadlock

Fix this by updating both the DM crypt and integrity targets to no
longer use the CRYPTO_TFM_REQ_MAY_SLEEP flag, which will change the
crypto allocations from GFP_KERNEL to GFP_ATOMIC, therefore they can't
recurse into a filesystem.  A GFP_ATOMIC allocation can fail, but
init_crypt() in xts.c handles the allocation failure gracefully - it
will fall back to preallocated buffer if the allocation fails.

The crypto API maintainer says that the crypto API only needs to
allocate memory when dealing with unaligned buffers and therefore
turning CRYPTO_TFM_REQ_MAY_SLEEP off is safe (see this discussion:
https://www.redhat.com/archives/dm-devel/2018-August/msg00195.html )

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06 13:31:09 -04:00
Mikulas Patocka
e4b069e094 dm verity: fix crash on bufio buffer that was allocated with vmalloc
Since commit d1ac3ff008 ("dm verity: switch to using asynchronous hash
crypto API") dm-verity uses asynchronous crypto calls for verification,
so that it can use hardware with asynchronous processing of crypto
operations.

These asynchronous calls don't support vmalloc memory, but the buffer data
can be allocated with vmalloc if dm-bufio is short of memory and uses a
reserved buffer that was preallocated in dm_bufio_client_create().

Fix verity_hash_update() so that it deals with vmalloc'd memory
correctly.

Reported-by: "Xiao, Jin" <jin.xiao@intel.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: d1ac3ff008 ("dm verity: switch to using asynchronous hash crypto API")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-04 11:25:25 -04:00
Guoqing Jiang
41a9504112 md-cluster: release RESYNC lock after the last resync message
All the RESYNC messages are sent with resync lock held, the only
exception is resync_finish which releases resync_lockres before
send the last resync message, this should be changed as well.
Otherwise, we can see deadlock issue as follows:

clustermd2-gqjiang2:~ # cat /proc/mdstat
Personalities : [raid10] [raid1]
md0 : active raid1 sdg[0] sdf[1]
      134144 blocks super 1.2 [2/2] [UU]
      [===================>.]  resync = 99.6% (134144/134144) finish=0.0min speed=26K/sec
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>
clustermd2-gqjiang2:~ # ps aux|grep md|grep D
root     20497  0.0  0.0      0     0 ?        D    16:00   0:00 [md0_raid1]
clustermd2-gqjiang2:~ # cat /proc/20497/stack
[<ffffffffc05ff51e>] dlm_lock_sync+0x8e/0xc0 [md_cluster]
[<ffffffffc05ff7e8>] __sendmsg+0x98/0x130 [md_cluster]
[<ffffffffc05ff900>] sendmsg+0x20/0x30 [md_cluster]
[<ffffffffc05ffc35>] resync_info_update+0xb5/0xc0 [md_cluster]
[<ffffffffc0593e84>] md_reap_sync_thread+0x134/0x170 [md_mod]
[<ffffffffc059514c>] md_check_recovery+0x28c/0x510 [md_mod]
[<ffffffffc060c882>] raid1d+0x42/0x800 [raid1]
[<ffffffffc058ab61>] md_thread+0x121/0x150 [md_mod]
[<ffffffff9a0a5b3f>] kthread+0xff/0x140
[<ffffffff9a800235>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

clustermd-gqjiang1:~ # ps aux|grep md|grep D
root     20531  0.0  0.0      0     0 ?        D    16:00   0:00 [md0_raid1]
root     20537  0.0  0.0      0     0 ?        D    16:00   0:00 [md0_cluster_rec]
root     20676  0.0  0.0      0     0 ?        D    16:01   0:00 [md0_resync]
clustermd-gqjiang1:~ # cat /proc/mdstat
Personalities : [raid10] [raid1]
md0 : active raid1 sdf[1] sdg[0]
      134144 blocks super 1.2 [2/2] [UU]
      [===================>.]  resync = 97.3% (131072/134144) finish=8076.8min speed=0K/sec
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>
clustermd-gqjiang1:~ # cat /proc/20531/stack
[<ffffffffc080974d>] metadata_update_start+0xcd/0xd0 [md_cluster]
[<ffffffffc079c897>] md_update_sb.part.61+0x97/0x820 [md_mod]
[<ffffffffc079f15b>] md_check_recovery+0x29b/0x510 [md_mod]
[<ffffffffc0816882>] raid1d+0x42/0x800 [raid1]
[<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod]
[<ffffffff9e0a5b3f>] kthread+0xff/0x140
[<ffffffff9e800235>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
clustermd-gqjiang1:~ # cat /proc/20537/stack
[<ffffffffc0813222>] freeze_array+0xf2/0x140 [raid1]
[<ffffffffc080a56e>] recv_daemon+0x41e/0x580 [md_cluster]
[<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod]
[<ffffffff9e0a5b3f>] kthread+0xff/0x140
[<ffffffff9e800235>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
clustermd-gqjiang1:~ # cat /proc/20676/stack
[<ffffffffc080951e>] dlm_lock_sync+0x8e/0xc0 [md_cluster]
[<ffffffffc080957f>] lock_token+0x2f/0xa0 [md_cluster]
[<ffffffffc0809622>] lock_comm+0x32/0x90 [md_cluster]
[<ffffffffc08098f5>] sendmsg+0x15/0x30 [md_cluster]
[<ffffffffc0809c0a>] resync_info_update+0x8a/0xc0 [md_cluster]
[<ffffffffc08130ba>] raid1_sync_request+0xa9a/0xb10 [raid1]
[<ffffffffc079b8ea>] md_do_sync+0xbaa/0xf90 [md_mod]
[<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod]
[<ffffffff9e0a5b3f>] kthread+0xff/0x140
[<ffffffff9e800235>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-31 17:38:10 -07:00
Xiao Ni
1d0ffd2642 RAID10 BUG_ON in raise_barrier when force is true and conf->barrier is 0
In raid10 reshape_request it gets max_sectors in read_balance. If the underlayer disks
have bad blocks, the max_sectors is less than last. It will call goto read_more many
times. It calls raise_barrier(conf, sectors_done != 0) every time. In this condition
sectors_done is not 0. So the value passed to the argument force of raise_barrier is
true.

In raise_barrier it checks conf->barrier when force is true. If force is true and
conf->barrier is 0, it panic. In this case reshape_request submits bio to under layer
disks. And in the callback function of the bio it calls lower_barrier. If the bio
finishes before calling raise_barrier again, it can trigger the BUG_ON.

Add one pair of raise_barrier/lower_barrier to fix this bug.

Signed-off-by: Xiao Ni <xni@redhat.com>
Suggested-by: Neil Brown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-31 17:38:10 -07:00
Shaohua Li
e254de6bcf md/raid5-cache: disable reshape completely
We don't support reshape yet if an array supports log device. Previously we
determine the fact by checking ->log. However, ->log could be NULL after a log
device is removed, but the array is still marked to support log device. Don't
allow reshape in this case too. User can disable log device support by setting
'consistency_policy' to 'resync' then do reshape.

Reported-by: Xiao Ni <xni@redhat.com>
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-31 17:38:09 -07:00
Linus Torvalds
828bf6e904 libnvdimm-for-4.19_misc
Collection of misc libnvdimm patches for 4.19 submission
 * Adding support to read locked nvdimm capacity.
 
 * Change test code to make DSM failure code injection an override.
 
 * Add support for calculate maximum contiguous area for namespace.
 
 * Add support for queueing a short ARS when there is on going ARS for
   nvdimm.
 
 * Allow NULL to be passed in to ->direct_access() for kaddr and
   pfn params.
 
 * Improve smart injection support for nvdimm emulation testing.
 
 * Fix test code that supports for emulating controller temperature.
 
 * Fix hang on error before devm_memremap_pages()
 
 * Fix a bug that causes user memory corruption when data returned
   to user for ars_status.
 
 * Maintainer updates for Ross Zwisler emails and adding Jan Kara to fsdax.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9uUIACgkQYGjFFmlT
 OErL+xAAgWHSGs8w98VtYA9kLDeTYEXutq93wJZQoBu/FMAXuuU3hYmQYnOQU87h
 KKYmfDkeusaih1R3IX7mzlegnnzSfQ6MraNSV76M43noJHbRTunknCPZH6ebp4fo
 b/eljvWlZF/idM+7YcsnoFMnHSRj2pjJGXmKQDlKedHD+KMxpmk6zEl2s5Y0zvPU
 4U7UQLtk3D5IIpLNsLEmxge32BfvNf5IzoSO1aZp7Eqk0+U5Tq3Sq/Tjmd+J0RKt
 6WH5yA6NqXQgBh+ayHsYU8YX62RqnbKQZXqVxD35OH64zJEUefnP1fpt9pmaZ9eL
 43BPMkpM09eLAikO2ET3/3c2k6h3h9ttz1sH8t/hiroCtfmxs3XgskY06hxpKjZV
 EbN+BUmut5Mr+zzYitRr3dbK2aHPVU9IbU7jUw/1Tz23rq3kU5iI7SHHv1b/eWup
 1Cr77Z1M6HB8VBhjnJ+R607sbRrnKQUOV7fGzAaIskyUOTWsEvIgTh/6MRiaj9MD
 5HXIgc/0y9E+G93s7MsUWwzpB7J6E7EGoybST2SKPtqwtDMPsBNeWRjyA9quBCoN
 u1s+e+lWHYutqRW0eisDTTlq3nJwPijSx1nnzhJxw9s1EkCXz3f7KRZhyH1C79Co
 7wjiuvKQ79e/HI/oXvGmTnv5lbLEpWYyJ3U3KIFfoUqugeyhr0k=
 =5p2n
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dave Jiang:
 "Collection of misc libnvdimm patches for 4.19 submission:

   - Adding support to read locked nvdimm capacity.

   - Change test code to make DSM failure code injection an override.

   - Add support for calculate maximum contiguous area for namespace.

   - Add support for queueing a short ARS when there is on going ARS for
     nvdimm.

   - Allow NULL to be passed in to ->direct_access() for kaddr and pfn
     params.

   - Improve smart injection support for nvdimm emulation testing.

   - Fix test code that supports for emulating controller temperature.

   - Fix hang on error before devm_memremap_pages()

   - Fix a bug that causes user memory corruption when data returned to
     user for ars_status.

   - Maintainer updates for Ross Zwisler emails and adding Jan Kara to
     fsdax"

* tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm: fix ars_status output length calculation
  device-dax: avoid hang on error before devm_memremap_pages()
  tools/testing/nvdimm: improve emulation of smart injection
  filesystem-dax: Do not request kaddr and pfn when not required
  md/dm-writecache: Don't request pointer dummy_addr when not required
  dax/super: Do not request a pointer kaddr when not required
  tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
  s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
  libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
  acpi/nfit: queue issuing of ars when an uc error notification comes in
  libnvdimm: Export max available extent
  libnvdimm: Use max contiguous area for namespace size
  MAINTAINERS: Add Jan Kara for filesystem DAX
  MAINTAINERS: update Ross Zwisler's email address
  tools/testing/nvdimm: Fix support for emulating controller temperature
  tools/testing/nvdimm: Make DSM failure code injection an override
  acpi, nfit: Prefer _DSM over _LSR for namespace label reads
  libnvdimm: Introduce locked DIMM capacity support
2018-08-25 18:13:10 -07:00
Shan Hai
3943b040f1 bcache: release dc->writeback_lock properly in bch_writeback_thread()
The writeback thread would exit with a lock held when the cache device
is detached via sysfs interface, fix it by releasing the held lock
before exiting the while-loop.

Fixes: fadd94e05c (bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set)
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Coly Li <colyli@suse.de>
Tested-by: Shenghui Wang <shhuiw@foxmail.com>
Cc: stable@vger.kernel.org #4.17+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-22 15:06:29 -06:00
Linus Torvalds
5bed49adfe for-4.19/post-20180822
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlt9on8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj1xEADKBmJlV9aVyxc5w6XggqAGeHqI4afFrl+v
 9fW6WUQMAaBUrr7PMIEJQ0Zm4B7KxgBaEWNtuuj4ULkjpgYm2AuGUuTJSyKz41rS
 Ma+KNyCA2Zmq4SvwGFbcdCuCbUqnoxTycscAgCjuDvIYLW0+nFGNc47ibmu9lZIV
 33Ef5LrxuCjhC2zyNxEdWpUxDCjoYzock85LW+wYyIYLU9uKdoExS+YmT8U+ebA/
 AkXBcxPztNDxwkcsIwgGVoTjwxiowqGz3uueWfyEmYgaCPiNOsxkoNQAtjX4ykQE
 hnqnHWyzJkRwbYo7Vd/bRAZXvszKGYE1YcJmu5QrNf0dK5MSq2o5OYJAEJWbucPj
 m0R2u7O9qbS2JEnxGrm5+oYJwBzwNY5/Lajr15WkljTqobKnqcvn/Hdgz/XdGtek
 0S1QHkkBsF7e+cax8sePWK+O3ilY7pl9CzyZKB/tJngl8A45Jv8xVojg0v3O7oS+
 zZib0rwWg/bwR/uN6uPCDcEsQusqL5YovB7m6NRVshwz6cV1zVNp2q+iOulk7KuC
 MprW4Du9CJf8HA19XtyJIG1XLstnuz+Exy+i5BiimUJ5InoEFDuj/6OZa6Qaczbo
 SrDDvpGtSf4h7czKpE5kV4uZiTOrjuI30TrI+4csdZ7HQIlboxNL72seNTLJs55F
 nbLjRM8L6g==
 =FS7e
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:

 - Set of bcache fixes and changes (Coly)

 - The flush warn fix (me)

 - Small series of BFQ fixes (Paolo)

 - wbt hang fix (Ming)

 - blktrace fix (Steven)

 - blk-mq hardware queue count update fix (Jianchao)

 - Various little fixes

* tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
  block/DAC960.c: make some arrays static const, shrinks object size
  blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
  blk-mq: init hctx sched after update ctx and hctx mapping
  block: remove duplicate initialization
  tracing/blktrace: Fix to allow setting same value
  pktcdvd: fix setting of 'ret' error return for a few cases
  block: change return type to bool
  block, bfq: return nbytes and not zero from struct cftype .write() method
  block, bfq: improve code of bfq_bfqq_charge_time
  block, bfq: reduce write overcharge
  block, bfq: always update the budget of an entity when needed
  block, bfq: readd missing reset of parent-entity service
  blk-wbt: fix IO hang in wbt_wait()
  block: don't warn for flush on read-only device
  bcache: add the missing comments for smp_mb()/smp_wmb()
  bcache: remove unnecessary space before ioctl function pointer arguments
  bcache: add missing SPDX header
  bcache: move open brace at end of function definitions to next line
  bcache: add static const prefix to char * array declarations
  bcache: fix code comments style
  ...
2018-08-22 13:38:05 -07:00
Coly Li
d23599630b bcache: use routines from lib/crc64.c for CRC64 calculation
Now we have crc64 calculation in lib/crc64.c, it is unnecessary for
bcache to use its own version.  This patch changes bcache code to use
crc64 routines in lib/crc64.c.

Link: http://lkml.kernel.org/r/20180718165545.1622-3-colyli@suse.de
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Noah Massey <noah.massey@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:48 -07:00
Linus Torvalds
08b5fa8199 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:

 - a new driver for Rohm BU21029 touch controller

 - new bitmap APIs: bitmap_alloc, bitmap_zalloc and bitmap_free

 - updates to Atmel, eeti. pxrc and iforce drivers

 - assorted driver cleanups and fixes.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (57 commits)
  MAINTAINERS: Add PhoenixRC Flight Controller Adapter
  Input: do not use WARN() in input_alloc_absinfo()
  Input: mark expected switch fall-throughs
  Input: raydium_i2c_ts - use true and false for boolean values
  Input: evdev - switch to bitmap API
  Input: gpio-keys - switch to bitmap_zalloc()
  Input: elan_i2c_smbus - cast sizeof to int for comparison
  bitmap: Add bitmap_alloc(), bitmap_zalloc() and bitmap_free()
  md: Avoid namespace collision with bitmap API
  dm: Avoid namespace collision with bitmap API
  Input: pm8941-pwrkey - add resin entry
  Input: pm8941-pwrkey - abstract register offsets and event code
  Input: iforce - reorganize joystick configuration lists
  Input: atmel_mxt_ts - move completion to after config crc is updated
  Input: atmel_mxt_ts - don't report zero pressure from T9
  Input: atmel_mxt_ts - zero terminate config firmware file
  Input: atmel_mxt_ts - refactor config update code to add context struct
  Input: atmel_mxt_ts - config CRC may start at T71
  Input: atmel_mxt_ts - remove unnecessary debug on ENOMEM
  Input: atmel_mxt_ts - remove duplicate setup of ABS_MT_PRESSURE
  ...
2018-08-18 16:48:07 -07:00
Linus Torvalds
b0e5c29426 - A couple stable fixes for the DM writecache target.
- A stable fix for the DM cache target that fixes the potential for data
   corruption after an unclean shutdown of a cache device using writeback
   mode.
 
 - Update DM integrity target to allow the metadata to be stored on a
   separate device from data.
 
 - Fix DM kcopyd and the snapshot target to cond_resched() where
   appropriate and be more efficient with processing completed work.
 
 - A few fixes and improvements for DM crypt.
 
 - Add DM delay target feature to configure delay of flushes independent
   of writes.
 
 - Update DM thin-provisioning target to include metadata_low_watermark
   threshold in pool status.
 
 - Fix stale DM thin-provisioning Documentation.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJbddbuAAoJEMUj8QotnQNaoZIH/1qxc4x26bUfkVwBCiqU0cqJ
 8PD8D5UBFB7+JXeI4P9p7ONVJNpk291QGeX+CBOXfhdeBBcXmuHnavJFcmH6+1Y3
 omPEIvKnsODEMyZuznA8HasvlZIDPNOESaDt/8vhDPkDmPLWi3h+fUO1ay+NRtua
 J+8ZTe35P5SY70uLFE3nTeZScZD8KAO9Py1W+5Lz1godS6UUHhNOAcm0zzw5Nvnc
 2gu2HZaCx0erFNou5Kj5Y4a/z/cITlAyXHQyzXBINk/X4sSwvzRr2tSy/FG91CuO
 kHduB0Tgo8eSeEta8jcOcYCm601XLlXzkIM69Z7c3fmCY+b8dY4ybHPxApTEv7c=
 =blC6
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - A couple stable fixes for the DM writecache target.

 - A stable fix for the DM cache target that fixes the potential for
   data corruption after an unclean shutdown of a cache device using
   writeback mode.

 - Update DM integrity target to allow the metadata to be stored on a
   separate device from data.

 - Fix DM kcopyd and the snapshot target to cond_resched() where
   appropriate and be more efficient with processing completed work.

 - A few fixes and improvements for DM crypt.

 - Add DM delay target feature to configure delay of flushes independent
   of writes.

 - Update DM thin-provisioning target to include metadata_low_watermark
   threshold in pool status.

 - Fix stale DM thin-provisioning Documentation.

* tag 'for-4.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits)
  dm writecache: fix a crash due to reading past end of dirty_bitmap
  dm crypt: don't decrease device limits
  dm cache metadata: set dirty on all cache blocks after a crash
  dm snapshot: remove stale FIXME in snapshot_map()
  dm snapshot: improve performance by switching out_of_order_list to rbtree
  dm kcopyd: avoid softlockup in run_complete_job
  dm cache metadata: save in-core policy_hint_size to on-disk superblock
  dm thin: stop no_space_timeout worker when switching to write-mode
  dm kcopyd: return void from dm_kcopyd_copy()
  dm thin: include metadata_low_watermark threshold in pool status
  dm writecache: report start_sector in status line
  dm crypt: convert essiv from ahash to shash
  dm crypt: use wake_up_process() instead of a wait queue
  dm integrity: recalculate checksums on creation
  dm integrity: flush journal on suspend when using separate metadata device
  dm integrity: use version 2 for separate metadata
  dm integrity: allow separate metadata device
  dm integrity: add ic->start in get_data_sector()
  dm integrity: report provided data sectors in the status
  dm integrity: implement fair range locks
  ...
2018-08-17 09:52:15 -07:00
Mikulas Patocka
1e1132ea21 dm writecache: fix a crash due to reading past end of dirty_bitmap
wc->dirty_bitmap_size is in bytes so must multiply it by 8, not by
BITS_PER_LONG, to get number of bitmap_bits.

Fixes crash in find_next_bit() that was reported:
https://bugzilla.kernel.org/show_bug.cgi?id=200819

Reported-by: edo.rus@gmail.com
Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-16 13:43:01 -04:00
Linus Torvalds
b219a1d2de Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD updates from Shaohua Li:
 "A few MD fixes for 4.19-rc1:

   - several md-cluster fixes from Guoqing

   - a data corruption fix from BingJing

   - other cleanups"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md/raid5: fix data corruption of replacements after originals dropped
  drivers/md/raid5: Do not disable irq on release_inactive_stripe_list() call
  drivers/md/raid5: Use irqsave variant of atomic_dec_and_lock()
  md/r5cache: remove redundant pointer bio
  md-cluster: don't send msg if array is closing
  md-cluster: show array's status more accurate
  md-cluster: clear another node's suspend_area after the copy is finished
2018-08-14 11:03:16 -07:00
Mikulas Patocka
bc9e9cf040 dm crypt: don't decrease device limits
dm-crypt should only increase device limits, it should not decrease them.

This fixes a bug where the user could creates a crypt device with 1024
sector size on the top of scsi device that had 4096 logical block size.
The limit 4096 would be lost and the user could incorrectly send
1024-I/Os to the crypt device.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-13 15:28:41 -04:00
Coly Li
eb2b3d0345 bcache: add the missing comments for smp_mb()/smp_wmb()
Checkpatch.pl warns there are 2 locations of smp_mb() and smp_wmb()
without code comment. This patch adds the missing code comments for
these memory barrier calls.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
d0c1b89a40 bcache: remove unnecessary space before ioctl function pointer arguments
This is warned by checkpatch.pl, this patch removes the extra space.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
87418ef9f0 bcache: add missing SPDX header
The SPDX header is missing fro closure.c, super.c and util.c, this
patch adds SPDX header for GPL-2.0 into these files.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
b3cf37bfa1 bcache: move open brace at end of function definitions to next line
This is not a preferred style to place open brace '{' at the end of
function definition, checkpatch.pl reports error for such coding
style. This patch moves them into the start of the next new line.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
e1f08f1bc0 bcache: add static const prefix to char * array declarations
This patch declares char * array with const prefix in sysfs.c,
which is suggested by checkpatch.pl.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
3be11dbab6 bcache: fix code comments style
This patch fixes 3 style issues warned by checkpatch.pl,
- Comment lines are not aligned
- Comments use "/*" on subsequent lines
- Comment lines use a trailing "*/"

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
3069211be3 bcache: do not check NULL pointer before calling kmem_cache_destroy
kmem_cache_destroy() is safe for NULL pointer as input, the NULL pointer
checking is unncessary. This patch just removes the NULL pointer checking
to make code simpler.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
bc81b47e82 bcache: prefer 'help' in Kconfig
Current bcache Kconfig uses '---help---' as header of help information,
for now 'help' is prefered. This patch fixes this style by replacing
'---help---' by 'help' in bcache Kconfig file.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
2b1edd23ec bcache: fix typo 'succesfully' to 'successfully'
This patch fixes typo 'succesfully' to correct 'successfully', which is
suggested by checkpatch.pl.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
d9c61d30e8 bcache: replace '%pF' by '%pS' in seq_printf()
'%pF' and '%pf' are deprecated vsprintf pointer extensions, this patch
replace them by '%pS', which is suggested by checkpatch.pl.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
c63ca7871a bcache: fix indent by replacing blank by tabs
bch_btree_insert_check_key() has unaligned indent, or indent by blank
characters. This patch makes the indent aligned and replace blank by
tabs.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
6ae63e3501 bcache: replace printk() by pr_*() routines
There are still many places in bcache use printk to display kernel
message, which are suggested to be preplaced by pr_*() routines like
pr_err(), pr_info(), or pr_notice().

This patch replaces all printk() with a proper pr_*() routine for
bcache code.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
958bf494ec bcache: replace Symbolic permissions by octal permission numbers
Symbolic permission names are used in bcache, for now octal permission
numbers are encouraged to use for readability. This patch replaces
all symbolic permissions by octal permission numbers.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
b0d30981c0 bcache: style fixes for lines over 80 characters
This patch fixes the lines over 80 characters into more lines, to minimize
warnings by checkpatch.pl. There are still some lines exceed 80 characters,
but it is better to be a single line and I don't change them.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
fc2d5988b5 bcache: add identifier names to arguments of function definitions
There are many function definitions do not have identifier argument names,
scripts/checkpatch.pl complains warnings like this,

 WARNING: function definition argument 'struct bcache_device *' should
  also have an identifier name
  #16735: FILE: writeback.h:120:
  +void bch_sectors_dirty_init(struct bcache_device *);

This patch adds identifier argument names to all bcache function
definitions to fix such warnings.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
1fae7cf052 bcache: style fix to add a blank line after declarations
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
6f10f7d1b0 bcache: style fix to replace 'unsigned' by 'unsigned int'
This patch fixes warning reported by checkpatch.pl by replacing 'unsigned'
with 'unsigned int'.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
46451874c7 bcache: fix error setting writeback_rate through sysfs interface
Commit ea8c5356d3 ("bcache: set max writeback rate when I/O request
is idle") changes struct bch_ratelimit member rate from uint32_t to
atomic_long_t and uses atomic_long_set() in drivers/md/bcache/sysfs.c
to set new writeback rate, after the input is converted from memory
buf to long int by sysfs_strtoul_clamp().

The above change has a problem because there is an implicit return
inside sysfs_strtoul_clamp() so the following atomic_long_set()
won't be called. This error is detected by 0day system with following
snipped smatch warnings:

drivers/md/bcache/sysfs.c:271 __cached_dev_store() error: uninitialized
symbol 'v'.
270  sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX);
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@271 atomic_long_set(&dc->writeback_rate.rate, v);

This patch fixes the above error by using strtoul_safe_clamp() to
convert the input buffer into a long int type result.

Fixes: ea8c5356d3 ("bcache: set max writeback rate when I/O request is idle")
Cc: Kai Krakow <kai@kaishome.de>
Cc: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-10 12:18:47 -06:00
Ilya Dryomov
5b1fe7bec8 dm cache metadata: set dirty on all cache blocks after a crash
Quoting Documentation/device-mapper/cache.txt:

  The 'dirty' state for a cache block changes far too frequently for us
  to keep updating it on the fly.  So we treat it as a hint.  In normal
  operation it will be written when the dm device is suspended.  If the
  system crashes all cache blocks will be assumed dirty when restarted.

This got broken in commit f177940a80 ("dm cache metadata: switch to
using the new cursor api for loading metadata") in 4.9, which removed
the code that consulted cmd->clean_when_opened (CLEAN_SHUTDOWN on-disk
flag) when loading cache blocks.  This results in data corruption on an
unclean shutdown with dirty cache blocks on the fast device.  After the
crash those blocks are considered clean and may get evicted from the
cache at any time.  This can be demonstrated by doing a lot of reads
to trigger individual evictions, but uncache is more predictable:

  ### Disable auto-activation in lvm.conf to be able to do uncache in
  ### time (i.e. see uncache doing flushing) when the fix is applied.

  # xfs_io -d -c 'pwrite -b 4M -S 0xaa 0 1G' /dev/vdb
  # vgcreate vg_cache /dev/vdb /dev/vdc
  # lvcreate -L 1G -n lv_slowdev vg_cache /dev/vdb
  # lvcreate -L 512M -n lv_cachedev vg_cache /dev/vdc
  # lvcreate -L 256M -n lv_metadev vg_cache /dev/vdc
  # lvconvert --type cache-pool --cachemode writeback vg_cache/lv_cachedev --poolmetadata vg_cache/lv_metadev
  # lvconvert --type cache vg_cache/lv_slowdev --cachepool vg_cache/lv_cachedev
  # xfs_io -d -c 'pwrite -b 4M -S 0xbb 0 512M' /dev/mapper/vg_cache-lv_slowdev
  # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
  0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  # dmsetup status vg_cache-lv_slowdev
  0 2097152 cache 8 27/65536 128 8192/8192 1 100 0 0 0 8192 7065 2 metadata2 writeback 2 migration_threshold 2048 smq 0 rw -
                                                            ^^^^
                                7065 * 64k = 441M yet to be written to the slow device
  # echo b >/proc/sysrq-trigger

  # vgchange -ay vg_cache
  # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
  0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  # lvconvert --uncache vg_cache/lv_slowdev
  Flushing 0 blocks for cache vg_cache/lv_slowdev.
  Logical volume "lv_cachedev" successfully removed
  Logical volume vg_cache/lv_slowdev is not cached.
  # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
  0fe00000:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
  0fe00010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................

This is the case with both v1 and v2 cache pool metatata formats.

After applying this patch:

  # vgchange -ay vg_cache
  # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
  0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  # lvconvert --uncache vg_cache/lv_slowdev
  Flushing 3724 blocks for cache vg_cache/lv_slowdev.
  ...
  Flushing 71 blocks for cache vg_cache/lv_slowdev.
  Logical volume "lv_cachedev" successfully removed
  Logical volume vg_cache/lv_slowdev is not cached.
  # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
  0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................

Cc: stable@vger.kernel.org
Fixes: f177940a80 ("dm cache metadata: switch to using the new cursor api for loading metadata")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-09 12:14:32 -04:00
Shenghui Wang
cbb751c060 bcache: trivial - remove tailing backslash in macro BTREE_FLAG
Remove the tailing backslash in macro BTREE_FLAG in btree.h

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:19 -06:00
Shenghui Wang
e921efeb07 bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
The pr_err statement in the code for sysfs_attatch section would run
for various error codes, which maybe confusing.

E.g,

Run the command twice:
   echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
				/sys/block/bcache0/bcache/attach
   [the backing dev got attached on the first run]
   echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
				/sys/block/bcache0/bcache/attach

In dmesg, after the command run twice, we can get:
	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
a8df5e8be891
               : cache set not found
The first statement in the message was right, but the second was
confusing.

bch_cached_dev_attach has various pr_ statements for various error
codes, except ENOENT.

After the change, rerun above command twice:
	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
			/sys/block/bcache0/bcache/attach
	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
			/sys/block/bcache0/bcache/attach

In dmesg we only got:
	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
No confusing "cache set not found" message anymore.

And for some not exist SET-UUID:
	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \
			/sys/block/bcache0/bcache/attach
In dmesg we can get:
	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
a8df5e8be898
	               : cache set not found

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:17 -06:00
Coly Li
ea8c5356d3 bcache: set max writeback rate when I/O request is idle
Commit b1092c9af9 ("bcache: allow quick writeback when backing idle")
allows the writeback rate to be faster if there is no I/O request on a
bcache device. It works well if there is only one bcache device attached
to the cache set. If there are many bcache devices attached to a cache
set, it may introduce performance regression because multiple faster
writeback threads of the idle bcache devices will compete the btree level
locks with the bcache device who have I/O requests coming.

This patch fixes the above issue by only permitting fast writebac when
all bcache devices attached on the cache set are idle. And if one of the
bcache devices has new I/O request coming, minimized all writeback
throughput immediately and let PI controller __update_writeback_rate()
to decide the upcoming writeback rate for each bcache device.

Also when all bcache devices are idle, limited wrieback rate to a small
number is wast of thoughput, especially when backing devices are slower
non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
rate for each backing device if the whole cache set is idle. A faster
writeback rate in idle time means new I/Os may have more available space
for dirty data, and people may observe a better write performance then.

Please note bcache may change its cache mode in run time, and this patch
still works if the cache mode is switched from writeback mode and there
is still dirty data on cache.

Fixes: Commit b1092c9af9 ("bcache: allow quick writeback when backing idle")
Cc: stable@vger.kernel.org #4.16+
Signed-off-by: Coly Li <colyli@suse.de>
Tested-by: Kai Krakow <kai@kaishome.de>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Cc: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:15 -06:00
Coly Li
b467a6ac0b bcache: add code comments for bset.c
This patch tries to add code comments in bset.c, to make some
tricky code and designment to be more comprehensible. Most information
of this patch comes from the discussion between Kent and I, he
offers very informative details. If there is any mistake
of the idea behind the code, no doubt that's from me misrepresentation.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:12 -06:00
Coly Li
0cba2e7111 bcache: fix mistaken comments in request.c
This patch updates code comment in bch_keylist_realloc() by fixing
incorrected function names, to make the code to be more comprehennsible.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:10 -06:00
Coly Li
cb329dec11 bcache: fix mistaken code comments in bcache.h
This patch updates the code comment in struct cache with correct array
names, to make the code to be more comprehensible.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:09 -06:00
Coly Li
e57fd74684 bcache: add a comment in super.c
This patch adds a line of code comment in super.c:register_bdev(), to
make code to be more comprehensible.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:07 -06:00
Coly Li
c2e8dcf7fa bcache: avoid unncessary cache prefetch bch_btree_node_get()
In bch_btree_node_get() the read-in btree node will be partially
prefetched into L1 cache for following bset iteration (if there is).
But if the btree node read is failed, the perfetch operations will
waste L1 cache space. This patch checkes whether read operation and
only does cache prefetch when read I/O succeeded.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:05 -06:00
Coly Li
b4cb6efc1a bcache: display rate debug parameters to 0 when writeback is not running
When writeback is not running, writeback rate should be 0, other value is
misleading. And the following dyanmic writeback rate debug parameters
should be 0 too,
	rate, proportional, integral, change
otherwise they are misleading when writeback is not running.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:03 -06:00
Coly Li
78ac210717 bcache: do not check return value of debugfs_create_dir()
Greg KH suggests that normal code should not care about debugfs. Therefore
no matter successful or failed of debugfs_create_dir() execution, it is
unncessary to check its return value.

There are two functions called debugfs_create_dir() and check the return
value, which are bch_debug_init() and closure_debug_init(). This patch
changes these two functions from int to void type, and ignore return values
of debugfs_create_dir().

This patch does not fix exact bug, just makes things work as they should.

Signed-off-by: Coly Li <colyli@suse.de>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Cc: Kai Krakow <kai@kaishome.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:01 -06:00
Mike Snitzer
c9a5e6a968 dm snapshot: remove stale FIXME in snapshot_map()
Commit ae1093be ("dm snapshot: use mutex instead of rw_semaphore")
eliminated the need to worry about read vs write locking.  So remove a
FIXME in snapshot_map() that is concerned about selectively taking a
write lock.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-08 20:50:58 -04:00
David Jeffery
3db2776d9f dm snapshot: improve performance by switching out_of_order_list to rbtree
copy_complete()'s processing of out_of_order_list can result in
quadratic complexity in the worst case.  As such it was the source of
consuming too much cpu and the source of significant loss in
performance.

Fix this by converting out_of_order_list to an rbtree.  This improved
a dm-snapshot test copy workload from 32 seconds to 4 seconds.

Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Brett Hull <bhull@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-08 10:41:49 -04:00
John Pittman
784c9a29e9 dm kcopyd: avoid softlockup in run_complete_job
It was reported that softlockups occur when using dm-snapshot ontop of
slow (rbd) storage.  E.g.:

[ 4047.990647] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [kworker/10:23:26177]
...
[ 4048.034151] Workqueue: kcopyd do_work [dm_mod]
[ 4048.034156] RIP: 0010:copy_callback+0x41/0x160 [dm_snapshot]
...
[ 4048.034190] Call Trace:
[ 4048.034196]  ? __chunk_is_tracked+0x70/0x70 [dm_snapshot]
[ 4048.034200]  run_complete_job+0x5f/0xb0 [dm_mod]
[ 4048.034205]  process_jobs+0x91/0x220 [dm_mod]
[ 4048.034210]  ? kcopyd_put_pages+0x40/0x40 [dm_mod]
[ 4048.034214]  do_work+0x46/0xa0 [dm_mod]
[ 4048.034219]  process_one_work+0x171/0x370
[ 4048.034221]  worker_thread+0x1fc/0x3f0
[ 4048.034224]  kthread+0xf8/0x130
[ 4048.034226]  ? max_active_store+0x80/0x80
[ 4048.034227]  ? kthread_bind+0x10/0x10
[ 4048.034231]  ret_from_fork+0x35/0x40
[ 4048.034233] Kernel panic - not syncing: softlockup: hung tasks

Fix this by calling cond_resched() after run_complete_job()'s callout to
the dm_kcopyd_notify_fn (which is dm-snap.c:copy_callback in the above
trace).

Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-08 09:16:24 -04:00
Mike Snitzer
fd2fa95416 dm cache metadata: save in-core policy_hint_size to on-disk superblock
policy_hint_size starts as 0 during __write_initial_superblock().  It
isn't until the policy is loaded that policy_hint_size is set in-core
(cmd->policy_hint_size).  But it never got recorded in the on-disk
superblock because __commit_transaction() didn't deal with transfering
the in-core cmd->policy_hint_size to the on-disk superblock.

The in-core cmd->policy_hint_size gets initialized by metadata_open()'s
__begin_transaction_flags() which re-reads all superblock fields.
Because the superblock's policy_hint_size was never properly stored, when
the cache was created, hints_array_available() would always return false
when re-activating a previously created cache.  This means
__load_mappings() always considered the hints invalid and never made use
of the hints (these hints served to optimize).

Another detremental side-effect of this oversight is the cache_check
utility would fail with: "invalid hint width: 0"

Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-07 14:30:30 -04:00
Hou Tao
75294442d8 dm thin: stop no_space_timeout worker when switching to write-mode
Now both check_for_space() and do_no_space_timeout() will read & write
pool->pf.error_if_no_space.  If these functions run concurrently, as
shown in the following case, the default setting of "queue_if_no_space"
can get lost.

precondition:
    * error_if_no_space = false (aka "queue_if_no_space")
    * pool is in Out-of-Data-Space (OODS) mode
    * no_space_timeout worker has been queued

CPU 0:                          CPU 1:
// delete a thin device
process_delete_mesg()
// check_for_space() invoked by commit()
set_pool_mode(pool, PM_WRITE)
    pool->pf.error_if_no_space = \
     pt->requested_pf.error_if_no_space

				// timeout, pool is still in OODS mode
				do_no_space_timeout
				    // "queue_if_no_space" config is lost
				    pool->pf.error_if_no_space = true
    pool->pf.mode = new_mode

Fix it by stopping no_space_timeout worker when switching to write mode.

Fixes: bcc696fac1 ("dm thin: stay in out-of-data-space mode once no_space_timeout expires")
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-07 14:30:29 -04:00
Jens Axboe
05b9ba4b55 Linux 4.18-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
 sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
 1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
 9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
 mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
 L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
 vJJlibI=
 =42a5
 -----END PGP SIGNATURE-----

Merge tag 'v4.18-rc6' into for-4.19/block2

Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-08-05 19:32:09 -06:00
BingJing Chang
d63e2fc804 md/raid5: fix data corruption of replacements after originals dropped
During raid5 replacement, the stripes can be marked with R5_NeedReplace
flag. Data can be read from being-replaced devices and written to
replacing spares without reading all other devices. (It's 'replace'
mode. s.replacing = 1) If a being-replaced device is dropped, the
replacement progress will be interrupted and resumed with pure recovery
mode. However, existing stripes before being interrupted cannot read
from the dropped device anymore. It prints lots of WARN_ON messages.
And it results in data corruption because existing stripes write
problematic data into its replacement device and update the progress.

\# Erase disks (1MB + 2GB)
dd if=/dev/zero of=/dev/sda bs=1MB count=2049
dd if=/dev/zero of=/dev/sdb bs=1MB count=2049
dd if=/dev/zero of=/dev/sdc bs=1MB count=2049
dd if=/dev/zero of=/dev/sdd bs=1MB count=2049
mdadm -C /dev/md0 -amd -R -l5 -n3 -x0 /dev/sd[abc] -z 2097152
\# Ensure array stores non-zero data
dd if=/root/data_4GB.iso of=/dev/md0 bs=1MB
\# Start replacement
mdadm /dev/md0 -a /dev/sdd
mdadm /dev/md0 --replace /dev/sda

Then, Hot-plug out /dev/sda during recovery, and wait for recovery done.
echo check > /sys/block/md0/md/sync_action
cat /sys/block/md0/md/mismatch_cnt # it will be greater than 0.

Soon after you hot-plug out /dev/sda, you will see many WARN_ON
messages. The replacement recovery will be interrupted shortly. After
the recovery finishes, it will result in data corruption.

Actually, it's just an unhandled case of replacement. In commit
<f94c0b6658c7> (md/raid5: fix interaction of 'replace' and 'recovery'.),
if a NeedReplace device is not UPTODATE then that is an error, the
commit just simply print WARN_ON but also mark these corrupted stripes
with R5_WantReplace. (it means it's ready for writes.)

To fix this case, we can leverage 'sync and replace' mode mentioned in
commit <9a3e1101b827> (md/raid5: detect and handle replacements during
recovery.). We can add logics to detect and use 'sync and replace' mode
for these stripes.

Reported-by: Alex Chen <alexchen@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-02 11:22:06 -07:00
Andy Shevchenko
e64e4018d5 md: Avoid namespace collision with bitmap API
bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.

On the other hand MD bitmap API is special case.
Adding 'md' prefix to it to avoid name space collision.

No functional changes intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2018-08-01 15:49:39 -07:00
Andy Shevchenko
5cc9cdf631 dm: Avoid namespace collision with bitmap API
bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.

On the other hand DM bitmap API is special case.
Adding 'dm' prefix to it to avoid potential name space collision.

No functional changes intended.

Suggested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2018-08-01 15:49:38 -07:00
Mike Snitzer
7209049d40 dm kcopyd: return void from dm_kcopyd_copy()
dm_kcopyd_copy() only ever returns 0 so there is no need for callers to
account for possible failure.  Same goes for dm_kcopyd_zero().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-31 17:33:21 -04:00
Huaisheng Ye
f742267ae9 md/dm-writecache: Don't request pointer dummy_addr when not required
Function persistent_memory_claim doesn't need to get local pointer
dummy_addr from direct_access. Using NULL instead of having to pass
in a useless local pointer that caller then just throw away.

Suggested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-30 09:40:04 -07:00
Andy Grover
63c8ecb626 dm thin: include metadata_low_watermark threshold in pool status
The metadata low watermark threshold is set by the kernel.  But the
kernel depends on userspace to extend the thinpool metadata device when
the threshold is crossed.

Since the metadata low watermark threshold is not visible to userspace,
upon receiving an event, userspace cannot tell that the kernel wants the
metadata device extended, instead of some other eventing condition.
Making it visible (but not settable) enables userspace to affirmatively
know the kernel is asking for a metadata device extension, by comparing
metadata_low_watermark against nr_free_blocks_metadata, also reported in
status.

Current solutions like dmeventd have their own thresholds for extending
the data and metadata devices, and both devices are checked against
their thresholds on each event.  This lessens the value of the kernel-set
threshold, since userspace will either extend the metadata device sooner,
when receiving another event; or will receive the metadata lowater event
and do nothing, if dmeventd's threshold is less than the kernel's.
(This second case is dangerous. The metadata lowater event will not be
re-sent, so no further event will be generated before the metadata
device is out if space, unless some other event causes userspace to
recheck its thresholds.)

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-30 11:49:08 -04:00
Mikulas Patocka
9ff07e7d63 dm writecache: report start_sector in status line
Fixes: d284f8248c ("dm writecache: support optional offset for start of device")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:28:58 -04:00