2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

6692 Commits

Author SHA1 Message Date
Mike Snitzer
f0e90b6c66 dm raid: remove unnecessary discard limits for raid10
Commit bcc90d2804 ("md/raid10: improve raid10 discard request")
removes raid10's inability to properly handle large discards.  So
eliminate associated constraint from dm-raid's raid10 support.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:10 -04:00
Mike Snitzer
e0910c8e4f dm raid: fix discard limits for raid1 and raid10
Block core warned that discard_granularity was 0 for dm-raid with
personality of raid1.  Reason is that raid_io_hints() was incorrectly
special-casing raid1 rather than raid0.

But since commit 29efc390b9 ("md/md0: optimize raid0 discard
handling") even raid0 properly handles large discards.

Fix raid_io_hints() by removing discard limits settings for raid1.
Also, fix limits for raid10 by properly stacking underlying limits as
done in blk_stack_limits().

Depends-on: 29efc390b9 ("md/md0: optimize raid0 discard handling")
Fixes: 61697a6abd ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:09 -04:00
Mikulas Patocka
cd74693870 dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
Don't use crypto drivers that have the flag CRYPTO_ALG_ALLOCATES_MEMORY
set. These drivers allocate memory and thus they are unsuitable for block
I/O processing.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:08 -04:00
Mike Snitzer
d4a512edcc dm: use dm_table_get_device_name() where appropriate in targets
dm_table_get_device_name() avoids calling dm_table_get_md() followed by
dm_device_name() -- saves intermediate dm_table_get_md() call.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:08 -04:00
Mike Snitzer
33bd6f0693 dm table: make 'struct dm_table' definition accessible to all of DM core
Move 'struct dm_table' definition from dm-table.c to dm-core.h and
update DM core to access its members directly.

Helps optimize max_io_len() and other methods slightly.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:07 -04:00
Mike Snitzer
7465d7ac50 dm: eliminate need for start_io_acct() forward declaration
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:06 -04:00
Mike Snitzer
9679b5a7ec dm: simplify __process_abnormal_io()
Only call bio_op() once in switch statement.  Also remove the
excessive factoring out to one line functions.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:05 -04:00
Mike Snitzer
828678b87e dm: push use of on-stack flush_bio down to __send_empty_flush()
Eliminates duplicate code, no functional change.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:04 -04:00
Mike Snitzer
3720281db9 dm: optimize max_io_len() by inlining max_io_len_target_boundary()
Saves redundant dm_target_offset() math.

Also, reverse argument order for max_io_len() to be consistent with
other similar functions.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:04 -04:00
Mike Snitzer
094ee64d7d dm: push md->immutable_target optimization down to __process_bio()
Also, update associated stale comment in __bind().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:03 -04:00
Mike Snitzer
5091cdec56 dm: change max_io_len() to use blk_max_size_offset()
Using blk_max_size_offset() enables DM core's splitting to impose
ti->max_io_len (via q->limits.chunk_sectors) and also fallback to
respecting q->limits.max_sectors if chunk_sectors isn't set.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:02 -04:00
Mike Snitzer
882ec4e609 dm table: stack 'chunk_sectors' limit to account for target-specific splitting
If target set ti->max_io_len it must be used when stacking
DM device's queue_limits to establish a 'chunk_sectors' that is
compatible with the IO stack.

By using lcm_not_zero() care is taken to avoid blindly overriding the
chunk_sectors limit stacked up by blk_stack_limits().

Depends-on: 07d098e6bb ("block: allow 'chunk_sectors' to be non-power-of-2")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:33:01 -04:00
Mike Snitzer
1471308fb5 Merge remote-tracking branch 'jens/for-5.10/block' into dm-5.10
DM depends on these block 5.10 commits:

22ada802ed block: use lcm_not_zero() when stacking chunk_sectors
07d098e6bb block: allow 'chunk_sectors' to be non-power-of-2
021a24460d block: add QUEUE_FLAG_NOWAIT
6abc49468e dm: add support for REQ_NOWAIT and enable it for linear target

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:31:35 -04:00
Konstantin Khlebnikov
6abc49468e dm: add support for REQ_NOWAIT and enable it for linear target
Add DM target feature flag DM_TARGET_NOWAIT which advertises that
target works with REQ_NOWAIT bios.

Add dm_table_supports_nowait() and update dm_table_set_restrictions()
to set/clear QUEUE_FLAG_NOWAIT accordingly.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:20:03 -06:00
Christoph Hellwig
4245e52d25 md: don't detour through bd_contains for the gendisk
bd_disk is set on all block devices, including those for partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:18:57 -06:00
Christoph Hellwig
61a27e1f52 md: compare bd_disk instead of bd_contains
To check for partitions of the same disk bd_contains works as well, but
bd_disk is way more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:18:57 -06:00
Christoph Hellwig
fa01b1e973 block: add a bdev_is_partition helper
Add a littler helper to make the somewhat arcane bd_contains checks a
little more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:18:57 -06:00
Xiao Ni
d3ee2d8415 md/raid10: improve discard request for far layout
For far layout, the discard region is not continuous on disks. So it needs
far copies r10bio to cover all regions. It needs a way to know all r10bios
have finish or not. Similar with raid10_sync_request, only the first r10bio
master_bio records the discard bio. Other r10bios master_bio record the
first r10bio. The first r10bio can finish after other r10bios finish and
then return the discard bio.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Xiao Ni
bcc90d2804 md/raid10: improve raid10 discard request
Now the discard request is split by chunk size. So it takes a long time
to finish mkfs on disks which support discard function. This patch improve
handling raid10 discard request. It uses the similar way with patch
29efc390b (md/md0: optimize raid0 discard handling).

But it's a little complex than raid0. Because raid10 has different layout.
If raid10 is offset layout and the discard request is smaller than stripe
size. There are some holes when we submit discard bio to underlayer disks.

For example: five disks (disk1 - disk5)
D01 D02 D03 D04 D05
D05 D01 D02 D03 D04
D06 D07 D08 D09 D10
D10 D06 D07 D08 D09
The discard bio just wants to discard from D03 to D10. For disk3, there is
a hole between D03 and D08. For disk4, there is a hole between D04 and D09.
D03 is a chunk, raid10_write_request can handle one chunk perfectly. So
the part that is not aligned with stripe size is still handled by
raid10_write_request.

If reshape is running when discard bio comes and the discard bio spans the
reshape position, raid10_write_request is responsible to handle this
discard bio.

I did a test with this patch set.
Without patch:
time mkfs.xfs /dev/md0
real4m39.775s
user0m0.000s
sys0m0.298s

With patch:
time mkfs.xfs /dev/md0
real0m0.105s
user0m0.000s
sys0m0.007s

nvme3n1           259:1    0   477G  0 disk
└─nvme3n1p1       259:10   0    50G  0 part
nvme4n1           259:2    0   477G  0 disk
└─nvme4n1p1       259:11   0    50G  0 part
nvme5n1           259:6    0   477G  0 disk
└─nvme5n1p1       259:12   0    50G  0 part
nvme2n1           259:9    0   477G  0 disk
└─nvme2n1p1       259:15   0    50G  0 part
nvme0n1           259:13   0   477G  0 disk
└─nvme0n1p1       259:14   0    50G  0 part

Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Xiao Ni
f046f5d0d7 md/raid10: pull codes that wait for blocked dev into one function
The following patch will reuse these logics, so pull the same codes into
one function.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Xiao Ni
8650a88901 md/raid10: extend r10bio devs to raid disks
Now it allocs r10bio->devs[conf->copies]. Discard bio needs to submit
to all member disks and it needs to use r10bio. So extend to
r10bio->devs[geo.raid_disks].

Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Xiao Ni
2628089b74 md: add md_submit_discard_bio() for submitting discard bio
Move these logic from raid0.c to md.c, so that we can also use it in
raid10.c.

Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Zhen Lei
e287308b83 md: Simplify code with existing definition RESYNC_SECTORS in raid10.c
#define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)

"RESYNC_BLOCK_SIZE/512" is equal to "RESYNC_BLOCK_SIZE >> 9", replace it
with RESYNC_SECTORS.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Yufen Yu
3891258443 md/raid5: reallocate page array after setting new stripe_size
When try to resize stripe_size, we also need to free old
shared page array and allocate new.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Yufen Yu
f16acaf328 md/raid5: resize stripe_head when reshape array
When reshape array, we try to reuse shared pages of old stripe_head,
and allocate more for the new one if needed.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Yufen Yu
046169f048 md/raid5: let multiple devices of stripe_head share page
In current implementation, grow_buffers() uses alloc_page() to
allocate the buffers for each stripe_head, i.e. allocate a page
for each dev[i] in stripe_head.

After setting stripe_size as a configurable value by writing
sysfs entry, it means that we always allocate 64K buffers, but
just use 4K of them when stripe_size is 4K in 64KB arm64.

To avoid wasting memory, we try to let multiple sh->dev share
one real page. That means, multiple sh->dev[i].page will point
to the only page with different offset. Example of 64K PAGE_SIZE
and 4K stripe_size as following:

                    64K PAGE_SIZE
          +---+---+---+---+------------------------------+
          |   |   |   |   |
          |   |   |   |   |
          +-+-+-+-+-+-+-+-+------------------------------+
            ^   ^   ^   ^
            |   |   |   +----------------------------+
            |   |   |                                |
            |   |   +-------------------+            |
            |   |                       |            |
            |   +----------+            |            |
            |              |            |            |
            +-+            |            |            |
              |            |            |            |
        +-----+-----+------+-----+------+-----+------+------+
sh      | offset(0) | offset(4K) | offset(8K) | offset(12K) |
 +      +-----------+------------+------------+-------------+
 +----> dev[0].page  dev[1].page  dev[2].page  dev[3].page

A new 'pages' array will be added into stripe_head to record shared
page used by this stripe_head. Allocate them when grow_buffers()
and free them when shrink_buffers().

After trying to share page, the users of sh->dev[i].page need to take
care of the related page offset: page of issued bio and page passed
to xor compution functions. But thanks for previous different page offset
supported. Here, we just need to set correct dev[i].offset.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Yufen Yu
4f86ff5580 md/raid6: let async recovery function support different page offset
For now, asynchronous raid6 recovery calculate functions are require
common offset for pages. But, we expect them to support different page
offset after introducing stripe shared page. Do that by simplily adding
page offset where each page address are referred. Then, replace the
old interface with the new ones in raid6 and raid6test.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Yufen Yu
d69454bc9f md/raid6: let syndrome computor support different page offset
For now, syndrome compute functions require common offset in the pages
array. However, we expect them to support different offset when try to
use shared page in the following. Simplily covert them by adding page
offset where each page address are referred.

Since the only caller of async_gen_syndrome() and async_syndrome_val()
are in raid6, we don't want to reserve the old interface but modify the
interface directly. After that, replacing old interfaces with new ones
for raid6 and raid6test.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Yufen Yu
a7c224a820 md/raid5: convert to new xor compution interface
We try to replace async_xor() and async_xor_val() with the new
introduced interface async_xor_offs() and async_xor_val_offs()
for raid456.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Yufen Yu
248728dd04 md/raid5: make async_copy_data() to support different page offset
ops_run_biofill() and ops_run_biodrain() will call async_copy_data()
to copy sh->dev[i].page from or to bio page. For now, it implies the
offset of dev[i].page is 0. But we want to support different page offset
in the following.

Thus, pass page offset to these functions and replace 'page_offset'
with 'page_offset + poff'.

No functional change.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Yufen Yu
7aba13b715 md/raid5: add a new member of offset into r5dev
Add a new member of offset into struct r5dev. It indicates the
offset of related dev[i].page. For now, since each device have a
privated page, the value is always 0. Thus, we set offset as 0
when allcate page in grow_buffers() and resize_stripes().

To support following different page offset, we try to use the page
offset rather than '0' directly for async_memcpy() and ops_run_io().

We try to support different page offset for xor compution functions
in the following. To avoid repeatly allocate a new array each time,
we add a memory region into scribble buffer to record offset.

No functional change.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Xianting Tian
313b825fa2 md: only calculate blocksize once and use i_blocksize()
We alreday has the interface i_blocksize(), which can be used
to get blocksize, so use it.
Only calculate blocksize once and use it within read_page().

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Jens Axboe
ac8f7a0264 Merge branch 'for-5.10/block' into for-5.10/drivers
* for-5.10/block: (140 commits)
  bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
  bdi: invert BDI_CAP_NO_ACCT_WB
  bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
  mm: use SWP_SYNCHRONOUS_IO more intelligently
  bdi: remove BDI_CAP_SYNCHRONOUS_IO
  bdi: remove BDI_CAP_CGROUP_WRITEBACK
  block: lift setting the readahead size into the block layer
  md: update the optimal I/O size on reshape
  bdi: initialize ->ra_pages and ->io_pages in bdi_init
  aoe: set an optimal I/O size
  bcache: inherit the optimal I/O size
  drbd: remove dead code in device_to_statistics
  fs: remove the unused SB_I_MULTIROOT flag
  block: mark blkdev_get static
  PM: mm: cleanup swsusp_swap_check
  mm: split swap_type_of
  PM: rewrite is_hibernate_resume_dev to not require an inode
  mm: cleanup claim_swapfile
  ocfs2: cleanup o2hb_region_dev_store
  dasd: cleanup dasd_scan_partitions
  ...
2020-09-24 13:44:39 -06:00
Christoph Hellwig
1cb039f3dc bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it.  This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore.  It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
c2e4cd57cf block: lift setting the readahead size into the block layer
Drivers shouldn't really mess with the readahead size, as that is a VM
concept.  Instead set it based on the optimal I/O size by lifting the
algorithm from the md driver when registering the disk.  Also set
bdi->io_pages there as well by applying the same scheme based on
max_sectors.  To ensure the limits work well for stacking drivers a
new helper is added to update the readahead limits from the block
limits, which is also called from disk_stack_limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
16ef510139 md: update the optimal I/O size on reshape
The raid5 and raid10 drivers currently update the read-ahead size,
but not the optimal I/O size on reshape.  To prepare for deriving the
read-ahead size from the optimal I/O size make sure it is updated
as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
5d4ce78b25 bcache: inherit the optimal I/O size
Inherit the optimal I/O size setting just like the readahead window,
as any reason to do larger I/O does not apply to just readahead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:38 -06:00
Mike Snitzer
cf9c378655 dm: fix comment in dm_process_bio()
Refer to the correct function (->submit_bio instead of ->queue_bio).
Also, add details about why using blk_queue_split() isn't needed for
dm_wq_work()'s call to dm_process_bio().

Fixes: c62b37d96b ("block: move ->make_request_fn to struct block_device_operations")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-21 19:49:15 -04:00
Mike Snitzer
ee1dfad532 dm: fix bio splitting and its bio completion order for regular IO
dm_queue_split() is removed because __split_and_process_bio() _must_
handle splitting bios to ensure proper bio submission and completion
ordering as a bio is split.

Otherwise, multiple recursive calls to ->submit_bio will cause multiple
split bios to be allocated from the same ->bio_split mempool at the same
time. This would result in deadlock in low memory conditions because no
progress could be made (only one bio is available in ->bio_split
mempool).

This fix has been verified to still fix the loss of performance, due
to excess splitting, that commit 120c9257f5 provided.

Fixes: 120c9257f5 ("Revert "dm: always call blk_queue_split() in dm_process_bio()"")
Cc: stable@vger.kernel.org # 5.0+, requires custom backport due to 5.9 changes
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-21 19:42:20 -04:00
Jan Kara
e2ec512825 dm: Call proper helper to determine dax support
DM was calling generic_fsdax_supported() to determine whether a device
referenced in the DM table supports DAX. However this is a helper for "leaf" device drivers so that
they don't have to duplicate common generic checks. High level code
should call dax_supported() helper which that calls into appropriate
helper for the particular device. This problem manifested itself as
kernel messages:

dm-3: error: dax access failed (-95)

when lvm2-testsuite run in cases where a DM device was stacked on top of
another DM device.

Fixes: 7bf7eac8d6 ("dax: Arrange for dax_supported check to span multiple devices")
Cc: <stable@vger.kernel.org>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/r/160061715195.13131.5503173247632041975.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-09-20 08:55:09 -07:00
Dan Williams
02186d8897 dm/dax: Fix table reference counts
A recent fix to the dm_dax_supported() flow uncovered a latent bug. When
dm_get_live_table() fails it is still required to drop the
srcu_read_lock(). Without this change the lvm2 test-suite triggers this
warning:

    # lvm2-testsuite --only pvmove-abort-all.sh

    WARNING: lock held when returning to user space!
    5.9.0-rc5+ #251 Tainted: G           OE
    ------------------------------------------------
    lvm/1318 is leaving the kernel with locks still held!
    1 lock held by lvm/1318:
     #0: ffff9372abb5a340 (&md->io_barrier){....}-{0:0}, at: dm_get_live_table+0x5/0xb0 [dm_mod]

...and later on this hang signature:

    INFO: task lvm:1344 blocked for more than 122 seconds.
          Tainted: G           OE     5.9.0-rc5+ #251
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    task:lvm             state:D stack:    0 pid: 1344 ppid:     1 flags:0x00004000
    Call Trace:
     __schedule+0x45f/0xa80
     ? finish_task_switch+0x249/0x2c0
     ? wait_for_completion+0x86/0x110
     schedule+0x5f/0xd0
     schedule_timeout+0x212/0x2a0
     ? __schedule+0x467/0xa80
     ? wait_for_completion+0x86/0x110
     wait_for_completion+0xb0/0x110
     __synchronize_srcu+0xd1/0x160
     ? __bpf_trace_rcu_utilization+0x10/0x10
     __dm_suspend+0x6d/0x210 [dm_mod]
     dm_suspend+0xf6/0x140 [dm_mod]

Fixes: 7bf7eac8d6 ("dax: Arrange for dax_supported check to span multiple devices")
Cc: <stable@vger.kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Reported-by: Adrian Huang <ahuang12@lenovo.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Link: https://lore.kernel.org/r/160045867590.25663.7548541079217827340.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-09-20 08:33:56 -07:00
Song Liu
0806e60f31 bcache: use part_[begin|end]_io_acct instead of disk_[begin|end]_io_acct
This enables proper statistics in /proc/diskstats for bcache partitions.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11 16:41:30 -06:00
Song Liu
00fe60eae9 md: use part_[begin|end]_io_acct instead of disk_[begin|end]_io_acct
This enables proper statistics in /proc/diskstats for md partitions.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11 16:41:30 -06:00
Christoph Hellwig
818077d6e0 md: use bdev_check_media_change
The md driver does not have a ->revalidate_disk method, so it can just
use bdev_check_media_change without any additional changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10 09:32:31 -06:00
Ye Bin
3a653b205f dm thin metadata: Fix use-after-free in dm_bm_set_read_only
The following error ocurred when testing disk online/offline:

[  301.798344] device-mapper: thin: 253:5: aborting current metadata transaction
[  301.848441] device-mapper: thin: 253:5: failed to abort metadata transaction
[  301.849206] Aborting journal on device dm-26-8.
[  301.850489] EXT4-fs error (device dm-26) in __ext4_new_inode:943: Journal has aborted
[  301.851095] EXT4-fs (dm-26): Delayed block allocation failed for inode 398742 at logical offset 181 with max blocks 19 with error 30
[  301.854476] BUG: KASAN: use-after-free in dm_bm_set_read_only+0x3a/0x40 [dm_persistent_data]

Reason is:

 metadata_operation_failed
    abort_transaction
        dm_pool_abort_metadata
	    __create_persistent_data_objects
	        r = __open_or_format_metadata
	        if (r) --> If failed will free pmd->bm but pmd->bm not set NULL
		    dm_block_manager_destroy(pmd->bm);
    set_pool_mode
	dm_pool_metadata_read_only(pool->pmd);
	dm_bm_set_read_only(pmd->bm);  --> use-after-free

Add checks to see if pmd->bm is NULL in dm_bm_set_read_only and
dm_bm_set_read_write functions.  If bm is NULL it means creating the
bm failed and so dm_bm_is_read_only must return true.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-02 13:38:40 -04:00
Ye Bin
219403d7e5 dm thin metadata: Avoid returning cmd->bm wild pointer on error
Maybe __create_persistent_data_objects() caller will use PTR_ERR as a
pointer, it will lead to some strange things.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-02 13:38:31 -04:00
Ye Bin
d16ff19e69 dm cache metadata: Avoid returning cmd->bm wild pointer on error
Maybe __create_persistent_data_objects() caller will use PTR_ERR as a
pointer, it will lead to some strange things.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-02 13:38:24 -04:00
Christoph Hellwig
de09077c89 block: remove revalidate_disk()
Remove the now unused helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02 08:00:26 -06:00
Christoph Hellwig
659e56ba86 block: add a new revalidate_disk_size helper
revalidate_disk is a relative awkward helper for driver use, as it first
calls an optional driver method and then updates the block device size,
while most callers either don't need the method call at all, or want to
keep state between the caller and the called method.

Add a revalidate_disk_size helper that just performs the update of the
block device size from the gendisk one, and switch all drivers that do
not implement ->revalidate_disk to use the new helper instead of
revalidate_disk()

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02 08:00:07 -06:00
Christoph Hellwig
c2b4bb8cb3 block: fix locking for struct block_device size updates
Two different callers use two different mutexes for updating the
block device size, which obviously doesn't help to actually protect
against concurrent updates from the different callers.  In addition
one of the locks, bd_mutex is rather prone to deadlocks with other
parts of the block stack that use it for high level synchronization.

Switch to using a new spinlock protecting just the size updates, as
that is all we need, and make sure everyone does the update through
the proper helper.

This fixes a bug reported with the nvme revalidating disks during a
hot removal operation, which can currently deadlock on bd_mutex.

Reported-by: Xianting Tian <xianting_tian@126.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:25 -06:00
Mikulas Patocka
e27fec66f0 dm integrity: fix error reporting in bitmap mode after creation
The dm-integrity target did not report errors in bitmap mode just after
creation. The reason is that the function integrity_recalc didn't clean up
ic->recalc_bitmap as it proceeded with recalculation.

Fix this by updating the bitmap accordingly -- the double shift serves
to rounddown.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 468dfca38b ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-01 16:41:57 -04:00
Damien Le Moal
7785a9e4c2 dm crypt: Initialize crypto wait structures
Use the DECLARE_CRYPTO_WAIT() macro to properly initialize the crypto
wait structures declared on stack before their use with
crypto_wait_req().

Fixes: 39d13a1ac4 ("dm crypt: reuse eboiv skcipher for IV generation")
Fixes: bbb1658461 ("dm crypt: Implement Elephant diffuser for Bitlocker compatibility")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-01 16:41:57 -04:00
Mike Snitzer
c322ee9320 dm mpath: fix racey management of PG initialization
Commit 935fcc56ab ("dm mpath: only flush workqueue when needed")
changed flush_multipath_work() to avoid needless workqueue
flushing (of a multipath global workqueue). But that change didn't
realize the surrounding flush_multipath_work() code should also only
run if 'pg_init_in_progress' is set.

Fix this by only doing all of flush_multipath_work()'s PG init related
work if 'pg_init_in_progress' is set.

Otherwise multipath_wait_for_pg_init_completion() will run
unconditionally but the preceeding flush_workqueue(kmpath_handlerd)
may not. This could lead to deadlock (though only if kmpath_handlerd
never runs a corresponding work to decrement 'pg_init_in_progress').

It could also be, though highly unlikely, that the kmpath_handlerd
work that does PG init completes before 'pg_init_in_progress' is set,
and then an intervening DM table reload's multipath_postsuspend()
triggers flush_multipath_work().

Fixes: 935fcc56ab ("dm mpath: only flush workqueue when needed")
Cc: stable@vger.kernel.org
Reported-by: Ben Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-01 16:41:56 -04:00
Mikulas Patocka
f9e040efcc dm writecache: handle DAX to partitions on persistent memory correctly
The function dax_direct_access doesn't take partitions into account,
it always maps pages from the beginning of the device. Therefore,
persistent_memory_claim() must get the partition offset using
get_start_sect() and add it to the page offsets passed to
dax_direct_access().

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-01 16:41:55 -04:00
Linus Torvalds
4d41ead6ea block-5.9-2020-08-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9JYHoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjMMD/9pp/oWPRv2pRXf7JBdQUh7zznIg84Uzv6c
 6Bkfq3eGRk/Rvv21kqbI5gZOnp3/se4M3DNBph89/6Q3idlHAot9UCIAVtsQ2+I6
 p/kNKMMkzIguxP0B6pb9aQTjuvSxUOuxNgGVDU8akPktjDj7Gv/gxIQVchAHqXob
 GL1vpBK/fvjiDHrXCCFVldCsm98P3FT7DzRvXkGXVW7Hg5KHuUVgCZtzeDMaB/dx
 YeGenGm5ZCOLvM1N+jgmF89XSqUmrWlZz8iqFVGJfHRf0G0QwDc2mpdNusnZdXDr
 sf1oG+uXszTTcK/w7/MPaHJyj6s7cKJOsXEnNUjU4VloOZvuemk3W8+t5qQ/+S/C
 usPuUa+QhiqoDLa7YyYgNSO6lx70s+S58TLAEgRbU7SHq/g8huOBQ3YT0MYuhX9C
 ZVGZfa2Yb+zWoQVFPqJcClCNjqzLihSo4qZnGSXiY7aDYM088mzLWnnFLkQ+axUq
 lecWHGOkF5ComTePQV31YCHx9uPtRZzyRJ+bENbFIcQc/CztpARGbADAaphZjSmV
 LXqHJyyNHlIo5IVITQm/qbdU9SdWS8bGNoYkB9Fa53sSWt6NIRfvXyWPE1OuyJ7C
 ICaXyMvo+pM5nXo28BnPwO30tfN5byqnvb7P0g3l7j3Ej/ujkhGPmWDyvj27cnve
 nrKgMZ9bXw==
 =UytK
 -----END PGP SIGNATURE-----

Merge tag 'block-5.9-2020-08-28' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - nbd timeout fix (Hou)

 - device size fix for loop LOOP_CONFIGURE (Martijn)

 - MD pull from Song with raid5 stripe size fix (Yufen)

* tag 'block-5.9-2020-08-28' of git://git.kernel.dk/linux-block:
  md/raid5: make sure stripe_size as power of two
  loop: Set correct device size when using LOOP_CONFIGURE
  nbd: restore default timeout when setting it to zero
2020-08-28 16:38:29 -07:00
Yufen Yu
6af10a33c5 md/raid5: make sure stripe_size as power of two
Commit 3b5408b98e ("md/raid5: support config stripe_size by sysfs
entry") make stripe_size as a configurable value. It just requires
stripe_size as multiple of 4KB.

In fact, we should make sure stripe_size as power of two. Otherwise,
stripe_shift which is the result of ilog2 can not represent the real
stripe_size. Then, stripe_hash() and stripe_hash_locks_hash() may
get unexpected value.

Fixes: 3b5408b98e ("md/raid5: support config stripe_size by sysfs entry")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-27 22:41:03 -07:00
Gustavo A. R. Silva
df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Linus Torvalds
4b6c093e21 block-5.9-2020-08-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl83DGUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplTYEACs5kPPpSVdzVsOeHj0LkfQXiSlli8dbPBo
 NB2+TIyr9NxJgxn8B4x+5/4DZgJaCoHOeyyzOocXQmvWGwWOTkrfX/OSyQOlRB5z
 dpzqF0Huhw31MSQEiwA/8lo3omBmat9cMzMa5PJYPghMGfqyQDzVJk1lIX51a1th
 oE01eBpNNsDK0OTwKrl6Rx2/OuFZnA0P3lQwgPZSLnDM6Hq+xeHTdx2LNSyE2QFv
 GzYl4dFoXg3NReLv9D57b7hE6Dc95NcCDDeU7Y3cE7XPksKMA/TkVYOD20ysJ31l
 9uzscvvcm2UugN2r0d/B35lf6NWmOG24SmkLMKTtExPGHOCQIbDAlSP/QQ4zz9pQ
 2yA+eImpQnRsCzPbGcnBzwEF3yX5+lQYmFWac+0AHDiWEWkb8e3MzNSWPZrsN+cD
 7U7c5Zw6zDEtl/naJccuZZPgQGbZgFJ/P6Wo6l5ywIPtE7wzv4MUe4eUxZhitL9M
 0ZP6WIQd8oNQdNoCYVQDwPdYJYMq7uUQFUo40vaSfntZxVKZQao7cvUHwmzVzNlZ
 v5UazETAx+4Eg6MNwfjKp+kt3rr6Xul7K9Nzn6R/cVacIU349FovUshm7WieoAUu
 niZ40gXltxj7NDwHj3p/dqesW5Nhv/qk6hlVWoi9vdmh8vAVBy/fedQfocvKrFJy
 prCI1h1UOQ==
 =10Pr
 -----END PGP SIGNATURE-----

Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few fixes on the block side of things:

   - Discard granularity fix (Coly)

   - rnbd cleanups (Guoqing)

   - md error handling fix (Dan)

   - md sysfs fix (Junxiao)

   - Fix flush request accounting, which caused an IO slowdown for some
     configurations (Ming)

   - Properly propagate loop flag for partition scanning (Lennart)"

* tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block:
  block: fix double account of flush request's driver tag
  loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE
  rnbd: no need to set bi_end_io in rnbd_bio_map_kern
  rnbd: remove rnbd_dev_submit_io
  md-cluster: Fix potential error pointer dereference in resize_bitmaps()
  block: check queue's limits.discard_granularity in __blkdev_issue_discard()
  md: get sysfs entry after redundancy attr group create
2020-08-15 20:36:42 -07:00
Linus Torvalds
97d052ea3f A set of locking fixes and updates:
- Untangle the header spaghetti which causes build failures in various
     situations caused by the lockdep additions to seqcount to validate that
     the write side critical sections are non-preemptible.
 
   - The seqcount associated lock debug addons which were blocked by the
     above fallout.
 
     seqcount writers contrary to seqlock writers must be externally
     serialized, which usually happens via locking - except for strict per
     CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot
     validate that the lock is held.
 
     This new debug mechanism adds the concept of associated locks.
     sequence count has now lock type variants and corresponding
     initializers which take a pointer to the associated lock used for
     writer serialization. If lockdep is enabled the pointer is stored and
     write_seqcount_begin() has a lockdep assertion to validate that the
     lock is held.
 
     Aside of the type and the initializer no other code changes are
     required at the seqcount usage sites. The rest of the seqcount API is
     unchanged and determines the type at compile time with the help of
     _Generic which is possible now that the minimal GCC version has been
     moved up.
 
     Adding this lockdep coverage unearthed a handful of seqcount bugs which
     have been addressed already independent of this.
 
     While generaly useful this comes with a Trojan Horse twist: On RT
     kernels the write side critical section can become preemtible if the
     writers are serialized by an associated lock, which leads to the well
     known reader preempts writer livelock. RT prevents this by storing the
     associated lock pointer independent of lockdep in the seqcount and
     changing the reader side to block on the lock when a reader detects
     that a writer is in the write side critical section.
 
  - Conversion of seqcount usage sites to associated types and initializers.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8xmPYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoTuQEACyzQCjU8PgehPp9oMqWzaX2fcVyuZO
 QU2yw6gmz2oTz3ZHUNwdW8UnzGh2OWosK3kDruoD9FtSS51lER1/ISfSPCGfyqxC
 KTjOcB1Kvxwq/3LcCx7Zi3ZxWApat74qs3EhYhKtEiQ2Y9xv9rLq8VV1UWAwyxq0
 eHpjlIJ6b6rbt+ARslaB7drnccOsdK+W/roNj4kfyt+gezjBfojGRdMGQNMFcpnv
 shuTC+vYurAVIiVA/0IuizgHfwZiXOtVpjVoEWaxg6bBH6HNuYMYzdSa/YrlDkZs
 n/aBI/Xkvx+Eacu8b1Zwmbzs5EnikUK/2dMqbzXKUZK61eV4hX5c2xrnr1yGWKTs
 F/juh69Squ7X6VZyKVgJ9RIccVueqwR2EprXWgH3+RMice5kjnXH4zURp0GHALxa
 DFPfB6fawcH3Ps87kcRFvjgm6FBo0hJ1AxmsW1dY4ACFB9azFa2euW+AARDzHOy2
 VRsUdhL9CGwtPjXcZ/9Rhej6fZLGBXKr8uq5QiMuvttp4b6+j9FEfBgD4S6h8csl
 AT2c2I9LcbWqyUM9P4S7zY/YgOZw88vHRuDH7tEBdIeoiHfrbSBU7EQ9jlAKq/59
 f+Htu2Io281c005g7DEeuCYvpzSYnJnAitj5Lmp/kzk2Wn3utY1uIAVszqwf95Ul
 81ppn2KlvzUK8g==
 =7Gj+
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Thomas Gleixner:
 "A set of locking fixes and updates:

   - Untangle the header spaghetti which causes build failures in
     various situations caused by the lockdep additions to seqcount to
     validate that the write side critical sections are non-preemptible.

   - The seqcount associated lock debug addons which were blocked by the
     above fallout.

     seqcount writers contrary to seqlock writers must be externally
     serialized, which usually happens via locking - except for strict
     per CPU seqcounts. As the lock is not part of the seqcount, lockdep
     cannot validate that the lock is held.

     This new debug mechanism adds the concept of associated locks.
     sequence count has now lock type variants and corresponding
     initializers which take a pointer to the associated lock used for
     writer serialization. If lockdep is enabled the pointer is stored
     and write_seqcount_begin() has a lockdep assertion to validate that
     the lock is held.

     Aside of the type and the initializer no other code changes are
     required at the seqcount usage sites. The rest of the seqcount API
     is unchanged and determines the type at compile time with the help
     of _Generic which is possible now that the minimal GCC version has
     been moved up.

     Adding this lockdep coverage unearthed a handful of seqcount bugs
     which have been addressed already independent of this.

     While generally useful this comes with a Trojan Horse twist: On RT
     kernels the write side critical section can become preemtible if
     the writers are serialized by an associated lock, which leads to
     the well known reader preempts writer livelock. RT prevents this by
     storing the associated lock pointer independent of lockdep in the
     seqcount and changing the reader side to block on the lock when a
     reader detects that a writer is in the write side critical section.

   - Conversion of seqcount usage sites to associated types and
     initializers"

* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  locking/seqlock, headers: Untangle the spaghetti monster
  locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
  x86/headers: Remove APIC headers from <asm/smp.h>
  seqcount: More consistent seqprop names
  seqcount: Compress SEQCNT_LOCKNAME_ZERO()
  seqlock: Fold seqcount_LOCKNAME_init() definition
  seqlock: Fold seqcount_LOCKNAME_t definition
  seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
  hrtimer: Use sequence counter with associated raw spinlock
  kvm/eventfd: Use sequence counter with associated spinlock
  userfaultfd: Use sequence counter with associated spinlock
  NFSv4: Use sequence counter with associated spinlock
  iocost: Use sequence counter with associated spinlock
  raid5: Use sequence counter with associated spinlock
  vfs: Use sequence counter with associated spinlock
  timekeeping: Use sequence counter with associated raw spinlock
  xfrm: policy: Use sequence counters with associated lock
  netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  netfilter: conntrack: Use sequence counter with associated spinlock
  sched: tasks: Use sequence counter with associated spinlock
  ...
2020-08-10 19:07:44 -07:00
Linus Torvalds
2f12d44085 - DM multipath locking fixes around m->flags tests and improvements to
bio-based code so that it follows patterns established by
   request-based code.
 
 - Request-based DM core improvement to eliminate unnecessary call to
   blk_mq_queue_stopped().
 
 - Add "panic_on_corruption" error handling mode to DM verity target.
 
 - DM bufio fix to to perform buffer cleanup from a workqueue rather
   than wait for IO in reclaim context from shrinker.
 
 - DM crypt improvement to optionally avoid async processing via
   workqueues for reads and/or writes -- via "no_read_workqueue" and
   "no_write_workqueue" features.  This more direct IO processing
   improves latency and throughput with faster storage.  Avoiding
   workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is
   a requirement for adding zoned block device support to DM crypt.
 
 - Add zoned block device support to DM crypt.  Makes use of
   DM_CRYPT_NO_WRITE_WORKQUEUE and a new optional feature
   (DM_CRYPT_WRITE_INLINE) that allows write completion to wait for
   encryption to complete.  This allows write ordering to be preserved,
   which is needed for zoned block devices.
 
 - Fix DM ebs target's check for REQ_OP_FLUSH.
 
 - Fix DM core's report zones support to not report more zones than
   were requested.
 
 - A few small compiler warning fixes.
 
 - DM dust improvements to return output directly to the user rather
   than require they scrape the system log for output.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl8tdOQTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWvDlB/sF8svagDqeqs27xTxCiUPykD29cMmS
 OGPr0Mp/BntZOBpSaTPM9s5XucP3WJhPsxet5qeoyM3OViSFx+O55PqPjn8C65y0
 eGMa4zknd9eO1933+ijmyQu6VNr4sf/6nusX4xSGqv00UR22dJ+3pHtfN9ANDXYX
 AAYA0Ve6UuOwAbGUCnRGI/2780aYY0B8Ok+cF21CskqryF+RpmbZ6BsR07+Hk4cy
 LX5EaHUqezW12cibLq2f0l7TLLJ86OscvqyU9lGVIxiV57e2i5c2S1HvhKZu+Wn3
 6CUmlOhGI0viCKgM1ArekZ+zOw9ROIaAKKPzC5mspqx9yuuCqdY8k8xV
 =X3tt
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - DM multipath locking fixes around m->flags tests and improvements to
   bio-based code so that it follows patterns established by
   request-based code.

 - Request-based DM core improvement to eliminate unnecessary call to
   blk_mq_queue_stopped().

 - Add "panic_on_corruption" error handling mode to DM verity target.

 - DM bufio fix to to perform buffer cleanup from a workqueue rather
   than wait for IO in reclaim context from shrinker.

 - DM crypt improvement to optionally avoid async processing via
   workqueues for reads and/or writes -- via "no_read_workqueue" and
   "no_write_workqueue" features. This more direct IO processing
   improves latency and throughput with faster storage. Avoiding
   workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is a
   requirement for adding zoned block device support to DM crypt.

 - Add zoned block device support to DM crypt. Makes use of
   DM_CRYPT_NO_WRITE_WORKQUEUE and a new optional feature
   (DM_CRYPT_WRITE_INLINE) that allows write completion to wait for
   encryption to complete. This allows write ordering to be preserved,
   which is needed for zoned block devices.

 - Fix DM ebs target's check for REQ_OP_FLUSH.

 - Fix DM core's report zones support to not report more zones than were
   requested.

 - A few small compiler warning fixes.

 - DM dust improvements to return output directly to the user rather
   than require they scrape the system log for output.

* tag 'for-5.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: don't call report zones for more than the user requested
  dm ebs: Fix incorrect checking for REQ_OP_FLUSH
  dm init: Set file local variable static
  dm ioctl: Fix compilation warning
  dm raid: Remove empty if statement
  dm verity: Fix compilation warning
  dm crypt: Enable zoned block device support
  dm crypt: add flags to optionally bypass kcryptd workqueues
  dm bufio: do buffer cleanup from a workqueue
  dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue()
  dm dust: add interface to list all badblocks
  dm dust: report some message results directly back to user
  dm verity: add "panic_on_corruption" error handling mode
  dm mpath: use double checked locking in fast path
  dm mpath: rename current_pgpath to pgpath in multipath_prepare_ioctl
  dm mpath: rework __map_bio()
  dm mpath: factor out multipath_queue_bio
  dm mpath: push locking down to must_push_back_rq()
  dm mpath: take m->lock spinlock when testing QUEUE_IF_NO_PATH
  dm mpath: changes from initial m->flags locking audit
2020-08-07 13:08:09 -07:00
Linus Torvalds
81e11336d9 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few MM hotfixes

 - kthread, tools, scripts, ntfs and ocfs2

 - some of MM

Subsystems affected by this patch series: kthread, tools, scripts, ntfs,
ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan,
debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore,
sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan).

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
  mm: vmscan: consistent update to pgrefill
  mm/vmscan.c: fix typo
  khugepaged: khugepaged_test_exit() check mmget_still_valid()
  khugepaged: retract_page_tables() remember to test exit
  khugepaged: collapse_pte_mapped_thp() protect the pmd lock
  khugepaged: collapse_pte_mapped_thp() flush the right range
  mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
  mm: thp: replace HTTP links with HTTPS ones
  mm/page_alloc: fix memalloc_nocma_{save/restore} APIs
  mm/page_alloc.c: skip setting nodemask when we are in interrupt
  mm/page_alloc: fallbacks at most has 3 elements
  mm/page_alloc: silence a KASAN false positive
  mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask()
  mm/page_alloc.c: simplify pageblock bitmap access
  mm/page_alloc.c: extract the common part in pfn_to_bitidx()
  mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits
  mm/shuffle: remove dynamic reconfiguration
  mm/memory_hotplug: document why shuffle_zone() is relevant
  mm/page_alloc: remove nr_free_pagecache_pages()
  mm: remove vm_total_pages
  ...
2020-08-07 11:39:33 -07:00
Waiman Long
453431a549 mm, treewide: rename kzfree() to kfree_sensitive()
As said by Linus:

  A symmetric naming is only helpful if it implies symmetries in use.
  Otherwise it's actively misleading.

  In "kzalloc()", the z is meaningful and an important part of what the
  caller wants.

  In "kzfree()", the z is actively detrimental, because maybe in the
  future we really _might_ want to use that "memfill(0xdeadbeef)" or
  something. The "zero" part of the interface isn't even _relevant_.

The main reason that kzfree() exists is to clear sensitive information
that should not be leaked to other future users of the same memory
objects.

Rename kzfree() to kfree_sensitive() to follow the example of the recently
added kvfree_sensitive() and make the intention of the API more explicit.
In addition, memzero_explicit() is used to clear the memory to make sure
that it won't get optimized away by the compiler.

The renaming is done by using the command sequence:

  git grep -w --name-only kzfree |\
  xargs sed -i 's/kzfree/kfree_sensitive/'

followed by some editing of the kfree_sensitive() kerneldoc and adding
a kzfree backward compatibility macro in slab.h.

[akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h]
[akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more]

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:22 -07:00
Linus Torvalds
25d8d4eeca powerpc updates for 5.9
- Add support for (optionally) using queued spinlocks & rwlocks.
 
  - Support for a new faster system call ABI using the scv instruction on Power9
    or later.
 
  - Drop support for the PROT_SAO mmap/mprotect flag as it will be unsupported on
    Power10 and future processors, leaving us with no way to implement the
    functionality it requests. This risks breaking userspace, though we believe
    it is unused in practice.
 
  - A bug fix for, and then the removal of, our custom stack expansion checking.
    We now allow stack expansion up to the rlimit, like other architectures.
 
  - Remove the remnants of our (previously disabled) topology update code, which
    tried to react to NUMA layout changes on virtualised systems, but was prone
    to crashes and other problems.
 
  - Add PMU support for Power10 CPUs.
 
  - A change to our signal trampoline so that we don't unbalance the link stack
    (branch return predictor) in the signal delivery path.
 
  - Lots of other cleanups, refactorings, smaller features and so on as usual.
 
 Thanks to:
   Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey Kardashevskiy,
   Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anton
   Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan S, Bharata B Rao, Bill
   Wendling, Bin Meng, Cédric Le Goater, Chris Packham, Christophe Leroy,
   Christoph Hellwig, Daniel Axtens, Dan Williams, David Lamparter, Desnes A.
   Nunes do Rosario, Erhard F., Finn Thain, Frederic Barrat, Ganesh Goudar,
   Gautham R. Shenoy, Geoff Levand, Greg Kurz, Gustavo A. R. Silva, Hari Bathini,
   Harish, Imre Kaloz, Joel Stanley, Joe Perches, John Crispin, Jordan Niethe,
   Kajol Jain, Kamalesh Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li
   RongQing, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal
   Suchanek, Milton Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan
   Chancellor, Nathan Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver
   O'Halloran, Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe
   Bergheaud, Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy
   Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh
   Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar Dronamraju,
   Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza Cascardo, Thiago Jung
   Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov, Wei Yongjun, Wen Xiong,
   YueHaibing.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl8tOxATHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgDQfEAClXHWf6hnxB84bEu39D51NkVotL1IG
 BRWFvyix+xHuUkHIouBPAAMl6ngY5X6wkYd+Z+CY9zHNtdSDoVlJE30YXdMQA/dE
 L/rYxR1884yGR/uU/3wusboO68ReXwcKQPmKOymUfh0zH7ujyJsSWLpXFK1YDC5d
 2TVVTi0Q+P5ucMHDh0L+AHirIxZvtZSp43+J7xLtywsj+XAxJWCTGo5WCJbdgbCA
 Qbv3aOkVyUa3EgsbdM/STPpv82ebqT+PHxeSIO4Jw6ZODtKRH0R5YsWCApuY9eZ+
 ebY9RLmgv9ZAhJqB2fv9A5NDcMoGpZNmjM7HrWpXwULKQpkBGHCzJ9FcSdHVMOx8
 nbVMFjt4uzLwV1w8lFYslQ2tNH/uH2o9BlryV1RLpiiKokDAJO/NOsWN9y0u/I4J
 EmAM5DSX2LgVvvas96IlGK8KX4xkOkf8FLX/H5UDvvAfloH8J4CZXk/CWCab/nqY
 KEHPnMmYvQZ1w9SzyZg9sO/1p6Bl1Gmm75Jv2F1lBiRW/42VcGBI/qLsJ4lC59Fc
 KbwufYNYYG38wbxDLW1HAPJhRonxIcaZj3EEqk7aTiLZ55nNbu8e2k32CpNXTGqt
 npOhzJHimcq7L6+878ZW+xpbZwogIEUdRSsmwb6aT8za3ShnYwSA2Q3LYxh9xyGH
 j3GifvPq6Efp3Q==
 =QMY1
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - Add support for (optionally) using queued spinlocks & rwlocks.

 - Support for a new faster system call ABI using the scv instruction on
   Power9 or later.

 - Drop support for the PROT_SAO mmap/mprotect flag as it will be
   unsupported on Power10 and future processors, leaving us with no way
   to implement the functionality it requests. This risks breaking
   userspace, though we believe it is unused in practice.

 - A bug fix for, and then the removal of, our custom stack expansion
   checking. We now allow stack expansion up to the rlimit, like other
   architectures.

 - Remove the remnants of our (previously disabled) topology update
   code, which tried to react to NUMA layout changes on virtualised
   systems, but was prone to crashes and other problems.

 - Add PMU support for Power10 CPUs.

 - A change to our signal trampoline so that we don't unbalance the link
   stack (branch return predictor) in the signal delivery path.

 - Lots of other cleanups, refactorings, smaller features and so on as
   usual.

Thanks to: Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey
Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju
T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan
S, Bharata B Rao, Bill Wendling, Bin Meng, Cédric Le Goater, Chris
Packham, Christophe Leroy, Christoph Hellwig, Daniel Axtens, Dan
Williams, David Lamparter, Desnes A. Nunes do Rosario, Erhard F., Finn
Thain, Frederic Barrat, Ganesh Goudar, Gautham R. Shenoy, Geoff Levand,
Greg Kurz, Gustavo A. R. Silva, Hari Bathini, Harish, Imre Kaloz, Joel
Stanley, Joe Perches, John Crispin, Jordan Niethe, Kajol Jain, Kamalesh
Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li RongQing, Madhavan
Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal Suchanek, Milton
Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan Chancellor, Nathan
Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver O'Halloran,
Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe Bergheaud,
Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy
Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh
Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar
Dronamraju, Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza
Cascardo, Thiago Jung Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov,
Wei Yongjun, Wen Xiong, YueHaibing.

* tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (337 commits)
  selftests/powerpc: Fix pkey syscall redefinitions
  powerpc: Fix circular dependency between percpu.h and mmu.h
  powerpc/powernv/sriov: Fix use of uninitialised variable
  selftests/powerpc: Skip vmx/vsx/tar/etc tests on older CPUs
  powerpc/40x: Fix assembler warning about r0
  powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
  powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
  cpuidle: pseries: Fixup exit latency for CEDE(0)
  cpuidle: pseries: Add function to parse extended CEDE records
  cpuidle: pseries: Set the latency-hint before entering CEDE
  selftests/powerpc: Fix online CPU selection
  powerpc/perf: Consolidate perf_callchain_user_[64|32]()
  powerpc/pseries/hotplug-cpu: Remove double free in error path
  powerpc/pseries/mobility: Add pr_debug() for device tree changes
  powerpc/pseries/mobility: Set pr_fmt()
  powerpc/cacheinfo: Warn if cache object chain becomes unordered
  powerpc/cacheinfo: Improve diagnostics about malformed cache lists
  powerpc/cacheinfo: Use name@unit instead of full DT path in debug messages
  powerpc/cacheinfo: Set pr_fmt()
  powerpc: fix function annotations to avoid section mismatch warnings with gcc-10
  ...
2020-08-07 10:33:50 -07:00
Linus Torvalds
e1ec517e18 Merge branch 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull init and set_fs() cleanups from Al Viro:
 "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

* 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
  init: add an init_dup helper
  init: add an init_utimes helper
  init: add an init_stat helper
  init: add an init_mknod helper
  init: add an init_mkdir helper
  init: add an init_symlink helper
  init: add an init_link helper
  init: add an init_eaccess helper
  init: add an init_chmod helper
  init: add an init_chown helper
  init: add an init_chroot helper
  init: add an init_chdir helper
  init: add an init_rmdir helper
  init: add an init_unlink helper
  init: add an init_umount helper
  init: add an init_mount helper
  init: mark create_dev as __init
  init: mark console_on_rootfs as __init
  init: initialize ramdisk_execute_command at compile time
  devtmpfs: refactor devtmpfsd()
  ...
2020-08-07 09:40:34 -07:00
Jens Axboe
905e321e21 Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.9
Pull MD fixes from Song.

* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md-cluster: Fix potential error pointer dereference in resize_bitmaps()
  md: get sysfs entry after redundancy attr group create
2020-08-06 17:23:22 -06:00
Ingo Molnar
a703f3633f Merge branch 'WIP.locking/seqlocks' into locking/urgent
Pick up the full seqlock series PeterZ is working on.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-08-06 10:16:38 +02:00
Dan Carpenter
e8abe1de43 md-cluster: Fix potential error pointer dereference in resize_bitmaps()
The error handling calls md_bitmap_free(bitmap) which checks for NULL
but will Oops if we pass an error pointer.  Let's set "bitmap" to NULL
on this error path.

Fixes: afd7562860 ("md-cluster/raid10: resize all the bitmaps before start reshape")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-05 17:07:48 -07:00
Junxiao Bi
e8efa9b88e md: get sysfs entry after redundancy attr group create
"sync_completed" and "degraded" belongs to redundancy attr group,
it was not exist yet when md device was created.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Fixes: e1a86dbbbd ("md: fix deadlock causing by sysfs_notify")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-05 16:07:39 -07:00
Linus Torvalds
060a72a268 for-5.9/block-merge-20200804
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8pjAEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphuHEAC5hNAi99HLktAQ7qZy4cBqnGNKCrguFszq
 Kxiecp3Nrb9EnAPNWYG+QMO0kD9z8quML85beBaJNxN1PlOk9pawqFAd4ziFncFI
 ruZwIMP+oH/0OmPUA2a4ymrqu+rpyFvfsDL2RKJ9dirAt9fuv9W0RZM5g7Oz83Xi
 cNdPRn0tOhK0DTPxL4M1/NR2OutSgvKDfA5Et3IrDFl7+bJAEFqmSO8wOSdZtvFp
 KcR4O/DXnr5Wl6cPvzlvooQze8vGGJkXAyIKaC9cuBm/nlzMCBGG8kE0v3kRJ8Sc
 uSSFkC+P+OlktY4JwXN+mCacDUdVBiiL/uUs1zel6HmociBgh67mgyJ6AfQtGZry
 yVl9mj44qWZjAzCODv5KnuxlH+gBacdmjcQqwxsZ2P477gfNkxmBXgHeWdfzO9A/
 zTUXaBDXg3VdYxQfD8zTWPkCwXYp+YG3SRb9pfrIWIiYuz2UECZTvl/8Upnacz2B
 POTf+6vcNDlILCtboVE0mKEYR0ckxqrbs0NQloQdmVOfXNyhLml9OrXmwJIffVtE
 pZ9g428c5bm44lIOiB2eW+QPsXo0s8GxqIrMtxzKsJ3WgFefwLiVDLJBqEt78jRJ
 RvpGUxrMLgWFubowH8yDmWV+Fp0NpqcqF+GU45z8nGC3OTS+i0ZvUFYgLM6a2uOf
 sv4bzDPDBg==
 =uMth
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block

Pull block stacking updates from Jens Axboe:
 "The stacking related fixes depended on both the core block and drivers
  branches, so here's a topic branch with that change.

  Outside of that, a late fix from Johannes for zone revalidation"

* tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block:
  block: don't do revalidate zones on invalid devices
  block: remove blk_queue_stack_limits
  block: remove bdev_stack_limits
  block: inherit the zoned characteristics in blk_stack_limits
2020-08-05 11:12:34 -07:00
Linus Torvalds
e0fc99e21e for-5.9/drivers-20200803
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8od3oQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppkpD/9D+XqD9qYcYTj+ShVCc5+3RtMG5ZiAAX0y
 l4QXomentn/1Y0UYXFGJH7JLZWrKYT0QiktLtfpe5pmTqRUkckTIyJQlsHb+K6Dz
 lFjtywRK9pcFYgiWIUg80wlJKrTa8QdnrlS/Esn4YITKGRbgMIdFvq2jymXC+1ho
 RgodlgzcBUREgHSLo0H3cqEKA53fQiJhKC6CbFrFdrkpf2yUpcTfEDtpSwuIuPj3
 2AUed1qXUtNjdHciCn3N37OuHqXKAA9noXAWfg9Gx/5zfGUNX9QJvlsny1AopgS0
 jJvPSDVAhu/qRLHW6q/ZOT0JAlHegguuTAOtgMh2cMpAS5sumCAtltxVcI7Qnx41
 HalMpTefXsVoBo0gfjqldnIPt34ZNj5aH5GYaH/wPpSg6VkTVBJK8GuQDBvg27qT
 w+U/T6EzuqniWXh/P3COhfrMCR9ueUOY1qWCRwzomlpeIfBhCzidt2wUqIxX1TOA
 Q0Ltf0eERDevsZbE+tIm+VAAg98kHehcS2t8lfFYFO6/PKu2iJpJt/HtJbZNBE+W
 rm96E4qXRiy1UuL7D9vBkaWsbnosuNHgGQXx57GlokQU+2IGBmOxV52XHiSxxpXd
 AS1ZTd56ItmID8VaU09Pbf7ZFbiCgdEAxIbUFzaCuvo+lxryHFphIUARNi/zPnNT
 UC2OzunCqA==
 =oADH
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - NVMe:
      - ZNS support (Aravind, Keith, Matias, Niklas)
      - Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David,
        Dongli, Max, Sagi)

 - null_blk zone capacity support (Aravind)

 - MD:
      - raid5/6 fixes (ChangSyun)
      - Warning fixes (Damien)
      - raid5 stripe fixes (Guoqing, Song, Yufen)
      - sysfs deadlock fix (Junxiao)
      - raid10 deadlock fix (Vitaly)

 - struct_size conversions (Gustavo)

 - Set of bcache updates/fixes (Coly)

* tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits)
  md/raid5: Allow degraded raid6 to do rmw
  md/raid5: Fix Force reconstruct-write io stuck in degraded raid5
  raid5: don't duplicate code for different paths in handle_stripe
  raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show
  md: print errno in super_written
  md/raid5: remove the redundant setting of STRIPE_HANDLE
  md: register new md sysfs file 'uuid' read-only
  md: fix max sectors calculation for super 1.0
  nvme-loop: remove extra variable in create ctrl
  nvme-loop: set ctrl state connecting after init
  nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths
  nvme-multipath: fix logic for non-optimized paths
  nvme-rdma: fix controller reset hang during traffic
  nvme-tcp: fix controller reset hang during traffic
  nvmet: introduce the passthru Kconfig option
  nvmet: introduce the passthru configfs interface
  nvmet: Add passthru enable/disable helpers
  nvmet: add passthru code to process commands
  nvme: export nvme_find_get_ns() and nvme_put_ns()
  nvme: introduce nvme_ctrl_get_by_path()
  ...
2020-08-05 10:51:40 -07:00
Linus Torvalds
2324d50d05 It's been a busy cycle for documentation - hopefully the busiest for a
while to come.  Changes include:
 
  - Some new Chinese translations
 
  - Progress on the battle against double words words and non-HTTPS URLs
 
  - Some block-mq documentation
 
  - More RST conversions from Mauro.  At this point, that task is
    essentially complete, so we shouldn't see this kind of churn again for a
    while.  Unless we decide to switch to asciidoc or something...:)
 
  - Lots of typo fixes, warning fixes, and more.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl8oVkwPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YoW8H/jJ/xnXFn7tkgVPQAlL3k5HCnK7A5nDP9RVR
 cg1pTx1cEFdjzxPlJyExU6/v+AImOvtweHXC+JDK7YcJ6XFUNYXJI3LxL5KwUXbY
 BL/xRFszDSXH2C7SJF5GECcFYp01e/FWSLN3yWAh+g+XwsKiTJ8q9+CoIDkHfPGO
 7oQsHKFu6s36Af0LfSgxk4sVB7EJbo8e4psuPsP5SUrl+oXRO43Put0rXkR4yJoH
 9oOaB51Do5fZp8I4JVAqGXvpXoExyLMO4yw0mASm6YSZ3KyjR8Fae+HD9Cq4ZuwY
 0uzb9K+9NEhqbfwtyBsi99S64/6Zo/MonwKwevZuhtsDTK4l4iU=
 =JQLZ
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.9' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "It's been a busy cycle for documentation - hopefully the busiest for a
  while to come. Changes include:

   - Some new Chinese translations

   - Progress on the battle against double words words and non-HTTPS
     URLs

   - Some block-mq documentation

   - More RST conversions from Mauro. At this point, that task is
     essentially complete, so we shouldn't see this kind of churn again
     for a while. Unless we decide to switch to asciidoc or
     something...:)

   - Lots of typo fixes, warning fixes, and more"

* tag 'docs-5.9' of git://git.lwn.net/linux: (195 commits)
  scripts/kernel-doc: optionally treat warnings as errors
  docs: ia64: correct typo
  mailmap: add entry for <alobakin@marvell.com>
  doc/zh_CN: add cpu-load Chinese version
  Documentation/admin-guide: tainted-kernels: fix spelling mistake
  MAINTAINERS: adjust kprobes.rst entry to new location
  devices.txt: document rfkill allocation
  PCI: correct flag name
  docs: filesystems: vfs: correct flag name
  docs: filesystems: vfs: correct sync_mode flag names
  docs: path-lookup: markup fixes for emphasis
  docs: path-lookup: more markup fixes
  docs: path-lookup: fix HTML entity mojibake
  CREDITS: Replace HTTP links with HTTPS ones
  docs: process: Add an example for creating a fixes tag
  doc/zh_CN: add Chinese translation prefer section
  doc/zh_CN: add clearing-warn-once Chinese version
  doc/zh_CN: add admin-guide index
  doc:it_IT: process: coding-style.rst: Correct __maybe_unused compiler label
  futex: MAINTAINERS: Re-add selftests directory
  ...
2020-08-04 22:47:54 -07:00
Linus Torvalds
99ea1521a0 Remove uninitialized_var() macro for v5.9-rc1
- Clean up non-trivial uses of uninitialized_var()
 - Update documentation and checkpatch for uninitialized_var() removal
 - Treewide removal of uninitialized_var()
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oYLQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsfjEACvf0D3WL3H7sLHtZ2HeMwOgAzq
 il08t6vUscINQwiIIK3Be43ok3uQ1Q+bj8sr2gSYTwunV2IYHFferzgzhyMMno3o
 XBIGd1E+v1E4DGBOiRXJvacBivKrfvrdZ7AWiGlVBKfg2E0fL1aQbe9AYJ6eJSbp
 UGqkBkE207dugS5SQcwrlk1tWKUL089lhDAPd7iy/5RK76OsLRCJFzIerLHF2ZK2
 BwvA+NWXVQI6pNZ0aRtEtbbxwEU4X+2J/uaXH5kJDszMwRrgBT2qoedVu5LXFPi8
 +B84IzM2lii1HAFbrFlRyL/EMueVFzieN40EOB6O8wt60Y4iCy5wOUzAdZwFuSTI
 h0xT3JI8BWtpB3W+ryas9cl9GoOHHtPA8dShuV+Y+Q2bWe1Fs6kTl2Z4m4zKq56z
 63wQCdveFOkqiCLZb8s6FhnS11wKtAX4czvXRXaUPgdVQS1Ibyba851CRHIEY+9I
 AbtogoPN8FXzLsJn7pIxHR4ADz+eZ0dQ18f2hhQpP6/co65bYizNP5H3h+t9hGHG
 k3r2k8T+jpFPaddpZMvRvIVD8O2HvJZQTyY6Vvneuv6pnQWtr2DqPFn2YooRnzoa
 dbBMtpon+vYz6OWokC5QNWLqHWqvY9TmMfcVFUXE4AFse8vh4wJ8jJCNOFVp8On+
 drhmmImUr1YylrtVOw==
 =xHmk
 -----END PGP SIGNATURE-----

Merge tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull uninitialized_var() macro removal from Kees Cook:
 "This is long overdue, and has hidden too many bugs over the years. The
  series has several "by hand" fixes, and then a trivial treewide
  replacement.

   - Clean up non-trivial uses of uninitialized_var()

   - Update documentation and checkpatch for uninitialized_var() removal

   - Treewide removal of uninitialized_var()"

* tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  compiler: Remove uninitialized_var() macro
  treewide: Remove uninitialized_var() usage
  checkpatch: Remove awareness of uninitialized_var() macro
  mm/debug_vm_pgtable: Remove uninitialized_var() usage
  f2fs: Eliminate usage of uninitialized_var() macro
  media: sur40: Remove uninitialized_var() usage
  KVM: PPC: Book3S PR: Remove uninitialized_var() usage
  clk: spear: Remove uninitialized_var() usage
  clk: st: Remove uninitialized_var() usage
  spi: davinci: Remove uninitialized_var() usage
  ide: Remove uninitialized_var() usage
  rtlwifi: rtl8192cu: Remove uninitialized_var() usage
  b43: Remove uninitialized_var() usage
  drbd: Remove uninitialized_var() usage
  x86/mm/numa: Remove uninitialized_var() usage
  docs: deprecated.rst: Add uninitialized_var()
2020-08-04 13:49:43 -07:00
Johannes Thumshirn
a9cb9f4148 dm: don't call report zones for more than the user requested
Don't call report zones for more zones than the user actually requested,
otherwise this can lead to out-of-bounds accesses in the callback
functions.

Such a situation can happen if the target's ->report_zones() callback
function returns 0 because we've reached the end of the target and then
restart the report zones on the second target.

We're again calling into ->report_zones() and ultimately into the user
supplied callback function but when we're not subtracting the number of
zones already processed this may lead to out-of-bounds accesses in the
user callbacks.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Fixes: d41003513e ("block: rework zone reporting")
Cc: stable@vger.kernel.org # v5.5+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-04 16:31:12 -04:00
John Dorminy
4cb6f22612 dm ebs: Fix incorrect checking for REQ_OP_FLUSH
REQ_OP_FLUSH was being treated as a flag, but the operation
part of bio->bi_opf must be treated as a whole. Change to
accessing the operation part via bio_op(bio) and checking
for equality.

Signed-off-by: John Dorminy <jdorminy@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Fixes: d3c7b35c20 ("dm: add emulated block size target")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-04 16:01:40 -04:00
Damien Le Moal
90e6bf0659 dm init: Set file local variable static
Declare dm_allowed_targets as static to avoid the warning:

drivers/md/dm-init.c:39:12: warning: symbol 'dm_allowed_targets' was
not declared. Should it be static?

when compiling with C=1.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-04 15:51:28 -04:00
Damien Le Moal
1aeb6e7cd1 dm ioctl: Fix compilation warning
In retrieve_status(), when copying the target type name in the
target_type string field of struct dm_target_spec, copy at most
DM_MAX_TYPE_NAME - 1 character to avoid the compilation warning:

warning: ‘__builtin_strncpy’ specified bound 16 equals destination
size [-Wstringop-truncation]

when compiling with W-1.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-04 15:50:58 -04:00
Damien Le Moal
04dc5330e5 dm raid: Remove empty if statement
In super_init_validation(), remove a body-less if statement testing only
variables to avoid a compilation warning when compiling with W=1.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-04 15:48:48 -04:00
Damien Le Moal
a84c430833 dm verity: Fix compilation warning
For the case !CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG, declare the
functions verity_verify_root_hash(), verity_verify_is_sig_opt_arg(),
verity_verify_sig_parse_opt_args() and verity_verify_sig_opts_cleanup()
as inline to avoid a "no previous prototype for xxx" compilation
warning when compiling with W=1.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-04 15:48:13 -04:00
Linus Torvalds
382625d0d4 for-5.9/block-20200802
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7YwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpt+dEAC7a0HYuX2OrkyawBnsgd1QQR/soC7surec
 yDDa7SMM8cOq3935bfzcYHV9FWJszEGIknchiGb9R3/T+vmSohbvDsM5zgwya9u/
 FHUIuTq324I6JWXKl30k4rwjiX9wQeMt+WZ5gC8KJYCWA296i2IpJwd0A45aaKuS
 x4bTjxqknE+fD4gQiMUSt+bmuOUAp81fEku3EPapCRYDPAj8f5uoY7R2arT/POwB
 b+s+AtXqzBymIqx1z0sZ/XcdZKmDuhdurGCWu7BfJFIzw5kQ2Qe3W8rUmrQ3pGut
 8a21YfilhUFiBv+B4wptfrzJuzU6Ps0BXHCnBsQjzvXwq5uFcZH495mM/4E4OJvh
 SbjL2K4iFj+O1ngFkukG/F8tdEM1zKBYy2ZEkGoWKUpyQanbAaGI6QKKJA+DCdBi
 yPEb7yRAa5KfLqMiocm1qCEO1I56HRiNHaJVMqCPOZxLmpXj19Fs71yIRplP1Trv
 GGXdWZsccjuY6OljoXWdEfnxAr5zBsO3Yf2yFT95AD+egtGsU1oOzlqAaU1mtflw
 ABo452pvh6FFpxGXqz6oK4VqY4Et7WgXOiljA4yIGoPpG/08L1Yle4eVc2EE01Jb
 +BL49xNJVeUhGFrvUjPGl9kVMeLmubPFbmgrtipW+VRg9W8+Yirw7DPP6K+gbPAR
 RzAUdZFbWw==
 =abJG
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Good amount of cleanups and tech debt removals in here, and as a
  result, the diffstat shows a nice net reduction in code.

   - Softirq completion cleanups (Christoph)

   - Stop using ->queuedata (Christoph)

   - Cleanup bd claiming (Christoph)

   - Use check_events, moving away from the legacy media change
     (Christoph)

   - Use inode i_blkbits consistently (Christoph)

   - Remove old unused writeback congestion bits (Christoph)

   - Cleanup/unify submission path (Christoph)

   - Use bio_uninit consistently, instead of bio_disassociate_blkg
     (Christoph)

   - sbitmap cleared bits handling (John)

   - Request merging blktrace event addition (Jan)

   - sysfs add/remove race fixes (Luis)

   - blk-mq tag fixes/optimizations (Ming)

   - Duplicate words in comments (Randy)

   - Flush deferral cleanup (Yufen)

   - IO context locking/retry fixes (John)

   - struct_size() usage (Gustavo)

   - blk-iocost fixes (Chengming)

   - blk-cgroup IO stats fixes (Boris)

   - Various little fixes"

* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
  block: blk-timeout: delete duplicated word
  block: blk-mq-sched: delete duplicated word
  block: blk-mq: delete duplicated word
  block: genhd: delete duplicated words
  block: elevator: delete duplicated word and fix typos
  block: bio: delete duplicated words
  block: bfq-iosched: fix duplicated word
  iocost_monitor: start from the oldest usage index
  iocost: Fix check condition of iocg abs_vdebt
  block: Remove callback typedefs for blk_mq_ops
  block: Use non _rcu version of list functions for tag_set_list
  blk-cgroup: show global disk stats in root cgroup io.stat
  blk-cgroup: make iostat functions visible to stat printing
  block: improve discard bio alignment in __blkdev_issue_discard()
  block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
  block: defer flush request no matter whether we have elevator
  block: make blk_timeout_init() static
  block: remove retry loop in ioc_release_fn()
  block: remove unnecessary ioc nested locking
  block: integrate bd_start_claiming into __blkdev_get
  ...
2020-08-03 11:57:03 -07:00
ChangSyun Peng
45a4d8fd6c md/raid5: Allow degraded raid6 to do rmw
Degraded raid6 always do reconstruct-write now. With raid6 xor supported,
we can do rmw in degraded raid6. This patch can reduce many read IOs to
improve performance.

If the failed disk is P, Q or the disk we want to write to, we may need to
do reconstruct-write in max degraded raid6. In this situation we can not
read enough data from handle_stripe_dirtying() so we have to set force_rcw
in handle_stripe_fill() to read all data.

Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Reviewed-by: Danny Shih <dannyshih@synology.com>
Signed-off-by: ChangSyun Peng <allenpeng@synology.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02 23:54:31 -07:00
ChangSyun Peng
a1c6ae3d9f md/raid5: Fix Force reconstruct-write io stuck in degraded raid5
In degraded raid5, we need to read parity to do reconstruct-write when
data disks fail. However, we can not read parity from
handle_stripe_dirtying() in force reconstruct-write mode.

Reproducible Steps:

1. Create degraded raid5
mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing
2. Set rmw_level to 0
echo 0 > /sys/block/md2/md/rmw_level
3. IO to raid5

Now some io may be stuck in raid5. We can use handle_stripe_fill() to read
the parity in this situation.

Cc: <stable@vger.kernel.org> # v4.4+
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Reviewed-by: Danny Shih <dannyshih@synology.com>
Signed-off-by: ChangSyun Peng <allenpeng@synology.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02 23:54:25 -07:00
Guoqing Jiang
3a31cf3d21 raid5: don't duplicate code for different paths in handle_stripe
As we can see, R5_LOCKED is set and s.locked is increased whether
R5_ReWrite is set or not, so move it to common path.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02 23:03:52 -07:00
Guoqing Jiang
01b5d32a57 raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show
Replace mddev_lock with spin_lock to align with other show methods in
raid5_attrs.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02 23:03:52 -07:00
Guoqing Jiang
b3db8a2163 md: print errno in super_written
It is better to print errno instead of bi_status.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02 23:03:51 -07:00
Guoqing Jiang
e3914d596f md/raid5: remove the redundant setting of STRIPE_HANDLE
The flag is already set before compare rcw with rmw, so it is
not necessary to do it again.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02 23:03:51 -07:00
Sebastian Parschauer
ec164d07aa md: register new md sysfs file 'uuid' read-only
Report the UUID of the MD array in the following format:
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

This is useful if you don't want to wait for udev to identify array.
And it is also easy for script to monitor it with the format.

Signed-off-by: Sebastian Parschauer <s.parschauer@gmx.de>
[Guoqing: mention the change in md.rst]
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02 23:03:51 -07:00
Xiao Ni
d9c0fa509e md: fix max sectors calculation for super 1.0
To grow size of super 1.0 raid array, it is necessary to check the device
max usable size.

Now it uses rdev->sectors for max usable size. If one disk is 500G and the
raid device only uses the 100GB of this disk. rdev->sectors can't tell the
real max usable size. The max usable size should be

dev_size-(superblock_size+bitmap_size+badblock_size).

Also, remove unnecessary sb_start update in super_1_rdev_size_change().

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02 23:03:51 -07:00
Christoph Hellwig
716308a533 init: add an init_stat helper
Add a simple helper to stat with a kernel space file name and switch
the early init code over to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:54 +02:00
Ahmed S. Darwish
0a87b25ff2 raid5: Use sequence counter with associated spinlock
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-20-a.darwish@linutronix.de
2020-07-29 16:14:27 +02:00
Coly Li
c5be1f2c5b bcache: use disk_{start,end}_io_acct() to count I/O for bcache device
This patch is a fix to patch "bcache: fix bio_{start,end}_io_acct with
proper device". The previous patch uses a hack to temporarily set
bi_disk to bcache device, which is mistaken too.

As Christoph suggests, this patch uses disk_{start,end}_io_acct() to
count I/O for bcache device in the correct way.

Fixes: 85750aeb74 ("bcache: use bio_{start,end}_io_acct")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-28 09:14:52 -06:00
Coly Li
a2f32ee8fd bcache: fix bio_{start,end}_io_acct with proper device
Commit 85750aeb74 ("bcache: use bio_{start,end}_io_acct") moves the
io account code to the location after bio_set_dev(bio, dc->bdev) in
cached_dev_make_request(). Then the account is performed incorrectly on
backing device, indeed the I/O should be counted to bcache device like
/dev/bcache0.

With the mistaken I/O account, iostat does not display I/O counts for
bcache device and all the numbers go to backing device. In writeback
mode, the hard drive may have 340K+ IOPS which is impossible and wrong
for spinning disk.

This patch introduces bch_bio_start_io_acct() and bch_bio_end_io_acct(),
which switches bio->bi_disk to bcache device before calling
bio_start_io_acct() or bio_end_io_acct(). Now the I/Os are counted to
bcache device, and bcache device, cache device and backing device have
their correct I/O count information back.

Fixes: 85750aeb74 ("bcache: use bio_{start,end}_io_acct")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:21 -06:00
Coly Li
4e4d4e0962 bcache: avoid extra memory consumption in struct bbio for large bucket size
Bcache uses struct bbio to do I/Os for meta data pages like uuids,
disk_buckets, prio_buckets, and btree nodes.

Example writing a btree node onto cache device, the process is,
- Allocate a struct bbio from mempool c->bio_meta.
- Inside struct bbio embedded a struct bio, initialize bi_inline_vecs
  for this embedded bio.
- Call bch_bio_map() to map each meta data page to each bv from the
  inlined  bi_io_vec table.
- Call bch_submit_bbio() to submit the bio into underlying block layer.
- When the I/O completed, only release the struct bbio, don't touch the
  reference counter of the meta data pages.

The struct bbio is defined as,
738 struct bbio {
739     unsigned int            submit_time_us;
	[snipped]
748     struct bio              bio;
749 };

Because struct bio is embedded at the end of struct bbio, therefore the
actual size of struct bbio is sizeof(struct bio) + size of the embedded
bio->bi_inline_vecs.

Now all the meta data bucket size are limited to meta_bucket_pages(), if
the bucket size is large than meta_bucket_pages()*PAGE_SECTORS, rested
space in the bucket is unused. Therefore the most used space in meta
bucket is (1<<MAX_ORDER) pages, or (1<<CONFIG_FORCE_MAX_ZONEORDER) if it
is configured.

Therefore for large bucket size, it is unnecessary to calculate the
allocation size of mempool c->bio_meta as,
	mempool_init_kmalloc_pool(&c->bio_meta, 2,
			sizeof(struct bbio) +
			sizeof(struct bio_vec) * bucket_pages(c))
It is too large, neither the Linux buddy allocator cannot allocate so
much continuous pages, nor the extra allocated pages are wasted.

This patch replace bucket_pages() to meta_bucket_pages() in two places,
- In bch_cache_set_alloc(), when initialize mempool c->bio_meta, uses
  sizeof(struct bbio) + sizeof(struct bio_vec) * bucket_pages(c) to set
  the allocating object size.
- In bch_bbio_alloc(), when calling bio_init() to set inline bvec talbe
  bi_inline_bvecs, uses meta_bucket_pages() to indicate number of the
  inline bio vencs number.

Now the maximum size of embedded bio inside struct bbio exactly matches
the limit of meta_bucket_pages(), no extra page wasted.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:21 -06:00
Coly Li
6907dc498f bcache: avoid extra memory allocation from mempool c->fill_iter
Mempool c->fill_iter is used to allocate memory for struct btree_iter in
bch_btree_node_read_done() to iterate all keys of a read-in btree node.

The allocation size is defined in bch_cache_set_alloc() by,
  mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size))
where iter_size is defined by a calculation,
  (sb->bucket_size / sb->block_size + 1) * sizeof(struct btree_iter_set)

For 16bit width bucket_size the calculation is OK, but now the bucket
size is extended to 32bit, the bucket size can be 2GB. By the above
calculation, iter_size can be 2048 pages (order 11 is still accepted by
buddy allocator).

But the actual size holds the bkeys in meta data bucket is limited to
meta_bucket_pages() already, which is 16MB. By the above calculation,
if replace sb->bucket_size by meta_bucket_pages() * PAGE_SECTORS, the
result is 16 pages. This is the size large enough for the mempool
allocation to struct btree_iter.

Therefore in worst case every time mempool c->fill_iter allocates, at
most 4080 pages are wasted and won't be used. Therefore this patch uses
meta_bucket_pages() * PAGE_SECTORS to calculate the iter size in
bch_cache_set_alloc(), to avoid extra memory allocation from mempool
c->fill_iter.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:21 -06:00
Coly Li
092bd54d69 bcache: add sysfs file to display feature sets information of cache set
The following three sysfs files are created to display according feature
set information of bcache:
	/sys/fs/bcache/<cache set UUID>/internal/feature_compat
	/sys/fs/bcache/<cache set UUID>/internal/feature_ro_compat
	/sys/fs/bcache/<cache set UUID>/internal/feature_incompat
is added by this patch, to display feature sets information of the cache
set.

Now only an incompat feature 'large_bucket' added in bcache, the sysfs
file content is:
        [large_bucket]
string large_bucket means the running bcache drive supports incompat
feature 'large_bucket', the wrapping [] means the 'large_bucket' feature
is currently enabled on this cache set.

This patch is ready to display compat and ro_compat features, in future
once bcache code implements such feature sets, the according feature
strings will be displayed in their sysfs files too.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:21 -06:00
Coly Li
ffa4703275 bcache: add bucket_size_hi into struct cache_sb_disk for large bucket
The large bucket feature is to extend bucket_size from 16bit to 32bit.

When create cache device on zoned device (e.g. zoned NVMe SSD), making
a single bucket cover one or more zones of the zoned device is the
simplest way to support zoned device as cache by bcache.

But current maximum bucket size is 16MB and a typical zone size of zoned
device is 256MB, this is the major motiviation to extend bucket size to
a larger bit width.

This patch is the basic and first change to support large bucket size,
the major changes it makes are,
- Add BCH_FEATURE_INCOMPAT_LARGE_BUCKET for the large bucket feature,
  INCOMPAT means it introduces incompatible on-disk format change.
- Add BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET) routines.
- Adds __le16 bucket_size_hi into struct cache_sb_disk at offset 0x8d0
  for the on-disk super block format.
- For the in-memory super block struct cache_sb, member bucket_size is
  extended from __u16 to __32.
- Add get_bucket_size() to combine the bucket_size and bucket_size_hi
  from struct cache_sb_disk into an unsigned int value.

Since we already have large bucket size helpers meta_bucket_pages(),
meta_bucket_bytes() and alloc_meta_bucket_pages(), they make sure when
bucket size > 8MB, the memory allocation for bcache meta data bucket
won't fail no matter how large the bucket size extended. So these meta
data buckets are handled properly when the bucket size width increase
from 16bit to 32bit, we don't need to worry about them.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:21 -06:00
Coly Li
f9c32a5a90 bcache: handle btree node memory allocation properly for bucket size > 8MB
Currently the bcache internal btree node occupies a whole bucket. When
loading the btree node from cache device into memory, mca_data_alloc()
will call bch_btree_keys_alloc() to allocate memory for the whole bucket
size, ilog2(b->c->btree_pages) is send to bch_btree_keys_alloc() as the
parameter 'page_order'.

c->btree_pages is set as bucket_pages() in bch_cache_set_alloc(), for
bucket size > 8MB, ilog2(b->c->btree_pages) is 12 for 4KB page size. By
default the maximum page order __get_free_pages() accepts is MAX_ORDER
(11), in this condition bch_btree_keys_alloc() will always fail.

Because of other over-page-order allocation failure fails the cache
device registration, such btree node allocation failure wasn't observed
during runtime. After other blocking page allocation failures for bucket
size > 8MB, this btree node allocation issue may trigger potentical risk
e.g. infinite dead-loop to retry btree node allocation after failure.

This patch fixes the potential problem by setting c->btree_pages to
meta_bucket_pages() in bch_cache_set_alloc(). In the condition that
bucket size > 8MB, meta_bucket_pages() will always return a number which
won't exceed the maximum page order of the buddy allocator.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li
bf6af17065 bcache: handle cache set verify_ondisk properly for bucket size > 8MB
In bch_btree_cache_alloc() when CONFIG_BCACHE_DEBUG is configured,
allocate memory for c->verify_ondisk may fail if the bucket size > 8MB,
which will require __get_free_pages() to allocate continuous pages
with order > 11 (the default MAX_ORDER of Linux buddy allocator). Such
over size allocation will fail, and cause 2 problems,
- When CONFIG_BCACHE_DEBUG is configured,  bch_btree_verify() does not
  work, because c->verify_ondisk is NULL and bch_btree_verify() returns
  immediately.
- bch_btree_cache_alloc() will fail due to c->verify_ondisk allocation
  failed, then the whole cache device registration fails. And because of
  this failure, the first problem of bch_btree_verify() has no chance to
  be triggered.

This patch fixes the above problem by two means,
1) If pages allocation of c->verify_ondisk fails, set it to NULL and
   returns bch_btree_cache_alloc() with -ENOMEM.
2) When calling __get_free_pages() to allocate c->verify_ondisk pages,
   use ilog2(meta_bucket_pages(&c->sb)) to make sure ilog2() will always
   generate a pages order <= MAX_ORDER (or CONFIG_FORCE_MAX_ZONEORDER).
   Then the buddy system won't directly reject the allocation request.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li
c954ac8d66 bcache: handle cache prio_buckets and disk_buckets properly for bucket size > 8MB
Similar to c->uuids, struct cache's prio_buckets and disk_buckets also
have the potential memory allocation failure during cache registration
if the bucket size > 8MB.

ca->prio_buckets can be stored on cache device in multiple buckets, its
in-memory space is allocated by kzalloc() interface but normally
allocated by alloc_pages() because the size > KMALLOC_MAX_CACHE_SIZE.

So allocation of ca->prio_buckets has the MAX_ORDER restriction too. If
the bucket size > 8MB, by default the page allocator will fail because
the page order > 11 (default MAX_ORDER value). ca->prio_buckets should
also use meta_bucket_bytes(), meta_bucket_pages() to decide its memory
size and use alloc_meta_bucket_pages() to allocate pages, to avoid the
allocation failure during cache set registration when bucket size > 8MB.

ca->disk_buckets is a single bucket size memory buffer, it is used to
iterate each bucket of ca->prio_buckets, and compose the bio based on
memory of ca->disk_buckets, then write ca->disk_buckets memory to cache
disk one-by-one for each bucket of ca->prio_buckets. ca->disk_buckets
should have in-memory size exact to the meta_bucket_pages(), this is the
size that ca->prio_buckets will be stored into each on-disk bucket.

This patch fixes the above issues and handle cache's prio_buckets and
disk_buckets properly for bucket size larger than 8MB.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li
21e478ddb2 bcache: handle c->uuids properly for bucket size > 8MB
Bcache allocates a whole bucket to store c->uuids on cache device, and
allocates continuous pages to store it in-memory. When the bucket size
exceeds maximum allocable continuous pages, bch_cache_set_alloc() will
fail and cache device registration will fail.

This patch allocates c->uuids by alloc_meta_bucket_pages(), and uses
ilog2(meta_bucket_pages(c)) to indicate order of c->uuids pages when
free it. When writing c->uuids to cache device, its size is decided
by meta_bucket_pages(c) * PAGE_SECTORS. Now c->uuids is properly handled
for bucket size > 8MB.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li
de1fafab64 bcache: introduce meta_bucket_pages() related helper routines
Currently the in-memory meta data like c->uuids or c->disk_buckets
are allocated by alloc_bucket_pages(). The macro alloc_bucket_pages()
calls __get_free_pages() to allocated continuous pages with order
indicated by ilog2(bucket_pages(c)),
 #define alloc_bucket_pages(gfp, c)                      \
     ((void *) __get_free_pages(__GFP_ZERO|gfp, ilog2(bucket_pages(c))))

The maximum order is defined as MAX_ORDER, the default value is 11 (and
can be overwritten by CONFIG_FORCE_MAX_ZONEORDER). In bcache code the
maximum bucket size width is 16bits, this is restricted both by KEY_SIZE
size and bucket_size size from struct cache_sb_disk. The maximum 16bits
width and power-of-2 value is (1<<15) in unit of sector (512byte). It
means the maximum value of bucket size in bytes is (1<<24) bytes a.k.a
4096 pages.

When the bucket size is set to maximum permitted value, ilog2(4096) is
12, which exceeds the default maximum order __get_free_pages() can
accepted, the failed pages allocation will fail cache set registration
procedure and print a kernel oops message for the exceeded pages order.

This patch introduces meta_bucket_pages(), meta_bucket_bytes(), and
alloc_bucket_pages() helper routines. meta_bucket_pages() indicates the
maximum pages can be allocated to meta data bucket, meta_bucket_bytes()
indicates the according maximum bytes, and alloc_bucket_pages() does
the pages allocation for meta bucket. Because meta_bucket_pages()
chooses the smaller value among the bucket size and MAX_ORDER_NR_PAGES,
it still works when MAX_ORDER overwritten by CONFIG_FORCE_MAX_ZONEORDER.

Following patches will use these helper routines to decide maximum pages
can be allocated for different meta data buckets. If the bucket size is
larger than meta_bucket_bytes(), the bcache registration can continue to
success, just the space more than meta_bucket_bytes() inside the bucket
is wasted. Comparing bcache failed for large bucket size, wasting some
space for meta data buckets is acceptable at this moment.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li
198efa35c5 bcache: move bucket related code into read_super_common()
Setting sb->first_bucket and checking sb->keys indeed are only for cache
device, it does not make sense to do them in read_super() for backing
device too.

This patch moves the related code piece into read_super_common()
explicitly for cache device and avoid the confusion.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li
d721a43ff6 bcache: increase super block version for cache device and backing device
The new added super block version BCACHE_SB_VERSION_BDEV_WITH_FEATURES
(5) BCACHE_SB_VERSION_CDEV_WITH_FEATURES value (6), is for the feature
set bits.

Devices have super block version equal to the new version will have
three new members for feature set bits in the on-disk super block,
        __le64                  feature_compat;
        __le64                  feature_incompat;
        __le64                  feature_ro_compat;

They are used for further new features which may introduce on-disk
format change, and avoid unncessary super block version increase.

The very basic features handling code skeleton is also initialized in
this patch.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li
117f636ea6 bcache: fix super block seq numbers comparision in register_cache_set()
In register_cache_set(), c is pointer to struct cache_set, and ca is
pointer to struct cache, if ca->sb.seq > c->sb.seq, it means this
registering cache has up to date version and other members, the in-
memory version and other members should be updated to the newer value.

But current implementation makes a cache set only has a single cache
device, so the above assumption works well except for a special case.
The execption is when a cache device new created and both ca->sb.seq and
c->sb.seq are 0, because the super block is never flushed out yet. In
the location for the following if() check,
2156         if (ca->sb.seq > c->sb.seq) {
2157                 c->sb.version           = ca->sb.version;
2158                 memcpy(c->sb.set_uuid, ca->sb.set_uuid, 16);
2159                 c->sb.flags             = ca->sb.flags;
2160                 c->sb.seq               = ca->sb.seq;
2161                 pr_debug("set version = %llu\n", c->sb.version);
2162         }
c->sb.version is not initialized yet and valued 0. When ca->sb.seq is 0,
the if() check will fail (because both values are 0), and the cache set
version, set_uuid, flags and seq won't be updated.

The above problem is hiden for current code, because the bucket size is
compatible among different super block version. And the next time when
running cache set again, ca->sb.seq will be larger than 0 and cache set
super block version will be updated properly.

But if the large bucket feature is enabled,  sb->bucket_size is the low
16bits of the bucket size. For a power of 2 value, when the actual
bucket size exceeds 16bit width, sb->bucket_size will always be 0. Then
read_super_common() will fail because the if() check to
is_power_of_2(sb->bucket_size) is false. This is how the long time
hidden bug is triggered.

This patch modifies the if() check to the following way,
2156         if (ca->sb.seq > c->sb.seq || c->sb.seq == 0) {
Then cache set's version, set_uuid, flags and seq will always be updated
corectly including for a new created cache device.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li
a42d3c642c bcache: disassemble the big if() checks in bch_cache_set_alloc()
In bch_cache_set_alloc() there is a big if() checks combined by 11 items
together. When this big if() statement fails, it is difficult to tell
exactly which item fails indeed.

This patch disassembles this big if() checks into 11 single if() checks,
which makes code debug more easier.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li
c557a5f7bb bcache: add more accurate error information in read_super_common()
The improperly set bucket or block size will trigger error in
read_super_common(). For large bucket size, a more accurate error message
for invalid bucket or block size is necessary.

This patch disassembles the combined if() checks into multiple single
if() check, and provide more accurate error message for each check
failure condition.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li
5b21403c7f bcache: add read_super_common() to read major part of super block
Later patches will introduce feature set bits to on-disk super block and
increase super block version. Current code in read_super() which reads
common part of super block for version BCACHE_SB_VERSION_CDEV and version
BCACHE_SB_VERSION_CDEV_WITH_UUID will be shared with the new version.

Therefore this patch moves the reusable part into read_super_common(),
this preparation patch will make later patches more simplier and only
focus on new feature set bits.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li
7a14812679 bcache: fix overflow in offset_to_stripe()
offset_to_stripe() returns the stripe number (in type unsigned int) from
an offset (in type uint64_t) by the following calculation,
	do_div(offset, d->stripe_size);
For large capacity backing device (e.g. 18TB) with small stripe size
(e.g. 4KB), the result is 4831838208 and exceeds UINT_MAX. The actual
returned value which caller receives is 536870912, due to the overflow.

Indeed in bcache_device_init(), bcache_device->nr_stripes is limited in
range [1, INT_MAX]. Therefore all valid stripe numbers in bcache are
in range [0, bcache_dev->nr_stripes - 1].

This patch adds a upper limition check in offset_to_stripe(): the max
valid stripe number should be less than bcache_device->nr_stripes. If
the calculated stripe number from do_div() is equal to or larger than
bcache_device->nr_stripe, -EINVAL will be returned. (Normally nr_stripes
is less than INT_MAX, exceeding upper limitation doesn't mean overflow,
therefore -EOVERFLOW is not used as error code.)

This patch also changes nr_stripes' type of struct bcache_device from
'unsigned int' to 'int', and return value type of offset_to_stripe()
from 'unsigned int' to 'int', to match their exact data ranges.

All locations where bcache_device->nr_stripes and offset_to_stripe() are
referenced also get updated for the above type change.

Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li
65f0f017e7 bcache: avoid nr_stripes overflow in bcache_device_init()
For some block devices which large capacity (e.g. 8TB) but small io_opt
size (e.g. 8 sectors), in bcache_device_init() the stripes number calcu-
lated by,
	DIV_ROUND_UP_ULL(sectors, d->stripe_size);
might be overflow to the unsigned int bcache_device->nr_stripes.

This patch uses the uint64_t variable to store DIV_ROUND_UP_ULL()
and after the value is checked to be available in unsigned int range,
sets it to bache_device->nr_stripes. Then the overflow is avoided.

Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Gustavo A. R. Silva
29f1d5cace bcache: Use struct_size() in kzalloc()
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.

This code was detected with the help of Coccinelle and, audited and
fixed manually.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Gustavo A. R. Silva
6706ad5643 bcache: movinggc: Use struct_size() helper in kzalloc()
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.

This code was detected with the help of Coccinelle and, audited and
fixed manually.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Xu Wang
7236657c6b bcache: writeback: Remove unneeded variable i
Remove unneeded variable i in bch_dirty_init_thread().

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Xu Wang
ef4eeb855f bcache: journel: use for_each_clear_bit() to simplify the code
Using for_each_clear_bit() to simplify the code.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:19 -06:00
Coly Li
5fe4886785 bcache: allocate meta data pages as compound pages
There are some meta data of bcache are allocated by multiple pages,
and they are used as bio bv_page for I/Os to the cache device. for
example cache_set->uuids, cache->disk_buckets, journal_write->data,
bset_tree->data.

For such meta data memory, all the allocated pages should be treated
as a single memory block. Then the memory management and underlying I/O
code can treat them more clearly.

This patch adds __GFP_COMP flag to all the location allocating >0 order
pages for the above mentioned meta data. Then their pages are treated
as compound pages now.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:19 -06:00
Jean Delvare
6acd193b26 bcache: Fix typo in Kconfig name
registraion -> registration

Fixes: 0c8d3fcead ("bcache: configure the asynchronous registertion to be experimental")
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Reviewed-by: Coly Li <colyli@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:19 -06:00
Mikulas Patocka
5df96f2b9f dm integrity: fix integrity recalculation that is improperly skipped
Commit adc0daad36 ("dm: report suspended
device during destroy") broke integrity recalculation.

The problem is dm_suspended() returns true not only during suspend,
but also during resume. So this race condition could occur:
1. dm_integrity_resume calls queue_work(ic->recalc_wq, &ic->recalc_work)
2. integrity_recalc (&ic->recalc_work) preempts the current thread
3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto unlock_ret;
4. integrity_recalc exits and no recalculating is done.

To fix this race condition, add a function dm_post_suspending that is
only true during the postsuspend phase and use it instead of
dm_suspended().

Signed-off-by: Mikulas Patocka <mpatocka redhat com>
Fixes: adc0daad36 ("dm: report suspended device during destroy")
Cc: stable vger kernel org # v4.18+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-23 14:39:37 -04:00
Yufen Yu
83c3e5e17b md/raid5: use do_div() for 64 bit divisions in raid5_sync_request
We get compilation error on 32-bit architectures (e.g. m68k), as:

  ERROR: modpost: "__udivdi3" [drivers/md/raid456.ko] undefined!

Since 'sync_blocks' is defined as u64, use do_div() to fix this error.

Fixes: c911c46c01 ("md/raid456: convert macro STRIPE_* to RAID5_STRIPE_*")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-22 22:49:46 -07:00
Vitaly Mayatskikh
fe630de009 md/raid10: avoid deadlock on recovery.
When disk failure happens and the array has a spare drive, resync thread
kicks in and starts to refill the spare. However it may get blocked by
a retry thread that resubmits failed IO to a mirror and itself can get
blocked on a barrier raised by the resync thread.

Acked-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Vitaly Mayatskikh <vmayatskikh@digitalocean.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-22 11:44:54 -07:00
Zhao Heming
edee9dfe51 md-cluster: fix rmmod issue when md_cluster convert bitmap to none
update_array_info misses calling module_put when removing cluster bitmap.

steps to reproduce:
```
node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda
/dev/sdb
mdadm: array /dev/md0 started.
node1 # lsmod | egrep "dlm|md_|raid1"
md_cluster             28672  1
dlm                   212992  14 md_cluster
configfs               57344  2 dlm
raid1                  53248  1
md_mod                176128  2 raid1,md_cluster
node1 # mdadm -G /dev/md0 -b none
node1 # lsmod | egrep "dlm|md_|raid1"
md_cluster             28672  1 <== should be zero
dlm                   212992  9 md_cluster
configfs               57344  2 dlm
raid1                  53248  1
md_mod                176128  2 raid1,md_cluster
node1 # mdadm -G /dev/md0 -b clustered
node1 # lsmod | egrep "dlm|md_|raid1"
md_cluster             28672  2 <== increase
dlm                   212992  14 md_cluster
configfs               57344  2 dlm
raid1                  53248  1
md_mod                176128  2 raid1,md_cluster
node1 # mdadm -G /dev/md0 -b none
node1 # mdadm -G /dev/md0 -b clustered
node1 # lsmod | egrep "dlm|md_|raid1"
md_cluster             28672  3 <== increase
dlm                   212992  14 md_cluster
configfs               57344  2 dlm
raid1                  53248  1
md_mod                176128  2 raid1,md_cluster
```

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21 17:30:37 -07:00
Zhao Heming
7c9d5c54fb md-cluster: fix safemode_delay value when converting to clustered bitmap
When array convert to clustered bitmap, the safe_mode_delay doesn't
clean and vice versa. the /sys/block/mdX/md/safe_mode_delay keep original
value after changing bitmap type. In safe_delay_store(), the code forbids
setting mddev->safemode_delay when array is clustered. So in cluster-md
env, the expected safemode_delay value should be 0.

Reproducible steps:
```
node1 # mdadm --zero-superblock /dev/sd{b,c,d}
node1 # mdadm -C /dev/md0 -b internal -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc
node1 # cat /sys/block/md0/md/safe_mode_delay
0.204
node1 # mdadm -G /dev/md0 -b none
node1 # mdadm --grow /dev/md0 --bitmap=clustered
node1 # cat /sys/block/md0/md/safe_mode_delay
0.204  <== doesn't change, should ZERO for cluster-md

node1 # mdadm --zero-superblock /dev/sd{b,c,d}
node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc
node1 # cat /sys/block/md0/md/safe_mode_delay
0.000
node1 # mdadm -G /dev/md0 -b none
node1 # cat /sys/block/md0/md/safe_mode_delay
0.000  <== doesn't change, should default value
```

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21 17:30:12 -07:00
Yufen Yu
3b5408b98e md/raid5: support config stripe_size by sysfs entry
Adding a new 'stripe_size' sysfs entry to set and show stripe_size.
stripe_size should not be bigger than PAGE_SIZE, and it requires to
be multiple of 4096. We can adjust stripe_size by writing value into
sysfs entry, likely, set stripe_size as 16KB:

          echo 16384 > /sys/block/md1/md/stripe_size

Show current stripe_size value:

          cat /sys/block/md1/md/stripe_size

For PAGE_SIZE is equal to 4096, 'stripe_size' can just be read.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21 17:18:22 -07:00
Yufen Yu
e236858243 md/raid5: set default stripe_size as 4096
In RAID5, if issued bio size is bigger than stripe_size, it will be
split in the unit of stripe_size and process them one by one. Even
for size less then stripe_size, RAID5 also request data from disk at
least of stripe_size.

Nowdays, stripe_size is equal to the value of PAGE_SIZE. Since filesystem
usually issue bio in the unit of 4KB, there is no problem for PAGE_SIZE
as 4KB. But, for 64KB PAGE_SIZE, bio from filesystem requests 4KB data
while RAID5 issue IO at least stripe_size (64KB) each time. That will
waste resource of disk bandwidth and compute xor.

To avoding the waste, we want to make stripe_size configurable. This
patch just set default stripe_size as 4096. User can also set the value
bigger than 4KB for some special requirements, such as we know the
issued io size is more than 4KB.

To evaluate the new feature, we create raid5 device '/dev/md5' with
4 SSD disk and test it on arm64 machine with 64KB PAGE_SIZE.

1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default
 configure on /mnt directory. Then, trying to test it by dbench with
 command: dbench -D /mnt -t 1000 10. Result show as:

 'stripe_size = 64KB'

  Operation      Count    AvgLat    MaxLat
  ----------------------------------------
  NTCreateX    9805011     0.021    64.728
  Close        7202525     0.001     0.120
  Rename        415213     0.051    44.681
  Unlink       1980066     0.079    93.147
  Deltree          240     1.793     6.516
  Mkdir            120     0.004     0.007
  Qpathinfo    8887512     0.007    37.114
  Qfileinfo    1557262     0.001     0.030
  Qfsinfo      1629582     0.012     0.152
  Sfileinfo     798756     0.040    57.641
  Find         3436004     0.019    57.782
  WriteX       4887239     0.021    57.638
  ReadX        15370483     0.005    37.818
  LockX          31934     0.003     0.022
  UnlockX        31933     0.001     0.021
  Flush         687205    13.302   530.088

 Throughput 307.799 MB/sec  10 clients  10 procs  max_latency=530.091 ms
 -------------------------------------------------------

 'stripe_size = 4KB'

  Operation      Count    AvgLat    MaxLat
  ----------------------------------------
  NTCreateX    11999166     0.021    36.380
  Close        8814128     0.001     0.122
  Rename        508113     0.051    29.169
  Unlink       2423242     0.070    38.141
  Deltree          300     1.885     7.155
  Mkdir            150     0.004     0.006
  Qpathinfo    10875921     0.007    35.485
  Qfileinfo    1905837     0.001     0.032
  Qfsinfo      1994304     0.012     0.125
  Sfileinfo     977450     0.029    26.489
  Find         4204952     0.019     9.361
  WriteX       5981890     0.019    27.804
  ReadX        18809742     0.004    33.491
  LockX          39074     0.003     0.025
  UnlockX        39074     0.001     0.014
  Flush         841022    10.712   458.848

 Throughput 376.777 MB/sec  10 clients  10 procs  max_latency=458.852 ms
 -------------------------------------------------------

 It show that setting stripe_size as 4KB has higher thoughput, i.e.
 (376.777 vs 307.799) and has smaller latency than that setting as 64KB.

 2) We try to evaluate IO throughput for /dev/md5 by fio with config:

 [4KB randwrite]
 direct=1
 numjob=2
 iodepth=64
 ioengine=libaio
 filename=/dev/md5
 bs=4KB
 rw=randwrite

 [64KB write]
 direct=1
 numjob=2
 iodepth=64
 ioengine=libaio
 filename=/dev/md5
 bs=1MB
 rw=write

 The result as follow:

               +                   +
               | stripe_size(64KB) | stripe_size(4KB)
 +----------------------------------------------------+
 4KB randwrite |     15MB/s        |      100MB/s
 +----------------------------------------------------+
 1MB write     |   1000MB/s        |      700MB/s

 The result show that when size of io is bigger than 4KB (64KB),
 64KB stripe_size has much higher IOPS. But for 4KB randwrite, that
 means, size of io issued to device are smaller, 4KB stripe_size
 have better performance.

Normally, default value (4096) can get relatively good performance.
But if each issued io is bigger than 4096, setting value more than
4096 may get better performance.

Here, we just set default stripe_size as 4096, and we will try to
support setting different stripe_size by sysfs interface in the
following patch.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21 17:18:17 -07:00
Yufen Yu
c911c46c01 md/raid456: convert macro STRIPE_* to RAID5_STRIPE_*
Convert macro STRIPE_SIZE, STRIPE_SECTORS and STRIPE_SHIFT to
RAID5_STRIPE_SIZE(), RAID5_STRIPE_SECTORS() and RAID5_STRIPE_SHIFT().

This patch is prepare for the following adjustable stripe_size.
It will not change any existing functionality.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21 17:18:12 -07:00
Christoph Hellwig
9efa82ef2b block: remove bdev_stack_limits
This function is just a tiny wrapper around blk_stack_limit and has
two callers.  Simplify the stack a bit by open coding it in the two
callers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-20 15:38:52 -06:00
Christoph Hellwig
3093a47972 block: inherit the zoned characteristics in blk_stack_limits
Lift the code from device mapper into blk_stack_limits to inherity
the stacking limitations.  This ensures we do the right thing for
all stacked zoned block devices.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-20 15:38:52 -06:00
Jens Axboe
4f43d64807 Merge branch 'for-5.9/drivers' into for-5.9/block-merge
* for-5.9/drivers: (38 commits)
  block: add max_active_zones to blk-sysfs
  block: add max_open_zones to blk-sysfs
  s390/dasd: Use struct_size() helper
  s390/dasd: fix inability to use DASD with DIAG driver
  md-cluster: fix wild pointer of unlock_all_bitmaps()
  md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes
  md: fix deadlock causing by sysfs_notify
  md: improve io stats accounting
  md: raid0/linear: fix dereference before null check on pointer mddev
  rsxx: switch from 'pci_free_consistent()' to 'dma_free_coherent()'
  nvme: remove ns->disk checks
  nvme-pci: use standard block status symbolic names
  nvme-pci: use the consistent return type of nvme_pci_iod_alloc_size()
  nvme-pci: add a blank line after declarations
  nvme-pci: fix some comments issues
  nvme-pci: remove redundant segment validation
  nvme: document quirked Intel models
  nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs
  nvme: support for zoned namespaces
  nvme: support for multiple Command Sets Supported and Effects log pages
  ...
2020-07-20 15:38:27 -06:00
Jens Axboe
9caaa66c91 Merge branch 'for-5.9/block' into for-5.9/block-merge
* for-5.9/block: (124 commits)
  blk-cgroup: show global disk stats in root cgroup io.stat
  blk-cgroup: make iostat functions visible to stat printing
  block: improve discard bio alignment in __blkdev_issue_discard()
  block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
  block: defer flush request no matter whether we have elevator
  block: make blk_timeout_init() static
  block: remove retry loop in ioc_release_fn()
  block: remove unnecessary ioc nested locking
  block: integrate bd_start_claiming into __blkdev_get
  block: use bd_prepare_to_claim directly in the loop driver
  block: refactor bd_start_claiming
  block: simplify the restart case in __blkdev_get
  Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait."
  block: always remove partitions from blk_drop_partitions()
  block: relax jiffies rounding for timeouts
  blk-mq: remove redundant validation in __blk_mq_end_request()
  blk-mq: Remove unnecessary local variable
  writeback: remove bdi->congested_fn
  writeback: remove struct bdi_writeback_congested
  writeback: remove {set,clear}_wb_congested
  ...
2020-07-20 15:38:23 -06:00
Damien Le Moal
8e225f04d2 dm crypt: Enable zoned block device support
Enable support for zoned block devices. This is done by:
1) implementing the target report_zones method.
2) adding the DM_TARGET_ZONED_HM flag to the target features.
3) setting DM_CRYPT_NO_WRITE_WORKQUEUE flag to avoid IO
   processing via workqueue.
4) Introducing inline write encryption completion to preserve write
   ordering.

The last point is implemented by introducing the internal flag
DM_CRYPT_WRITE_INLINE. When set, kcryptd_crypt_write_convert() always
waits inline for the completion of a write request encryption if the
request is not already completed once crypt_convert() returns.
Completion of write request encryption is signaled using the
restart completion by kcryptd_async_done(). This mechanism allows
using ciphers that have an asynchronous implementation, isolating
dm-crypt from any potential request completion reordering for these
ciphers.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-20 11:17:45 -04:00
Ignat Korchagin
39d42fa96b dm crypt: add flags to optionally bypass kcryptd workqueues
This is a follow up to [1] that detailed latency problems associated
with dm-crypt's use of workqueues when processing IO.

Current dm-crypt implementation creates a significant IO performance
overhead (at least on small IO block sizes) for both latency and
throughput. We suspect offloading IO request processing into
workqueues and async threads is more harmful these days with the
modern fast storage. I also did some digging into the dm-crypt git
history and much of this async processing is not needed anymore,
because the reasons it was added are mostly gone from the kernel. More
details can be found in [2] (see "Git archeology" section).

This change adds DM_CRYPT_NO_READ_WORKQUEUE and
DM_CRYPT_NO_WRITE_WORKQUEUE flags for read and write BIOs, which
direct dm-crypt to not offload crypto operations into kcryptd
workqueues.  In addition, writes are not buffered to be sorted in the
dm-crypt red-black tree, but dispatched immediately. For cases, where
crypto operations cannot happen (hard interrupt context, for example
the read path of some NVME drivers), we offload the work to a tasklet
rather than a workqueue.

These flags only ensure no async BIO processing in the dm-crypt
module. It is worth noting that some Crypto API implementations may
offload encryption into their own workqueues, which are independent of
the dm-crypt and its configuration. However upon enabling these new
flags dm-crypt will instruct Crypto API not to backlog crypto
requests.

To give an idea of the performance gains for certain workloads,
consider the script, and results when tested against various
devices, detailed here:
https://www.redhat.com/archives/dm-devel/2020-July/msg00138.html

[1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
[2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-20 11:17:44 -04:00
Mikulas Patocka
70704c33db dm bufio: do buffer cleanup from a workqueue
Until now, DM bufio's waiting for IO from reclaim context in its
shrinker has caused kswapd to block; which results in systemic IO
stalls and even deadlock, e.g.:
https://www.redhat.com/archives/dm-devel/2020-March/msg00025.html

Here is Dave Chinner's problem description that motivated this fix,
from: https://lore.kernel.org/linux-fsdevel/20190809215733.GZ7777@dread.disaster.area/

"Waiting for IO in kswapd reclaim context is considered harmful -
kswapd context shrinker reclaim should be as non-blocking as possible,
and any back-off to wait for IO to complete should be done by the high
level reclaim core once it's completed an entire reclaim scan cycle of
everything....

What follows from that, and is pertinent in this situation, is that if
you don't block kswapd, then other reclaim contexts are not going to
get stuck waiting for it regardless of the reclaim context they use."

Continued elsewhere:

"The only way to fix this problem once and for all is to stop using
the shrinker as a mechanism to issue and wait on IO. If you need
background writeback of dirty buffers, do it from a WQ_MEM_RECLAIM
workqueue that isn't directly in the memory reclaim path and so can
issue writeback and block safely from a GFP_KERNEL context. Kick the
workqueue from the shrinker context, but get rid of the IO submission
and waiting from the shrinker and all the GFP_NOFS memory reclaim
recursion problems go away."

As such, this commit moves buffer cleanup to a workqueue.

Suggested-by: Dave Chinner <dchinner@redhat.com>
Reported-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-20 11:17:43 -04:00
Ming Lei
e766668c6c dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue()
dm_stop_queue() only uses blk_mq_quiesce_queue() so it doesn't
formally stop the blk-mq queue; therefore there is no point making the
blk_mq_queue_stopped() check -- it will never be stopped.

In addition, even though dm_stop_queue() actually tries to quiesce hw
queues via blk_mq_quiesce_queue(), checking with blk_queue_quiesced()
to avoid unnecessary queue quiesce isn't reliable because: the
QUEUE_FLAG_QUIESCED flag is set before synchronize_rcu() and
dm_stop_queue() may be called when synchronize_rcu() from another
blk_mq_quiesce_queue() is in-progress.

Fixes: 7b17c2f729 ("dm: Fix a race condition related to stopping and starting queues")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-20 11:17:42 -04:00
yangerkun
0c248ea27f dm dust: add interface to list all badblocks
This interface may help anyone who want to know all badblocks without
querying for each block.

[Bryan: DMEMIT message if no blocks are in the bad block list.]

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-20 11:17:41 -04:00
yangerkun
4f7f590b15 dm dust: report some message results directly back to user
Some messages (queryblock, countbadblocks, removebadblock) are best
reported directly to user directly. Do so with DMEMIT.

[Bryan: maintain __func__ output in DMEMIT messages]

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-20 11:17:34 -04:00
Kees Cook
3f649ab728 treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
	xargs perl -pi -e \
		's/\buninitialized_var\(([^\)]+)\)/\1/g;
		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-16 12:35:15 -07:00
Guoqing Jiang
1684e97538 raid5: remove the meaningless check in raid5_make_request
We can't guarntee the batched stripe to be set with STRIPE_HANDLE since
there are lots of functions could set the flag, such as sync_request,
ops_complete_* and end_{read,write}_request etc.

Also clear_batch_ready called in handle_stripe ensures the batched list
can't continue to be handled by handle_stripe.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-16 10:14:08 -07:00
Guoqing Jiang
cb9902db38 raid5: put the comment of clear_batch_ready to the right place
To make people understand the function well, let's put the comment to
the right place.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-16 10:14:04 -07:00
Guoqing Jiang
a377a472b9 raid5: call clear_batch_ready before set STRIPE_ACTIVE
We tried to only put the head sh of batch list to handle_list, then the
handle_stripe doesn't handle other members in the batch list. However,
we still got the calltrace in break_stripe_batch_list.

[593764.644269] stripe state: 2003
kernel: [593764.644299] ------------[ cut here ]------------
kernel: [593764.644308] WARNING: CPU: 12 PID: 856 at drivers/md/raid5.c:4625 break_stripe_batch_list+0x203/0x240 [raid456]
[...]
kernel: [593764.644363] Call Trace:
kernel: [593764.644370]  handle_stripe+0x907/0x20c0 [raid456]
kernel: [593764.644376]  ? __wake_up_common_lock+0x89/0xc0
kernel: [593764.644379]  handle_active_stripes.isra.57+0x35f/0x570 [raid456]
kernel: [593764.644382]  ? raid5_wakeup_stripe_thread+0x96/0x1f0 [raid456]
kernel: [593764.644385]  raid5d+0x480/0x6a0 [raid456]
kernel: [593764.644390]  ? md_thread+0x11f/0x160
kernel: [593764.644392]  md_thread+0x11f/0x160
kernel: [593764.644394]  ? wait_woken+0x80/0x80
kernel: [593764.644396]  kthread+0xfc/0x130
kernel: [593764.644398]  ? find_pers+0x70/0x70
kernel: [593764.644399]  ? kthread_create_on_node+0x70/0x70
kernel: [593764.644401]  ret_from_fork+0x1f/0x30

As we can see, the stripe was set with STRIPE_ACTIVE and STRIPE_HANDLE,
and only handle_stripe could set those flags then return. And since the
stipe was already in the batch list, we need to return earlier before
set the two flags.

And after dig a little about git history especially commit 3664847d95
("md/raid5: fix a race condition in stripe batch"), it seems the batched
stipe still could be handled by handle_stipe, then handle_stipe needs to
return earlier if clear_batch_ready to return true.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-16 10:12:18 -07:00
Christoph Hellwig
7e0adbfc20 md: rewrite md_setup_drive to avoid ioctls
md_setup_drive knows it works with md devices, so it is rather pointless
to open a file descriptor and issue ioctls.  Just call directly into the
relevant low-level md routines after getting a handle to the device using
blkdev_get_by_dev instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: NeilBrown <neilb@suse.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 17:59:24 +02:00
Christoph Hellwig
d1100488c3 md: simplify md_setup_drive
Move the loop over the possible arrays into the caller to remove a level
of indentation for the whole function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: NeilBrown <neilb@suse.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:35:28 +02:00
Christoph Hellwig
1a6a050620 md: remove the kernel version of md_u.h
mdp_major can just move to drivers/md/md.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:35:21 +02:00
Christoph Hellwig
a1d6bc0189 md: remove the autoscan partition re-read
devfs is long gone, and autoscan works just fine without this these days.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: NeilBrown <neilb@suse.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:35:16 +02:00
Christoph Hellwig
d82fa81c23 md: replace the RAID_AUTORUN ioctl with a direct function call
Instead of using a spcial RAID_AUTORUN ioctl that only exists for
non-modular builds and is only called from the early init code, just
call the actual function directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: NeilBrown <neilb@suse.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:35:03 +02:00
Christoph Hellwig
4f5b246b37 md: move the early init autodetect code to drivers/md/
Just like the NFS and CIFS root code this better lives with the
driver it is tightly integrated with.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:34:47 +02:00
Damien Le Moal
38ffc01f38 md: raid10: Fix compilation warning
Remove the if statement around the call to sysfs_link_rdev() in
raid10_start_reshape() to avoid the compilation warning:

warning: suggest braces around empty body in an ‘if’ statement

when compiling with W=1.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-15 22:46:07 -07:00
Damien Le Moal
2aada5b14b md: raid5: Fix compilation warning
Remove the if statement around the calls to sysfs_link_rdev() to avoid
the compilation warning "suggest braces around empty body in an ‘if’
statement" when compiling with W=1.

Also fix function description comments to avoid kdoc format warnings.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-15 22:46:07 -07:00
Damien Le Moal
52923083b5 md: raid5-cache: Remove set but unused variable
Remove the variable offset in r5c_tree_index() to avoid a "set but not
used" compilation warning when compiling with W=1.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-15 22:46:07 -07:00
Damien Le Moal
5e3b8a8d59 md: Fix compilation warning
Remove the if statement around the calls to sysfs_link_rdev() to avoid
the compilation warnings:

warning: suggest braces around empty body in an ‘if’ statement

when compiling with W=1. For the call to sysfs_create_link() generating
the same warning, use the err variable to store the function result,
avoiding triggering another warning as the function is declared
as 'warn_unused_result'.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-15 22:46:07 -07:00
Aneesh Kumar K.V
3e79f082eb libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier
Architectures like ppc64 provide persistent memory specific barriers
that will ensure that all stores for which the modifications are
written to persistent storage by preceding dcbfps and dcbstps
instructions have updated persistent storage before any data
access or data transfer caused by subsequent instructions is initiated.
This is in addition to the ordering done by wmb()

Update nvdimm core such that architecture can use barriers other than
wmb to ensure all previous writes are architecturally visible for
the platform buffer flush.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200701072235.223558-5-aneesh.kumar@linux.ibm.com
2020-07-16 13:00:22 +10:00
Zhao Heming
60f80d6f2d md-cluster: fix wild pointer of unlock_all_bitmaps()
reproduction steps:
```
node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda
/dev/sdb
node2 # mdadm -A /dev/md0 /dev/sda /dev/sdb
node1 # mdadm -G /dev/md0 -b none
mdadm: failed to remove clustered bitmap.
node1 # mdadm -S --scan
^C  <==== mdadm hung & kernel crash
```

kernel stack:
```
[  335.230657] general protection fault: 0000 [#1] SMP NOPTI
[...]
[  335.230848] Call Trace:
[  335.230873]  ? unlock_all_bitmaps+0x5/0x70 [md_cluster]
[  335.230886]  unlock_all_bitmaps+0x3d/0x70 [md_cluster]
[  335.230899]  leave+0x10f/0x190 [md_cluster]
[  335.230932]  ? md_super_wait+0x93/0xa0 [md_mod]
[  335.230947]  ? leave+0x5/0x190 [md_cluster]
[  335.230973]  md_cluster_stop+0x1a/0x30 [md_mod]
[  335.230999]  md_bitmap_free+0x142/0x150 [md_mod]
[  335.231013]  ? _cond_resched+0x15/0x40
[  335.231025]  ? mutex_lock+0xe/0x30
[  335.231056]  __md_stop+0x1c/0xa0 [md_mod]
[  335.231083]  do_md_stop+0x160/0x580 [md_mod]
[  335.231119]  ? 0xffffffffc05fb078
[  335.231148]  md_ioctl+0xa04/0x1930 [md_mod]
[  335.231165]  ? filename_lookup+0xf2/0x190
[  335.231179]  blkdev_ioctl+0x93c/0xa10
[  335.231205]  ? _cond_resched+0x15/0x40
[  335.231214]  ? __check_object_size+0xd4/0x1a0
[  335.231224]  block_ioctl+0x39/0x40
[  335.231243]  do_vfs_ioctl+0xa0/0x680
[  335.231253]  ksys_ioctl+0x70/0x80
[  335.231261]  __x64_sys_ioctl+0x16/0x20
[  335.231271]  do_syscall_64+0x65/0x1f0
[  335.231278]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
```

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-14 23:38:32 -07:00
Song Liu
c9020e64cf md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes
In recovery, if we process too much data, raid5-cache may set
MD_SB_CHANGE_PENDING, which causes spinning in handle_stripe().
Fix this issue by clearing the bit before flushing data only
stripes. This issue was initially discussed in [1].

[1] https://www.spinics.net/lists/raid/msg64409.html

Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-14 23:01:31 -07:00
Junxiao Bi
e1a86dbbbd md: fix deadlock causing by sysfs_notify
The following deadlock was captured. The first process is holding 'kernfs_mutex'
and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device,
this pending bio list would be flushed by second process 'md127_raid1', but
it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace
sysfs_notify() can fix it. There were other sysfs_notify() invoked from io
path, removed all of them.

 PID: 40430  TASK: ffff8ee9c8c65c40  CPU: 29  COMMAND: "probe_file"
  #0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec
  #1 [ffffb87c4df372f8] schedule at ffffffff9a867f06
  #2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6
  #3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs]
  #4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs]
  #5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs]
  #6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs]
  #7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs]
  #8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs]
  #9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7
 #10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93
 #11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968
 #12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2
 #13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445
 #14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f
 #15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a
 #16 [ffffb87c4df37958] new_slab at ffffffff9a251430
 #17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85
 #18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d
 #19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89
 #20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10
 #21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854
 #22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377
 #23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b
 #24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118
 #25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83
 #26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619
 #27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af
 #28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566
 #29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787
 #30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d
 #31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e
 #32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949
 #33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad
     RIP: 00007f617a5f2905  RSP: 00007f607334f838  RFLAGS: 00000246
     RAX: ffffffffffffffda  RBX: 00007f6064044b20  RCX: 00007f617a5f2905
     RDX: 00007f6064044b20  RSI: 00007f6064044b20  RDI: 00007f6064005890
     RBP: 00007f6064044aa0   R8: 0000000000000030   R9: 000000000000011c
     R10: 0000000000000013  R11: 0000000000000246  R12: 00007f606417e6d0
     R13: 00007f6064044aa0  R14: 00007f6064044b10  R15: 00000000ffffffff
     ORIG_RAX: 0000000000000006  CS: 0033  SS: 002b

 PID: 927    TASK: ffff8f15ac5dbd80  CPU: 42  COMMAND: "md127_raid1"
  #0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec
  #1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06
  #2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e
  #3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc
  #4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013
  #5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f
  #6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83
  #7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a
  #8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696
  #9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5
 #10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c
 #11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1]
 #12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348
 #13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005
 #14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-14 22:58:51 -07:00
Artur Paszkiewicz
41d2d848e5 md: improve io stats accounting
Use generic io accounting functions to manage io stats. There was an
attempt to do this earlier in commit 18c0b223cf ("md: use generic io
stats accounting functions to simplify io stat accounting"), but it did
not include a call to generic_end_io_acct() and caused issues with
tracking in-flight IOs, so it was later removed in commit 74672d069b
("md: fix md io stats accounting broken").

This patch attempts to fix this by using both disk_start_io_acct() and
disk_end_io_acct(). To make it possible, a struct md_io is allocated for
every new md bio, which includes the io start_time. A new mempool is
introduced for this purpose. We override bio->bi_end_io with our own
callback and call disk_start_io_acct() before passing the bio to
md_handle_request(). When it completes, we call disk_end_io_acct() and
the original bi_end_io callback.

This adds correct statistics about in-flight IOs and IO processing time,
interpreted e.g. in iostat as await, svctm, aqu-sz and %util.

It also fixes a situation where too many IOs where reported if a bio was
re-submitted to the mddev, because io accounting is now performed only
on newly arriving bios.

Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-13 23:06:56 -07:00
Colin Ian King
9a5a85972c md: raid0/linear: fix dereference before null check on pointer mddev
Pointer mddev is being dereferenced with a test_bit call before mddev
is being null checked, this may cause a null pointer dereference. Fix
this by moving the null pointer checks to sanity check mddev before
it is dereferenced.

Addresses-Coverity: ("Dereference before null check")
Fixes: 62f7b1989c ("md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-13 23:06:56 -07:00
JeongHyeon Lee
e1fef0b08e dm verity: add "panic_on_corruption" error handling mode
Samsung smart phones may need the ability to panic on corruption.  Not
all devices provide the bootloader support needed to use the existing
"restart_on_corruption" mode.  Additional details for why Samsung needs
this new mode can be found here:
https://www.redhat.com/archives/dm-devel/2020-June/msg00235.html

Signed-off-by: jhs2.lee <jhs2.lee@samsung.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-13 11:47:33 -04:00
Mike Snitzer
374117ad47 dm mpath: use double checked locking in fast path
Fast-path code biased toward lazy acknowledgement of bit being set
(primarily only for initialization).  Multipath code is very retry
oriented so even if state is missed it'll recover.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-13 11:47:32 -04:00
Mike Snitzer
564dbb130b dm mpath: rename current_pgpath to pgpath in multipath_prepare_ioctl
Makes consistent with __map_bio() and multipath_clone_and_map().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-13 11:47:32 -04:00
Mike Snitzer
17213ec180 dm mpath: rework __map_bio()
so that it follows same pattern as request-based
multipath_clone_and_map()

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-13 11:47:32 -04:00
Mike Snitzer
f45f11868e dm mpath: factor out multipath_queue_bio
Enables further cleanup.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-13 11:47:32 -04:00
Mike Snitzer
73265f3ffd dm mpath: push locking down to must_push_back_rq()
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-13 11:47:32 -04:00
Mike Snitzer
a271a89c6e dm mpath: take m->lock spinlock when testing QUEUE_IF_NO_PATH
Fix multipath_end_io, multipath_end_io_bio and multipath_busy to take
m->lock while testing if MPATHF_QUEUE_IF_NO_PATH bit is set.  These are
all slow-path cases when no paths are available so extra locking isn't a
performance hit.  Correctness matters most.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-13 11:47:32 -04:00
Mike Snitzer
69cea0d45a dm mpath: changes from initial m->flags locking audit
Fix locking in slow-paths where m->lock should be taken.

Signed-off-by: Mike Snitzer <snitzer@rredhat.com>
2020-07-13 11:47:32 -04:00
Christoph Hellwig
21cf866145 writeback: remove bdi->congested_fn
Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures.  And
pktdvd is a deprecated driver that isn't useful in stack setup
either.  So remove the dead congested_fn stacking infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 17:20:46 -06:00
Christoph Hellwig
8c911f3d4c writeback: remove struct bdi_writeback_congested
We never set any congested bits in the group writeback instances of it.
And for the simpler bdi-wide case a simple scalar field is all that
that is needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 17:05:53 -06:00
Christoph Hellwig
a564e23f0f md: switch to ->check_events for media change notifications
md is the last driver using the legacy media_changed method.  Switch
it over to (not so) new ->clear_events approach, which also removes the
need for the ->revalidate_disk method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: remove unused 'bdops' variable in disk_clear_events()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 16:19:47 -06:00
Mikulas Patocka
6958c1c640 dm: use noio when sending kobject event
kobject_uevent may allocate memory and it may be called while there are dm
devices suspended. The allocation may recurse into a suspended device,
causing a deadlock. We must set the noio flag when sending a uevent.

The observed deadlock was reported here:
https://www.redhat.com/archives/dm-devel/2020-March/msg00025.html

Reported-by: Khazhismel Kumykov <khazhy@google.com>
Reported-by: Tahsin Erdogan <tahsin@google.com>
Reported-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-08 12:50:51 -04:00
Damien Le Moal
174364f6a8 dm zoned: Fix zone reclaim trigger
Only triggering reclaim based on the percentage of unmapped cache
zones can fail to detect cases where reclaim is needed, e.g. if the
target has only 2 or 3 cache zones and only one unmapped cache zone,
the percentage of free cache zones is higher than
DMZ_RECLAIM_LOW_UNMAP_ZONES (30%) and reclaim does not trigger.

This problem, combined with the fact that dmz_schedule_reclaim() is
called from dmz_handle_bio() without the map lock held, leads to a
race between zone allocation and dmz_should_reclaim() result.
Depending on the workload applied, this race can lead to the write
path waiting forever for a free zone without reclaim being triggered.

Fix this by moving dmz_schedule_reclaim() inside dmz_alloc_zone()
under the map lock. This results in checking the need for zone reclaim
whenever a new data or buffer zone needs to be allocated.

Also fix dmz_reclaim_percentage() to always return 0 if the number of
unmapped cache (or random) zones is less than or equal to 1.

Suggested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-08 12:21:53 -04:00
Wei Yongjun
ce34c9b461 dm zoned: fix unused but set variable warnings
Fix unused but set variable warnings:

drivers/md/dm-zoned-reclaim.c:504:42: warning:
 variable nr_rnd set but not used [-Wunused-but-set-variable]
  504 |  unsigned int p_unmap, nr_unmap_rnd = 0, nr_rnd = 0;
      |                                          ^~~~~~
drivers/md/dm-zoned-reclaim.c:504:24: warning:
 variable nr_unmap_rnd set but not used [-Wunused-but-set-variable]
  504 |  unsigned int p_unmap, nr_unmap_rnd = 0, nr_rnd = 0;
      |                        ^~~~~~~~~~~~

Fixes: f97809aec5 ("dm zoned: per-device reclaim")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-08 11:11:21 -04:00
Michal Suchanek
a466245803 dm writecache: reject asynchronous pmem devices
DM writecache does not handle asynchronous pmem. Reject it when
supplied as cache.

Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc.fsf@linux.ibm.com/
Fixes: 6e84200c0a ("virtio-pmem: Add virtio pmem driver")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-08 11:06:57 -04:00
Christoph Hellwig
382761dc63 dm: use bio_uninit instead of bio_disassociate_blkg
bio_uninit is the proper API to clean up a BIO that has been allocated
on stack or inside a structure that doesn't come from the BIO allocator.
Switch dm to use that instead of bio_disassociate_blkg, which really is
an implementation detail.  Note that the bio_uninit calls are also moved
to the two callers of __send_empty_flush, so that they better pair with
the bio_init calls used to initialize them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-08 11:02:05 -04:00
Jens Axboe
482c6b614a Linux 5.8-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8CYDYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGcQkH/2vOsPf79yWtsc7x
 hd2LpCPfrm7T1xlQcYcXbEbyRI8sqPmguixO8pRI1ePl2lBZ7KurfyeYgYZNGpFU
 t74Ph6A6dSWoCgO68Genm/SQuK8ic6o9n1Vr8tDsGDp5KlHWNaweq4JwHrsPmO1T
 cI0PR/ClAhLG8cQZ4x988Es5HTNGY17XK27e+M/zKYxSMGY2NRdJBGQIq964i5Q8
 2d9G0rtVCaVDzgjrLwaFm6RBu21Il7HV6KsBsacyTFiL1ywx2vnUHzeZQyvuJSOQ
 4YpLo9v4tBP10WHC50LRStZyO0qRwPVd/Yl7fL4R/CKsJT9H4uiwasVoEBVSL/k6
 CUn3JL0=
 =P/Vx
 -----END PGP SIGNATURE-----

Merge tag 'v5.8-rc4' into for-5.9/drivers

Merge in 5.8-rc4 for-5.9/block to setup for-5.9/drivers, to provide
a clean base and making the life for the NVMe changes easier.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

* tag 'v5.8-rc4': (732 commits)
  Linux 5.8-rc4
  x86/ldt: use "pr_info_once()" instead of open-coding it badly
  MIPS: Do not use smp_processor_id() in preemptible code
  MIPS: Add missing EHB in mtc0 -> mfc0 sequence for DSPen
  .gitignore: Do not track `defconfig` from `make savedefconfig`
  io_uring: fix regression with always ignoring signals in io_cqring_wait()
  x86/ldt: Disable 16-bit segments on Xen PV
  x86/entry/32: Fix #MC and #DB wiring on x86_32
  x86/entry/xen: Route #DB correctly on Xen PV
  x86/entry, selftests: Further improve user entry sanity checks
  x86/entry/compat: Clear RAX high bits on Xen PV SYSENTER
  i2c: mlxcpld: check correct size of maximum RECV_LEN packet
  i2c: add Kconfig help text for slave mode
  i2c: slave-eeprom: update documentation
  i2c: eg20t: Load module automatically if ID matches
  i2c: designware: platdrv: Set class based on DMI
  i2c: algo-pca: Add 0x78 as SCL stuck low status for PCA9665
  mm/page_alloc: fix documentation error
  vmalloc: fix the owner argument for the new __vmalloc_node_range callers
  mm/cma.c: use exact_nid true to fix possible per-numa cma leak
  ...
2020-07-08 08:02:13 -06:00
Ming Lei
85067747cf dm: do not use waitqueue for request-based DM
Given request-based DM now uses blk-mq's blk_mq_queue_inflight() to
determine if outstanding IO has completed (and DM has no control over
the blk-mq state machine used to track outstanding IO) it is unsafe to
wakeup waiter (dm_wait_for_completion) before blk-mq has cleared a
request's state bits (e.g. MQ_RQ_IN_FLIGHT or MQ_RQ_COMPLETE).  As
such dm_wait_for_completion() could be left to wait indefinitely if no
other requests complete.

Fix this by eliminating request-based DM's use of waitqueue to wait
for blk-mq requests to complete in dm_wait_for_completion.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Depends-on: 3c94d83cb3 ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-07 12:00:01 -04:00
Alexander A. Klimov
6f3bc22bf5 Replace HTTP links with HTTPS ones: LVM
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.

Deterministic algorithm:
For each file:
  If not .svg:
    For each line:
      If doesn't contain `\bxmlns\b`:
        For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
          If both the HTTP and HTTPS versions
          return 200 OK and serve the same content:
            Replace HTTP with HTTPS.

Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Link: https://lore.kernel.org/r/20200627103138.71885-1-grandmaster@al2klimov.de
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-07-05 14:28:27 -06:00
Jens Axboe
b53ac8b891 dm: remove unused variable
Since merging the commit identified in Fixes below, we trigger this
compile time warning:

drivers/md/dm.c: In function ‘__map_bio’:
drivers/md/dm.c:1296:24: warning: unused variable ‘md’ [-Wunused-variable]
 1296 |  struct mapped_device *md = io->md;
       |                        ^~

Remove the 'md' variable.

Fixes: 5a6c35f9af ("block: remove direct_make_request")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 15:45:19 -06:00
Christoph Hellwig
e556f6ba10 block: remove the bd_queue field from struct block_device
Just use bd_disk->queue instead.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 08:08:20 -06:00
Christoph Hellwig
5a6c35f9af block: remove direct_make_request
Now that submit_bio_noacct has a decent blk-mq fast path there is no
more need for this bypass.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Christoph Hellwig
ed00aabd5e block: rename generic_make_request to submit_bio_noacct
generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Christoph Hellwig
c62b37d96b block: move ->make_request_fn to struct block_device_operations
The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector.  Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does).  Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Christoph Hellwig
f695ca3886 block: remove the request_queue argument from blk_queue_split
The queue can be trivially derived from the bio, so pass one less
argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:23 -06:00
Christoph Hellwig
c4a59c4e5d dm: stop using ->queuedata
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:23 -06:00
Christoph Hellwig
987a0ef88b bcache: stop setting ->queuedata
Nothing in bcache actually uses the ->queuedata field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:23 -06:00
Christoph Hellwig
4ef2c5c246 dm: use bio_uninit instead of bio_disassociate_blkg
bio_uninit is the proper API to clean up a BIO that has been allocated
on stack or inside a structure that doesn't come from the BIO allocator.
Switch dm to use that instead of bio_disassociate_blkg, which really is
an implementation detail.  Note that the bio_uninit calls are also moved
to the two callers of __send_empty_flush, so that they better pair with
the bio_init calls used to initialize them.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29 09:09:07 -06:00
Linus Torvalds
5e8eed279f - Quite a few DM zoned target fixes and a Zone append fix in DM core.
Considering the amount of dm-zoned changes that went in during the
   5.8 merge window these fixes are not that surprising.
 
 - A few DM writecache target fixes.
 
 - A fix to Documentation index to include DM ebs target docs.
 
 - Small cleanup to use struct_size() in DM core's retrieve_deps().
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl72SdYTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWtzXB/4y42NJLrH8EXNW/bGIpPlexpbWRW29
 5z6V1VL/q/s9AWrDunLsvhMAtx38FjHJO6i8GQld/gKRtEbOMHfEItsyrQZqKoSr
 M1NzlthuwFnUS30Uq6i06/ihdH5DDEhTElLs7+Q7pvNUd+KiI0mTgM/y856x3s4p
 4OMG1qEi6AnwvYFyCgu/w7/LLBYRipis+j+cm9Y0cTZujh4QBTeeQdBsu++fit2G
 7U1jkddiQSlb8DrbziRt9JK5WrvR4WpQmP9bnK1SxZtKY/55ZNpuNmFnUmUCgUAY
 Y7CEafoElJ6XYTBzKgYSDqY/eutXSzaxn1euDWizDS4H7kdgDS+3uVT9
 =zq2g
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Quite a few DM zoned target fixes and a Zone append fix in DM core.

   Considering the amount of dm-zoned changes that went in during the
   5.8 merge window these fixes are not that surprising.

 - A few DM writecache target fixes.

 - A fix to Documentation index to include DM ebs target docs.

 - Small cleanup to use struct_size() in DM core's retrieve_deps().

* tag 'for-5.8/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm writecache: add cond_resched to loop in persistent_memory_claim()
  dm zoned: Fix reclaim zone selection
  dm zoned: Fix random zone reclaim selection
  dm: update original bio sector on Zone Append
  dm zoned: Fix metadata zone size check
  docs: device-mapper: add dm-ebs.rst to an index file
  dm ioctl: use struct_size() helper in retrieve_deps()
  dm writecache: skip writecache_wait when using pmem mode
  dm writecache: correct uncommitted_block when discarding uncommitted entry
  dm zoned: assign max_io_len correctly
  dm zoned: fix uninitialized pointer dereference
2020-06-27 08:57:16 -07:00
Christoph Hellwig
15f73f5b3e blk-mq: move failure injection out of blk_mq_complete_request
Move the call to blk_should_fake_timeout out of blk_mq_complete_request
and into the drivers, skipping call sites that are obvious error
handlers, and remove the now superflous blk_mq_force_complete_rq helper.
This ensures we don't keep injecting errors into completions that just
terminate the Linux request after the hardware has been reset or the
command has been aborted.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:57 -06:00
Mikulas Patocka
d35bd764e6 dm writecache: add cond_resched to loop in persistent_memory_claim()
Add cond_resched() to a loop that fills in the mapper memory area
because the loop can be executed many times.

Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-19 12:32:24 -04:00
Shin'ichiro Kawasaki
f2cd9a5e85 dm zoned: Fix reclaim zone selection
When dm zoned has multiple devices, random zones are never selected for
reclaim if all reserved sequential write zones are in use and no
sequential write required zones can be selected for reclaim. This can
lead to deadlocks as selecting a cache zone allows reclaiming a
sequential zone, ensuring forward progress.

Fix this by always defaulting to selecting a random zone when no
sequential write required zone can be selected.

[Damien: fix commit message]

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-19 12:29:39 -04:00
Damien Le Moal
3ee39573e5 dm zoned: Fix random zone reclaim selection
Commit 2094045fe5 ("dm zoned: prefer full zones for reclaim")
modified dmz_get_rnd_zone_for_reclaim() to add a search for the buffer
zone with the heaviest weight as an optimal candidate for reclaim. This
modification uses the zone pointer variabl "last" which is set only once
and never modified as zones are scanned, resulting in the search being
inefective. Furthermore, if the selected buffer zone at the end of the
search loop is active or already locked for reclaim,
dmz_get_rnd_zone_for_reclaim() returns NULL even if other random zones
with a lesser weight can be reclaimed.

To fix the search and to guarantee that reclaim can make forward
progress, fix dmz_get_rnd_zone_for_reclaim() loop to correctly find
the buffer zone with the heaviest weight using the variable maxw_z.
Also make sure to fallback to finding the first random zone that can
be reclaimed if this best candidate zone cannot be reclaimed.

While at it, also fix the device index check to consider only random
zones, ignoring cache zones belonging to the cache device if one is
used as that device does not have a reclaim process.

Fixes: 2094045fe5 ("dm zoned: prefer full zones for reclaim")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-19 12:28:23 -04:00
Johannes Thumshirn
415c79e13b dm: update original bio sector on Zone Append
Naohiro reported that issuing zone-append bios to a zoned block device
underneath a dm-linear device does not work as expected.

This because we forgot to reverse-map the sector the device wrote to the
original bio.

For zone-append bios, get the offset in the zone of the written sector
from the clone bio and add that to the original bio's sector position.

Fixes: 0512a75b98 ("block: Introduce REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-19 12:25:58 -04:00
Shin'ichiro Kawasaki
b38c0ad57f dm zoned: Fix metadata zone size check
When dm zoned has multiple devices, metadata is on the cache device, not
in random zones of the zoned devices. Then the number of metadata zones
shall be checked with the number of cache zones, not random zones.

Fixes: 34f5affd04 ("dm zoned: separate random and cache zones")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-19 12:21:57 -04:00
Gustavo A. R. Silva
da8996250a dm ioctl: use struct_size() helper in retrieve_deps()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct dm_target_deps {
      ...
        __u64 dev[0];   /* out */
};

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-17 12:31:45 -04:00
Huaisheng Ye
a143e172b6 dm writecache: skip writecache_wait when using pmem mode
The array bio_in_progress is only used with ssd mode. So skip
writecache_wait_for_ios in writecache_discard when pmem mode.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-17 12:25:42 -04:00
Huaisheng Ye
39495b12ef dm writecache: correct uncommitted_block when discarding uncommitted entry
When uncommitted entry has been discarded, correct wc->uncommitted_block
for getting the exact number.

Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-17 12:25:41 -04:00
Hou Tao
7b23774867 dm zoned: assign max_io_len correctly
The unit of max_io_len is sector instead of byte (spotted through
code review), so fix it.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-17 12:25:34 -04:00
Damien Le Moal
c69cb1d17b dm zoned: fix uninitialized pointer dereference
Make sure that the local variable rzone in dmz_do_reclaim() is always
initialized before being used for printing debug messages.

Fixes: f97809aec5 ("dm zoned: per-device reclaim")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-17 12:13:08 -04:00
Coly Li
4b25bbf52a bcache: pr_info() format clean up in bcache_device_init()
scripts/checkpatch.pl reports following warning for patch
("bcache: check and adjust logical block size for backing devices"),
    WARNING: quoted string split across lines
    #146: FILE: drivers/md/bcache/super.c:896:
    +  pr_info("%s: sb/logical block size (%u) greater than page size "
    +	       "(%lu) falling back to device logical block size (%u)",

There are two things to fix up,
- The kernel message print should be in a single line.
- pr_info() won't automatically add new line since v5.8, a '\n' should
  be added.

This patch just does the above cleanup in bcache_device_init().

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-14 16:47:56 -06:00
Coly Li
ee4a36f414 bcache: use delayed kworker fo asynchronous devices registration
This patch changes the asynchronous registration kworker to a delayed
kworker. There is probability queue_work() queues the async registration
kworker to the same CPU (even though very little), then the process
which writing sysfs interface to reigster bcache device may won't return
immeidately. queue_delayed_work() in this patch will delay 10 jiffies
before insert the kworker to run queue, which makes sure the registering
process may always returns to user space in time.

Fixes: 9e23ccf8f0 ("bcache: asynchronous devices registration")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-14 16:47:56 -06:00
Mauricio Faria de Oliveira
dcacbc1242 bcache: check and adjust logical block size for backing devices
It's possible for a block driver to set logical block size to
a value greater than page size incorrectly; e.g. bcache takes
the value from the superblock, set by the user w/ make-bcache.

This causes a BUG/NULL pointer dereference in the path:

  __blkdev_get()
  -> set_init_blocksize() // set i_blkbits based on ...
     -> bdev_logical_block_size()
        -> queue_logical_block_size() // ... this value
  -> bdev_disk_changed()
     ...
     -> blkdev_readpage()
        -> block_read_full_page()
           -> create_page_buffers() // size = 1 << i_blkbits
              -> create_empty_buffers() // give size/take pointer
                 -> alloc_page_buffers() // return NULL
                 .. BUG!

Because alloc_page_buffers() is called with size > PAGE_SIZE,
thus it initializes head = NULL, skips the loop, return head;
then create_empty_buffers() gets (and uses) the NULL pointer.

This has been around longer than commit ad6bf88a6c ("block:
fix an integer overflow in logical block size"); however, it
increased the range of values that can trigger the issue.

Previously only 8k/16k/32k (on x86/4k page size) would do it,
as greater values overflow unsigned short to zero, and queue_
logical_block_size() would then use the default of 512.

Now the range with unsigned int is much larger, and users w/
the 512k value, which happened to be zero'ed previously and
work fine, started to hit this issue -- as the zero is gone,
and queue_logical_block_size() does return 512k (>PAGE_SIZE.)

Fix this by checking the bcache device's logical block size,
and if it's greater than page size, fallback to the backing/
cached device's logical page size.

This doesn't affect cache devices as those are still checked
for block/page size in read_super(); only the backing/cached
devices are not.

Apparently it's a regression from commit 2903381fce ("bcache:
Take data offset from the bdev superblock."), moving the check
into BCACHE_SB_VERSION_CDEV only. Now that we have superblocks
of backing devices out there with this larger value, we cannot
refuse to load them (i.e., have a similar check in _BDEV.)

Ideally perhaps bcache should use all values from the backing
device (physical/logical/io_min block size)? But for now just
fix the problematic case.

Test-case:

    # IMG=/root/disk.img
    # dd if=/dev/zero of=$IMG bs=1 count=0 seek=1G
    # DEV=$(losetup --find --show $IMG)
    # make-bcache --bdev $DEV --block 8k
      < see dmesg >

Before:

    # uname -r
    5.7.0-rc7

    [   55.944046] BUG: kernel NULL pointer dereference, address: 0000000000000000
    ...
    [   55.949742] CPU: 3 PID: 610 Comm: bcache-register Not tainted 5.7.0-rc7 #4
    ...
    [   55.952281] RIP: 0010:create_empty_buffers+0x1a/0x100
    ...
    [   55.966434] Call Trace:
    [   55.967021]  create_page_buffers+0x48/0x50
    [   55.967834]  block_read_full_page+0x49/0x380
    [   55.972181]  do_read_cache_page+0x494/0x610
    [   55.974780]  read_part_sector+0x2d/0xaa
    [   55.975558]  read_lba+0x10e/0x1e0
    [   55.977904]  efi_partition+0x120/0x5a6
    [   55.980227]  blk_add_partitions+0x161/0x390
    [   55.982177]  bdev_disk_changed+0x61/0xd0
    [   55.982961]  __blkdev_get+0x350/0x490
    [   55.983715]  __device_add_disk+0x318/0x480
    [   55.984539]  bch_cached_dev_run+0xc5/0x270
    [   55.986010]  register_bcache.cold+0x122/0x179
    [   55.987628]  kernfs_fop_write+0xbc/0x1a0
    [   55.988416]  vfs_write+0xb1/0x1a0
    [   55.989134]  ksys_write+0x5a/0xd0
    [   55.989825]  do_syscall_64+0x43/0x140
    [   55.990563]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [   55.991519] RIP: 0033:0x7f7d60ba3154
    ...

After:

    # uname -r
    5.7.0.bcachelbspgsz

    [   31.672460] bcache: bcache_device_init() bcache0: sb/logical block size (8192) greater than page size (4096) falling back to device logical block size (512)
    [   31.675133] bcache: register_bdev() registered backing device loop0

    # grep ^ /sys/block/bcache0/queue/*_block_size
    /sys/block/bcache0/queue/logical_block_size:512
    /sys/block/bcache0/queue/physical_block_size:8192

Reported-by: Ryan Finnie <ryan@finnie.org>
Reported-by: Sebastian Marsching <sebastian@marsching.com>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-14 16:47:56 -06:00
Zhiqiang Liu
be23e83733 bcache: fix potential deadlock problem in btree_gc_coalesce
coccicheck reports:
  drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417

In btree_gc_coalesce func, if the coalescing process fails, we will goto
to out_nocoalesce tag directly without releasing new_nodes[i]->write_lock.
Then, it will cause a deadlock when trying to acquire new_nodes[i]->
write_lock for freeing new_nodes[i] before return.

btree_gc_coalesce func details as follows:
	if alloc new_nodes[i] fails:
		goto out_nocoalesce;
	// obtain new_nodes[i]->write_lock
	mutex_lock(&new_nodes[i]->write_lock)
	// main coalescing process
	for (i = nodes - 1; i > 0; --i)
		[snipped]
		if coalescing process fails:
			// Here, directly goto out_nocoalesce
			 // tag will cause a deadlock
			goto out_nocoalesce;
		[snipped]
	// release new_nodes[i]->write_lock
	mutex_unlock(&new_nodes[i]->write_lock)
	// coalesing succ, return
	return;
out_nocoalesce:
	btree_node_free(new_nodes[i])	// free new_nodes[i]
	// obtain new_nodes[i]->write_lock
	mutex_lock(&new_nodes[i]->write_lock);
	// set flag for reuse
	clear_bit(BTREE_NODE_dirty, &ew_nodes[i]->flags);
	// release new_nodes[i]->write_lock
	mutex_unlock(&new_nodes[i]->write_lock);

To fix the problem, we add a new tag 'out_unlock_nocoalesce' for
releasing new_nodes[i]->write_lock before out_nocoalesce tag. If
coalescing process fails, we will go to out_unlock_nocoalesce tag
for releasing new_nodes[i]->write_lock before free new_nodes[i] in
out_nocoalesce tag.

(Coly Li helps to clean up commit log format.)

Fixes: 2a285686c1 ("bcache: btree locking rework")
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-14 16:47:56 -06:00
Masahiro Yamada
a7f7f6248d treewide: replace '---help---' in Kconfig files with 'help'
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.

This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.

There are a variety of indentation styles found.

  a) 4 spaces + '---help---'
  b) 7 spaces + '---help---'
  c) 8 spaces + '---help---'
  d) 1 space + 1 tab + '---help---'
  e) 1 tab + '---help---'    (correct indentation)
  f) 1 tab + 1 space + '---help---'
  g) 1 tab + 2 spaces + '---help---'

In order to convert all of them to 1 tab + 'help', I ran the
following commend:

  $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-14 01:57:21 +09:00
Linus Torvalds
b25c6644bf - Largest change for this cycle is the DM zoned target's metadata
version 2 feature that adds support for pairing regular block
   devices with a zoned device to ease performance impact associated
   with finite random zones of zoned device.  Changes came in 3
   batches: first prepared for and then added the ability to pair a
   single regular block device, second was a batch of fixes to improve
   zoned's reclaim heuristic, third removed the limitation of only
   adding a single additional regular block device to allow many
   devices.  Testing has shown linear scaling as more devices are
   added.
 
 - Add new emulated block size (ebs) target that emulates a smaller
   logical_block_size than a block device supports.  Primary use-case
   is to emulate "512e" devices that have 512 byte logical_block_size
   and 4KB physical_block_size.  This is useful to some legacy
   applications otherwise wouldn't be ablee to be used on 4K devices
   because they depend on issuing IO in 512 byte granularity.
 
 - Add discard interfaces to DM bufio.  First consumer of the interface
   is the dm-ebs target that makes heavy use of dm-bufio.
 
 - Fix DM crypt's block queue_limits stacking to not truncate
   logic_block_size.
 
 - Add Documentation for DM integrity's status line.
 
 - Switch DMDEBUG from a compile time config option to instead use
   dynamic debug via pr_debug.
 
 - Fix DM multipath target's hueristic for how it manages
   "queue_if_no_path" state internally.  DM multipath now avoids
   disabling "queue_if_no_path" unless it is actually needed (e.g. in
   response to configure timeout or explicit "fail_if_no_path"
   message).  This fixes reports of spurious -EIO being reported back
   to userspace application during fault tolerance testing with an NVMe
   backend.  Added various dynamic DMDEBUG messages to assist with
   debugging queue_if_no_path in the future.
 
 - Add a new DM multipath "Historical Service Time" Path Selector.
 
 - Fix DM multipath's dm_blk_ioctl() to switch paths on IO error.
 
 - Improve DM writecache target performance by using explicit
   cache flushing for target's single-threaded usecase and a small
   cleanup to remove unnecessary test in persistent_memory_claim.
 
 - Other small cleanups in DM core, dm-persistent-data, and DM integrity.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl7alrgTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWl42B/9sBd+j60emy4Bliu/f3pd7SEkFrSXv
 K2jXicRFx4E5kO0aLK+fX65cOiq2vvLsDh8c++0TLXcD9q7oK0qxK9c8TCPq29Cx
 W2J2dwdjyyqqbr3/FZHYYM9KLOl5rsxJLygXwhJhQ2Gny44L7nVACrAXNzXIHJ3r
 f8xr+GLdF/jz7WLj8bwEDo3Bf8wvxDvl2ijqj7EceOhTutNE8xHQ6UcTPqTtozJy
 sNM8UQNk1L43DBAvXfrKZ+yQ5DYAKdXKJpV9C8qv5DEGbaEikuMrHddgO4KlDdp4
 VjPk9GSfPwGcJ4ecN8vgecZVGvh52ZU7OZ8qey/q0zqps74jeHTZQQyM
 =jRun
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - The largest change for this cycle is the DM zoned target's metadata
   version 2 feature that adds support for pairing regular block devices
   with a zoned device to ease the performance impact associated with
   finite random zones of zoned device.

   The changes came in three batches: the first prepared for and then
   added the ability to pair a single regular block device, the second
   was a batch of fixes to improve zoned's reclaim heuristic, and the
   third removed the limitation of only adding a single additional
   regular block device to allow many devices.

   Testing has shown linear scaling as more devices are added.

 - Add new emulated block size (ebs) target that emulates a smaller
   logical_block_size than a block device supports

   The primary use-case is to emulate "512e" devices that have 512 byte
   logical_block_size and 4KB physical_block_size. This is useful to
   some legacy applications that otherwise wouldn't be able to be used
   on 4K devices because they depend on issuing IO in 512 byte
   granularity.

 - Add discard interfaces to DM bufio. First consumer of the interface
   is the dm-ebs target that makes heavy use of dm-bufio.

 - Fix DM crypt's block queue_limits stacking to not truncate
   logic_block_size.

 - Add Documentation for DM integrity's status line.

 - Switch DMDEBUG from a compile time config option to instead use
   dynamic debug via pr_debug.

 - Fix DM multipath target's hueristic for how it manages
   "queue_if_no_path" state internally.

   DM multipath now avoids disabling "queue_if_no_path" unless it is
   actually needed (e.g. in response to configure timeout or explicit
   "fail_if_no_path" message).

   This fixes reports of spurious -EIO being reported back to userspace
   application during fault tolerance testing with an NVMe backend.
   Added various dynamic DMDEBUG messages to assist with debugging
   queue_if_no_path in the future.

 - Add a new DM multipath "Historical Service Time" Path Selector.

 - Fix DM multipath's dm_blk_ioctl() to switch paths on IO error.

 - Improve DM writecache target performance by using explicit cache
   flushing for target's single-threaded usecase and a small cleanup to
   remove unnecessary test in persistent_memory_claim.

 - Other small cleanups in DM core, dm-persistent-data, and DM
   integrity.

* tag 'for-5.8/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (62 commits)
  dm crypt: avoid truncating the logical block size
  dm mpath: add DM device name to Failing/Reinstating path log messages
  dm mpath: enhance queue_if_no_path debugging
  dm mpath: restrict queue_if_no_path state machine
  dm mpath: simplify __must_push_back
  dm zoned: check superblock location
  dm zoned: prefer full zones for reclaim
  dm zoned: select reclaim zone based on device index
  dm zoned: allocate zone by device index
  dm zoned: support arbitrary number of devices
  dm zoned: move random and sequential zones into struct dmz_dev
  dm zoned: per-device reclaim
  dm zoned: add metadata pointer to struct dmz_dev
  dm zoned: add device pointer to struct dm_zone
  dm zoned: allocate temporary superblock for tertiary devices
  dm zoned: convert to xarray
  dm zoned: add a 'reserved' zone flag
  dm zoned: improve logging messages for reclaim
  dm zoned: avoid unnecessary device recalulation for secondary superblock
  dm zoned: add debugging message for reading superblocks
  ...
2020-06-05 15:45:03 -07:00
Eric Biggers
64611a15ca dm crypt: avoid truncating the logical block size
queue_limits::logical_block_size got changed from unsigned short to
unsigned int, but it was forgotten to update crypt_io_hints() to use the
new type.  Fix it.

Fixes: ad6bf88a6c ("block: fix an integer overflow in logical block size")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:59 -04:00
Mike Snitzer
04867370ec dm mpath: add DM device name to Failing/Reinstating path log messages
When there are many DM multipath devices it really helps to have
additional context for which DM device a failed or reinstated path is
part of.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:58 -04:00
Mike Snitzer
4c3f48380f dm mpath: enhance queue_if_no_path debugging
Add more DMDEBUG that shows arguments passed and caller, and another
that shows state of related flags at end of queue_if_no_path().

Also add queue_if_no_path DMDEBUG to multipath_resume().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:57 -04:00
Mike Snitzer
553ec94cb4 dm mpath: restrict queue_if_no_path state machine
Do not allow saving disabled queue_if_no_path if already saved as
enabled; implies multiple suspends (which shouldn't ever happen).  Log
if this unlikely scenario is ever triggered.

Also, only write MPATHF_SAVED_QUEUE_IF_NO_PATH during presuspend or if
"fail_if_no_path" message.  MPATHF_SAVED_QUEUE_IF_NO_PATH is no longer
always modified, e.g.: even if queue_if_no_path()'s save_old_value
argument wasn't set.  This just implies a bit tighter control over
the management of MPATHF_SAVED_QUEUE_IF_NO_PATH.  Side-effect is
multipath_resume() doesn't reset MPATHF_QUEUE_IF_NO_PATH unless
MPATHF_SAVED_QUEUE_IF_NO_PATH was set (during presuspend); and at that
time the MPATHF_SAVED_QUEUE_IF_NO_PATH bit gets cleared.  So
MPATHF_SAVED_QUEUE_IF_NO_PATH's use is much more narrow in scope.

Last, but not least, do _not_ disable queue_if_no_path during noflush
suspend.  There is no need/benefit to saving off queue_if_no_path via
MPATHF_SAVED_QUEUE_IF_NO_PATH and clearing MPATHF_QUEUE_IF_NO_PATH for
noflush suspend -- by avoiding this needless queue_if_no_path flag
churn there is less potential for MPATHF_QUEUE_IF_NO_PATH to get lost.
Which avoids potential for IOs to be errored back up to userspace
during DM multipath's handling of path failures.

That said, this last change papers over a reported issue concerning
request-based dm-multipath's interaction with blk-mq, relative to
suspend and resume: multipath_endio is being called _before_
multipath_resume.  This should never happen if DM suspend's
blk_mq_quiesce_queue() + dm_wait_for_completion() is genuinely waiting
for all inflight blk-mq requests to complete.  Similarly:
drivers/md/dm.c:__dm_resume() clearly calls dm_table_resume_targets()
_before_ dm_start_queue()'s blk_mq_unquiesce_queue() is called.  If
the queue isn't even restarted until after multipath_resume(); the BIG
question that still needs answering is: how can multipath_end_io beat
multipath_resume in a race!?

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:56 -04:00
Mike Snitzer
a862e4e215 dm mpath: simplify __must_push_back
Remove micro-optimization that infers device is between presuspend and
resume (was done purely to avoid call to dm_noflush_suspending, which
isn't expensive anyway).

Remove flags argument since they are no longer checked.

And remove must_push_back_bio() since it was simply a call to
__must_push_back().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:55 -04:00
Hannes Reinecke
27d49ac1dd dm zoned: check superblock location
When specifying several devices the superblock location must be
checked to ensure the devices are specified in the correct order.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:54 -04:00
Hannes Reinecke
2094045fe5 dm zoned: prefer full zones for reclaim
Prefer full zones when selecting the next zone for reclaim.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:54 -04:00
Hannes Reinecke
69875d443b dm zoned: select reclaim zone based on device index
per-device reclaim should select zones on that device only.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:53 -04:00
Hannes Reinecke
22c1ef66c4 dm zoned: allocate zone by device index
When allocating a zone, pass in an indicator on which device the zone
should be allocated; this increases performance for a multi-device
setup because reclaim will now allocate zones on the device for which
reclaim is running.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:52 -04:00
Hannes Reinecke
4dba12881f dm zoned: support arbitrary number of devices
Remove the hard-coded limit of two devices and support an unlimited
number of additional zoned devices.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:51 -04:00
Hannes Reinecke
bd82fdabf1 dm zoned: move random and sequential zones into struct dmz_dev
Random and sequential zones should be part of the respective
device structure to make arbitration between devices possible.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:50 -04:00
Hannes Reinecke
f97809aec5 dm zoned: per-device reclaim
Instead of having one reclaim workqueue for the entire set we should
be allocating a reclaim workqueue per device; doing so will reduce
contention and should boost performance for a multi-device setup.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:50 -04:00
Hannes Reinecke
18979819b5 dm zoned: add metadata pointer to struct dmz_dev
Add a metadata pointer within struct dmz_dev and use it as argument
for blkdev_report_zones() instead of the metadata itself.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:49 -04:00
Hannes Reinecke
8f22272af7 dm zoned: add device pointer to struct dm_zone
Add a pointer, to the containing device, within struct dm_zone and
kill dmz_zone_to_dev().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:48 -04:00
Hannes Reinecke
5d2c74f3dd dm zoned: allocate temporary superblock for tertiary devices
Checking the tertiary superblock just consists of validating UUIDs,
crcs, and the generation number; it doesn't have contents which would
be required during the actual operation.

So allocate a temporary superblock when checking tertiary devices to
avoid having to store it together with the 'real' superblocks.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:47 -04:00
Hannes Reinecke
a92fbc446d dm zoned: convert to xarray
The zones array is getting really large, and large arrays tend to
wreak havoc with the CPU caches.  So convert it to xarray to become
more cache friendly.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # fix leak in dmz_insert
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:46 -04:00
Hannes Reinecke
aec67b4ffa dm zoned: add a 'reserved' zone flag
Instead of counting the number of reserved zones in dmz_free_zone(),
mark the zone as 'reserved' during allocation and simplify
dmz_free_zone().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:46 -04:00
Hannes Reinecke
c3ff479dde dm zoned: improve logging messages for reclaim
Instead of just reporting the errno, add some more verbose debugging
message in the reclaim path.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:45 -04:00
Hannes Reinecke
1565929b87 dm zoned: avoid unnecessary device recalulation for secondary superblock
The secondary superblock must reside on the same device as the primary
superblock, so there is no need to re-calculate the device.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:44 -04:00
Hannes Reinecke
35d0c96e42 dm zoned: add debugging message for reading superblocks
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:43 -04:00
Mikulas Patocka
334b4fc172 dm ebs: use dm_bufio_forget_buffers
Use dm_bufio_forget_buffers instead of a block-by-block loop that
calls dm_bufio_forget. dm_bufio_forget_buffers is faster than the loop
because it searches for used buffers using rb-tree.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:42 -04:00
Mikulas Patocka
33a180623b dm bufio: introduce forget_buffer_locked
Introduce a function forget_buffer_locked that forgets a range of
buffers. It is more efficient than calling forget_buffer in a loop.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:41 -04:00
Mikulas Patocka
88f878e588 dm bufio: clean up rbtree block ordering
dm-bufio uses unnatural ordering in the rb-tree - blocks with smaller
numbers were put to the right node and blocks with bigger numbers were
put to the left node.

Reverse that logic so that it's natural.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-05 14:59:41 -04:00
Mikulas Patocka
a1c979f330 dm bufio: delete unused and inefficient dm_bufio_discard_buffers
There is no user for this interface.  If in future it is needed it can
be reimplemented to walk the rbtree of buffers instead of doing
block-by-block lookups.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-04 20:57:38 -04:00
Linus Torvalds
bce159d734 for-5.8/drivers-2020-06-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VPc4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgQkEACnQlzWOfNQMz1AzgUAv/S8IYDJCLrkbjLZ
 JK4pJv8Hjhss/7sS+fd8kyKe9VtaZz2IjmrXcC66RMMwtpx4iHnkRffoNAgEdGOl
 /M5TCZGhs+F/mp3Lc0WdR5DFHkM6yy2Tkk9wCFLreB4bW67janAWnd7nbU4INqJj
 +WqIgpzNMc/kfUhpBYTeQLORhL4e2TG9ADTi/zeUITlpnEsA65LOgXKEpeIFYnSX
 KTl4GIZ9tjazG3Y1Eva7DYHDIErNNAtX67KBqf+WBgMV98eB0O6xIPN1WlmhDTqj
 FGMLkb8msH1HHntvxDAuc4/ortnUy8vPI4o6zKP89HJJNjIM5p5eHEuVF5JnBw42
 Rtu9Om6JqWx51nhAhJNBj9bUStYbhEl0vVQCwbkfPbDJhzTy3RR8z709q9+ZwOrL
 xbp4aJBzqrzscjBEiSQbNCf2PyuOAdU0r1x81UN81ZN41d5qUcumcinjw4Y7vru8
 z5zMlo1Iy/AWQYyu7jgHmnpI7ZyA/1Qclo5dV7aa72bLFaJa35e7QxgfQOFBA5dY
 UZl6QPJRlnB80uGRzD5jCh2O2sQ3XZqYnpaKsUAka1GgbceCp9IC4A5mfZvpACsh
 Xk8VXjlhvY/iPJsKLqrh4Oedg4Dj5M3PLL9C3MDfYeIP2qgXpbnk87UV1TPNSpY0
 QcTxsXXXIw==
 =H+/Z
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "On top of the core changes, here are the block driver changes for this
  merge window:

   - NVMe changes:
        - NVMe over Fibre Channel protocol updates, which also reach
          over to drivers/scsi/lpfc (James Smart)
        - namespace revalidation support on the target (Anthony
          Iliopoulos)
        - gcc zero length array fix (Arnd Bergmann)
        - nvmet cleanups (Chaitanya Kulkarni)
        - misc cleanups and fixes (me, Keith Busch, Sagi Grimberg)
        - use a SRQ per completion vector (Max Gurtovoy)
        - fix handling of runtime changes to the queue count (Weiping
          Zhang)
        - t10 protection information support for nvme-rdma and
          nvmet-rdma (Israel Rukshin and Max Gurtovoy)
        - target side AEN improvements (Chaitanya Kulkarni)
        - various fixes and minor improvements all over, icluding the
          nvme part of the lpfc driver"

   - Floppy code cleanup series (Willy, Denis)

   - Floppy contention fix (Jiri)

   - Loop CONFIGURE support (Martijn)

   - bcache fixes/improvements (Coly, Joe, Colin)

   - q->queuedata cleanups (Christoph)

   - Get rid of ioctl_by_bdev (Christoph, Stefan)

   - md/raid5 allocation fixes (Coly)

   - zero length array fixes (Gustavo)

   - swim3 task state fix (Xu)"

* tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits)
  bcache: configure the asynchronous registertion to be experimental
  bcache: asynchronous devices registration
  bcache: fix refcount underflow in bcache_device_free()
  bcache: Convert pr_<level> uses to a more typical style
  bcache: remove redundant variables i and n
  lpfc: Fix return value in __lpfc_nvme_ls_abort
  lpfc: fix axchg pointer reference after free and double frees
  lpfc: Fix pointer checks and comments in LS receive refactoring
  nvme: set dma alignment to qword
  nvmet: cleanups the loop in nvmet_async_events_process
  nvmet: fix memory leak when removing namespaces and controllers concurrently
  nvmet-rdma: add metadata/T10-PI support
  nvmet: add metadata support for block devices
  nvmet: add metadata/T10-PI support
  nvme: add Metadata Capabilities enumerations
  nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len
  nvmet: rename nvmet_rw_len to nvmet_rw_data_len
  nvmet: add metadata characteristics for a namespace
  nvme-rdma: add metadata/T10-PI support
  nvme-rdma: introduce nvme_rdma_sgl structure
  ...
2020-06-02 15:37:03 -07:00
Linus Torvalds
750a02ab8d for-5.8/block-2020-06-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VOwMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoR7EADAlz3TCkb4wwuHytTBDrm6gVDdsJ9zUfQW
 Cl2ASLtufA8PWZUCEI3vhFyOe6P5e+ZZ0O2HjljSevmHyogCaRYXFYVfbWKcQKuk
 AcxiTgnYNevh8KbGLfJY1WL4eXsY+C3QUGivg35cCgrx+kr9oDaHMeqA9Tm1plyM
 FSprDBoSmHPqRxiV/1gnr8uXLX6K7i/fHzwmKgySMhavum7Ma8W3wdAGebzvQwrO
 SbFSuJVgz06e4B1Fzr/wSvVNUE/qW/KqfGuQKIp7VQFIywbgG7TgRMHjE1FSnpnh
 gn+BfL+O5gc0sTvcOTGOE0SRWWwLx961WNg8Azq08l3fzsxLA6h8/AnoDf3i+QMA
 rHmLpWZIic2xPSvjaFHX3/V9ITyGYeAMpAR77EL+4ivWrKv5JrBhnSLDt1fKILdg
 5elxm7RDI+C4nCP4xuTlVCy5gCd6gwjgytKj+NUWhNq1WiGAD0B54SSiV+SbCSH6
 Om2f5trcxz8E4pqWcf0k3LjFapVKRNV8v/+TmVkCdRPBl3y9P0h0wFTkkcEquqnJ
 y7Yq6efdWviRCnX5w/r/yj0qBuk4xo5hMVsPmlthCWtnBm+xZQ6LwMRcq4HQgZgR
 2SYNscZ3OFMekHssH7DvY4DAy1J+n83ims+KzbScbLg2zCZjh/scQuv38R5Eh9WZ
 rCS8c+T7Ig==
 =HYf4
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Core block changes that have been queued up for this release:

   - Remove dead blk-throttle and blk-wbt code (Guoqing)

   - Include pid in blktrace note traces (Jan)

   - Don't spew I/O errors on wouldblock termination (me)

   - Zone append addition (Johannes, Keith, Damien)

   - IO accounting improvements (Konstantin, Christoph)

   - blk-mq hardware map update improvements (Ming)

   - Scheduler dispatch improvement (Salman)

   - Inline block encryption support (Satya)

   - Request map fixes and improvements (Weiping)

   - blk-iocost tweaks (Tejun)

   - Fix for timeout failing with error injection (Keith)

   - Queue re-run fixes (Douglas)

   - CPU hotplug improvements (Christoph)

   - Queue entry/exit improvements (Christoph)

   - Move DMA drain handling to the few drivers that use it (Christoph)

   - Partition handling cleanups (Christoph)"

* tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits)
  block: mark bio_wouldblock_error() bio with BIO_QUIET
  blk-wbt: rename __wbt_update_limits to wbt_update_limits
  blk-wbt: remove wbt_update_limits
  blk-throttle: remove tg_drain_bios
  blk-throttle: remove blk_throtl_drain
  null_blk: force complete for timeout request
  blk-mq: drain I/O when all CPUs in a hctx are offline
  blk-mq: add blk_mq_all_tag_iter
  blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
  blk-mq: use BLK_MQ_NO_TAG in more places
  blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
  blk-mq: move more request initialization to blk_mq_rq_ctx_init
  blk-mq: simplify the blk_mq_get_request calling convention
  blk-mq: remove the bio argument to ->prepare_request
  nvme: force complete cancelled requests
  blk-mq: blk-mq: provide forced completion method
  block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds
  block: blk-crypto-fallback: remove redundant initialization of variable err
  block: reduce part_stat_lock() scope
  block: use __this_cpu_add() instead of access by smp_processor_id()
  ...
2020-06-02 15:29:19 -07:00
Christoph Hellwig
88dca4ca5a mm: remove the pgprot argument to __vmalloc
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv]
Acked-by: Gao Xiang <xiang@kernel.org> [erofs]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:11 -07:00
Guoqing Jiang
db2c1d86cc md: remove __clear_page_buffers and use attach/detach_page_private
After introduction attach/detach_page_private in pagemap.h, we can remove
the duplicated code and call the new functions.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Song Liu <song@kernel.org>
Link: http://lkml.kernel.org/r/20200517214718.468-3-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Ming Lei
bf0beec060 blk-mq: drain I/O when all CPUs in a hctx are offline
Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"That was the constraint of managed interrupts from the very beginning:

 The driver/subsystem has to quiesce the interrupt line and the associated
 queue _before_ it gets shutdown in CPU unplug and not fiddle with it
 until it's restarted by the core when the CPU is plugged in again."

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
cpuhp state handled after the CPU is down, so there isn't any chance to
quiesce the hctx before shutting down the CPU.

Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
where the last CPU goes away, and wait for completion of in-flight
requests.  This guarantees that there is no inflight I/O before shutting
down the managed IRQ.

Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
to wait for completion of in-flight requests from these drivers to avoid
a potential dead-lock. It is safe to do this for stacking drivers as those
do not use interrupts at all and their I/O completions are triggered by
underlying devices I/O completion.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

[hch: different retry mechanism, merged two patches, minor cleanups]

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29 10:23:25 -06:00
Christoph Hellwig
86240d5b68 dm: use bio_{start,end}_io_acct
Switch dm to use the nicer bio accounting helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00
Christoph Hellwig
85750aeb74 bcache: use bio_{start,end}_io_acct
Switch bcache to use the nicer bio accounting helpers, and call the
routines where we also sample the start time to give coherent accounting
results.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00
Coly Li
0c8d3fcead bcache: configure the asynchronous registertion to be experimental
In order to avoid the experimental async registration interface to
be treated as new kernel ABI for common users, this patch makes it
as an experimental kernel configure BCACHE_ASYNC_REGISTRAION.

This interface is for extreme large cached data situation, to make sure
the bcache device can always created without the udev timeout issue. For
normal users the async or sync registration does not make difference.

In future when we decide to use the asynchronous registration as default
behavior, this experimental interface may be removed.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:19:36 -06:00
Coly Li
9e23ccf8f0 bcache: asynchronous devices registration
When there is a lot of data cached on cache device, the bcach internal
btree can take a very long to validate during the backing device and
cache device registration. In my test, it may takes 55+ minutes to check
all the internal btree nodes.

The problem is that the registration is invoked by udev rules and the
udevd has 180 seconds timeout by default. If the btree node checking
time is longer than udevd timeout, the registering  process will be
killed by udevd with SIGKILL. If the registering process has pending
sigal, creating kthread for bcache will fail and the device registration
will fail. The result is, for bcache device which cached a lot of data
on cache device, the bcache device node like /dev/bcache<N> won't create
always due to the very long btree checking time.

A solution to avoid the udevd 180 seconds timeout is to register devices
in an asynchronous way. Which is, after writing cache or backing device
path into /sys/fs/bcache/register_async, the kernel code will create a
kworker and move all the btree node checking (for cache device) or dirty
data counting (for cached device) in the kwork context. Then the kworder
is scheduled on system_wq and the registration code just returned to
user space udev rule task. By this asynchronous way, the udev task for
bcache rule will complete in seconds, no matter how long time spent in
the kworker context, it won't be killed by udevd for a timeout.

After all the checking and counting are done asynchronously in the
kworker, the bcache device will eventually be created successfully.

This patch does the above chagne and add a register sysfs file
/sys/fs/bcache/register_async. Writing the registering device path into
this sysfs file will do the asynchronous registration.

The register_async interface is for very rare condition and won't be
used for common users. In future I plan to make the asynchronous
registration as default behavior, which depends on feedback for this
patch.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:19:36 -06:00
Coly Li
86da9f7367 bcache: fix refcount underflow in bcache_device_free()
The problematic code piece in bcache_device_free() is,

 785 static void bcache_device_free(struct bcache_device *d)
 786 {
 787     struct gendisk *disk = d->disk;
 [snipped]
 799     if (disk) {
 800             if (disk->flags & GENHD_FL_UP)
 801                     del_gendisk(disk);
 802
 803             if (disk->queue)
 804                     blk_cleanup_queue(disk->queue);
 805
 806             ida_simple_remove(&bcache_device_idx,
 807                               first_minor_to_idx(disk->first_minor));
 808             put_disk(disk);
 809         }
 [snipped]
 816 }

At line 808, put_disk(disk) may encounter kobject refcount of 'disk'
being underflow.

Here is how to reproduce the issue,
- Attche the backing device to a cache device and do random write to
  make the cache being dirty.
- Stop the bcache device while the cache device has dirty data of the
  backing device.
- Only register the backing device back, NOT register cache device.
- The bcache device node /dev/bcache0 won't show up, because backing
  device waits for the cache device shows up for the missing dirty
  data.
- Now echo 1 into /sys/fs/bcache/pendings_cleanup, to stop the pending
  backing device.
- After the pending backing device stopped, use 'dmesg' to check kernel
  message, a use-after-free warning from KASA reported the refcount of
  kobject linked to the 'disk' is underflow.

The dropping refcount at line 808 in the above code piece is added by
add_disk(d->disk) in bch_cached_dev_run(). But in the above condition
the cache device is not registered, bch_cached_dev_run() has no chance
to be called and the refcount is not added. The put_disk() for a non-
added refcount of gendisk kobject triggers a underflow warning.

This patch checks whether GENHD_FL_UP is set in disk->flags, if it is
not set then the bcache device was not added, don't call put_disk()
and the the underflow issue can be avoided.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:19:36 -06:00
Joe Perches
46f5aa8806 bcache: Convert pr_<level> uses to a more typical style
Remove the trailing newline from the define of pr_fmt and add newlines
to the uses.

Miscellanea:

o Convert bch_bkey_dump from multiple uses of pr_err to pr_cont
  as the earlier conversion was inappropriate done causing multiple
  lines to be emitted where only a single output line was desired
o Use vsprintf extension %pV in bch_cache_set_error to avoid multiple
  line output where only a single line output was desired
o Coalesce formats

Fixes: 6ae63e3501 ("bcache: replace printk() by pr_*() routines")

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:19:36 -06:00
Colin Ian King
3b5b7b1f70 bcache: remove redundant variables i and n
Variables i and n are being assigned but are never used. They are
redundant and can be removed.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Addresses-Coverity: ("Unused value")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:19:36 -06:00
Hannes Reinecke
b4756d43a1 dm zoned: remove leftover hunk for switching to sequential zones
Remove a leftover hunk to switch from random zones to sequential
zones when selecting a reclaim zone; the logic has moved into the
caller and this hunk is now pointless.

Fixes: 34f5affd04 ("dm zoned: separate random and cache zones")
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-22 12:07:14 -04:00
Christoph Hellwig
9398554fb3 block: remove the error_sector argument to blkdev_issue_flush
The argument isn't used by any caller, and drivers don't fill out
bi_sector for flush requests either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-22 08:45:46 -06:00
Hannes Reinecke
a16b7dee30 dm zoned: terminate reclaim on congestion
When dmz_get_chunk_mapping() selects a zone which is under reclaim
we should terminate the reclaim copy process. Since we're changing
the zone itself, reclaim needs to run afterwards again anyway.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 17:09:52 -04:00
Hannes Reinecke
c5c7885952 dm zoned: start reclaim with sequential zones
Sequential zones perform better for reclaim, so start off using
them and only use random zones as a fallback when cache zones are
present.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 17:09:51 -04:00
Hannes Reinecke
90a9b8693f dm zoned: reclaim random zones when idle
When the system is idle we should be starting reclaiming
random zones, too.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 17:09:50 -04:00
Hannes Reinecke
34f5affd04 dm zoned: separate random and cache zones
Instead of lumping emulated zones together with random zones we
should be handling them as separate 'cache' zones. This improves
code readability and allows an easier implementation of different
cache policies.

Also add additional allocation flags, to separate the type (cache,
random, or sequential) from the purpose (eg reclaim).

Also switch the allocation policy to not use random zones as buffer
zones if cache zones are present. This avoids a performance drop when
all cache zones are used.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 17:09:49 -04:00
Hannes Reinecke
489dc0f06a dm zoned: return NULL if dmz_get_zone_for_reclaim() fails to find a zone
The only case where dmz_get_zone_for_reclaim() cannot return a zone is
if the respective lists are empty. So we should just return a simple
NULL value here as we really don't have an error code which would make
sense.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 17:09:49 -04:00
Nathan Chancellor
42c689f671 dm zoned: Avoid 64-bit division error in dmz_fixup_devices
When building arm32 allyesconfig:

ld.lld: error: undefined symbol: __aeabi_uldivmod
>>> referenced by dm-zoned-target.c
>>>               md/dm-zoned-target.o:(dmz_ctr) in archive drivers/built-in.a

dmz_fixup_devices uses DIV_ROUND_UP with variables of type sector_t. As
such, it should be using DIV_ROUND_UP_SECTOR_T, which handles this
automatically.

Fixes: 70978208ec91 ("dm zoned: metadata version 2")
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 17:09:48 -04:00
Mike Snitzer
ac75b09fc6 dm: use DMDEBUG macros now that they use pr_debug variants
Now that DMDEBUG uses pr_debug and DMDEBUG_LIMIT uses
pr_debug_ratelimited cleanup DM's 2 direct pr_debug callers to use
them to get the benefit of consistent DM_FMT formatting of debugging
messages.

While doing so, dm-mpath.c:dm_report_EIO() was switched over to using
DMDEBUG_LIMIT due to the potential for error handling floods in the IO
completion path.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 17:09:47 -04:00
Hannes Reinecke
49de3b7d21 dm zoned: remove spurious newlines from debugging messages
DMDEBUG will already add a newline to the logging messages, so we
shouldn't be adding it to the message itself.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 17:09:46 -04:00
Gustavo A. R. Silva
b18ae8dd9d dm: replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 17:09:44 -04:00
Hannes Reinecke
bd5c40313a dm zoned: metadata version 2
Implement handling for metadata version 2. The new metadata adds a
label and UUID for the device mapper device, and additional UUID for
the underlying block devices.

It also allows for an additional regular drive to be used for
emulating random access zones. The emulated zones will be placed
logically in front of the zones from the zoned block device, causing
the superblocks and metadata to be stored on that device.

The first zone of the original zoned device will be used to hold
another, tertiary copy of the metadata; this copy carries a generation
number of 0 and is never updated; it's just used for identification.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 17:09:43 -04:00
Hannes Reinecke
dc076c838f dm zoned: ignore metadata zone in dmz_alloc_zone()
When looking up zones in dmz_alloc_zone() we need to ignore
metadata zones so as not to accidentally overwrite metadata.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 15:47:31 -04:00
Hannes Reinecke
ae3c1f1171 dm zoned: Reduce logging output on startup
dm-zoned is becoming quite chatty during startup; reduce the noise
by moving some information to 'debug' level.

Suggested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 15:47:30 -04:00
Hannes Reinecke
ca1a70450a dm zoned: add metadata logging functions
Use the metadata label for logging and not the underlying
device.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 15:47:25 -04:00
Hannes Reinecke
aa821c8dc0 dm zoned: use dmz_zone_to_dev() when handling metadata I/O
Use accessors to retrieve the device pointer in preparation
for adding an additional block device.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-19 13:30:32 -04:00
Hannes Reinecke
52d6775888 dm zoned: replace 'target' pointer in the bio context
Replace the 'target' pointer in the bio context with the
device pointer as this is what's actually used.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-19 13:30:31 -04:00
Hannes Reinecke
6c805f77f1 dm zoned: remove 'dev' argument from reclaim
Use the dmz_zone_to_dev() mapping function to remove the
'dev' argument from reclaim.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-19 13:30:30 -04:00
Christoph Hellwig
ac7c5675fa blk-mq: allow blk_mq_make_request to consume the q_usage_counter reference
blk_mq_make_request currently needs to grab an q_usage_counter
reference when allocating a request.  This is because the block layer
grabs one before calling blk_mq_make_request, but also releases it as
soon as blk_mq_make_request returns.  Remove the blk_queue_exit call
after blk_mq_make_request returns, and instead let it consume the
reference.  This works perfectly fine for the block layer caller, just
device mapper needs an extra reference as the old problem still
persists there.  Open code blk_queue_enter_live in device mapper,
as there should be no other callers and this allows better documenting
why we do a non-try get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19 09:34:29 -06:00
Hannes Reinecke
d0e21ce40c dm zoned: Introduce dmz_dev_is_dying() and dmz_check_dev()
Introduce accessors dmz_dev_is_dying() and dmz_check_dev() to
avoid having to reference the devices directly.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:37 -04:00
Hannes Reinecke
2234e7321d dm zoned: introduce dmz_metadata_label() to format device name
Introduce dmz_metadata_label() to format the device-mapper device
name and use it instead of the device name of the underlying device.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:37 -04:00
Hannes Reinecke
3682056013 dm zoned: move fields from struct dmz_dev to dmz_metadata
Move fields from the device structure into the metadata structure
and provide accessor functions.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:37 -04:00
Hannes Reinecke
bf28a3ba09 dm zoned: store device in struct dmz_sb
Store the device together with the superblock so that
we don't have to recur to the metadata to find it.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:37 -04:00
Hannes Reinecke
735bd7e4cd dm zoned: use array for superblock zones
Instead of storing just the first superblock zone and calculate
the secondary relative to that we should be using an array for
holding the superblock zones.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:37 -04:00
Hannes Reinecke
b712287398 dm zoned: store zone id within the zone structure and kill dmz_id()
Instead of calculating the zone index by the offset within the
zone array store the index within the structure itself. With that
the helper dmz_id() is pointless and can be replaced with accessing
the ->id value directly.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:36 -04:00
Hannes Reinecke
90b39d58f3 dm zoned: add 'message' callback
Add callback for 'dmsetup message' to allow the reclaim process
to be triggered manually.
Eg.

	dmsetup message /dev/dm-X 0 message

will start the reclaim process even if the default threshold
of 50 percent of free random zones is not reached.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:36 -04:00
Hannes Reinecke
bc3d5717d2 dm zoned: add 'status' callback
Add callback to supply information for 'dmsetup status'
and 'dmsetup table'. The output for 'dmsetup status' is

0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential

where <nr_unmap_rnd> is the number of unmapped (ie free) random zones,
<nr_rnd> the total number of random zones, <nr_unmap_seq> the number
of unmapped sequential zones, and <nr_seq> the total number of
sequential zones.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:36 -04:00
Khazhismel Kumykov
2613eab119 dm mpath: add Historical Service Time Path Selector
This new selector keeps an exponential moving average of the service
time for each path (losely defined as delta between start_io and
end_io), and uses this along with the number of inflight requests to
estimate future service time for a path.  Since we don't have a prober
to account for temporally slow paths, re-try "slow" paths every once in
a while (num_paths * historical_service_time). To account for fast paths
transitioning to slow, if a path has not completed any request within
(num_paths * historical_service_time), limit the number of outstanding
requests.  To account for low volume situations where number of
inflight IOs would be zero, the last finish time of each path is
factored in.

Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Co-developed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:36 -04:00
Gabriel Krisman Bertazi
087615bf3a dm mpath: pass IO start time to path selector
The HST path selector needs this information to perform path
prediction. For request-based mpath, struct request's io_start_time_ns
is used, while for bio-based, use the start_time stored in dm_io.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:36 -04:00
Mikulas Patocka
48338daaa0 dm writecache: improve performance on DDR persistent memory (Optane)
When testing the dm-writecache target on a real DDR persistent memory
(Intel Optane), it turned out that explicit cache flushing using the
clflushopt instruction performs better than non-temporal stores for
block sizes 1k, 2k and 4k.

The dm-writecache target is singlethreaded (all the copying is done
while holding the writecache lock), so it benefits from clwb, see:
http://lore.kernel.org/r/alpine.LRH.2.02.2004160411460.7833@file01.intranet.prod.int.rdu2.redhat.com

Add a new function memcpy_flushcache_optimized() that tests if
clflushopt is present - and if it is, we use it instead of
memcpy_flushcache.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:36 -04:00
Mikulas Patocka
499c18045e dm writecache: remove superfluous test in persistent_memory_claim
Remove superfluous test if dax_dev is NULL - dax_direct_access already
does this test.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:36 -04:00
Zhiqiang Liu
9431cf6efc dm persistent data: switch exit_ro_spine to return void
In commit 4c7da06f5a ("dm persistent data: eliminate unnecessary
return values"), r value in exit_ro_spine will not change, so
exit_ro_spine doesn't need a return value.

Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
YueHaibing
a86fe8be51 dm integrity: remove set but not used variables
Fixes gcc '-Wunused-but-set-variable' warning:

drivers/md/dm-integrity.c: In function 'integrity_metadata':
drivers/md/dm-integrity.c:1557:12: warning:
 variable 'save_metadata_offset' set but not used [-Wunused-but-set-variable]
drivers/md/dm-integrity.c:1556:12: warning:
 variable 'save_metadata_block' set but not used [-Wunused-but-set-variable]

They are never used, so remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
Heinz Mauelshagen
a5089a95d8 dm ebs: pass discards down to underlying device
Make use of dm_bufio_issue_discard() to pass discards down to the
underlying device.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
Mikulas Patocka
6fbeb0048e dm bufio: implement discard
Add functions dm_bufio_issue_discard and dm_bufio_discard_buffers.
dm_bufio_issue_discard sends discard request to the underlying device.
dm_bufio_discard_buffers frees buffers in the range and then calls
dm_bufio_issue_discard.

Also, factor out block_to_sector for reuse in dm_bufio_issue_discard.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
Heinz Mauelshagen
d3c7b35c20 dm: add emulated block size target
This new target is similar to the linear target except that it emulates
a smaller logical block size on a device with a larger logical block
size.  Its main purpose is to emulate 512 byte sectors on 4K native
disks (i.e. 512e).

See Documentation/admin-guide/device-mapper/dm-ebs.rst for details.

Reviewed-by: Damien Le Moal <DamienLeMoal@wdc.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> [Kconfig fixes]
Signed-off-by: Zheng Bin <zhengbin13@huawei.com> [static fixes]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
Martin Wilck
2361ae5953 dm mpath: switch paths in dm_blk_ioctl() code path
SCSI LUN passthrough code such as qemu's "scsi-block" device model
pass every IO to the host via SG_IO ioctls. Currently, dm-multipath
calls choose_pgpath() only in the block IO code path, not in the ioctl
code path (unless current_pgpath is NULL). This has the effect that no
path switching and thus no load balancing is done for SCSI-passthrough
IO, unless the active path fails.

Fix this by using the same logic in multipath_prepare_ioctl() as in
multipath_clone_and_map().

Note: The allegedly best path selection algorithm, service-time,
still wouldn't work perfectly, because the io size of the current
request is always set to 0. Changing that for the IO passthrough
case would require the ioctl cmd and arg to be passed to dm's
prepare_ioctl() method.

Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:35 -04:00
Dmitry Baryshkov
27f5411a71 dm crypt: support using encrypted keys
Allow one to use "encrypted" in addition to "user" and "logon" key
types for device encryption.

Signed-off-by: Dmitry Baryshkov <dmitry_baryshkov@mentor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:34 -04:00
Satya Tangirala
a892c8d52c block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.

We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.

We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.

Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.

Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.

Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 09:47:53 -06:00
Gustavo A. R. Silva
358369f03a md/raid1: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 12:02:23 -07:00
Xiongfeng Wang
3f99980c8f md: add a newline when printing parameter 'start_ro' by sysfs
Add a missing newline when printing module parameter 'start_ro' by
sysfs.

Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 11:49:19 -07:00
Christoph Hellwig
e4fc5a7429 md: stop using ->queuedata
Pointer to mddev is already available in private_data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 11:48:28 -07:00
David Jeffery
c91114c2b8 md/raid1: release pending accounting for an I/O only after write-behind is also finished
When using RAID1 and write-behind, md can deadlock when errors occur. With
write-behind, r1bio structs can be accounted by raid1 as queued but not
counted as pending. The pending count is dropped when the original bio is
returned complete but write-behind for the r1bio may still be active.

This breaks the accounting used in some conditions to know when the raid1
md device has reached an idle state. It can result in calls to
freeze_array deadlocking. freeze_array will never complete from a negative
"unqueued" value being calculated due to a queued count larger than the
pending count.

To properly account for write-behind, move the call to allow_barrier from
call_bio_endio to raid_end_bio_io. When using write-behind, md can call
call_bio_endio before all write-behind I/O is complete. Using
raid_end_bio_io for the point to call allow_barrier will release the
pending count at a point where all I/O for an r1bio, even write-behind, is
done.

Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 11:22:31 -07:00
Coly Li
3024ba2d6c md: remove redundant memalloc scope API usage
In mddev_create_serial_pool(), memalloc scope APIs memalloc_noio_save()
and memalloc_noio_restore() are used when allocating memory by calling
mempool_create_kmalloc_pool(). After adding the memalloc scope APIs in
raid array suspend context, it is unncessary to explicitly call them
around mempool_create_kmalloc_pool() any longer.

This patch removes the redundant memalloc scope APIs in
mddev_create_serial_pool().

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 11:22:31 -07:00
Coly Li
7f8a30e5d2 raid5: update code comment of scribble_alloc()
Code comments of scribble_alloc() is outdated for a while. This patch
update the comments in function header for the new parameter list.

Suggested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 11:22:31 -07:00
Coly Li
ba54d4d4d2 raid5: remove gfp flags from scribble_alloc()
Using GFP_NOIO flag to call scribble_alloc() from resize_chunk() does
not have the expected behavior. kvmalloc_array() inside scribble_alloc()
which receives the GFP_NOIO flag will eventually call kmalloc_node() to
allocate physically continuous pages.

Now we have memalloc scope APIs in mddev_suspend()/mddev_resume() to
prevent memory reclaim I/Os during raid array suspend context, calling
to kvmalloc_array() with GFP_KERNEL flag may avoid deadlock of recursive
I/O as expected.

This patch removes the useless gfp flags from parameters list of
scribble_alloc(), and call kvmalloc_array() with GFP_KERNEL flag. The
incorrect GFP_NOIO flag does not exist anymore.

Fixes: b330e6a49d ("md: convert to kvmalloc")
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 11:22:31 -07:00
Coly Li
78f57ef9d5 md: use memalloc scope APIs in mddev_suspend()/mddev_resume()
In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
flag, then it is sent into kvmalloc_array() inside scribble_alloc().

The problem is kvmalloc_array() eventually calls kvmalloc_node() which
does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
kmalloc_node() is called indeed to allocate physically continuous
pages. When system memory is under heavy pressure, and the requesting
size is large, there is high probability that allocating continueous
pages will fail.

But simply using GFP_KERNEL flag to call kvmalloc_array() is also
progblematic. In the code path where scribble_alloc() is called, the
raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
and such I/Os go back to the suspend raid array, deadlock will happen.

What is desired here is to allocate non-physically (a.k.a virtually)
continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
to use the mmealloc sceope APIs to restrict memory reclaim I/O in
allocating context, specifically to call memalloc_noio_save() when
suspend the raid array and to call memalloc_noio_restore() when
resume the raid array.

This patch adds the memalloc scope APIs in mddev_suspend() and
mddev_resume(), to restrict memory reclaim I/Os during the raid array
is suspended. The benifit of adding the memalloc scope API in the
unified entry point mddev_suspend()/mddev_resume() is, no matter which
md raid array type (personality), we are sure the deadlock by recursive
memory reclaim I/O won't happen on the suspending context.

Please notice that the memalloc scope APIs only take effect on the raid
array suspending context, if the memory allocation is from another new
created kthread after raid array suspended, the recursive memory reclaim
I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are
used for the critical section where the raid metadata is modifying,
creating a kthread to allocate memory inside the critical section is
queer and very probably being buggy.

Fixes: b330e6a49d ("md: convert to kvmalloc")
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 11:22:31 -07:00
Guoqing Jiang
3f79cc2234 md: remove the extra line for ->hot_add_disk
It is not not necessary to add a newline for them since they don't exceed
80 characters, and it is not intutive to distinguish ->hot_add_disk() from
hot_add_disk() too.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 11:22:31 -07:00
Guoqing Jiang
78b990cf28 md: flush md_rdev_misc_wq for HOT_ADD_DISK case
Since rdev->kobj is removed asynchronously, it is possible that the
rdev->kobj still exists when try to add the rdev again after rdev
is removed. But this path md_ioctl (HOT_ADD_DISK) -> hot_add_disk
-> bind_rdev_to_array missed it.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 11:22:31 -07:00
Guoqing Jiang
f6766ff6af md: don't flush workqueue unconditionally in md_open
We need to check mddev->del_work before flush workqueu since the purpose
of flush is to ensure the previous md is disappeared. Otherwise the similar
deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the
bdev->bd_mutex before flush workqueue.

kernel: [  154.522645] ======================================================
kernel: [  154.522647] WARNING: possible circular locking dependency detected
kernel: [  154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G           O
kernel: [  154.522651] ------------------------------------------------------
kernel: [  154.522653] mdadm/2482 is trying to acquire lock:
kernel: [  154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0
kernel: [  154.522673]
kernel: [  154.522673] but task is already holding lock:
kernel: [  154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [  154.522691]
kernel: [  154.522691] which lock already depends on the new lock.
kernel: [  154.522691]
kernel: [  154.522694]
kernel: [  154.522694] the existing dependency chain (in reverse order) is:
kernel: [  154.522696]
kernel: [  154.522696] -> #4 (&bdev->bd_mutex){+.+.}:
kernel: [  154.522704]        __mutex_lock+0x87/0x950
kernel: [  154.522706]        __blkdev_get+0x79/0x590
kernel: [  154.522708]        blkdev_get+0x65/0x140
kernel: [  154.522709]        blkdev_get_by_dev+0x2f/0x40
kernel: [  154.522716]        lock_rdev+0x3d/0x90 [md_mod]
kernel: [  154.522719]        md_import_device+0xd6/0x1b0 [md_mod]
kernel: [  154.522723]        new_dev_store+0x15e/0x210 [md_mod]
kernel: [  154.522728]        md_attr_store+0x7a/0xc0 [md_mod]
kernel: [  154.522732]        kernfs_fop_write+0x117/0x1b0
kernel: [  154.522735]        vfs_write+0xad/0x1a0
kernel: [  154.522737]        ksys_write+0xa4/0xe0
kernel: [  154.522745]        do_syscall_64+0x64/0x2b0
kernel: [  154.522748]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [  154.522749]
kernel: [  154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}:
kernel: [  154.522752]        __mutex_lock+0x87/0x950
kernel: [  154.522756]        new_dev_store+0xc9/0x210 [md_mod]
kernel: [  154.522759]        md_attr_store+0x7a/0xc0 [md_mod]
kernel: [  154.522761]        kernfs_fop_write+0x117/0x1b0
kernel: [  154.522763]        vfs_write+0xad/0x1a0
kernel: [  154.522765]        ksys_write+0xa4/0xe0
kernel: [  154.522767]        do_syscall_64+0x64/0x2b0
kernel: [  154.522769]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [  154.522770]
kernel: [  154.522770] -> #2 (kn->count#253){++++}:
kernel: [  154.522775]        __kernfs_remove+0x253/0x2c0
kernel: [  154.522778]        kernfs_remove+0x1f/0x30
kernel: [  154.522780]        kobject_del+0x28/0x60
kernel: [  154.522783]        mddev_delayed_delete+0x24/0x30 [md_mod]
kernel: [  154.522786]        process_one_work+0x2a7/0x5f0
kernel: [  154.522788]        worker_thread+0x2d/0x3d0
kernel: [  154.522793]        kthread+0x117/0x130
kernel: [  154.522795]        ret_from_fork+0x3a/0x50
kernel: [  154.522796]
kernel: [  154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}:
kernel: [  154.522800]        process_one_work+0x27e/0x5f0
kernel: [  154.522802]        worker_thread+0x2d/0x3d0
kernel: [  154.522804]        kthread+0x117/0x130
kernel: [  154.522806]        ret_from_fork+0x3a/0x50
kernel: [  154.522807]
kernel: [  154.522807] -> #0 ((wq_completion)md_misc){+.+.}:
kernel: [  154.522813]        __lock_acquire+0x1392/0x1690
kernel: [  154.522816]        lock_acquire+0xb4/0x1a0
kernel: [  154.522818]        flush_workqueue+0xab/0x4b0
kernel: [  154.522821]        md_open+0xb6/0xc0 [md_mod]
kernel: [  154.522823]        __blkdev_get+0xea/0x590
kernel: [  154.522825]        blkdev_get+0x65/0x140
kernel: [  154.522828]        do_dentry_open+0x1d1/0x380
kernel: [  154.522831]        path_openat+0x567/0xcc0
kernel: [  154.522834]        do_filp_open+0x9b/0x110
kernel: [  154.522836]        do_sys_openat2+0x201/0x2a0
kernel: [  154.522838]        do_sys_open+0x57/0x80
kernel: [  154.522840]        do_syscall_64+0x64/0x2b0
kernel: [  154.522842]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [  154.522844]
kernel: [  154.522844] other info that might help us debug this:
kernel: [  154.522844]
kernel: [  154.522846] Chain exists of:
kernel: [  154.522846]   (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex
kernel: [  154.522846]
kernel: [  154.522850]  Possible unsafe locking scenario:
kernel: [  154.522850]
kernel: [  154.522852]        CPU0                    CPU1
kernel: [  154.522853]        ----                    ----
kernel: [  154.522854]   lock(&bdev->bd_mutex);
kernel: [  154.522856]                                lock(&mddev->reconfig_mutex);
kernel: [  154.522858]                                lock(&bdev->bd_mutex);
kernel: [  154.522860]   lock((wq_completion)md_misc);
kernel: [  154.522861]
kernel: [  154.522861]  *** DEADLOCK ***
kernel: [  154.522861]
kernel: [  154.522864] 1 lock held by mdadm/2482:
kernel: [  154.522865]  #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [  154.522868]
kernel: [  154.522868] stack backtrace:
kernel: [  154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G           O      5.6.0-rc7-lp151.27-default #25
kernel: [  154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
kernel: [  154.522878] Call Trace:
kernel: [  154.522881]  dump_stack+0x8f/0xcb
kernel: [  154.522884]  check_noncircular+0x194/0x1b0
kernel: [  154.522888]  ? __lock_acquire+0x1392/0x1690
kernel: [  154.522890]  __lock_acquire+0x1392/0x1690
kernel: [  154.522893]  lock_acquire+0xb4/0x1a0
kernel: [  154.522895]  ? flush_workqueue+0x84/0x4b0
kernel: [  154.522898]  flush_workqueue+0xab/0x4b0
kernel: [  154.522900]  ? flush_workqueue+0x84/0x4b0
kernel: [  154.522905]  ? md_open+0xb6/0xc0 [md_mod]
kernel: [  154.522908]  md_open+0xb6/0xc0 [md_mod]
kernel: [  154.522910]  __blkdev_get+0xea/0x590
kernel: [  154.522912]  ? bd_acquire+0xc0/0xc0
kernel: [  154.522914]  blkdev_get+0x65/0x140
kernel: [  154.522916]  ? bd_acquire+0xc0/0xc0
kernel: [  154.522918]  do_dentry_open+0x1d1/0x380
kernel: [  154.522921]  path_openat+0x567/0xcc0
kernel: [  154.522923]  ? __lock_acquire+0x380/0x1690
kernel: [  154.522926]  do_filp_open+0x9b/0x110
kernel: [  154.522929]  ? __alloc_fd+0xe5/0x1f0
kernel: [  154.522935]  ? kmem_cache_alloc+0x28c/0x630
kernel: [  154.522939]  ? do_sys_openat2+0x201/0x2a0
kernel: [  154.522941]  do_sys_openat2+0x201/0x2a0
kernel: [  154.522944]  do_sys_open+0x57/0x80
kernel: [  154.522946]  do_syscall_64+0x64/0x2b0
kernel: [  154.522948]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [  154.522951] RIP: 0033:0x7f98d279d9ae

And md_alloc also flushed the same workqueue, but the thing is different
here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and
the flush is necessary to avoid race condition, so leave it as it is.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 11:22:31 -07:00
Guoqing Jiang
cc1ffe61c0 md: add new workqueue for delete rdev
Since the purpose of call flush_workqueue in new_dev_store is to ensure
md_delayed_delete() has completed, so we should check rdev->del_work is
pending or not.

To suppress lockdep warning, we have to check mddev->del_work while
md_delayed_delete is attached to rdev->del_work, so it is not aligned
to the purpose of flush workquee. So a new workqueue is needed to avoid
the awkward situation, and introduce a new func flush_rdev_wq to flush
the new workqueue after check if there was pending work.

Also like new_dev_store, ADD_NEW_DISK ioctl has the same purpose to flush
workqueue while it holds bdev->bd_mutex, so make the same change applies
to the ioctl to avoid similar lock issue.

And md_delayed_delete actually wants to delete rdev, so rename the function
to rdev_delayed_delete.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 11:22:31 -07:00
Guoqing Jiang
21e0958ec9 md: add checkings before flush md_misc_wq
Coly reported possible circular locking dependencyi with LOCKDEP enabled,
quote the below info from the detailed report [1].

[ 1607.673903] Chain exists of:
[ 1607.673903]   kn->count#256 --> (wq_completion)md_misc -->
(work_completion)(&rdev->del_work)
[ 1607.673903]
[ 1607.827946]  Possible unsafe locking scenario:
[ 1607.827946]
[ 1607.898780]        CPU0                    CPU1
[ 1607.952980]        ----                    ----
[ 1608.007173]   lock((work_completion)(&rdev->del_work));
[ 1608.069690]                                lock((wq_completion)md_misc);
[ 1608.149887]                                lock((work_completion)(&rdev->del_work));
[ 1608.242563]   lock(kn->count#256);
[ 1608.283238]
[ 1608.283238]  *** DEADLOCK ***
[ 1608.283238]
[ 1608.354078] 2 locks held by kworker/5:0/843:
[ 1608.405152]  #0: ffff8889eecc9948 ((wq_completion)md_misc){+.+.}, at:
process_one_work+0x42b/0xb30
[ 1608.512399]  #1: ffff888a1d3b7e10
((work_completion)(&rdev->del_work)){+.+.}, at: process_one_work+0x42b/0xb30
[ 1608.632130]

Since works (rdev->del_work and mddev->del_work) are queued in md_misc_wq,
then lockdep_map lock is held if either of them are running, then both of
them try to hold kernfs lock by call kobject_del. Then if new_dev_store
or array_state_store are triggered by write to the related sysfs node, so
the write operation gets kernfs lock, but need the lockdep_map because all
of them would trigger flush_workqueue(md_misc_wq) finally, then the same
lockdep_map lock is needed.

To suppress the lockdep warnning, we should flush the workqueue in case the
related work is pending. And several works are attached to md_misc_wq, so
we need to check which work should be checked:

1. for __md_stop_writes, the purpose of call flush workqueue is ensure sync
thread is started if it was starting, so check mddev->del_work is pending
or not since md_start_sync is attached to mddev->del_work.

2. __md_stop flushes md_misc_wq to ensure event_work is done, check the
event_work is enough. Assume raid_{ctr,dtr} -> md_stop -> __md_stop doesn't
need the kernfs lock.

3. both new_dev_store (holds kernfs lock) and ADD_NEW_DISK ioctl (holds the
bdev->bd_mutex) call flush_workqueue to ensure md_delayed_delete has
completed, this case will be handled in next patch.

4. md_open flushes workqueue to ensure the previous md is disappeared, but
it holds bdev->bd_mutex then try to flush workqueue, so it is better to
check mddev->del_work as well to avoid potential lock issue, this will be
done in another patch.

[1]: https://marc.info/?l=linux-raid&m=158518958031584&w=2

Cc: Coly Li <colyli@suse.de>
Reported-by: Coly Li <colyli@suse.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13 11:22:30 -07:00
Linus Torvalds
c45e8bccec - Document DM integrity allow_discard feature that was added during
5.7 merge window.
 
 - Fix potential for DM writecache data corruption during DM table
   reloads.
 
 - Fix DM verity's FEC support's hash block number calculation in
   verity_fec_decode().
 
 - Fix bio-based DM multipath crash due to use of stale copy of
   MPATHF_QUEUE_IO flag state in __map_bio().
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl6rSn4THHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWjmFB/9kwi4L5SQTg81ak0jmiO4lNSxqvc8s
 ZJUyT8m/Iheh6FWo6kZYyfN+YVPPwXRFDWk4TGSrKyoqElTbpxkkmVxE7nhp9Z+O
 1f7cfD8vCsTexHxYjxx3DxW529YVjvccifhaJFQgCA3II8+0to9PFzc7v6JpFBGS
 1CWd169OUDHe2XGNmah0lbgwEgb7ZQRn9MNrkdYu3L9HihNLs8h4uin390FlfhSj
 +/rS89jkA9X5MhFnspKrX0wGl3qoQBzFFvvn4KKPcH8EPt5zsUPVo+/rTOBL8Tr8
 VXDYKco4CDfuqzB5LdHeTi3mKwGq+fpPmwzRTk+/gosDZHA49m7mLKGB
 =cr8K
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Document DM integrity allow_discard feature that was added during 5.7
   merge window.

 - Fix potential for DM writecache data corruption during DM table
   reloads.

 - Fix DM verity's FEC support's hash block number calculation in
   verity_fec_decode().

 - Fix bio-based DM multipath crash due to use of stale copy of
   MPATHF_QUEUE_IO flag state in __map_bio().

* tag 'for-5.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm multipath: use updated MPATHF_QUEUE_IO on mapping for bio-based mpath
  dm verity fec: fix hash block number in verity_fec_decode
  dm writecache: fix data corruption when reloading the target
  dm integrity: document allow_discard option
2020-04-30 16:45:08 -07:00
Gabriel Krisman Bertazi
5686dee34d dm multipath: use updated MPATHF_QUEUE_IO on mapping for bio-based mpath
When adding devices that don't have a scsi_dh on a BIO based multipath,
I was able to consistently hit the warning below and lock-up the system.

The problem is that __map_bio reads the flag before it potentially being
modified by choose_pgpath, and ends up using the older value.

The WARN_ON below is not trivially linked to the issue. It goes like
this: The activate_path delayed_work is not initialized for non-scsi_dh
devices, but we always set MPATHF_QUEUE_IO, asking for initialization.
That is fine, since MPATHF_QUEUE_IO would be cleared in choose_pgpath.
Nevertheless, only for BIO-based mpath, we cache the flag before calling
choose_pgpath, and use the older version when deciding if we should
initialize the path.  Therefore, we end up trying to initialize the
paths, and calling the non-initialized activate_path work.

[   82.437100] ------------[ cut here ]------------
[   82.437659] WARNING: CPU: 3 PID: 602 at kernel/workqueue.c:1624
  __queue_delayed_work+0x71/0x90
[   82.438436] Modules linked in:
[   82.438911] CPU: 3 PID: 602 Comm: systemd-udevd Not tainted 5.6.0-rc6+ #339
[   82.439680] RIP: 0010:__queue_delayed_work+0x71/0x90
[   82.440287] Code: c1 48 89 4a 50 81 ff 00 02 00 00 75 2a 4c 89 cf e9
94 d6 07 00 e9 7f e9 ff ff 0f 0b eb c7 0f 0b 48 81 7a 58 40 74 a8 94 74
a7 <0f> 0b 48 83 7a 48 00 74 a5 0f 0b eb a1 89 fe 4c 89 cf e9 c8 c4 07
[   82.441719] RSP: 0018:ffffb738803977c0 EFLAGS: 00010007
[   82.442121] RAX: ffffa086389f9740 RBX: 0000000000000002 RCX: 0000000000000000
[   82.442718] RDX: ffffa086350dd930 RSI: ffffa0863d76f600 RDI: 0000000000000200
[   82.443484] RBP: 0000000000000200 R08: 0000000000000000 R09: ffffa086350dd970
[   82.444128] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa086350dd930
[   82.444773] R13: ffffa0863d76f600 R14: 0000000000000000 R15: ffffa08636738008
[   82.445427] FS:  00007f6abfe9dd40(0000) GS:ffffa0863dd80000(0000) knlGS:00000
[   82.446040] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   82.446478] CR2: 0000557d288db4e8 CR3: 0000000078b36000 CR4: 00000000000006e0
[   82.447104] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   82.447561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   82.448012] Call Trace:
[   82.448164]  queue_delayed_work_on+0x6d/0x80
[   82.448472]  __pg_init_all_paths+0x7b/0xf0
[   82.448714]  pg_init_all_paths+0x26/0x40
[   82.448980]  __multipath_map_bio.isra.0+0x84/0x210
[   82.449267]  __map_bio+0x3c/0x1f0
[   82.449468]  __split_and_process_non_flush+0x14a/0x1b0
[   82.449775]  __split_and_process_bio+0xde/0x340
[   82.450045]  ? dm_get_live_table+0x5/0xb0
[   82.450278]  dm_process_bio+0x98/0x290
[   82.450518]  dm_make_request+0x54/0x120
[   82.450778]  generic_make_request+0xd2/0x3e0
[   82.451038]  ? submit_bio+0x3c/0x150
[   82.451278]  submit_bio+0x3c/0x150
[   82.451492]  mpage_readpages+0x129/0x160
[   82.451756]  ? bdev_evict_inode+0x1d0/0x1d0
[   82.452033]  read_pages+0x72/0x170
[   82.452260]  __do_page_cache_readahead+0x1ba/0x1d0
[   82.452624]  force_page_cache_readahead+0x96/0x110
[   82.452903]  generic_file_read_iter+0x84f/0xae0
[   82.453192]  ? __seccomp_filter+0x7c/0x670
[   82.453547]  new_sync_read+0x10e/0x190
[   82.453883]  vfs_read+0x9d/0x150
[   82.454172]  ksys_read+0x65/0xe0
[   82.454466]  do_syscall_64+0x4e/0x210
[   82.454828]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[...]
[   82.462501] ---[ end trace bb39975e9cf45daa ]---

Cc: stable@vger.kernel.org
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-28 19:51:46 -04:00
Christoph Hellwig
8cf7961dab block: bypass ->make_request_fn for blk-mq drivers
Call blk_mq_make_request when no ->make_request_fn is set.  This is
safe now that blk_alloc_queue always sets up the pointer for make_request
based drivers.  This avoids an indirect call in the blk-mq driver I/O
fast path, which is rather expensive due to spectre mitigations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-25 09:45:44 -06:00
Christoph Hellwig
ae3cc8d8ff dm: remove the make_request_fn check in device_area_is_invalid
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-25 09:45:43 -06:00
Christoph Hellwig
a91b2014fc bcache: remove a duplicate ->make_request_fn assignment
The make_request_fn pointer should only be assigned by blk_alloc_queue.
Fix a left over manual initialization.

Fixes: ff27668ce8 ("bcache: pass the make_request methods to blk_queue_make_request")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-25 09:45:43 -06:00
Sunwook Eom
ad4e80a639 dm verity fec: fix hash block number in verity_fec_decode
The error correction data is computed as if data and hash blocks
were concatenated. But hash block number starts from v->hash_start.
So, we have to calculate hash block number based on that.

Fixes: a739ff3f54 ("dm verity: add support for forward error correction")
Cc: stable@vger.kernel.org
Signed-off-by: Sunwook Eom <speed.eom@samsung.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-16 16:16:38 -04:00
Mikulas Patocka
31b2212019 dm writecache: fix data corruption when reloading the target
The dm-writecache reads metadata in the target constructor. However, when
we reload the target, there could be another active instance running on
the same device. This is the sequence of operations when doing a reload:

1. construct new target
2. suspend old target
3. resume new target
4. destroy old target

Metadata that were written by the old target between steps 1 and 2 would
not be visible by the new target.

Fix the data corruption by loading the metadata in the resume handler.

Also, validate block_size is at least as large as both the devices'
logical block size and only read 1 block from the metadata during
target constructor -- no need to read entirety of metadata now that it
is done during resume.

Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-16 16:04:13 -04:00
Linus Torvalds
9b06860d7c libnvdimm for 5.7
- Add support for region alignment configuration and enforcement to
   fix compatibility across architectures and PowerPC page size
   configurations.
 
 - Introduce 'zero_page_range' as a dax operation. This facilitates
   filesystem-dax operation without a block-device.
 
 - Introduce phys_to_target_node() to facilitate drivers that want to
   know resulting numa node if a given reserved address range was
   onlined.
 
 - Advertise a persistence-domain for of_pmem and papr_scm. The
   persistence domain indicates where cpu-store cycles need to reach in
   the platform-memory subsystem before the platform will consider them
   power-fail protected.
 
 - Promote numa_map_to_online_node() to a cross-kernel generic facility.
 
 - Save x86 numa information to allow for node-id lookups for reserved
   memory ranges, deploy that capability for the e820-pmem driver.
 
 - Pick up some miscellaneous minor fixes, that missed v5.6-final,
   including a some smatch reports in the ioctl path and some unit test
   compilation fixups.
 
 - Fixup some flexible-array declarations.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAl6LtIAACgkQHtKRamZ9
 iAIwRA/8CLVVuQpgHQ1tqK4h8CZPrISFXh7wy7uhocEU2xrDh6iGVnLztmoLRr2k
 5f8T9lRzreSAwIVL5DbGqP1pFncqIt9VMnKsFlaPMBGCBNR+hURY0iBCNjIT+jiq
 BOzLd52MR2rqJxeXGTMUbWrBrbmuj4mZPdmGVuFFe7GFRpoaVpCgOo+296eWa/ot
 gIOFUTonZY7STYjNvDok0TXCmiCFuJb+P+y5ldfCPShHvZhTiaF53jircja8vAjO
 G5dt8ixBKUK0rXRc4SEQsQhAZNcAFHb6Gy5lg4C2QzhTF374xTc9usJZNWbIE9iM
 5mipBYvjVuoY+XaCNZDkaRcJIy/jqB15O6l3QIWbZLGaK9m95YPp9LmkPFwd3JpO
 e3rO24ML471DxqB9iWIiJCNcBBocLOlnd6qAQTpppWDpGNbudwXvfsmKHmKIScSE
 x+IDCdscLmmm+WG2dLmLraWOVPu42xZFccoQCi4M3TTqfeB9pZ9XckFQ37zX62zG
 5t+7Ek+t1W4QVt/JQYVKH03XT15sqUpVknvx0Hl4Y5TtbDOkFLkO8RN0/HyExDef
 7iegS35kqTsM4EfZQ+9juKbI2JBAjHANcbj0V4dogqaRj6vr3akumBzUtuYqAofv
 qU3s9skmLsEemOJC+ns2PT8vl5dyIoeDfH0r2XvGWxYqolMqJpA=
 =sY4N
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm and dax updates from Dan Williams:
 "There were multiple touches outside of drivers/nvdimm/ this round to
  add cross arch compatibility to the devm_memremap_pages() interface,
  enhance numa information for persistent memory ranges, and add a
  zero_page_range() dax operation.

  This cycle I switched from the patchwork api to Konstantin's b4 script
  for collecting tags (from x86, PowerPC, filesystem, and device-mapper
  folks), and everything looks to have gone ok there. This has all
  appeared in -next with no reported issues.

  Summary:

   - Add support for region alignment configuration and enforcement to
     fix compatibility across architectures and PowerPC page size
     configurations.

   - Introduce 'zero_page_range' as a dax operation. This facilitates
     filesystem-dax operation without a block-device.

   - Introduce phys_to_target_node() to facilitate drivers that want to
     know resulting numa node if a given reserved address range was
     onlined.

   - Advertise a persistence-domain for of_pmem and papr_scm. The
     persistence domain indicates where cpu-store cycles need to reach
     in the platform-memory subsystem before the platform will consider
     them power-fail protected.

   - Promote numa_map_to_online_node() to a cross-kernel generic
     facility.

   - Save x86 numa information to allow for node-id lookups for reserved
     memory ranges, deploy that capability for the e820-pmem driver.

   - Pick up some miscellaneous minor fixes, that missed v5.6-final,
     including a some smatch reports in the ioctl path and some unit
     test compilation fixups.

   - Fixup some flexible-array declarations"

* tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
  dax: Move mandatory ->zero_page_range() check in alloc_dax()
  dax,iomap: Add helper dax_iomap_zero() to zero a range
  dax: Use new dax zero page method for zeroing a page
  dm,dax: Add dax zero_page_range operation
  s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
  dax, pmem: Add a dax operation zero_page_range
  pmem: Add functions for reading/writing page to/from pmem
  libnvdimm: Update persistence domain value for of_pmem and papr_scm device
  tools/test/nvdimm: Fix out of tree build
  libnvdimm/region: Fix build error
  libnvdimm/region: Replace zero-length array with flexible-array member
  libnvdimm/label: Replace zero-length array with flexible-array member
  ACPI: NFIT: Replace zero-length array with flexible-array member
  libnvdimm/region: Introduce an 'align' attribute
  libnvdimm/region: Introduce NDD_LABELING
  libnvdimm/namespace: Enforce memremap_compat_align()
  libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
  libnvdimm: Out of bounds read in __nd_ioctl()
  acpi/nfit: improve bounds checking for 'func'
  mm/memremap_pages: Introduce memremap_compat_align()
  ...
2020-04-08 21:03:40 -07:00
Linus Torvalds
de3c913c6e - Fix excessive bio splitting that caused performance regressions.
- Fix logic bug in DM integrity discard support's integrity tag
   testing.
 
 - Fix DM integrity warning on ppc64le due to missing cast.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl6HbiETHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWiRFCACoiewv5jBvqzYvINy0FpgwLyU0SOOY
 11UCIzeLgQUPMZ4a5CUpPQuqSxKR3g7RAoD1EJZ1cyOnAJuk6A+VPkOLDioa4BEC
 uS7ifenihclFjcpWjQaKTNEhuTURYSjIwk1wqSwb7Fv+L2Uo+4XB8DvazBZdMq8+
 0EAO2CNl3r9Tkut0MRGr7DKrZa4QauaPsl7BvtzlLZbC3Nj2VncpwePYv9z7c7Ra
 g0zAq8IJlOPgDBUt4szbEAKiwGWQEiG6yxcyj5J3keU6mrHAg0mr08rxaapFkfSK
 F++3gTYjjLi8rQmczufvJ440t2kjK+b4dOfxMyjZccgjcoPxrxxkgaWq
 =Cj9X
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix excessive bio splitting that caused performance regressions

 - Fix logic bug in DM integrity discard support's integrity tag testing

 - Fix DM integrity warning on ppc64le due to missing cast

* tag 'for-5.7/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm integrity: fix logic bug in integrity tag testing
  Revert "dm: always call blk_queue_split() in dm_process_bio()"
  dm integrity: fix ppc64le warning
2020-04-03 14:44:48 -07:00
Mikulas Patocka
8267d8fb48 dm integrity: fix logic bug in integrity tag testing
If all the bytes are equal to DISCARD_FILLER, we want to accept the
buffer. If any of the bytes are different, we must do thorough
tag-by-tag checking.

The condition was inverted.

Fixes: 84597a44a9 ("dm integrity: add optional discard support")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-03 13:07:41 -04:00
Mike Snitzer
120c9257f5 Revert "dm: always call blk_queue_split() in dm_process_bio()"
This reverts commit effd58c95f.

blk_queue_split() is causing excessive IO splitting -- because
blk_max_size_offset() depends on 'chunk_sectors' limit being set and
if it isn't (as is the case for DM targets!) it falls back to
splitting on a 'max_sectors' boundary regardless of offset.

"Fix" this by reverting back to _not_ using blk_queue_split() in
dm_process_bio() for normal IO (reads and writes).  Long-term fix is
still TBD but it should focus on training blk_max_size_offset() to
call into a DM provided hook (to call DM's max_io_len()).

Test results from simple misaligned IO test on 4-way dm-striped device
with chunksize of 128K and stripesize of 512K:

xfs_io -d -c 'pread -b 2m 224s 4072s' /dev/mapper/stripe_dev

before this revert:

253,0   21        1     0.000000000  2206  Q   R 224 + 4072 [xfs_io]
253,0   21        2     0.000008267  2206  X   R 224 / 480 [xfs_io]
253,0   21        3     0.000010530  2206  X   R 224 / 256 [xfs_io]
253,0   21        4     0.000027022  2206  X   R 480 / 736 [xfs_io]
253,0   21        5     0.000028751  2206  X   R 480 / 512 [xfs_io]
253,0   21        6     0.000033323  2206  X   R 736 / 992 [xfs_io]
253,0   21        7     0.000035130  2206  X   R 736 / 768 [xfs_io]
253,0   21        8     0.000039146  2206  X   R 992 / 1248 [xfs_io]
253,0   21        9     0.000040734  2206  X   R 992 / 1024 [xfs_io]
253,0   21       10     0.000044694  2206  X   R 1248 / 1504 [xfs_io]
253,0   21       11     0.000046422  2206  X   R 1248 / 1280 [xfs_io]
253,0   21       12     0.000050376  2206  X   R 1504 / 1760 [xfs_io]
253,0   21       13     0.000051974  2206  X   R 1504 / 1536 [xfs_io]
253,0   21       14     0.000055881  2206  X   R 1760 / 2016 [xfs_io]
253,0   21       15     0.000057462  2206  X   R 1760 / 1792 [xfs_io]
253,0   21       16     0.000060999  2206  X   R 2016 / 2272 [xfs_io]
253,0   21       17     0.000062489  2206  X   R 2016 / 2048 [xfs_io]
253,0   21       18     0.000066133  2206  X   R 2272 / 2528 [xfs_io]
253,0   21       19     0.000067507  2206  X   R 2272 / 2304 [xfs_io]
253,0   21       20     0.000071136  2206  X   R 2528 / 2784 [xfs_io]
253,0   21       21     0.000072764  2206  X   R 2528 / 2560 [xfs_io]
253,0   21       22     0.000076185  2206  X   R 2784 / 3040 [xfs_io]
253,0   21       23     0.000077486  2206  X   R 2784 / 2816 [xfs_io]
253,0   21       24     0.000080885  2206  X   R 3040 / 3296 [xfs_io]
253,0   21       25     0.000082316  2206  X   R 3040 / 3072 [xfs_io]
253,0   21       26     0.000085788  2206  X   R 3296 / 3552 [xfs_io]
253,0   21       27     0.000087096  2206  X   R 3296 / 3328 [xfs_io]
253,0   21       28     0.000093469  2206  X   R 3552 / 3808 [xfs_io]
253,0   21       29     0.000095186  2206  X   R 3552 / 3584 [xfs_io]
253,0   21       30     0.000099228  2206  X   R 3808 / 4064 [xfs_io]
253,0   21       31     0.000101062  2206  X   R 3808 / 3840 [xfs_io]
253,0   21       32     0.000104956  2206  X   R 4064 / 4096 [xfs_io]
253,0   21       33     0.001138823     0  C   R 4096 + 200 [0]

after this revert:

253,0   18        1     0.000000000  4430  Q   R 224 + 3896 [xfs_io]
253,0   18        2     0.000018359  4430  X   R 224 / 256 [xfs_io]
253,0   18        3     0.000028898  4430  X   R 256 / 512 [xfs_io]
253,0   18        4     0.000033535  4430  X   R 512 / 768 [xfs_io]
253,0   18        5     0.000065684  4430  X   R 768 / 1024 [xfs_io]
253,0   18        6     0.000091695  4430  X   R 1024 / 1280 [xfs_io]
253,0   18        7     0.000098494  4430  X   R 1280 / 1536 [xfs_io]
253,0   18        8     0.000114069  4430  X   R 1536 / 1792 [xfs_io]
253,0   18        9     0.000129483  4430  X   R 1792 / 2048 [xfs_io]
253,0   18       10     0.000136759  4430  X   R 2048 / 2304 [xfs_io]
253,0   18       11     0.000152412  4430  X   R 2304 / 2560 [xfs_io]
253,0   18       12     0.000160758  4430  X   R 2560 / 2816 [xfs_io]
253,0   18       13     0.000183385  4430  X   R 2816 / 3072 [xfs_io]
253,0   18       14     0.000190797  4430  X   R 3072 / 3328 [xfs_io]
253,0   18       15     0.000197667  4430  X   R 3328 / 3584 [xfs_io]
253,0   18       16     0.000218751  4430  X   R 3584 / 3840 [xfs_io]
253,0   18       17     0.000226005  4430  X   R 3840 / 4096 [xfs_io]
253,0   18       18     0.000250404  4430  Q   R 4120 + 176 [xfs_io]
253,0   18       19     0.000847708     0  C   R 4096 + 24 [0]
253,0   18       20     0.000855783     0  C   R 4120 + 176 [0]

Fixes: effd58c95f ("dm: always call blk_queue_split() in dm_process_bio()")
Cc: stable@vger.kernel.org
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Barry Marson <bmarson@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-03 11:32:19 -04:00
Mike Snitzer
e7fc1e57d9 dm integrity: fix ppc64le warning
Otherwise:

In file included from drivers/md/dm-integrity.c:13:
drivers/md/dm-integrity.c: In function 'dm_integrity_status':
drivers/md/dm-integrity.c:3061:10: error: format '%llu' expects
argument of type 'long long unsigned int', but argument 4 has type
'long int' [-Werror=format=]
   DMEMIT("%llu %llu",
          ^~~~~~~~~~~
    atomic64_read(&ic->number_of_mismatches),
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/linux/device-mapper.h:550:46: note: in definition of macro 'DMEMIT'
      0 : scnprintf(result + sz, maxlen - sz, x))
                                              ^
cc1: all warnings being treated as errors

Fixes: 7649194a16 ("dm integrity: remove sector type casts")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-03 10:44:24 -04:00
Vivek Goyal
4e4ced9379 dax: Move mandatory ->zero_page_range() check in alloc_dax()
zero_page_range() dax operation is mandatory for dax devices. Right now
that check happens in dax_zero_page_range() function. Dan thinks that's
too late and its better to do the check earlier in alloc_dax().

I also modified alloc_dax() to return pointer with error code in it in
case of failure. Right now it returns NULL and caller assumes failure
happened due to -ENOMEM. But with this ->zero_page_range() check, I
need to return -EINVAL instead.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20200401161125.GB9398@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-04-02 19:15:03 -07:00
Vivek Goyal
cdf6cdcd3b dm,dax: Add dax zero_page_range operation
This patch adds support for dax zero_page_range operation to dm targets.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20200228163456.1587-5-vgoyal@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-04-02 19:15:03 -07:00