2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

46 Commits

Author SHA1 Message Date
Sagi Grimberg
39d06079a5 nvme-tcp: fix possible crash in recv error flow
If the target misbehaves and sends us unexpected payload we
need to make sure to fail the controller and stop processing
the input stream. We clear the rd_enabled flag and stop
the io_work, but we may still requeue it if we still have pending
sends and then in the next invocation we will process the input
stream as the check is only in the .data_ready upcall.

To fix this we need to make sure not to self-requeue io_work
upon a recv flow error.

This fixes the crash:
 nvme nvme2: receive failed:  -22
 BUG: unable to handle page fault for address: ffffbeb5816c3b48
 nvme_ns_head_make_request: 29 callbacks suppressed
 block nvme0n5: no usable path - requeuing I/O
 block nvme0n5: no usable path - requeuing I/O
 block nvme0n7: no usable path - requeuing I/O
 block nvme0n7: no usable path - requeuing I/O
 block nvme0n3: no usable path - requeuing I/O
 block nvme0n3: no usable path - requeuing I/O
 block nvme0n3: no usable path - requeuing I/O
 block nvme0n7: no usable path - requeuing I/O
 block nvme0n3: no usable path - requeuing I/O
 block nvme0n3: no usable path - requeuing I/O
 #PF: supervisor read access inkernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 1039157067 P4D 1039157067 PUD 103915a067 PMD 102719f067 PTE 0
 Oops: 0000 [#1] SMP PTI
 CPU: 8 PID: 411 Comm: kworker/8:1H Not tainted 5.3.0-40-generic #32~18.04.1-Ubuntu
 Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0 12/17/2015
 Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
 RIP: 0010:nvme_tcp_recv_skb+0x2ae/0xb50 [nvme_tcp]
 RSP: 0018:ffffbeb5806cfd10 EFLAGS: 00010246
 RAX: ffffbeb5816c3b48 RBX: 00000000000003d0 RCX: 0000000000000008
 RDX: 00000000000003d0 RSI: 0000000000000001 RDI: ffff9a3040684b40
 RBP: ffffbeb5806cfd90 R08: 0000000000000000 R09: ffffffff946e6900
 R10: ffffbeb5806cfce0 R11: 0000000000000001 R12: 0000000000000000
 R13: ffff9a2ff86501c0 R14: 00000000000003d0 R15: ffff9a30b85f2798
 FS:  0000000000000000(0000) GS:ffff9a30bf800000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffbeb5816c3b48 CR3: 000000088400a006 CR4: 00000000003626e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  tcp_read_sock+0x8c/0x290
  ? __release_sock+0x9d/0xe0
  ? nvme_tcp_write_space+0xb0/0xb0 [nvme_tcp]
  nvme_tcp_io_work+0x4b4/0x830 [nvme_tcp]
  ? finish_task_switch+0x163/0x270
  process_one_work+0x1fd/0x3f0
  worker_thread+0x34/0x410
  kthread+0x121/0x140
  ? process_one_work+0x3f0/0x3f0
  ? kthread_park+0xb0/0xb0
  ret_from_fork+0x35/0x40

Reported-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-01 11:07:13 +02:00
Sagi Grimberg
f86e5bf817 nvme-tcp: don't poll a non-live queue
In error recovery we might be removing the queue so check we
can actually poll before we do.

Reported-by: Mark Wunderlich <mark.wunderlich@intel.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-03-31 17:57:32 +02:00
Sagi Grimberg
25e5cb780e nvme-tcp: fix possible crash in write_zeroes processing
We cannot look at blk_rq_payload_bytes without first checking
that the request has a mappable physical segments first (e.g.
blk_rq_nr_phys_segments(rq) != 0) and only then to take the
request payload bytes. This caused us to send a wrong sgl to
the target or even dereference a non-existing buffer in case
we actually got to the data send sequence (if it was in-capsule).

Reported-by: Tony Asleson <tasleson@redhat.com>
Suggested-by: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-03-31 17:57:28 +02:00
Israel Rukshin
bea54ef53f nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl
The transition to LIVE state should not fail in case of a new controller.
Moving to DELETING state before nvme_tcp_create_ctrl() allocates all the
resources may leads to NULL dereference at teardown flow (e.g., IO tagset,
admin_q, connect_q).

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:56 +09:00
Israel Rukshin
726612b6b8 nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl
Put the ctrl reference count at nvme_uninit_ctrl as opposed to
nvme_init_ctrl which takes it. This decrease the reference count at the
core layer instead of decreasing it on each transport separately.
Also move the call of nvme_uninit_ctrl at PCI driver after calling to
nvme_release_prp_pools and nvme_dev_unmap, in order to put the reference
count after using the dev. This is safe because those functions use
nvme_dev which is freed only later at nvme_pci_free_ctrl.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:56 +09:00
Israel Rukshin
b780d7415a nvme: Fix ctrl use-after-free during sysfs deletion
In case nvme_sysfs_delete() is called by the user before taking the ctrl
reference count, the ctrl may be freed during the creation and cause the
bug. Take the reference as soon as the controller is externally visible,
which is done by cdev_device_add() in nvme_init_ctrl(). Also take the
reference count at the core layer instead of taking it on each transport
separately.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:56 +09:00
Sagi Grimberg
761ad26c45 nvme-tcp: break from io_work loop if recv failed
If we failed to receive data from the socket, don't try
to further process it, we will for sure be handling a queue
error at this point. While no issue was seen with the
current behavior thus far, its safer to cease socket processing
if we detected an error.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:45 +09:00
Sagi Grimberg
5ff4e11264 nvme-tcp: move send failure to nvme_tcp_try_send
Consolidate the request failure handling code to where
it is being fetched (nvme_tcp_try_send).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:50:21 +09:00
Sagi Grimberg
40510a639e nvme-tcp: optimize queue io_cpu assignment for multiple queue maps
Currently, queue io_cpu assignment is done sequentially for default,
read and poll queues based on queue id. This causes miss-alignment between
context of CPU initiating I/O and the I/O worker thread processing
queued requests or completions.

Change to modify queue io_cpu assignment to take into account queue
maps offset. Each queue io_cpu will start at zero for each queue map.
This essentially aligns read/poll queues to start over the same range as
default queues.

Testing performed by Mark with:
- ram device (nvmet)
- single CPU core (pinned)
- 100% 4k reads
- engine io_uring (not using sq_thread option)
- hipri flag set

Micro-benchmark results show a net gain of:
- increase of 18%-29% in IOPs
- reduction of 16%-22% in average latency
- reduction of 7%-23% in 99.99% latency

Baseline:
========
QDepth/Batch	| IOPs [k]	| Avg. Lat [us]	| 99.99% Lat [us]
-----------------------------------------------------------------
1/1 		| 32.4		| 30.11		| 50.94
32/8		| 179		| 168.20	| 371

CPU alignment:
=============
QDepth/Batch	| IOPs [k]	| Avg. Lat [us]	| 99.99% Lat [us]
-----------------------------------------------------------------
1/1 		| 38.5		|   25.18	| 39.16
32/8		| 231		|   130.75	| 343

Reported-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:48:06 +09:00
Wunderlich, Mark
9912ade355 nvme-tcp: Set SO_PRIORITY for all host sockets
Enable ability to associate all sockets related to NVMf TCP traffic
to a priority group that will perform optimized network processing for
this traffic class. Maintain initial default behavior of using priority
of zero.

Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-04 09:09:08 -08:00
Nigel Kirkland
97b2512ad0 nvme: prevent warning triggered by nvme_stop_keep_alive
Delayed keep alive work is queued on system workqueue and may be cancelled
via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq.

Check_flush_dependency detects mismatched attributes between the work-queue
context used to cancel the keep alive work and system-wq. Specifically
system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used
to cancel keep alive work have WQ_MEM_RECLAIM flag.

Example warning:

  workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc]
	is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core]

To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq.

However this creates a secondary concern where work and a request to cancel
that work may be in the same work queue - namely err_work in the rdma and
tcp transports, which will want to flush/cancel the keep alive work which
will now be on nvme_wq.

After reviewing the transports, it looks like err_work can be moved to
nvme_reset_wq. In fact that aligns them better with transition into
RESETTING and performing related reset work in nvme_reset_wq.

Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq.

Signed-off-by: Nigel Kirkland <nigel.kirkland@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-14 10:12:04 -07:00
Anton Eidelman
2d570a7c02 nvme/tcp: fix bug on double requeue when send fails
When nvme_tcp_io_work() fails to send to socket due to
connection close/reset, error_recovery work is triggered
from nvme_tcp_state_change() socket callback.
This cancels all the active requests in the tagset,
which requeues them.

The failed request, however, was ended and thus requeued
individually as well unless send returned -EPIPE.
Another return code to be treated the same way is -ECONNRESET.

Double requeue caused BUG_ON(blk_queued_rq(rq))
in blk_mq_requeue_request() from either the individual requeue
of the failed request or the bulk requeue from
blk_mq_tagset_busy_iter(, nvme_cancel_request, );

Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-14 10:12:04 -07:00
Linus Torvalds
2d53943090 for-5.5/drivers-20191121
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WyEAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplgbD/4jNeqT0q2IkNcUUEWkZWsBOlfi0SiclS5v
 X8JY1IxTlL0kaBWm83mw06JewucQ97Fh7xblPE8/iDHJqpgEX4vvSQY1b8hcDulZ
 YOKUnLkFU22nICeT04/8x/+f8gqD5KOlGxkgEvUKViQW15oc0oNu4St/yFM1QEN0
 qNMzpcfFXV9lYsOPl0y3pKdP+qbfcpeSmaFD9Z65gxN6rJy1WR8rtUGXy2luoiEc
 dh15IL9AGN/r8VTo8yRpD9PStiuJqpALIR8OHJSHPj+s0pQ6twk4aehcnYseAMbH
 zSDpa9AJrfqlnh8tUfKYLWi/PM7pMH0F01rAiQv47j/C0+QhbiOU/uTFTzUW5hQ1
 eK6XzJ0slxwnDsHLKf+xJmCj0Oyk0jDimNQr/2MNsuhmr29V5lfvBNflub8eOLyZ
 ie2Eulv+z6pYBSJx6kqm0X3vhXOy4wgU+X8LzvfcP9iAjgU1rfzxUWxLEj+KfJS2
 Nl+ERV9nafoPpoKpNR7zWRBUulp1qZJzo/U9JaUKiI5cWkIH1hhHmU2++xMeyJpb
 XHoDFNTGv6z/eef65eSveFD7F274TSi16K56Obk+4KWaSrIR0d6VwUA7FDmJbSI+
 Jqk1OFdaRGsQ5OcVxF1Qo4WChn0FvhcD0c+yL0N19WZ01QeYsb3hlA+MUPDtGQ04
 U79MPfu7iA==
 =i0jf
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Here are the main block driver updates for 5.5. Nothing major in here,
  mostly just fixes. This contains:

   - a set of bcache changes via Coly

   - MD changes from Song

   - loop unmap write-zeroes fix (Darrick)

   - spelling fixes (Geert)

   - zoned additions cleanups to null_blk/dm (Ajay)

   - allow null_blk online submit queue changes (Bart)

   - NVMe changes via Keith, nothing major here either"

* tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block: (56 commits)
  Revert "bcache: fix fifo index swapping condition in journal_pin_cmp()"
  drivers/md/raid5-ppl.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
  drivers/md/raid5.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
  bcache: don't export symbols
  bcache: remove the extra cflags for request.o
  bcache: at least try to shrink 1 node in bch_mca_scan()
  bcache: add idle_max_writeback_rate sysfs interface
  bcache: add code comments in bch_btree_leaf_dirty()
  bcache: fix deadlock in bcache_allocator
  bcache: add code comment bch_keylist_pop() and bch_keylist_pop_front()
  bcache: deleted code comments for dead code in bch_data_insert_keys()
  bcache: add more accurate error messages in read_super()
  bcache: fix static checker warning in bcache_device_free()
  bcache: fix a lost wake-up problem caused by mca_cannibalize_lock
  bcache: fix fifo index swapping condition in journal_pin_cmp()
  md/raid10: prevent access of uninitialized resync_pages offset
  md: avoid invalid memory access for array sb->dev_roles
  md/raid1: avoid soft lockup under high load
  null_blk: add zone open, close, and finish support
  dm: add zone open, close and finish support
  ...
2019-11-25 11:15:41 -08:00
Israel Rukshin
58a8df67e0 nvme: introduce nvme_is_aen_req function
This function improves code readability and reduces code duplication.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04 10:56:40 -07:00
Eric Dumazet
3f926af3f4 net: use skb_queue_empty_lockless() in busy poll contexts
Busy polling usually runs without locks.
Let's use skb_queue_empty_lockless() instead of skb_queue_empty()

Also uses READ_ONCE() in __skb_try_recv_datagram() to address
a similar potential problem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28 13:33:41 -07:00
Max Gurtovoy
28a4cac48c nvme-tcp: fix possible leakage during error flow
During nvme_tcp_setup_cmd_pdu error flow, one must call nvme_cleanup_cmd
since it's symmetric to nvme_setup_cmd.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-15 22:47:29 +09:00
Sebastian Andrzej Siewior
ac1c4e1885 nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL
The access to sk->sk_ll_usec should be hidden behind
CONFIG_NET_RX_BUSY_POLL like the definition of sk_ll_usec.

Put access to ->sk_ll_usec behind CONFIG_NET_RX_BUSY_POLL.

Fixes: 1a9460cef5 ("nvme-tcp: support simple polling")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14 23:27:01 +09:00
Keith Busch
92b98e88d5 nvme: Restart request timers in resetting state
A controller in the resetting state has not yet completed its recovery
actions. The pci and fc transports were already handling this, so update
the remaining transports to not attempt additional recovery in this
state. Instead, just restart the request timer.

Tested-by: Edmund Nadolski <edmund.nadolski@intel.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14 23:21:49 +09:00
Wunderlich, Mark
ddef29578a nvme-tcp: fix wrong stop condition in io_work
Allow the do/while statement to continue if current time
is not after the proposed time 'deadline'. Intent is to
allow loop to proceed for a specific time period. Currently
the loop, as coded, will exit after first pass.

Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25 12:53:14 -07:00
Colin Ian King
312910f4d2 nvme: tcp: remove redundant assignment to variable ret
The variable ret is being initialized with a value that is never read
and is being re-assigned immediately afterwards. The assignment is
redundant and hence can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:46 -07:00
Sagi Grimberg
1668601008 nvme-tcp: fail command with NVME_SC_HOST_PATH_ERROR send failed
This is a more appropriate error status for a transport error
detected by us (the host).

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:45 -07:00
Sagi Grimberg
e7832cb48a nvme: make fabrics command run on a separate request queue
We have a fundamental issue that fabric commands use the admin_q.
The reason is, that admin-connect, register reads and writes and
admin commands cannot be guaranteed ordering while we are running
controller resets.

For example, when we reset a controller we perform:
1. disable the controller
2. teardown the admin queue
3. re-establish the admin queue
4. enable the controller

In order to perform (3), we need to unquiesce the admin queue, however
we may have some admin commands that are already pending on the
quiesced admin_q and will immediate execute when we unquiesce it before
we execute (4). The host must not send admin commands to the controller
before enabling the controller.

To fix this, we have the fabric commands (admin connect and property
get/set, but not I/O queue connect) use a separate fabrics_q and make
sure to quiesce the admin_q before we disable the controller, and
unquiesce it only after we enable the controller.

This fixes the error prints from nvmet in a controller reset storm test:
kernel: nvmet: got cmd 6 while CC.EN == 0 on qid = 0
Which indicate that the host is sending an admin command when the
controller is not enabled.

Reviewed-by:  James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29 12:55:03 -07:00
Israel Rukshin
bb13985d5a nvme-tcp: Add TOS for tcp transport
TOS provide clients the ability to segregate traffic flows for
different type of data.
One of the TOS usage is bandwidth management which allows setting bandwidth
limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A
and 20% to controllers at QoS class B.

usage examples:
nvme connect --tos=0 --transport=tcp --traddr=10.0.1.1 --nqn=test-nvme

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29 12:55:02 -07:00
Israel Rukshin
9924b0304a nvme-tcp: Use struct nvme_ctrl directly
This patch doesn't change any functionality.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29 12:55:02 -07:00
Sagi Grimberg
1a9460cef5 nvme-tcp: support simple polling
Simple polling support via socket busy_poll interface.
Although we do not shutdown interrupts but simply hammer
the socket poll, we can sometimes find completions faster
than the normal interrupt driven RX path.

We add per queue nr_cqe counter that resets every time
RX path is invoked such that .poll callback can return it
to stay consistent with the semantics.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29 12:55:01 -07:00
Sagi Grimberg
b5b0504878 nvme: don't pass cap to nvme_disable_ctrl
All seem to call it with ctrl->cap so no need to pass it
at all.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29 12:55:00 -07:00
Sagi Grimberg
c0f2f45be2 nvme: move sqsize setting to the core
nvme_enable_ctrl reads the cap register right after, so
no need to do that locally in the transport driver. Have
sqsize setting in nvme_init_identify.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29 12:55:00 -07:00
Potnuri Bharat Teja
10407ec9b4 nvme-tcp: Use protocol specific operations while reading socket
Using socket specific read_sock() calls instead of directly calling
tcp_read_sock() helps lld module registered handlers if any, to be called
from nvme-tcp host.
This patch therefore replaces the tcp_read_sock() with socket specific
prot_ops.

Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29 12:55:00 -07:00
Sagi Grimberg
6be182607d nvme-tcp: cleanup nvme_tcp_recv_pdu
Can return directly in the switch statement

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29 12:55:00 -07:00
Ming Lei
622b8b6893 nvme: wait until all completed request's complete fn is called
When aborting in-flight request for recovering controller, we have
to make sure that queue's complete function is called on completed
request before moving on. Otherwise, for example, the warning of
WARN_ON_ONCE(qp->mrs_used > 0) in ib_destroy_qp_user() may be
triggered on nvme-rdma.

Fix this issue by using blk_mq_tagset_wait_completed_request.

Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-04 21:41:29 -06:00
Mikhail Skorzhinskii
37c1521959 nvme-tcp: don't use sendpage for SLAB pages
According to commit a10674bf24 ("tcp: detecting the misuse of
.sendpage for Slab objects") and previous discussion, tcp_sendpage
should not be used for pages that is managed by SLAB, as SLAB is not
taking page reference counters into consideration.

Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09 14:18:09 -07:00
Sagi Grimberg
6486199378 nvme-tcp: fix queue mapping when queue count is limited
When the controller supports less queues than requested, we
should make sure that queue mapping does the right thing and
not assume that all queues are available. This fixes a crash
when the controller supports less queues than requested.

The rules are:
1. if no write queues are requested, we assign the available queues
   to the default queue map. The default and read queue maps share the
   existing queues.
2. if write queues are requested:
  - first make sure that read queue map gets the requested
    nr_io_queues count
  - then grant the default queue map the minimum between the requested
    nr_write_queues and the remaining queues. If there are no available
    queues to dedicate to the default queue map, fallback to (1) and
    share all the queues in the existing queue map.

Also, provide a log indication on how we constructed the different
queue maps.

Reported-by: Harris, James R <james.r.harris@intel.com>
Tested-by: Jim Harris <james.r.harris@intel.com>
Cc: <stable@vger.kernel.org> # v5.0+
Suggested-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-05-30 11:07:37 -07:00
Sagi Grimberg
f34e25898a nvme-tcp: fix possible null deref on a timed out io queue connect
If I/O queue connect times out, we might have freed the queue socket
already, so check for that on the error path in nvme_tcp_start_queue.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01 09:17:15 -04:00
Sagi Grimberg
efb973b19b nvme-tcp: rename function to have nvme_tcp prefix
usually nvme_ prefix is for core functions.
While we're cleaning up, remove redundant empty lines

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:51:41 +02:00
Sagi Grimberg
7a42589654 nvme-tcp: fix a NULL deref when an admin connect times out
If we timeout the admin startup sequence we might not yet have
an I/O tagset allocated which causes the teardown sequence to crash.
Make nvme_tcp_teardown_io_queues safe by not iterating inflight tags
if the tagset wasn't allocated.

Fixes: 39d5775746 ("nvme-tcp: fix timeout handler")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:51:37 +02:00
Christoph Hellwig
988aef9e8b nvme-tcp: fix an endianess miss-annotation
nvme_tcp_end_request just takes the status value and the converts
it to little endian as well as shifting for the phase bit.

Fixes: 43ce38a6d823 ("nvme-tcp: support C2HData with SUCCESS flag")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-03-28 18:15:02 +01:00
Sagi Grimberg
602d674ce9 nvme-tcp: support C2HData with SUCCESS flag
A C2HData PDU with the SUCCESS flag set indicates that the I/O was
completed by the controller successfully and means that a subsequent
completion response capsule PDU will be ommitted.

If we see this flag, fisrt we check that LAST_PDU flag is set as well,
and then we complete the request when the data transfer (and data digest
verification if its on) is done.

While we're at it, reuse a bit of code with nvme_fail_request.

Reported-by: Steve Blightman <steve.blightman@oracle.com>
Suggested-by: Oliver Smith-Denny <osmithde@cisco.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Oliver Smith-Denny <osmithde@cisco.com>
Tested-by: Oliver Smith-Denny <osmithde@cisco.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13 12:57:34 -06:00
Sagi Grimberg
794a4cb3d2 nvme: remove the .stop_ctrl callout
It is used now just to flush error recovery and reconnect work items in
the RDMA and TCP transports, which can simply be moved to the
corresponding teardown routines.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-04 15:41:25 +01:00
Sagi Grimberg
39d5775746 nvme-tcp: fix timeout handler
Currently, we have several problems with the timeout
handler:
1. If we timeout on the controller establishment flow, we will hang
because we don't execute the error recovery (and we shouldn't because
the create_ctrl flow needs to fail and cleanup on its own)
2. We might also hang if we get a disconnet on a queue while the
controller is already deleting. This racy flow can cause the controller
disable/shutdown admin command to hang.

We cannot complete a timed out request from the timeout handler without
mutual exclusion from the teardown flow (e.g. nvme_rdma_error_recovery_work).
So we serialize it in the timeout handler and teardown io and admin
queues to guarantee that no one races with us from completing the
request.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-23 17:16:59 -07:00
Sagi Grimberg
e85037a2e9 nvme-tcp: don't ask if controller is fabrics
For sure we are a fabric driver.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:06 -05:00
Sagi Grimberg
e9c2edc098 nvme-tcp: remove dead code
We should never touch the opal device from the transport driver.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:06 -05:00
Sagi Grimberg
26c682274e nvme-fabrics: allow nvmf_connect_io_queue to poll
Preparation for polling support for fabrics. Polling support
means that our completion queues are not generating any interrupts
which means we need to poll for the nvmf io queue connect as well.

Reviewed by Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-18 17:50:48 +01:00
Colin Ian King
56a77d26d6 nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
There is a spelling mistake in a dev_info message, fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-18 17:50:47 +01:00
Christoph Hellwig
a7273d4023 nvme-tcp: fix endianess annotations
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-12-18 17:50:46 +01:00
Sagi Grimberg
873946f4b9 nvme-tcp: support separate queue maps for read and write
Allow NVMF_OPT_NR_WRITE_QUEUES to describe additional write queues.  In
addition, implement .map_queues that will apply 2 queue maps for read
and write queue sets.

Note that with the separate queue map, HCTX_TYPE_READ will always use
nr_io_queues and HCTX_TYPE_DEFAULT will use nr_write_queues.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-13 09:59:09 +01:00
Sagi Grimberg
3f2304f8c6 nvme-tcp: add NVMe over TCP host driver
This patch implements the NVMe over TCP host driver. It can be used to
connect to remote NVMe over Fabrics subsystems over good old TCP/IP.

The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.

To connect to all NVMe over Fabrics controllers reachable on a given taget
port over TCP use the following command:

	nvme connect-all -t tcp -a $IPADDR

This requires the latest version of nvme-cli with TCP support.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Solganik Alexander <sashas@lightbitslabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-13 09:58:58 +01:00