Let's handle unplug in Big Block Mode similar to Sub Block Mode --
prioritize memory blocks onlined to ZONE_MOVABLE.
We won't care further about big blocks with mixed zones, as it's
rather a corner case that won't matter in practice.
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210602185720.31821-8-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Let's simplify high-level big block selection when unplugging in
Big Block Mode.
Combine handling of offline and online blocks. We can get rid of
virtio_mem_bbm_bb_is_offline() and simply use
virtio_mem_bbm_offline_remove_and_unplug_bb(), as that already tolerates
offline parts.
We can race with concurrent onlining/offlining either way, so we don;t
have to be super correct by failing if an offline big block we'd like to
unplug just got (partially) onlined.
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210602185720.31821-7-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Until now, memory provided by a single virtio-mem device was usually
either onlined completely to ZONE_MOVABLE (online_movable) or to
ZONE_NORMAL (online_kernel); however, that will change in the future.
There are two reasons why we want to track to which zone a memory blocks
belongs to and prioritize ZONE_MOVABLE blocks:
1) Memory managed by ZONE_MOVABLE can more likely get unplugged, therefore,
resulting in a faster memory hotunplug process. Further, we can more
reliably unplug and remove complete memory blocks, removing metadata
allocated for the whole memory block.
2) We want to avoid corner cases where unplugging with the current scheme
(highest to lowest address) could result in accidential zone imbalances,
whereby we remove too much ZONE_NORMAL memory for ZONE_MOVABLE memory
of the same device.
Let's track the zone via memory block states and try unplug from
ZONE_MOVABLE first. Rename VIRTIO_MEM_SBM_MB_ONLINE* to
VIRTIO_MEM_SBM_MB_KERNEL* to avoid even longer state names.
In commit 27f852795a ("virtio-mem: don't special-case ZONE_MOVABLE"),
we removed slightly similar tracking for fully plugged memory blocks to
support unplugging from ZONE_MOVABLE at all -- as we didn't allow partially
plugged memory blocks in ZONE_MOVABLE before that. That commit already
mentioned "In the future, we might want to remember the zone again and use
the information when (un)plugging memory."
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210602185720.31821-6-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Let's simplify by introducing a new virtio_mem_sbm_unplug_any_sb(),
similar to virtio_mem_sbm_plug_any_sb(), to simplify high-level memory
block selection when unplugging in Sub Block Mode.
Rename existing virtio_mem_sbm_unplug_any_sb() to
virtio_mem_sbm_unplug_any_sb_raw().
The only change is that we now temporarily unlock the hotplug mutex around
cond_resched() when processing offline memory blocks, which doesn't
make a real difference as we already have to temporarily unlock in
virtio_mem_sbm_unplug_any_sb_offline() when removing a memory block.
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210602185720.31821-5-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Let's simplify high-level memory block selection when plugging in Sub
Block Mode.
No need for two separate loops when selecting memory blocks for plugging
memory. Avoid passing the "online" state by simply obtaining the state
in virtio_mem_sbm_plug_any_sb().
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210602185720.31821-4-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We are reading a Big Block Mode value while in Sub Block Mode
when initializing. Fortunately, vm->bbm.bb_size maps to some counter
in the vm->sbm.mb_count array, which is 0 at that point in time.
No harm done; still, this was unintended and is not future-proof.
Fixes: 4ba50cd335 ("virtio-mem: Big Block Mode (BBM) memory hotplug")
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210602185720.31821-2-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Clear the available index as part of the initialization process to
clear and values that might be left from previous usage of the device.
For example, if the device was previously used by vhost_vdpa and now
probed by vhost_vdpa, you want to start with indices.
Fixes: c043b4a8cf ("virtio: introduce a vDPA based transport")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20210602021536.39525-5-jasowang@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eli Cohen <elic@nvidia.com>
We used to fail the set_vq_state() since it was not supported yet by
the virtio spec. But if the bus tries to set the state which is equal
to the device initial state after reset, we can let it go.
This is a must for virtio_vdpa() to set vq state during probe which is
required for some vDPA parents.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20210602021536.39525-4-jasowang@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eli Cohen <elic@nvidia.com>
This patch extends the vdpa_vq_state to support packed virtqueue
state which is basically the device/driver ring wrap counters and the
avail and used index. This will be used for the virito-vdpa support
for the packed virtqueue and the future vhost/vhost-vdpa support for
the packed virtqueue.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20210602021536.39525-2-jasowang@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eli Cohen <elic@nvidia.com>
For split virtqueue, we used to depend on the address, length and
flags stored in the descriptor ring for DMA unmapping. This is unsafe
for the case since the device can manipulate the behavior of virtio
driver, IOMMU drivers and swiotlb.
For safety, maintain the DMA address, DMA length, descriptor flags and
next filed of the non indirect descriptors in vring_desc_state_extra
when DMA API is used for virtio as we did for packed virtqueue and use
those metadata for performing DMA operations. Indirect descriptors
should be safe since they are using streaming mappings.
With this the descriptor ring is write only form the view of the
driver.
This slight increase the footprint of the drive but it's not noticed
through pktgen (64B) test and netperf test in the case of virtio-net.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20210604055350.58753-8-jasowang@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We should not depend on the DMA address, length and flag of descriptor
table since they could be wrote with arbitrary value by the device. So
this patch switches to use the stored one in desc_extra.
Note that the indirect descriptors are fine since they are read-only
streaming mappings.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20210604055350.58753-5-jasowang@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
After device reset, the virtqueues are not ready so clear the ready
field.
Failing to do so can result in virtio_vdpa failing to load if the device
was previously used by vhost_vdpa and the old values are ready.
virtio_vdpa expects to find VQs in "not ready" state.
Fixes: 1a86b377aa ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Link: https://lore.kernel.org/r/20210606053128.170399-1-elic@nvidia.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
There are currently two cases where we poll TX vq not in response to a
callback: start xmit and rx napi. We currently do this with callbacks
enabled which can cause extra interrupts from the card. Used not to be
a big issue as we run with interrupts disabled but that is no longer the
case, and in some cases the rate of spurious interrupts is so high
linux detects this and actually kills the interrupt.
Fix up by disabling the callbacks before polling the tx vq.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
virtio_disable_cb is currently a nop for split ring with event index.
This is because it used to be always called from a callback when we know
device won't trigger more events until we update the index. However,
now that we run with interrupts enabled a lot we also poll without a
callback so that is different: disabling callbacks will help reduce the
number of spurious interrupts.
Further, if using event index with a packed ring, and if being called
from a callback, we actually do disable interrupts which is unnecessary.
Fix both issues by tracking whenever we get a callback. If that is
the case disabling interrupts with event index can be a nop.
If not the case disable interrupts. Note: with a split ring
there's no explicit "no interrupts" value. For now we write
a fixed value so our chance of triggering an interupt
is 1/ring size. It's probably better to write something
related to the last used index there to reduce the chance
even further. For now I'm keeping it simple.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We currently check num_free outside tx q lock
which is unsafe: new packets can arrive meanwhile
and there won't be space in the queue.
Thus a spurious queue wakeup causing overhead
and even packet drops.
Move the check under the lock to fix that.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
It's unsafe to operate a vq from multiple threads.
Unfortunately this is exactly what we do when invoking
clean tx poll from rx napi.
Same happens with napi-tx even without the
opportunistic cleaning from the receive interrupt: that races
with processing the vq in start_xmit.
As a fix move everything that deals with the vq to under tx lock.
Fixes: b92f1e6751 ("virtio-net: transmit napi")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In order to support running vdpa using vritio_vdpa driver, we need to
create a different kind of MR, one that has 1:1 mapping, since the
addresses referring to virtqueues are dma addresses.
We create the 1:1 MR in mlx5_vdpa_dev_add() only in case firmware
supports the general capability umem_uid_0. The reason for that is that
1:1 MRs must be created with uid == 0 while virtqueue objects can be
created with uid == 0 only when the firmware capability is on.
If the set_map() callback is called with new translations provided
through iotlb, the driver will destroy the 1:1 MR and create a regular
one.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Link: https://lore.kernel.org/r/20210602085854.62690-1-elic@nvidia.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Before SF support was introduced, the DMA device was equal to
mdev->device which was in essence equal to pdev->dev.
With SF introduction this is no longer true. It has already been
handled for vhost_vdpa since the reference to the dma device can from
within mlx5_vdpa. With virtio_vdpa this broke. To fix this we set the
real dma device when initializing the device.
In addition, for the sake of consistency, previous references in the
code to the dma device are changed to vdev->dma_dev.
Fixes: d13a15d544 ("vdpa/mlx5: Use the correct dma device when registering memory")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Link: https://lore.kernel.org/r/20210606053150.170489-1-elic@nvidia.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Currently all resources must be created with uid != 0 which is essential
when userspace processes are allocating virtquueue resources. Since this
is a kernel implementation, it is perfectly legal to open resources with
uid == 0.
In case firmware supports, avoid allocating user context.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Link: https://lore.kernel.org/r/20210531160404.31368-1-elic@nvidia.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
umem size is a 32 bit unsigned value so assigning it to an int could
cause false failures. Set the calculated value inside the function and
modify function name to reflect the fact it updates the size.
This bug was found during code review but never had real impact to this
date.
Fixes: 1a86b377aa ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Link: https://lore.kernel.org/r/20210530090349.8360-1-elic@nvidia.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Fix copy paste bug assigning umem1 size to umem2 and umem3. The issue
was discovered when trying to use a 1:1 MR that covers the entire
address space where firmware complained that provided sizes are not
large enough. 1:1 MRs are required to support virtio_vdpa.
Fixes: 1a86b377aa ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Link: https://lore.kernel.org/r/20210530090317.8284-1-elic@nvidia.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
The struct virtio_blk_config seg_max value is read from the device and
incremented by 2 to account for the request header and status byte
descriptors added by the driver.
In preparation for supporting untrusted virtio-blk devices, protect
against integer overflow and limit the value to a safe maximum.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20210524154020.98195-1-stefanha@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Trivial change for the vhost_iotlb_del_range() documentation,
fixing the function name in the comment block.
Discovered with `make C=2 M=drivers/vhost`:
../drivers/vhost/iotlb.c:92: warning: expecting prototype for vring_iotlb_del_range(). Prototype was for vhost_iotlb_del_range() instead
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20210504135444.158716-1-sgarzare@redhat.com
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Pull cgroup updates from Tejun Heo:
- cgroup.kill is added which implements atomic killing of the whole
subtree.
Down the line, this should be able to replace the multiple userland
implementations of "keep killing till empty".
- PSI can now be turned off at boot time to avoid overhead for
configurations which don't care about PSI.
* 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: make per-cgroup pressure stall tracking configurable
cgroup: Fix kernel-doc
cgroup: inline cgroup_task_freeze()
tests/cgroup: test cgroup.kill
tests/cgroup: move cg_wait_for(), cg_prepare_for_wait()
tests/cgroup: use cgroup.kill in cg_killall()
docs/cgroup: add entry for cgroup.kill
cgroup: introduce cgroup.kill
Pull percpu updates from Dennis Zhou:
- percpu chunk depopulation - depopulate backing pages for chunks with
empty pages when we exceed a global threshold without those pages.
This lets us reclaim a portion of memory that would previously be
lost until the full chunk would be freed (possibly never).
- memcg accounting cleanup - previously separate chunks were managed
for normal allocations and __GFP_ACCOUNT allocations. These are now
consolidated which cleans up the code quite a bit.
- a few misc clean ups for clang warnings
* 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: optimize locking in pcpu_balance_workfn()
percpu: initialize best_upa variable
percpu: rework memcg accounting
mm, memcg: introduce mem_cgroup_kmem_disabled()
mm, memcg: mark cgroup_memory_nosocket, nokmem and noswap as __ro_after_init
percpu: make symbol 'pcpu_free_slot' static
percpu: implement partial chunk depopulation
percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
percpu: factor out pcpu_check_block_hint()
percpu: split __pcpu_balance_workfn()
percpu: fix a comment about the chunks ordering