2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

1274 Commits

Author SHA1 Message Date
Linus Torvalds
3719a04a80 pci-v6.16-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmhAa9EUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vyA3w//aX8d73z/xVxkYLMN/6XQA5fdmd4d
 Dv4n0Pjf0WCMKbsgRCdXEYLvcHV8VhH5iCR/b2UsFm9LjxSIRuqE5XosY3bNhrHn
 xVKEh2prq2XZOibWrFkJ+RZ0FF7Ogq1Uy5gUBbBHbE1q1byZzrOALaF3FWGaDIZQ
 6QLLAFtd3UtqOOUu8J8P9N15uFR8gunyfuM9U7TLMcy4B8txk6T6m/9xAWtRURuJ
 I6WN8lO+g8Nl2mL9m27+wyWiVT3tKqoMwp8rVtym/L5JQOmHycYhn0WQAr2dPCMs
 Xbgmoeei0je7mZvk5btpt68NAKQ3ZnCVkxbbINBkUxAjI0dbI6h37EhW18ShYVUk
 CCo4fmaFtwP8qNN9tSvDN8vZdGB44fN5tIz4lmGzKk5gt+oV50RC/APrzC+PJBQ0
 +2SdDVKj71Gr2H1VnI6uLB7oQ+tp7TOdhg+DGV4bdc6QFnsM+BpKWRq5f1UQcau/
 XVDmorM/2t6z0DNktAv3NFwSodUjk1loWESr/pRBH1AqAWZTK98PWIg97XYsal59
 zbJ3dLrnCqUNozeVgjtZo1LWD2FZaVTvhq2NY7D+QPpnMGhFUhHxNliZUXiQa1q4
 boI2hEFdu3IQP/OC2a1zGJyMRLU43d5rhZ1U5xQSVtM0c3lgCY7rn/t26LymQVPA
 SYdg2jBcnhe6gXo=
 =eWJw
 -----END PGP SIGNATURE-----

Merge tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci updates from Bjorn Helgaas:
 "Enumeration:

   - Print the actual delay time in pci_bridge_wait_for_secondary_bus()
     instead of assuming it was 1000ms (Wilfred Mallawa)

   - Revert 'iommu/amd: Prevent binding other PCI drivers to IOMMU PCI
     devices', which broke resume from system sleep on AMD platforms and
     has been fixed by other commits (Lukas Wunner)

  Resource management:

   - Remove mtip32xx use of pcim_iounmap_regions(), which is deprecated
     and unnecessary (Philipp Stanner)

   - Remove pcim_iounmap_regions() and pcim_request_region_exclusive()
     and related flags since all uses have been removed (Philipp
     Stanner)

   - Rework devres 'request' functions so they are no longer 'hybrid',
     i.e., their behavior no longer depends on whether
     pcim_enable_device or pci_enable_device() was used, and remove
     related code (Philipp Stanner)

   - Warn (not BUG()) about failure to assign optional resources (Ilpo
     Järvinen)

  Error handling:

   - Log the DPC Error Source ID only when it's actually valid (when
     ERR_FATAL or ERR_NONFATAL was received from a downstream device)
     and decode into bus/device/function (Bjorn Helgaas)

   - Determine AER log level once and save it so all related messages
     use the same level (Karolina Stolarek)

   - Use KERN_WARNING, not KERN_ERR, when logging PCIe Correctable
     Errors (Karolina Stolarek)

   - Ratelimit PCIe Correctable and Non-Fatal error logging, with sysfs
     controls on interval and burst count, to avoid flooding logs and
     RCU stall warnings (Jon Pan-Doh)

  Power management:

   - Increment PM usage counter when probing reset methods so we don't
     try to read config space of a powered-off device (Alex Williamson)

   - Set all devices to D0 during enumeration to ensure ACPI opregion is
     connected via _REG (Mario Limonciello)

  Power control:

   - Rename pwrctrl Kconfig symbols from 'PWRCTL' to 'PWRCTRL' to match
     the filename paths. Retain old deprecated symbols for
     compatibility, except for the pwrctrl slot driver
     (PCI_PWRCTRL_SLOT) (Johan Hovold)

   - When unregistering pwrctrl, cancel outstanding rescan work before
     cleaning up data structures to avoid use-after-free issues (Brian
     Norris)

  Bandwidth control:

   - Simplify link bandwidth controller by replacing the count of Link
     Bandwidth Management Status (LBMS) events with a PCI_LINK_LBMS_SEEN
     flag (Ilpo Järvinen)

   - Update the Link Speed after retraining, since the Link Speed may
     have changed (Ilpo Järvinen)

  PCIe native device hotplug:

   - Ignore Presence Detect Changed caused by DPC.

     pciehp already ignores Link Down/Up events caused by DPC, but on
     slots using in-band presence detect, DPC causes a spurious Presence
     Detect Changed event (Lukas Wunner)

   - Ignore Link Down/Up caused by Secondary Bus Reset.

     On hotplug ports using in-band presence detect, the reset causes a
     Presence Detect Changed event, which mistakenly caused teardown and
     re-enumeration of the device. Drivers may need to annotate code
     that resets their device (Lukas Wunner)

  Virtualization:

   - Add an ACS quirk for Loongson Root Ports that don't advertise ACS
     but don't allow peer-to-peer transactions between Root Ports; the
     quirk allows each Root Port to be in a separate IOMMU group (Huacai
     Chen)

  Endpoint framework:

   - For fixed-size BARs, retain both the actual size and the possibly
     larger size allocated to accommodate iATU alignment requirements
     (Jerome Brunet)

   - Simplify ctrl/SPAD space allocation and avoid allocating more space
     than needed (Jerome Brunet)

   - Correct MSI-X PBA offset calculations for DesignWare and Cadence
     endpoint controllers (Niklas Cassel)

   - Align the return value (number of interrupts) encoding for
     pci_epc_get_msi()/pci_epc_ops::get_msi() and
     pci_epc_get_msix()/pci_epc_ops::get_msix() (Niklas Cassel)

   - Align the nr_irqs parameter encoding for
     pci_epc_set_msi()/pci_epc_ops::set_msi() and
     pci_epc_set_msix()/pci_epc_ops::set_msix() (Niklas Cassel)

  Common host controller library:

   - Convert pci-host-common to a library so platforms that don't need
     native host controller drivers don't need to include these helper
     functions (Manivannan Sadhasivam)

  Apple PCIe controller driver:

   - Extract ECAM bridge creation helper from pci_host_common_probe() to
     separate driver-specific things like MSI from PCI things (Marc
     Zyngier)

   - Dynamically allocate RID-to_SID bitmap to prepare for SoCs with
     varying capabilities (Marc Zyngier)

   - Skip ports disabled in DT when setting up ports (Janne Grunau)

   - Add t6020 compatible string (Alyssa Rosenzweig)

   - Add T602x PCIe support (Hector Martin)

   - Directly set/clear INTx mask bits because T602x dropped the
     accessors that could do this without locking (Marc Zyngier)

   - Move port PHY registers to their own reg items to accommodate
     T602x, which moves them around; retain default offsets for existing
     DTs that lack phy%d entries with the reg offsets (Hector Martin)

   - Stop polling for core refclk, which doesn't work on T602x and the
     bootloader has already done anyway (Hector Martin)

   - Use gpiod_set_value_cansleep() when asserting PERST# in probe
     because we're allowed to sleep there (Hector Martin)

  Cadence PCIe controller driver:

   - Drop a runtime PM 'put' to resolve a runtime atomic count underflow
     (Hans Zhang)

   - Make the cadence core buildable as a module (Kishon Vijay Abraham I)

   - Add cdns_pcie_host_disable() and cdns_pcie_ep_disable() for use by
     loadable drivers when they are removed (Siddharth Vadapalli)

  Freescale i.MX6 PCIe controller driver:

   - Apply link training workaround only on IMX6Q, IMX6SX, IMX6SP
     (Richard Zhu)

   - Remove redundant dw_pcie_wait_for_link() from
     imx_pcie_start_link(); since the DWC core does this, imx6 only
     needs it when retraining for a faster link speed (Richard Zhu)

   - Toggle i.MX95 core reset to align with PHY powerup (Richard Zhu)

   - Set SYS_AUX_PWR_DET to work around i.MX95 ERR051624 erratum: in
     some cases, the controller can't exit 'L23 Ready' through Beacon or
     PERST# deassertion (Richard Zhu)

   - Clear GEN3_ZRXDC_NONCOMPL to work around i.MX95 ERR051586 erratum:
     controller can't meet 2.5 GT/s ZRX-DC timing when operating at 8
     GT/s, causing timeouts in L1 (Richard Zhu)

   - Wait for i.MX95 PLL lock before enabling controller (Richard Zhu)

   - Save/restore i.MX95 LUT for suspend/resume (Richard Zhu)

  Mobiveil PCIe controller driver:

   - Return bool (not int) for link-up check in
     mobiveil_pab_ops.link_up() and layerscape-gen4, mobiveil (Hans
     Zhang)

  NVIDIA Tegra194 PCIe controller driver:

   - Create debugfs directory for 'aspm_state_cnt' only when
     CONFIG_PCIEASPM is enabled, since there are no other entries (Hans
     Zhang)

  Qualcomm PCIe controller driver:

   - Add OF support for parsing DT 'eq-presets-<N>gts' property for lane
     equalization presets (Krishna Chaitanya Chundru)

   - Read Maximum Link Width from the Link Capabilities register if DT
     lacks 'num-lanes' property (Krishna Chaitanya Chundru)

   - Add Physical Layer 64 GT/s Capability ID and register offsets for
     8, 32, and 64 GT/s lane equalization registers (Krishna Chaitanya
     Chundru)

   - Add generic dwc support for configuring lane equalization presets
     (Krishna Chaitanya Chundru)

   - Add DT and driver support for PCIe on IPQ5018 SoC (Nitheesh Sekar)

  Renesas R-Car PCIe controller driver:

   - Describe endpoint BAR 4 as being fixed size (Jerome Brunet)

   - Document how to obtain R-Car V4H (r8a779g0) controller firmware
     (Yoshihiro Shimoda)

  Rockchip PCIe controller driver:

   - Reorder rockchip_pci_core_rsts because
     reset_control_bulk_deassert() deasserts in reverse order, to fix a
     link training regression (Jensen Huang)

   - Mark RK3399 as being capable of raising INTx interrupts (Niklas
     Cassel)

  Rockchip DesignWare PCIe controller driver:

   - Check only PCIE_LINKUP, not LTSSM status, to determine whether the
     link is up (Shawn Lin)

   - Increase N_FTS (used in L0s->L0 transitions) and enable ASPM L0s
     for Root Complex and Endpoint modes (Shawn Lin)

   - Hide the broken ATS Capability in rockchip_pcie_ep_init() instead
     of rockchip_pcie_ep_pre_init() so it stays hidden after PERST#
     resets non-sticky registers (Shawn Lin)

   - Call phy_power_off() before phy_exit() in rockchip_pcie_phy_deinit()
     (Diederik de Haas)

  Synopsys DesignWare PCIe controller driver:

   - Set PORT_LOGIC_LINK_WIDTH to one lane to make initial link training
     more robust; this will not affect the intended link width if all
     lanes are functional (Wenbin Yao)

   - Return bool (not int) for link-up check in dw_pcie_ops.link_up()
     and armada8k, dra7xx, dw-rockchip, exynos, histb, keembay,
     keystone, kirin, meson, qcom, qcom-ep, rcar_gen4, spear13xx,
     tegra194, uniphier, visconti (Hans Zhang)

   - Add debugfs support for exposing DWC device-specific PTM context
     (Manivannan Sadhasivam)

  TI J721E PCIe driver:

   - Make j721e buildable as a loadable and removable module (Siddharth
     Vadapalli)

   - Fix j721e host/endpoint dependencies that result in link failures
     in some configs (Arnd Bergmann)

  Device tree bindings:

   - Add qcom DT binding for 'global' interrupt (PCIe controller and
     link-specific events) for ipq8074, ipq8074-gen3, ipq6018, sa8775p,
     sc7280, sc8180x sdm845, sm8150, sm8250, sm8350 (Manivannan
     Sadhasivam)

   - Add qcom DT binding for 8 MSI SPI interrupts for msm8998, ipq8074,
     ipq8074-gen3, ipq6018 (Manivannan Sadhasivam)

   - Add dw rockchip DT binding for rk3576 and rk3562 (Kever Yang)

   - Correct indentation and style of examples in brcm,stb-pcie,
     cdns,cdns-pcie-ep, intel,keembay-pcie-ep, intel,keembay-pcie,
     microchip,pcie-host, rcar-pci-ep, rcar-pci-host, xilinx-versal-cpm
     (Krzysztof Kozlowski)

   - Convert Marvell EBU (dove, kirkwood, armada-370, armada-xp) and
     armada8k from text to schema DT bindings (Rob Herring)

   - Remove obsolete .txt DT bindings for content that has been moved to
     schemas (Rob Herring)

   - Add qcom DT binding for MHI registers in IPQ5332, IPQ6018, IPQ8074
     and IPQ9574 (Varadarajan Narayanan)

   - Convert v3,v360epc-pci from text to DT schema binding (Rob Herring)

   - Change microchip,pcie-host DT binding to be 'dma-noncoherent' since
     PolarFire may be configured that way (Conor Dooley)

  Miscellaneous:

   - Drop 'pci' suffix from intel_mid_pci.c filename to match similar
     files (Andy Shevchenko)

   - All platforms with PCI have an MMU, so add PCI Kconfig dependency
     on MMU to simplify build testing and avoid inadvertent build
     regressions (Arnd Bergmann)

   - Update Krzysztof Wilczyński's email address in MAINTAINERS
     (Krzysztof Wilczyński)

   - Update Manivannan Sadhasivam's email address in MAINTAINERS
     (Manivannan Sadhasivam)"

* tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (147 commits)
  MAINTAINERS: Update Manivannan Sadhasivam email address
  PCI: j721e: Fix host/endpoint dependencies
  PCI: j721e: Add support to build as a loadable module
  PCI: cadence-ep: Introduce cdns_pcie_ep_disable() helper for cleanup
  PCI: cadence-host: Introduce cdns_pcie_host_disable() helper for cleanup
  PCI: cadence: Add support to build pcie-cadence library as a kernel module
  MAINTAINERS: Update Krzysztof Wilczyński email address
  PCI: Remove unnecessary linesplit in __pci_setup_bridge()
  PCI: WARN (not BUG()) when we fail to assign optional resources
  PCI: Remove unused pci_printk()
  PCI: qcom: Replace PERST# sleep time with proper macro
  PCI: dw-rockchip: Replace PERST# sleep time with proper macro
  PCI: host-common: Convert to library for host controller drivers
  PCI/ERR: Remove misleading TODO regarding kernel panic
  PCI: cadence: Remove duplicate message code definitions
  PCI: endpoint: Align pci_epc_set_msix(), pci_epc_ops::set_msix() nr_irqs encoding
  PCI: endpoint: Align pci_epc_set_msi(), pci_epc_ops::set_msi() nr_irqs encoding
  PCI: endpoint: Align pci_epc_get_msix(), pci_epc_ops::get_msix() return value encoding
  PCI: endpoint: Align pci_epc_get_msi(), pci_epc_ops::get_msi() return value encoding
  PCI: cadence-ep: Correct PBA offset in .set_msix() callback
  ...
2025-06-04 11:26:17 -07:00
Li RongQing
4518e5a60c vfio/type1: Fix error unwind in migration dirty bitmap allocation
When setting up dirty page tracking at the vfio IOMMU backend for
device migration, if an error is encountered allocating a tracking
bitmap, the unwind loop fails to free previously allocated tracking
bitmaps.  This occurs because the wrong loop index is used to
generate the tracking object.  This results in unintended memory
usage for the life of the current DMA mappings where bitmaps were
successfully allocated.

Use the correct loop index to derive the tracking object for
freeing during unwind.

Fixes: d6a4c18566 ("vfio iommu: Implementation of ioctl for dirty pages tracking")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/r/20250521034647.2877-1-lirongqing@baidu.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-22 10:41:24 -06:00
Leon Romanovsky
089803c401 vfio/mlx5: Enable the DMA link API
Remove intermediate scatter-gather table completely and
enable new DMA link API.

Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/f71638d50c9c79a462f2e0423501b1de77617656.1747747694.git.leon@kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-20 08:32:07 -06:00
Leon Romanovsky
ac6c973a48 vfio/mlx5: Rewrite create mkey flow to allow better code reuse
Change the creation of mkey to be performed in multiple steps:
data allocation, DMA setup and actual call to HW to create that mkey.

In this new flow, the whole input to MKEY command is saved to eliminate
the need to keep array of pointers for DMA addresses for receive list
and in the future patches for send list too.

In addition to memory size reduce and elimination of unnecessary data
movements to set MKEY input, the code is prepared for future reuse.

Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/d4ad0384fbd1e23a607cbbe9e5756748f3a761d9.1747747694.git.leon@kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-20 08:32:07 -06:00
Leon Romanovsky
674ebb64cd vfio/mlx5: Explicitly use number of pages instead of allocated length
allocated_length is a multiple of page size and number of pages,
so let's change the functions to accept number of pages. This improves
code readability, simplifies buffer handling, and enables combining DMA
send/receive operations, as will be introduced in the next patches.

Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/76f39993d2ca0311b3bcfe56038a669d03926815.1747747694.git.leon@kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-20 08:32:07 -06:00
Longfang Liu
1dcf2cf102 hisi_acc_vfio_pci: update function return values.
In this driver file, many functions call sub-functions and use ret
to store the error code of the sub-functions.
However, instead of directly returning ret to the caller, they use a
converted error code, which prevents the end-user from clearly
understanding the root cause of the error.
Therefore, the code needs to be modified to directly return the error
code from the sub-functions.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20250510081155.55840-7-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-19 12:41:31 -06:00
Longfang Liu
2777a40998 hisi_acc_vfio_pci: bugfix live migration function without VF device driver
If the VF device driver is not loaded in the Guest OS and we attempt to
perform device data migration, the address of the migrated data will
be NULL.
The live migration recovery operation on the destination side will
access a null address value, which will cause access errors.

Therefore, live migration of VMs without added VF device drivers
does not require device data migration.
In addition, when the queue address data obtained by the destination
is empty, device queue recovery processing will not be performed.

Fixes: b0eed08590 ("hisi_acc_vfio_pci: Add support for VFIO live migration")
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20250510081155.55840-6-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-19 12:41:31 -06:00
Longfang Liu
db6525a857 hisi_acc_vfio_pci: bugfix the problem of uninstalling driver
In a live migration scenario. If the number of VFs at the
destination is greater than the source, the recovery operation
will fail and qemu will not be able to complete the process and
exit after shutting down the device FD.

This will cause the driver to be unable to be unloaded normally due
to abnormal reference counting of the live migration driver caused
by the abnormal closing operation of fd.

Therefore, make sure the migration file descriptor references are
always released when the device is closed.

Fixes: b0eed08590 ("hisi_acc_vfio_pci: Add support for VFIO live migration")
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20250510081155.55840-5-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-19 12:41:31 -06:00
Longfang Liu
e63c466398 hisi_acc_vfio_pci: bugfix cache write-back issue
At present, cache write-back is placed in the device data
copy stage after stopping the device operation.
Writing back to the cache at this stage will cause the data
obtained by the cache to be written back to be empty.

In order to ensure that the cache data is written back
successfully, the data needs to be written back into the
stop device stage.

Fixes: b0eed08590 ("hisi_acc_vfio_pci: Add support for VFIO live migration")
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20250510081155.55840-4-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-19 12:41:31 -06:00
Longfang Liu
3495cec078 hisi_acc_vfio_pci: add eq and aeq interruption restore
In order to ensure that the task packets of the accelerator
device are not lost during the migration process, it is necessary
to send an EQ and AEQ command to the device after the live migration
is completed and to update the completion position of the task queue.

Let the device recheck the completed tasks data and if there are
uncollected packets, device resend a task completion interrupt
to the software.

Fixes: b0eed08590 ("hisi_acc_vfio_pci: Add support for VFIO live migration")
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20250510081155.55840-3-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-19 12:41:31 -06:00
Longfang Liu
8bb7170c5a hisi_acc_vfio_pci: fix XQE dma address error
The dma addresses of EQE and AEQE are wrong after migration and
results in guest kernel-mode encryption services  failure.
Comparing the definition of hardware registers, we found that
there was an error when the data read from the register was
combined into an address. Therefore, the address combination
sequence needs to be corrected.

Even after fixing the above problem, we still have an issue
where the Guest from an old kernel can get migrated to
new kernel and may result in wrong data.

In order to ensure that the address is correct after migration,
if an old magic number is detected, the dma address needs to be
updated.

Fixes: b0eed08590 ("hisi_acc_vfio_pci: Add support for VFIO live migration")
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20250510081155.55840-2-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-19 12:41:31 -06:00
Jason Gunthorpe
07970d048c vfio/type1: Remove Fine Grained Superpages detection
VFIO is looking to enable an optimization where it can rely on a fast
unmap operation that returned the size of a larger IOPTE.

Due to how the test was constructed this would only ever succeed on the
AMDv1 page table that supported an 8k contiguous size. Nothing else
supports this.

Alex says the performance win was fairly minor, so lets remove this
code. Always use iommu_iova_to_phys() to extent contiguous pages.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v2-97fa1da8d983+412-vfio_fgsp_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-19 12:39:27 -06:00
Alex Williamson
c1d9dac0db vfio/pci: Align huge faults to order
The vfio-pci huge_fault handler doesn't make any attempt to insert a
mapping containing the faulting address, it only inserts mappings if the
faulting address and resulting pfn are aligned.  This works in a lot of
cases, particularly in conjunction with QEMU where DMA mappings linearly
fault the mmap.  However, there are configurations where we don't get
that linear faulting and pages are faulted on-demand.

The scenario reported in the bug below is such a case, where the physical
address width of the CPU is greater than that of the IOMMU, resulting in a
VM where guest firmware has mapped device MMIO beyond the address width of
the IOMMU.  In this configuration, the MMIO is faulted on demand and
tracing indicates that occasionally the faults generate a VM_FAULT_OOM.
Given the use case, this results in a "error: kvm run failed Bad address",
killing the VM.

The host is not under memory pressure in this test, therefore it's
suspected that VM_FAULT_OOM is actually the result of a NULL return from
__pte_offset_map_lock() in the get_locked_pte() path from insert_pfn().
This suggests a potential race inserting a pte concurrent to a pmd, and
maybe indicates some deficiency in the mm layer properly handling such a
case.

Nevertheless, Peter noted the inconsistency of vfio-pci's huge_fault
handler where our mapping granularity depends on the alignment of the
faulting address relative to the order rather than aligning the faulting
address to the order to more consistently insert huge mappings.  This
change not only uses the page tables more consistently and efficiently, but
as any fault to an aligned page results in the same mapping, the race
condition suspected in the VM_FAULT_OOM is avoided.

Reported-by: Adolfo <adolfotregosa@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220057
Fixes: 09dfc8a5f2 ("vfio/pci: Fallback huge faults for unaligned pfn")
Cc: stable@vger.kernel.org
Tested-by: Adolfo <adolfotregosa@gmail.com>
Co-developed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20250502224035.3183451-1-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-06 12:59:12 -06:00
Arnd Bergmann
8fe743b5eb PCI: Add CONFIG_MMU dependency
It turns out that there are no platforms that have PCI but don't have an
MMU, so adding a Kconfig dependency on CONFIG_PCI simplifies build testing
kernels for those platforms a lot, and avoids a lot of inadvertent build
regressions.

Add a dependency for CONFIG_PCI and remove all the ones for PCI specific
device drivers that are currently marked not having it.

There are a few platforms that have an optional MMU, but they usually
cannot have PCI at all. The one exception is Coldfire MCF54xx, but this is
mainly for historic reasons, and anyone using those chips should really use
the MMU these days.

Link: https://lore.kernel.org/lkml/a41f1b20-a76c-43d8-8c36-f12744327a54@app.fastmail.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20250423202215.3315550-1-arnd@kernel.org
2025-04-23 15:40:30 -05:00
Alex Williamson
2bd42b03ab vfio/pci: Virtualize zero INTx PIN if no pdev->irq
Typically pdev->irq is consistent with whether the device itself
supports INTx, where device support is reported via the PIN register.
Therefore the PIN register is often already zero if pdev->irq is zero.

Recently virtualization of the PIN register was expanded to include
the case where the device supports INTx but the platform does not
route the interrupt.  This is reported by a value of IRQ_NOTCONNECTED
on some architectures.  Other architectures just report zero for
pdev->irq.

We already disallow INTx setup if pdev->irq is zero, therefore add
this to the PIN register virtualization criteria so that a consistent
view is provided to userspace through virtualized config space and
ioctls.

Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Link: https://lore.kernel.org/all/174231895238.2295.12586708771396482526.stgit@linux.ibm.com/
Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Link: https://lore.kernel.org/r/20250320194145.2816379-1-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-04-14 08:31:45 -06:00
Linus Torvalds
3491aa0478 VFIO updates for v6.15-rc1
- Relax IGD support code to match display class device rather than
    specifically requiring a VGA device. (Tomita Moeko)
 
  - Accelerate DMA mapping of device MMIO by iterating at PMD and PUD
    levels to take advantage of huge pfnmap support added in v6.12.
    (Alex Williamson)
 
  - Extend virtio vfio-pci variant driver to include migration support
    for block devices where enabled by the PF. (Yishai Hadas)
 
  - Virtualize INTx PIN register for devices where the platform does
    not route legacy PCI interrupts for the device and the interrupt
    is reported as IRQ_NOTCONNECTED. (Alex Williamson)
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmfq5nEbHGFsZXgud2ls
 bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsi3KAP/2MQcQaKTZ6/+dG6YdKT
 ZFaY4+xJ14DnUN/z96UlIWLk8bWgSyDFxdoFMbtFGENKRslEWxZ7In9Caow7f6ux
 7/usBjSvJa5Yx9YWRGsblrx7IyYfSW6R1V+jH3xPd+K8Ir4K7SUvb1CJLVPdfEYh
 OWer8eRpZ5tw3R2X4o+QxZ+H4Fx1zVQourW35h4daqrjnn7kOQMJIzGYOwHSDlCy
 lW0X0yD3sGgw9w7qAmEDmw9UbKGf245AVylIl5T1a7c3RaO+eKdKPZfNa18g0J/Q
 5pRMK+2PvZ+S0OTYxotcF9GtEJ3iBxY8W4QnlLiyTs9XyZ7tLMzGvLEKmCDKA0U8
 yAtoJ5T00PVXjMxkZx1+oMGja9Hx+b7gABTYpbf5wRtab6EdNWln++I1HCLKgZZ+
 yStvQNsMYGbJsLfwiGouMwD24JT+xg3A+Dv2Cx+Ai4NVJebxTD8Lhc0lz2I6IpOh
 wFBpBzBIPpcG53oQ1Syb9GLESQ0Acb4LUMjsSxIg7QFSrWgAAlq/PiLXv852S3xJ
 pUEh7r/YByQytUsQajgE7ekKqyXw0gn99Z+UTk0LUIq/y7SxrIPeqzq2qRf490RV
 wnkOrMxrAWj84lkIv8hLCiLsXMmvV4rsMJgV+s8KQZ+hPv38mPbkFYGdHj4h5RPA
 5J5h32dDkHLK+X5u//gBY8xh
 =fJPO
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.15-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Relax IGD support code to match display class device rather than
   specifically requiring a VGA device (Tomita Moeko)

 - Accelerate DMA mapping of device MMIO by iterating at PMD and PUD
   levels to take advantage of huge pfnmap support added in v6.12
   (Alex Williamson)

 - Extend virtio vfio-pci variant driver to include migration support
   for block devices where enabled by the PF (Yishai Hadas)

 - Virtualize INTx PIN register for devices where the platform does not
   route legacy PCI interrupts for the device and the interrupt is
   reported as IRQ_NOTCONNECTED (Alex Williamson)

* tag 'vfio-v6.15-rc1' of https://github.com/awilliam/linux-vfio:
  vfio/pci: Handle INTx IRQ_NOTCONNECTED
  vfio/virtio: Enable support for virtio-block live migration
  vfio/type1: Use mapping page mask for pfnmaps
  mm: Provide address mask in struct follow_pfnmap_args
  vfio/type1: Use consistent types for page counts
  vfio/type1: Use vfio_batch for vaddr_get_pfns()
  vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
  vfio/type1: Catch zero from pin_user_pages_remote()
  vfio/pci: match IGD devices in display controller class
2025-04-01 19:35:19 -07:00
Linus Torvalds
48552153cf iommufd 6.15 merge window pull
Two significant new items:
 
 - Allow reporting IOMMU HW events to userspace when the events are clearly
   linked to a device. This is linked to the VIOMMU object and is intended to
   be used by a VMM to forward HW events to the virtual machine as part of
   emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
   mechanism. Like the existing fault events the data is delivered through
   a simple FD returning event records on read().
 
 - PASID support in VFIO. "Process Address Space ID" is a PCI feature that
   allows the device to tag all PCI DMA operations with an ID. The IOMMU
   will then use the ID to select a unique translation for those DMAs. This
   is part of Intel's vIOMMU support as VT-D HW requires the hypervisor to
   manage each PASID entry. The support is generic so any VFIO user could
   attach any translation to a PASID, and the support should work on ARM
   SMMUv3 as well. AMD requires additional driver work.
 
 Some minor updates, along with fixes:
 
 - Prevent using nested parents with fault's, no driver support today
 
 - Put a single "cookie_type" value in the iommu_domain to indicate what
   owns the various opaque owner fields
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ+q6NgAKCRCFwuHvBreF
 YZ3zAQDbl4/Z0O+CLN2AXq4Zeiyq1HTSoF94hzqmm7lQ17zTIwD8CCdyLXHvupaq
 tkBIv5IovpaxlrSk6M0kh2K8vPCk9Qk=
 =CIM3
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd

Pull iommufd updates from Jason Gunthorpe:
 "Two significant new items:

   - Allow reporting IOMMU HW events to userspace when the events are
     clearly linked to a device.

     This is linked to the VIOMMU object and is intended to be used by a
     VMM to forward HW events to the virtual machine as part of
     emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
     mechanism. Like the existing fault events the data is delivered
     through a simple FD returning event records on read().

   - PASID support in VFIO.

     The "Process Address Space ID" is a PCI feature that allows the
     device to tag all PCI DMA operations with an ID. The IOMMU will
     then use the ID to select a unique translation for those DMAs. This
     is part of Intel's vIOMMU support as VT-D HW requires the
     hypervisor to manage each PASID entry.

     The support is generic so any VFIO user could attach any
     translation to a PASID, and the support should work on ARM SMMUv3
     as well. AMD requires additional driver work.

  Some minor updates, along with fixes:

   - Prevent using nested parents with fault's, no driver support today

   - Put a single "cookie_type" value in the iommu_domain to indicate
     what owns the various opaque owner fields"

* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (49 commits)
  iommufd: Test attach before detaching pasid
  iommufd: Fix iommu_vevent_header tables markup
  iommu: Convert unreachable() to BUG()
  iommufd: Balance veventq->num_events inc/dec
  iommufd: Initialize the flags of vevent in iommufd_viommu_report_event()
  iommufd/selftest: Add coverage for reporting max_pasid_log2 via IOMMU_HW_INFO
  iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability
  vfio: VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT support pasid
  vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  ida: Add ida_find_first_range()
  iommufd/selftest: Add coverage for iommufd pasid attach/detach
  iommufd/selftest: Add test ops to test pasid attach/detach
  iommufd/selftest: Add a helper to get test device
  iommufd/selftest: Add set_dev_pasid in mock iommu
  iommufd: Allow allocating PASID-compatible domain
  iommu/vt-d: Add IOMMU_HWPT_ALLOC_PASID support
  iommufd: Enforce PASID-compatible domain for RID
  iommufd: Support pasid attach/replace
  iommufd: Enforce PASID-compatible domain in PASID path
  iommufd/device: Add pasid_attach array to track per-PASID attach
  ...
2025-04-01 18:03:46 -07:00
Linus Torvalds
7d06015d93 pci-v6.15-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmfmt1AUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vxqNw//cC1BlVe4UUVR5r9nfpoFeGAZeDJz
 32naGvZKjwL0tR6dStO/BEZx4QrBp+smVfJfuxtQxfzLHLgMigM2jVhfa7XUmaun
 7yZlJZu4Jmydc57sPf56CFOYOFP6zyPzSaE8u1Eb4IIqpvuoYpvDayDt6PSsLmFS
 PDzqmicT3nuNbbcfE4rYLyL6JsXooKCR1h+NNcDjy7Run9DvQbE6N2PPvXCu6O97
 aC3+kYUydEpgn9DfjBDghe+GBQCkBPldwnWqXBxKDFmYj5bKFujNccS9/IDSEuuX
 oWntDRAXgLWg048sBC1AuJQajF3UaqffRGJkzUBaZWbU/jB9t5N/Z3GpYlXzizRx
 CAqnt/ciGUKVbaESKwoeeIqgK+wG1bnrmoEaJHXFGqjr6sjm2A2T5EzyBMJ1hFwE
 wUq6SDnkp5igG7rWtsBPo/lGa5h/pNlaXng11570ikD2ZfHVfRgwy2MpXYxChrkt
 X2q/lRYU7yyfNJQ8O5LQJ6bYztatjxT0TxXNFv+cxfVrdI7vnQMuaeBE352jn+Lo
 aZ8fTJDFbRXtrZolcoetZjBdHwPIS42wYQtvxo/ylUl64xKEzZEzN09XODjwn74v
 nuOzDtbM0TAjZWxi6bwRFPnemTEAQxPuv1i4VdPQ7yblvC0at2agNBsz6Fdq60GS
 s80YMJVUU0UcVoc=
 =gM1i
 -----END PGP SIGNATURE-----

Merge tag 'pci-v6.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci updates from Bjorn Helgaas:
 "Enumeration:

   - Enable Configuration RRS SV, which makes device readiness visible,
     early instead of during child bus scanning (Bjorn Helgaas)

   - Log debug messages about reset methods being used (Bjorn Helgaas)

   - Avoid reset when it has been disabled via sysfs (Nishanth
     Aravamudan)

   - Add common pci-ep-bus.yaml schema for exporting several peripherals
     of a single PCI function via devicetree (Andrea della Porta)

   - Create DT nodes for PCI host bridges to enable loading device tree
     overlays to create platform devices for PCI devices that have
     several features that require multiple drivers (Herve Codina)

  Resource management:

   - Enlarge devres table[] to accommodate bridge windows, ROM, IOV
     BARs, etc., and validate BAR index in devres interfaces (Philipp
     Stanner)

   - Fix typo that repeatedly distributed resources to a bridge instead
     of iterating over subordinate bridges, which resulted in too little
     space to assign some BARs (Kai-Heng Feng)

   - Relax bridge window tail sizing for optional resources, e.g., IOV
     BARs, to avoid failures when removing and re-adding devices (Ilpo
     Järvinen)

   - Allow drivers to enable devices even if we haven't assigned
     optional IOV resources to them (Ilpo Järvinen)

   - Rework handling of optional resources (IOV BARs, ROMs) to reduce
     failures if we can't allocate them (Ilpo Järvinen)

   - Fix a NULL dereference in the SR-IOV VF creation error path (Shay
     Drory)

   - Fix s390 mmio_read/write syscalls, which didn't cause page faults
     in some cases, which broke vfio-pci lazy mapping on first access
     (Niklas Schnelle)

   - Add pdev->non_mappable_bars to replace CONFIG_VFIO_PCI_MMAP, which
     was disabled only for s390 (Niklas Schnelle)

   - Support mmap of PCI resources on s390 except for ISM devices
     (Niklas Schnelle)

  ASPM:

   - Delay pcie_link_state deallocation to avoid dangling pointers that
     cause invalid references during hot-unplug (Daniel Stodden)

  Power management:

   - Allow PCI bridges to go to D3Hot when suspending on all non-x86
     systems (Manivannan Sadhasivam)

  Power control:

   - Create pwrctrl devices in pci_scan_device() to make it more
     symmetric with pci_pwrctrl_unregister() and make pwrctrl devices
     for PCI bridges possible (Manivannan Sadhasivam)

   - Unregister pwrctrl devices in pci_destroy_dev() so DOE, ASPM, etc.
     can still access devices after pci_stop_dev() (Manivannan
     Sadhasivam)

   - If there's a pwrctrl device for a PCI device, skip scanning it
     because the pwrctrl core will rescan the bus after the device is
     powered on (Manivannan Sadhasivam)

   - Add a pwrctrl driver for PCI slots based on voltage regulators
     described via devicetree (Manivannan Sadhasivam)

  Bandwidth control:

   - Add set_pcie_speed.sh to TEST_PROGS to fix issue when executing the
     set_pcie_cooling_state.sh test case (Yi Lai)

   - Avoid a NULL pointer dereference when we run out of bus numbers to
     assign for a bridge secondary bus (Lukas Wunner)

  Hotplug:

   - Drop superfluous pci_hotplug_slot_list, try_module_get() calls, and
     NULL pointer checks (Lukas Wunner)

   - Drop shpchp module init/exit logging, replace shpchp dbg() with
     ctrl_dbg(), and remove unused dbg(), err(), info(), warn() wrappers
     (Ilpo Järvinen)

   - Drop 'shpchp_debug' module parameter in favor of standard dynamic
     debugging (Ilpo Järvinen)

   - Drop unused cpcihp .get_power(), .set_power() function pointers
     (Guilherme Giacomo Simoes)

   - Disable hotplug interrupts in portdrv only when pciehp is not
     enabled to avoid issuing two hotplug commands too close together
     (Feng Tang)

   - Skip pciehp 'device replaced' check if the device has been removed
     to address a deadlock when resuming after a device was removed
     during system sleep (Lukas Wunner)

   - Don't enable pciehp hotplug interupt when resuming in poll mode
     (Ilpo Järvinen)

  Virtualization:

   - Fix bugs in 'pci=config_acs=' kernel command line parameter (Tushar
     Dave)

  DOE:

   - Expose supported DOE features via sysfs (Alistair Francis)

   - Allow DOE support to be enabled even if CXL isn't enabled (Alistair
     Francis)

  Endpoint framework:

   - Convert PCI device data so pci-epf-test works correctly on
     big-endian endpoint systems (Niklas Cassel)

   - Add BAR_RESIZABLE type to endpoint framework and add DWC core
     support for EPF drivers to set BAR_RESIZABLE type and size (Niklas
     Cassel)

   - Fix pci-epf-test double free that causes an oops if the host
     reboots and PERST# deassertion restarts endpoint BAR allocation
     (Christian Bruel)

   - Fix endpoint BAR testing so tests can skip disabled BARs instead of
     reporting them as failures (Niklas Cassel)

   - Widen endpoint test BAR size variable to accommodate BARs larger
     than INT_MAX (Niklas Cassel)

   - Remove unused tools 'pci' build target left over after moving tests
     to tools/testing/selftests/pci_endpoint (Jianfeng Liu)

  Altera PCIe controller driver:

   - Add DT binding and driver support for Agilex family (P-Tile,
     F-Tile, R-Tile) (Matthew Gerlach and D M, Sharath Kumar)

  AMD MDB PCIe controller driver:

   - Add DT binding and driver for AMD MDB (Multimedia DMA Bridge)
     (Thippeswamy Havalige)

  Broadcom STB PCIe controller driver:

   - Add BCM2712 MSI-X DT binding and interrupt controller drivers and
     add softdep on irq_bcm2712_mip driver to ensure that it is loaded
     first (Stanimir Varbanov)

   - Expand inbound window map to 64GB so it can accommodate BCM2712
     (Stanimir Varbanov)

   - Add BCM2712 support and DT updates (Stanimir Varbanov)

   - Apply link speed restriction before bringing link up, not after
     (Jim Quinlan)

   - Update Max Link Speed in Link Capabilities via the internal
     writable register, not the read-only config register (Jim Quinlan)

   - Handle regulator_bulk_get() error to avoid panic when we call
     regulator_bulk_free() later (Jim Quinlan)

   - Disable regulators only when removing the bus immediately below a
     Root Port because we don't support regulators deeper in the
     hierarchy (Jim Quinlan)

   - Make const read-only arrays static (Colin Ian King)

  Cadence PCIe endpoint driver:

   - Correct MSG TLP generation so endpoints can generate INTx messages
     (Hans Zhang)

  Freescale i.MX6 PCIe controller driver:

   - Identify the second controller on i.MX8MQ based on devicetree
     'linux,pci-domain' instead of DBI 'reg' address (Richard Zhu)

   - Remove imx_pcie_cpu_addr_fixup() since dwc core can now derive the
     ATU input address (using parent_bus_offset) from devicetree (Frank
     Li)

  Freescale Layerscape PCIe controller driver:

   - Drop deprecated 'num-ib-windows' and 'num-ob-windows' and
     unnecessary 'status' from example (Krzysztof Kozlowski)

   - Correct the syscon_regmap_lookup_by_phandle_args("fsl,pcie-scfg")
     arg_count to fix probe failure on LS1043A (Ioana Ciornei)

  HiSilicon STB PCIe controller driver:

   - Call phy_exit() to clean up if histb_pcie_probe() fails (Christophe
     JAILLET)

  Intel Gateway PCIe controller driver:

   - Remove intel_pcie_cpu_addr() since dwc core can now derive the ATU
     input address (using parent_bus_offset) from devicetree (Frank Li)

  Intel VMD host bridge driver:

   - Convert vmd_dev.cfg_lock from spinlock_t to raw_spinlock_t so
     pci_ops.read() will never sleep, even on PREEMPT_RT where
     spinlock_t becomes a sleepable lock, to avoid calling a sleeping
     function from invalid context (Ryo Takakura)

  MediaTek PCIe Gen3 controller driver:

   - Remove leftover mac_reset assert for Airoha EN7581 SoC (Lorenzo
     Bianconi)

   - Add EN7581 PBUS controller 'mediatek,pbus-csr' DT property and
     program host bridge memory aperture to this syscon node (Lorenzo
     Bianconi)

  Qualcomm PCIe controller driver:

   - Add qcom,pcie-ipq5332 binding (Varadarajan Narayanan)

   - Add qcom i.MX8QM and i.MX8QXP/DXP optional DMA interrupt (Alexander
     Stein)

   - Add optional dma-coherent DT property for Qualcomm SA8775P (Dmitry
     Baryshkov)

   - Make DT iommu property required for SA8775P and prohibited for
     SDX55 (Dmitry Baryshkov)

   - Add DT IOMMU and DMA-related properties for Qualcomm SM8450 (Dmitry
     Baryshkov)

   - Add endpoint DT properties for SAR2130P and enable endpoint mode in
     driver (Dmitry Baryshkov)

   - Describe endpoint BAR0 and BAR2 as 64-bit only and BAR1 and BAR3 as
     RESERVED (Manivannan Sadhasivam)

  Rockchip DesignWare PCIe controller driver:

   - Describe rk3568 and rk3588 BARs as Resizable, not Fixed (Niklas
     Cassel)

  Synopsys DesignWare PCIe controller driver:

   - Add debugfs-based Silicon Debug, Error Injection, Statistical
     Counter support for DWC (Shradha Todi)

   - Add debugfs property to expose LTSSM status of DWC PCIe link (Hans
     Zhang)

   - Add Rockchip support for DWC debugfs features (Niklas Cassel)

   - Add dw_pcie_parent_bus_offset() to look up the parent bus address
     of a specified 'reg' property and return the offset from the CPU
     physical address (Frank Li)

   - Use dw_pcie_parent_bus_offset() to derive CPU -> ATU addr offset
     via 'reg[config]' for host controllers and 'reg[addr_space]' for
     endpoint controllers (Frank Li)

   - Apply struct dw_pcie.parent_bus_offset in ATU users to remove use
     of .cpu_addr_fixup() when programming ATU (Frank Li)

  TI J721E PCIe driver:

   - Correct the 'link down' interrupt bit for J784S4 (Siddharth
     Vadapalli)

  TI Keystone PCIe controller driver:

   - Describe AM65x BARs 2 and 5 as Resizable (not Fixed) and reduce
     alignment requirement from 1MB to 64KB (Niklas Cassel)

  Xilinx Versal CPM PCIe controller driver:

   - Free IRQ domain in probe error path to avoid leaking it
     (Thippeswamy Havalige)

   - Add DT .compatible "xlnx,versal-cpm5nc-host" and driver support for
     Versal Net CPM5NC Root Port controller (Thippeswamy Havalige)

   - Add driver support for CPM5_HOST1 (Thippeswamy Havalige)

  Miscellaneous:

   - Convert fsl,mpc83xx-pcie binding to YAML (J. Neuschäfer)

   - Use for_each_available_child_of_node_scoped() to simplify apple,
     kirin, mediatek, mt7621, tegra drivers (Zhang Zekun)"

* tag 'pci-v6.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (197 commits)
  PCI: layerscape: Fix arg_count to syscon_regmap_lookup_by_phandle_args()
  PCI: j721e: Fix the value of .linkdown_irq_regfield for J784S4
  misc: pci_endpoint_test: Add support for PCITEST_IRQ_TYPE_AUTO
  PCI: endpoint: pci-epf-test: Expose supported IRQ types in CAPS register
  PCI: dw-rockchip: Endpoint mode cannot raise INTx interrupts
  PCI: endpoint: Add intx_capable to epc_features struct
  dt-bindings: PCI: Add common schema for devices accessible through PCI BARs
  PCI: intel-gw: Remove intel_pcie_cpu_addr()
  PCI: imx6: Remove imx_pcie_cpu_addr_fixup()
  PCI: dwc: Use parent_bus_offset to remove need for .cpu_addr_fixup()
  PCI: dwc: ep: Ensure proper iteration over outbound map windows
  PCI: dwc: ep: Use devicetree 'reg[addr_space]' to derive CPU -> ATU addr offset
  PCI: dwc: ep: Consolidate devicetree handling in dw_pcie_ep_get_resources()
  PCI: dwc: ep: Call epc_create() early in dw_pcie_ep_init()
  PCI: dwc: Use devicetree 'reg[config]' to derive CPU -> ATU addr offset
  PCI: dwc: Add dw_pcie_parent_bus_offset() checking and debug
  PCI: dwc: Add dw_pcie_parent_bus_offset()
  PCI/bwctrl: Fix NULL pointer dereference on bus number exhaustion
  PCI: xilinx-cpm: Add cpm_csr register mapping for CPM5_HOST1 variant
  PCI: brcmstb: Make const read-only arrays static
  ...
2025-03-28 19:36:53 -07:00
Yi Liu
ad744ed5dd vfio: VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT support pasid
This extends the VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT ioctls to attach/detach
a given pasid of a vfio device to/from an IOAS/HWPT.

Link: https://patch.msgid.link/r/20250321180143.8468-4-yi.l.liu@intel.com
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-03-25 10:18:31 -03:00
Yi Liu
290641346d vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
This adds pasid_at|de]tach_ioas ops for attaching hwpt to pasid of a
device and the helpers for it. For now, only vfio-pci supports pasid
attach/detach.

Link: https://patch.msgid.link/r/20250321180143.8468-3-yi.l.liu@intel.com
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-03-25 10:18:31 -03:00
Yi Liu
2fb69c602d iommufd: Support pasid attach/replace
This extends the below APIs to support PASID. Device drivers to manage pasid
attach/replace/detach.

    int iommufd_device_attach(struct iommufd_device *idev,
			      ioasid_t pasid, u32 *pt_id);
    int iommufd_device_replace(struct iommufd_device *idev,
			       ioasid_t pasid, u32 *pt_id);
    void iommufd_device_detach(struct iommufd_device *idev,
			       ioasid_t pasid);

The pasid operations share underlying attach/replace/detach infrastructure
with the device operations, but still have some different implications:

 - no reserved region per pasid otherwise SVA architecture is already
   broken (CPU address space doesn't count device reserved regions);

 - accordingly no sw_msi trick;

Cache coherency enforcement is still applied to pasid operations since
it is about memory accesses post page table walking (no matter the walk
is per RID or per PASID).

Link: https://patch.msgid.link/r/20250321171940.7213-12-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-03-25 10:18:31 -03:00
Niklas Schnelle
888bd8322d s390/pci: Introduce pdev->non_mappable_bars and replace VFIO_PCI_MMAP
The ability to map PCI resources to user-space is controlled by global
defines. For vfio there is VFIO_PCI_MMAP which is only disabled on s390 and
controls mapping of PCI resources using vfio-pci with a fallback option via
the pread()/pwrite() interface.

For the PCI core there is ARCH_GENERIC_PCI_MMAP_RESOURCE which enables a
generic implementation for mapping PCI resources plus the newer sysfs
interface. Then there is HAVE_PCI_MMAP which can be used with custom
definitions of pci_mmap_resource_range() and the historical /proc/bus/pci
interface. Both mechanisms are all or nothing.

For s390 mapping PCI resources is possible and useful for testing and
certain applications such as QEMU's vfio-pci based user-space NVMe driver.
For certain devices, however access to PCI resources via mappings to
user-space is not possible and these must be excluded from the general PCI
resource mapping mechanisms.

Introduce pdev->non_mappable_bars to indicate that a PCI device's BARs can
not be accessed via mappings to user-space. In the future this enables
per-device restrictions of PCI resource mapping.

For now, set this flag for all PCI devices on s390 in line with the
existing, general disable of PCI resource mapping. As s390 is the only user
of the VFI_PCI_MMAP Kconfig options this can already be replaced with a
check of this new flag. Also add similar checks in the other code protected
by HAVE_PCI_MMAP respectively ARCH_GENERIC_PCI_MMAP in preparation for
enabling these for supported devices.

Link: https://lore.kernel.org/lkml/20250212132808.08dcf03c.alex.williamson@redhat.com/
Link: https://lore.kernel.org/r/20250226-vfio_pci_mmap-v7-2-c5c0f1d26efd@linux.ibm.com
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-21 14:54:16 -05:00
Alex Williamson
860be250fc vfio/pci: Handle INTx IRQ_NOTCONNECTED
Some systems report INTx as not routed by setting pdev->irq to
IRQ_NOTCONNECTED, resulting in a -ENOTCONN error when trying to
setup eventfd signaling.  Include this in the set of conditions
for which the PIN register is virtualized to zero.

Additionally consolidate vfio_pci_get_irq_count() to use this
virtualized value in reporting INTx support via ioctl and sanity
checking ioctl paths since pdev->irq is re-used when the device
is in MSI mode.

The combination of these results in both the config space of the
device and the ioctl interface behaving as if the device does not
support INTx.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250311230623.1264283-1-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-03-17 15:15:17 -06:00
Yishai Hadas
384a530111 vfio/virtio: Enable support for virtio-block live migration
With a functional and tested backend for virtio-block live migration,
add the virtio-block device ID to the pci_device_id table.

Currently, the driver supports legacy IO functionality only for
virtio-net, and it is accounted for in specific parts of the code.

To enforce this limitation, an explicit check for virtio-net, has been
added in virtiovf_support_legacy_io(). Once a backend implements legacy
IO functionality for virtio-block, the necessary support will be added
to the driver, and this additional check should be removed.

The module description was updated accordingly.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20250302162723.82578-1-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-03-10 14:42:50 -06:00
Alex Williamson
0fd06844de vfio/type1: Use mapping page mask for pfnmaps
vfio-pci supports huge_fault for PCI MMIO BARs and will insert pud and
pmd mappings for well aligned mappings.  follow_pfnmap_start() walks the
page table and therefore knows the page mask of the level where the
address is found and returns this through follow_pfnmap_args.addr_mask.
Subsequent pfns from this address until the end of the mapping page are
necessarily consecutive.  Use this information to retrieve a range of
pfnmap pfns in a single pass.

With optimal mappings and alignment on systems with 1GB pud and 4KB
page size, this reduces iterations for DMA mapping PCI BARs by a
factor of 256K.  In real world testing, the overhead of iterating
pfns for a VM DMA mapping a 32GB PCI BAR is reduced from ~1s to
sub-millisecond overhead.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Tested-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250218222209.1382449-7-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-02-27 11:55:54 -07:00
Alex Williamson
0635559233 vfio/type1: Use consistent types for page counts
Page count should more consistently be an unsigned long when passed as
an argument while functions returning a number of pages should use a
signed long to allow for -errno.

vaddr_get_pfns() can therefore be upgraded to return long, though in
practice it's currently limited by the batch capacity.  In fact, the
batch indexes are noted to never hold negative values, so while it
doesn't make sense to bloat the structure with unsigned longs in this
case, it does make sense to specify these as unsigned.

No change in behavior expected.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Tested-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250218222209.1382449-5-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-02-27 11:55:54 -07:00
Alex Williamson
eb996eec78 vfio/type1: Use vfio_batch for vaddr_get_pfns()
Passing the vfio_batch to vaddr_get_pfns() allows for greater
distinction between page backed pfns and pfnmaps.  In the case of page
backed pfns, vfio_batch.size is set to a positive value matching the
number of pages filled in vfio_batch.pages.  For a pfnmap,
vfio_batch.size remains zero as vfio_batch.pages are not used.  In both
cases the return value continues to indicate the number of pfns and the
provided pfn arg is set to the initial pfn value.

This allows us to shortcut the pfnmap case, which is detected by the
zero vfio_batch.size.  pfnmaps do not contribute to locked memory
accounting, therefore we can update counters and continue directly,
which also enables a future where vaddr_get_pfns() can return a value
greater than one for consecutive pfnmaps.

NB. Now that we're not guessing whether the initial pfn is page backed
or pfnmap, we no longer need to special case the put_pfn() and batch
size reset.  It's safe for vfio_batch_unpin() to handle this case.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Tested-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Link: https://lore.kernel.org/r/20250218222209.1382449-4-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-02-27 11:55:54 -07:00
Alex Williamson
7a701e90fc vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
This is a step towards passing the structure to vaddr_get_pfns()
directly in order to provide greater distinction between page backed
pfns and pfnmaps.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Tested-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Link: https://lore.kernel.org/r/20250218222209.1382449-3-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-02-27 11:55:49 -07:00
Alex Williamson
afe84f3b7a vfio/type1: Catch zero from pin_user_pages_remote()
pin_user_pages_remote() can currently return zero for invalid args
or zero nr_pages, neither of which should ever happen.  However
vaddr_get_pfns() indicates it should only ever return a positive
value or -errno and there's a theoretical case where this can slip
through and be unhandled by callers.  Therefore convert zero to
-EFAULT.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Tested-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250218222209.1382449-2-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-02-27 11:55:39 -07:00
Tomita Moeko
41112160ca vfio/pci: match IGD devices in display controller class
IGD device can either expose as a VGA controller or display controller
depending on whether it is configured as the primary display device in
BIOS. In both cases, the OpRegion may be present. A new helper function
vfio_pci_is_intel_display() is introduced to check if the device might
be an IGD device.

Signed-off-by: Tomita Moeko <tomitamoeko@gmail.com>
Link: https://lore.kernel.org/r/20250123163416.7653-1-tomitamoeko@gmail.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-02-27 11:30:07 -07:00
Al Viro
f9835fa147
make use of anon_inode_getfile_fmode()
["fallen through the cracks" misc stuff]

A bunch of anon_inode_getfile() callers follow it with adjusting
->f_mode; we have a helper doing that now, so let's make use
of it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250118014434.GT1977892@ZenIV
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-21 10:25:31 +01:00
Linus Torvalds
3673f5be0e VFIO updates for v6.14-rc1
- Extend vfio-pci 8-byte read/write support to include archs defining
    CONFIG_GENERIC_IOMAP, such as x86, and remove now extraneous #ifdefs
    around 64-bit accessors. (Ramesh Thomas)
 
  - Update vfio-pci shadow ROM handling and allow cached ROM from setup
    data to be exposed as a functional ROM BAR region when available.
    (Yunxiang Li)
 
  - Update nvgrace-gpu vfio-pci variant driver for new Grace Blackwell
    hardware, conditionalizing the uncached BAR workaround for previous
    generation hardware based on the presence of a flag in a new DVSEC
    capability, and include a delay during probe for link training to
    complete, a new requirement for GB devices. (Ankit Agrawal)
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmeZNf0bHGFsZXgud2ls
 bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiZDIP/16zZ6FwPxu52erigZCN
 6GwniPU051kDEKMhlO9aGSJNXkzB3qGYUXH1rExIYazRrovc9QYRG1JXBEGIj+Dj
 NkxhwJBvaj61dc1p+lfNH/jZYipE+mbuguz1gOZfwMEA/8uNmsFd2uhXBZBehI2X
 WCQQdxwz9GWB34pabN2OuR3cFwYbYD06kg/WfxsAEisDhRsjrPn//fbwZNBM7mLy
 4gmatoQ97uPexo+SACQSIIop7TlNeiA+Mo8i/XxTpmjry9Wl0tMNBKfNaxMf7o5p
 4laRU6EyT0/Cimc1w8Mct96fvO1AqKIRnBqFYwxzmtYthllpKPqnoZlwOPSE24f7
 zbB46NkhdE6JOsqJUMPj+hdW3bBhQgcpIMU3MkYgbzNVjcb5DcIDZk3b64DIJOqK
 HzItxvUNXVo9HYnc1gdI88c2btDA1hDOzH5fFX85AmQcUqs24+i7ekdEyws65J0O
 iVBJP/cC51vAJv0y4gtty+bq1OqcQ3jwnEvre52F9LPJVHFsKA8RheOyodlG0Ar8
 m1zWJVZbQIFbs8gp+q/GHdltQ9w0XvECQOe1EE7zxAQX0noks+3S+Ea+wAYYZH5p
 a1fbep0MoL3fZF+s4a7kc/avcm1WRpQTSY10HC3K/+0wQ5S0B8n/QG4S9mMsEEBn
 G/9+7ELvYrFop/CfC4Mkj0MA
 =NJZ8
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.14-rc1' of https://github.com/awilliam/linux-vfio

Pull vfio updates from Alex Williamson:

 - Extend vfio-pci 8-byte read/write support to include archs defining
   CONFIG_GENERIC_IOMAP, such as x86, and remove now extraneous #ifdefs
   around 64-bit accessors (Ramesh Thomas)

 - Update vfio-pci shadow ROM handling and allow cached ROM from setup
   data to be exposed as a functional ROM BAR region when available
   (Yunxiang Li)

 - Update nvgrace-gpu vfio-pci variant driver for new Grace Blackwell
   hardware, conditionalizing the uncached BAR workaround for previous
   generation hardware based on the presence of a flag in a new DVSEC
   capability, and include a delay during probe for link training to
   complete, a new requirement for GB devices (Ankit Agrawal)

* tag 'vfio-v6.14-rc1' of https://github.com/awilliam/linux-vfio:
  vfio/nvgrace-gpu: Add GB200 SKU to the devid table
  vfio/nvgrace-gpu: Check the HBM training and C2C link status
  vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
  vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem
  vfio/platform: check the bounds of read/write syscalls
  vfio/pci: Expose setup ROM at ROM bar when needed
  vfio/pci: Remove shadow ROM specific code paths
  vfio/pci: Remove #ifdef iowrite64 and #ifdef ioread64
  vfio/pci: Enable iowrite64 and ioread64 for vfio pci
2025-01-28 14:16:46 -08:00
Linus Torvalds
2ab002c755 Driver core and debugfs updates
Here is the big set of driver core and debugfs updates for 6.14-rc1.
 It's coming late in the merge cycle as there are a number of merge
 conflicts with your tree now, and I wanted to make sure they were
 working properly.  To resolve them, look in linux-next, and I will send
 the "fixup" patch as a response to the pull request.
 
 Included in here is a bunch of driver core, PCI, OF, and platform rust
 bindings (all acked by the different subsystem maintainers), hence the
 merge conflict with the rust tree, and some driver core api updates to
 mark things as const, which will also require some fixups due to new
 stuff coming in through other trees in this merge window.
 
 There are also a bunch of debugfs updates from Al, and there is at least
 one user that does have a regression with these, but Al is working on
 tracking down the fix for it.  In my use (and everyone else's linux-next
 use), it does not seem like a big issue at the moment.
 
 Here's a short list of the things in here:
   - driver core bindings for PCI, platform, OF, and some i/o functions.
     We are almost at the "write a real driver in rust" stage now,
     depending on what you want to do.
   - misc device rust bindings and a sample driver to show how to use
     them
   - debugfs cleanups in the fs as well as the users of the fs api for
     places where drivers got it wrong or were unnecessarily doing things
     in complex ways.
   - driver core const work, making more of the api take const * for
     different parameters to make the rust bindings easier overall.
   - other small fixes and updates
 
 All of these have been in linux-next with all of the aforementioned
 merge conflicts, and the one debugfs issue, which looks to be resolved
 "soon".
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ5koPA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymFHACfT5acDKf2Bov2Lc/5u3vBW/R6ChsAnj+LmgVI
 hcDSPodj4szR40RRnzBd
 =u5Ey
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core and debugfs updates from Greg KH:
 "Here is the big set of driver core and debugfs updates for 6.14-rc1.

  Included in here is a bunch of driver core, PCI, OF, and platform rust
  bindings (all acked by the different subsystem maintainers), hence the
  merge conflict with the rust tree, and some driver core api updates to
  mark things as const, which will also require some fixups due to new
  stuff coming in through other trees in this merge window.

  There are also a bunch of debugfs updates from Al, and there is at
  least one user that does have a regression with these, but Al is
  working on tracking down the fix for it. In my use (and everyone
  else's linux-next use), it does not seem like a big issue at the
  moment.

  Here's a short list of the things in here:

   - driver core rust bindings for PCI, platform, OF, and some i/o
     functions.

     We are almost at the "write a real driver in rust" stage now,
     depending on what you want to do.

   - misc device rust bindings and a sample driver to show how to use
     them

   - debugfs cleanups in the fs as well as the users of the fs api for
     places where drivers got it wrong or were unnecessarily doing
     things in complex ways.

   - driver core const work, making more of the api take const * for
     different parameters to make the rust bindings easier overall.

   - other small fixes and updates

  All of these have been in linux-next with all of the aforementioned
  merge conflicts, and the one debugfs issue, which looks to be resolved
  "soon""

* tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
  rust: device: Use as_char_ptr() to avoid explicit cast
  rust: device: Replace CString with CStr in property_present()
  devcoredump: Constify 'struct bin_attribute'
  devcoredump: Define 'struct bin_attribute' through macro
  rust: device: Add property_present()
  saner replacement for debugfs_rename()
  orangefs-debugfs: don't mess with ->d_name
  octeontx2: don't mess with ->d_parent or ->d_parent->d_name
  arm_scmi: don't mess with ->d_parent->d_name
  slub: don't mess with ->d_name
  sof-client-ipc-flood-test: don't mess with ->d_name
  qat: don't mess with ->d_name
  xhci: don't mess with ->d_iname
  mtu3: don't mess wiht ->d_iname
  greybus/camera - stop messing with ->d_iname
  mediatek: stop messing with ->d_iname
  netdevsim: don't embed file_operations into your structs
  b43legacy: make use of debugfs_get_aux()
  b43: stop embedding struct file_operations into their objects
  carl9170: stop embedding file_operations into their objects
  ...
2025-01-28 12:25:12 -08:00
Ankit Agrawal
2bb447540e vfio/nvgrace-gpu: Add GB200 SKU to the devid table
NVIDIA is productizing the new Grace Blackwell superchip
SKU bearing device ID 0x2941.

Add the SKU devid to nvgrace_gpu_vfio_pci_table.

CC: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20250124183102.3976-5-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-27 09:43:33 -07:00
Ankit Agrawal
d85f69d520 vfio/nvgrace-gpu: Check the HBM training and C2C link status
In contrast to Grace Hopper systems, the HBM training has been moved
out of the UEFI on the Grace Blackwell systems. This reduces the system
bootup time significantly.

The onus of checking whether the HBM training has completed thus falls
on the module.

The HBM training status can be determined from a BAR0 register.
Similarly, another BAR0 register exposes the status of the CPU-GPU
chip-to-chip (C2C) cache coherent interconnect.

Based on testing, 30s is determined to be sufficient to ensure
initialization completion on all the Grace based systems. Thus poll
these register and check for 30s. If the HBM training is not complete
or if the C2C link is not ready, fail the probe.

While the time is not required on Grace Hopper systems, it is
beneficial to make the check to ensure the device is in an
expected state. Hence keeping it generalized to both the generations.

Ensure that the BAR0 is enabled before accessing the registers.

CC: Alex Williamson <alex.williamson@redhat.com>
CC: Kevin Tian <kevin.tian@intel.com>
CC: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20250124183102.3976-4-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-27 09:43:33 -07:00
Ankit Agrawal
6a9eb2d125 vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
There is a HW defect on Grace Hopper (GH) to support the
Multi-Instance GPU (MIG) feature [1] that necessiated the presence
of a 1G region carved out from the device memory and mapped as
uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3)
to workaround the issue.

The Grace Blackwell systems (GB) differ from GH systems in the following
aspects:
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].

This patch accommodate those GB changes by showing the 64b physical
device BAR1 (region2 and 3) to the VM instead of the fake one. This
takes care of both the differences.

Moreover, the entire device memory is exposed on GB as cacheable to
the VM as there is no carveout required.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20250124183102.3976-3-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-27 09:43:33 -07:00
Ankit Agrawal
bd53764a60 vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem
NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a
continuation with the Grace Hopper (GH) superchip that provides a
cache coherent access to CPU and GPU to each other's memory with
an internal proprietary chip-to-chip cache coherent interconnect.

There is a HW defect on GH systems to support the Multi-Instance
GPU (MIG) feature [1] that necessiated the presence of a 1G region
with uncached mapping carved out from the device memory. The 1G
region is shown as a fake BAR (comprising region 2 and 3) to
workaround the issue. This is fixed on the GB systems.

The presence of the fix for the HW defect is communicated by the
device firmware through the DVSEC PCI config register with ID 3.
The module reads this to take a different codepath on GB vs GH.

Scan through the DVSEC registers to identify the correct one and use
it to determine the presence of the fix. Save the value in the device's
nvgrace_gpu_pci_core_device structure.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]

CC: Jason Gunthorpe <jgg@nvidia.com>
CC: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20250124183102.3976-2-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-27 09:43:33 -07:00
Linus Torvalds
9c5968db9e The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
 
 - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
   page allocator so we end up with the ability to allocate and free
   zero-refcount pages.  So that callers (ie, slab) can avoid a refcount
   inc & dec.
 
 - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
   large folios other than PMD-sized ones.
 
 - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
   fixes for this small built-in kernel selftest.
 
 - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
   the mapletree code.
 
 - "mm: fix format issues and param types" from Keren Sun implements a
   few minor code cleanups.
 
 - "simplify split calculation" from Wei Yang provides a few fixes and a
   test for the mapletree code.
 
 - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
   continues the work of moving vma-related code into the (relatively) new
   mm/vma.c.
 
 - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
   Hildenbrand cleans up and rationalizes handling of gfp flags in the page
   allocator.
 
 - "readahead: Reintroduce fix for improper RA window sizing" from Jan
   Kara is a second attempt at fixing a readahead window sizing issue.  It
   should reduce the amount of unnecessary reading.
 
 - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
   addresses an issue where "huge" amounts of pte pagetables are
   accumulated
   (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
   Qi's series addresses this windup by synchronously freeing PTE memory
   within the context of madvise(MADV_DONTNEED).
 
 - "selftest/mm: Remove warnings found by adding compiler flags" from
   Muhammad Usama Anjum fixes some build warnings in the selftests code
   when optional compiler warnings are enabled.
 
 - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
   Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
 
 - "pkeys kselftests improvements" from Kevin Brodsky implements various
   fixes and cleanups in the MM selftests code, mainly pertaining to the
   pkeys tests.
 
 - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
   estimate application working set size.
 
 - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
   provides some cleanups to memcg's hugetlb charging logic.
 
 - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
   removes the global swap cgroup lock.  A speedup of 10% for a tmpfs-based
   kernel build was demonstrated.
 
 - "zram: split page type read/write handling" from Sergey Senozhatsky
   has several fixes and cleaups for zram in the area of zram_write_page().
   A watchdog softlockup warning was eliminated.
 
 - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
   cleans up the pagetable destructor implementations.  A rare
   use-after-free race is fixed.
 
 - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
   simplifies and cleans up the debugging code in the VMA merging logic.
 
 - "Account page tables at all levels" from Kevin Brodsky cleans up and
   regularizes the pagetable ctor/dtor handling.  This results in
   improvements in accounting accuracy.
 
 - "mm/damon: replace most damon_callback usages in sysfs with new core
   functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
   file interface logic.
 
 - "mm/damon: enable page level properties based monitoring" from
   SeongJae Park increases the amount of information which is presented in
   response to DAMOS actions.
 
 - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
   DAMON's long-deprecated debugfs interfaces.  Thus the migration to sysfs
   is completed.
 
 - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
   Xu cleans up and generalizes the hugetlb reservation accounting.
 
 - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
   removes a never-used feature of the alloc_pages_bulk() interface.
 
 - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
   extends DAMOS filters to support not only exclusion (rejecting), but
   also inclusion (allowing) behavior.
 
 - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
   "introduces a new memory descriptor for zswap.zpool that currently
   overlaps with struct page for now.  This is part of the effort to reduce
   the size of struct page and to enable dynamic allocation of memory
   descriptors."
 
 - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
   simplifies the swap allocator locking.  A speedup of 400% was
   demonstrated for one workload.  As was a 35% reduction for kernel build
   time with swap-on-zram.
 
 - "mm: update mips to use do_mmap(), make mmap_region() internal" from
   Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
   mmap_region() can be made MM-internal.
 
 - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
   regressions and otherwise improves MGLRU performance.
 
 - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
   updates DAMON documentation.
 
 - "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
 
 - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
   provides various cleanups in the areas of hugetlb folios, THP folios and
   migration.
 
 - "Uncached buffered IO" from Jens Axboe implements the new
   RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
   reading and writing.  To permite userspace to address issues with
   massive buildup of useless pagecache when reading/writing fast devices.
 
 - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
   Weißschuh fixes and optimizes some of the MM selftests.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
 jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
 uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
 =Ib2s
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "The various patchsets are summarized below. Plus of course many
  indivudual patches which are described in their changelogs.

   - "Allocate and free frozen pages" from Matthew Wilcox reorganizes
     the page allocator so we end up with the ability to allocate and
     free zero-refcount pages. So that callers (ie, slab) can avoid a
     refcount inc & dec

   - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
     use large folios other than PMD-sized ones

   - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
     and fixes for this small built-in kernel selftest

   - "mas_anode_descend() related cleanup" from Wei Yang tidies up part
     of the mapletree code

   - "mm: fix format issues and param types" from Keren Sun implements a
     few minor code cleanups

   - "simplify split calculation" from Wei Yang provides a few fixes and
     a test for the mapletree code

   - "mm/vma: make more mmap logic userland testable" from Lorenzo
     Stoakes continues the work of moving vma-related code into the
     (relatively) new mm/vma.c

   - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
     Hildenbrand cleans up and rationalizes handling of gfp flags in the
     page allocator

   - "readahead: Reintroduce fix for improper RA window sizing" from Jan
     Kara is a second attempt at fixing a readahead window sizing issue.
     It should reduce the amount of unnecessary reading

   - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
     addresses an issue where "huge" amounts of pte pagetables are
     accumulated:

       https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/

     Qi's series addresses this windup by synchronously freeing PTE
     memory within the context of madvise(MADV_DONTNEED)

   - "selftest/mm: Remove warnings found by adding compiler flags" from
     Muhammad Usama Anjum fixes some build warnings in the selftests
     code when optional compiler warnings are enabled

   - "mm: don't use __GFP_HARDWALL when migrating remote pages" from
     David Hildenbrand tightens the allocator's observance of
     __GFP_HARDWALL

   - "pkeys kselftests improvements" from Kevin Brodsky implements
     various fixes and cleanups in the MM selftests code, mainly
     pertaining to the pkeys tests

   - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
     estimate application working set size

   - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
     provides some cleanups to memcg's hugetlb charging logic

   - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
     removes the global swap cgroup lock. A speedup of 10% for a
     tmpfs-based kernel build was demonstrated

   - "zram: split page type read/write handling" from Sergey Senozhatsky
     has several fixes and cleaups for zram in the area of
     zram_write_page(). A watchdog softlockup warning was eliminated

   - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
     Brodsky cleans up the pagetable destructor implementations. A rare
     use-after-free race is fixed

   - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
     simplifies and cleans up the debugging code in the VMA merging
     logic

   - "Account page tables at all levels" from Kevin Brodsky cleans up
     and regularizes the pagetable ctor/dtor handling. This results in
     improvements in accounting accuracy

   - "mm/damon: replace most damon_callback usages in sysfs with new
     core functions" from SeongJae Park cleans up and generalizes
     DAMON's sysfs file interface logic

   - "mm/damon: enable page level properties based monitoring" from
     SeongJae Park increases the amount of information which is
     presented in response to DAMOS actions

   - "mm/damon: remove DAMON debugfs interface" from SeongJae Park
     removes DAMON's long-deprecated debugfs interfaces. Thus the
     migration to sysfs is completed

   - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
     Peter Xu cleans up and generalizes the hugetlb reservation
     accounting

   - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
     removes a never-used feature of the alloc_pages_bulk() interface

   - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
     extends DAMOS filters to support not only exclusion (rejecting),
     but also inclusion (allowing) behavior

   - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
     introduces a new memory descriptor for zswap.zpool that currently
     overlaps with struct page for now. This is part of the effort to
     reduce the size of struct page and to enable dynamic allocation of
     memory descriptors

   - "mm, swap: rework of swap allocator locks" from Kairui Song redoes
     and simplifies the swap allocator locking. A speedup of 400% was
     demonstrated for one workload. As was a 35% reduction for kernel
     build time with swap-on-zram

   - "mm: update mips to use do_mmap(), make mmap_region() internal"
     from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
     mmap_region() can be made MM-internal

   - "mm/mglru: performance optimizations" from Yu Zhao fixes a few
     MGLRU regressions and otherwise improves MGLRU performance

   - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
     Park updates DAMON documentation

   - "Cleanup for memfd_create()" from Isaac Manjarres does that thing

   - "mm: hugetlb+THP folio and migration cleanups" from David
     Hildenbrand provides various cleanups in the areas of hugetlb
     folios, THP folios and migration

   - "Uncached buffered IO" from Jens Axboe implements the new
     RWF_DONTCACHE flag which provides synchronous dropbehind for
     pagecache reading and writing. To permite userspace to address
     issues with massive buildup of useless pagecache when
     reading/writing fast devices

   - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
     Weißschuh fixes and optimizes some of the MM selftests"

* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm/compaction: fix UBSAN shift-out-of-bounds warning
  s390/mm: add missing ctor/dtor on page table upgrade
  kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
  tools: add VM_WARN_ON_VMG definition
  mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
  seqlock: add missing parameter documentation for raw_seqcount_try_begin()
  mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
  mm/page_alloc: remove the incorrect and misleading comment
  zram: remove zcomp_stream_put() from write_incompressible_page()
  mm: separate move/undo parts from migrate_pages_batch()
  mm/kfence: use str_write_read() helper in get_access_type()
  selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
  kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
  selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
  selftests/mm: vm_util: split up /proc/self/smaps parsing
  selftests/mm: virtual_address_range: unmap chunks after validation
  selftests/mm: virtual_address_range: mmap() without PROT_WRITE
  selftests/memfd/memfd_test: fix possible NULL pointer dereference
  mm: add FGP_DONTCACHE folio creation flag
  mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
  ...
2025-01-26 18:36:23 -08:00
Luiz Capitulino
6bf9b5b40a mm: alloc_pages_bulk: rename API
The previous commit removed the page_list argument from
alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function.

Now that only the *_array() flavour of the API remains, we can do the
following renaming (along with the _noprof() ones):

  alloc_pages_bulk_array -> alloc_pages_bulk
  alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy
  alloc_pages_bulk_array_node -> alloc_pages_bulk_node

Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com
Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:31 -08:00
Linus Torvalds
647d69605c pci-v6.14-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmeTr8wUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vxrMw//TJXH+U6x5LhYvBPD/KZ20ecGHqaA
 eGXrbHAasYbU1CfW7HM0onR8NffOIGoYvQrtefjQAln0w6rTvyFO0xJKLP15vMfN
 hnj+y1WWtKwAkSpu10Cl9nTj8uYRNNSQeoy5kS+1diwuXdby/DlgQONO2APSe9zd
 KMPXJcqSfDJlM5zHrcqqtlxauO9KHInLCc/iutd85AKjvcjOoNHNeZE0pTC0C3gE
 sXYHDqJiS3zdEG6X6mWFo3OzI/Q/7NGlHJ2j0CQaObsgQ9yA7eWkez25ifwZcugc
 TPtjm8DhaDo9/zx0NV9c2dPauHRC6NYUjAflMPK7Aye/41BE1Ag5Ka+tMDgC2i/N
 TbfBxSeArhjnjY+eZwRhrJNNC58TtHTUs69TO7Dbmuwr7cp99MIEDAYI5V6LFAdk
 plKqn1h8FztW5QKRPCgmzy6KTE+WPytiGAGAQFxzIGYkV/QqyvFaVs8FIyOJUIFM
 aDSa6Xy5WLGxmPZ9hPapzEm4ws/HTRpFjNgi/d4rRG5RWMwAxZZa44s9eldhN1D/
 ZwEmF2rJ+U8S7Q+mXPHDlwcsHe5APCbiaTEp4X+e3LNe0i9oxhhaUWG6LrDDmTlQ
 tU5j5daHiBa0nTDL1lfaayJlYX/oJ+IYQrIYzGbnivZv4ZVdPnuWSsOsMOEiLhEt
 4QqCoanqf0mCn2A=
 =O62O
 -----END PGP SIGNATURE-----

Merge tag 'pci-v6.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci updates from Bjorn Helgaas:
 "Enumeration:

   - Batch sizing of multiple BARs while memory decoding is disabled
     instead of disabling/enabling decoding for each BAR individually;
     this optimizes virtualized environments where toggling decoding
     enable is expensive (Alex Williamson)

   - Add host bridge .enable_device() and .disable_device() hooks for
     bridges that need to configure things like Requester ID to StreamID
     mapping when enabling devices (Frank Li)

   - Extend struct pci_ecam_ops with .enable_device() and
     .disable_device() hooks so drivers that use pci_host_common_probe()
     instead of their own .probe() have a way to set the
     .enable_device() callbacks (Marc Zyngier)

   - Drop 'No bus range found' message so we don't complain when DTs
     don't specify the default 'bus-range = <0x00 0xff>' (Bjorn Helgaas)

   - Rename the drivers/pci/of_property.c struct of_pci_range to
     of_pci_range_entry to avoid confusion with the global of_pci_range
     in include/linux/of_address.h (Bjorn Helgaas)

  Driver binding:

   - Update resource request API documentation to encourage callers to
     supply a driver name when requesting resources (Philipp Stanner)

   - Export pci_intx_unmanaged() and pcim_intx() (always managed) so
     callers of pci_intx() (which is sometimes managed) can explicitly
     choose the one they need (Philipp Stanner)

   - Convert drivers from pci_intx() to always-managed pcim_intx() or
     never-managed pci_intx_unmanaged(): amd_sfh, ata (ahci, ata_piix,
     pata_rdc, sata_sil24, sata_sis, sata_uli, sata_vsc), bnx2x, bna,
     ntb, qtnfmac, rtsx, tifm_7xx1, vfio, xen-pciback (Philipp Stanner)

   - Remove pci_intx_unmanaged() since pci_intx() is now always
     unmanaged and pcim_intx() is always managed (Philipp Stanner)

  Error handling:

   - Unexport pcie_read_tlp_log() to encourage drivers to use PCI core
     logging rather than building their own (Ilpo Järvinen)

   - Move TLP Log handling to its own file (Ilpo Järvinen)

   - Store number of supported End-End TLP Prefixes always so we can
     read the correct number of DWORDs from the TLP Prefix Log (Ilpo
     Järvinen)

   - Read TLP Prefixes in addition to the Header Log in
     pcie_read_tlp_log() (Ilpo Järvinen)

   - Add pcie_print_tlp_log() to consolidate printing of TLP Header and
     Prefix Log (Ilpo Järvinen)

   - Quirk the Intel Raptor Lake-P PIO log size to accommodate vendor
     BIOSes that don't configure it correctly (Takashi Iwai)

  ASPM:

   - Save parent L1 PM Substates config so when we restore it along with
     an endpoint's config, the parent info isn't junk (Jian-Hong Pan)

  Power management:

   - Avoid D3 for Root Ports on TUXEDO Sirius Gen1 with old BIOS because
     the system can't wake up from suspend (Werner Sembach)

  Endpoint framework:

   - Destroy the EPC device in devm_pci_epc_destroy(), which previously
     didn't call devres_release() (Zijun Hu)

   - Finish virtual EP removal in pci_epf_remove_vepf(), which
     previously caused a subsequent pci_epf_add_vepf() to fail with
     -EBUSY (Zijun Hu)

   - Write BAR_MASK before iATU registers in pci_epc_set_bar() so we
     don't depend on the BAR_MASK reset value being larger than the
     requested BAR size (Niklas Cassel)

   - Prevent changing BAR size/flags in pci_epc_set_bar() to prevent
     reads from bypassing the iATU if we reduced the BAR size (Niklas
     Cassel)

   - Verify address alignment when programming iATU so we don't attempt
     to write bits that are read-only because of the BAR size, which
     could lead to directing accesses to the wrong address (Niklas
     Cassel)

   - Implement artpec6 pci_epc_features so we can rely on all drivers
     supporting it so we can use it in EPC core code (Niklas Cassel)

   - Check for BARs of fixed size to prevent endpoint drivers from
     trying to change their size (Niklas Cassel)

   - Verify that requested BAR size is a power of two when endpoint
     driver sets the BAR (Niklas Cassel)

  Endpoint framework tests:

   - Clear pci-epf-test dma_chan_rx, not dma_chan_tx, after freeing
     dma_chan_rx (Mohamed Khalfella)

   - Correct the DMA MEMCPY test so it doesn't fail if the Endpoint
     supports both DMA_PRIVATE and DMA_MEMCPY (Manivannan Sadhasivam)

   - Add pci-epf-test and pci_endpoint_test support for capabilities
     (Niklas Cassel)

   - Add Endpoint test for consecutive BARs (Niklas Cassel)

   - Remove redundant comparison from Endpoint BAR test because a > 1MB
     BAR can always be exactly covered by iterating with a 1MB buffer
     (Hans Zhang)

   - Move and convert PCI Endpoint tests from tools/pci to Kselftests
     (Manivannan Sadhasivam)

  Apple PCIe controller driver:

   - Convert StreamID mapping configuration from a bus notifier to the
     .enable_device() and .disable_device() callbacks (Marc Zyngier)

  Freescale i.MX6 PCIe controller driver:

   - Add Requester ID to StreamID mapping configuration when enabling
     devices (Frank Li)

   - Use DWC core suspend/resume functions for imx6 (Frank Li)

   - Add suspend/resume support for i.MX8MQ, i.MX8Q, and i.MX95 (Richard
     Zhu)

   - Add DT compatible string 'fsl,imx8q-pcie-ep' and driver support for
     i.MX8Q series (i.MX8QM, i.MX8QXP, and i.MX8DXL) Endpoints (Frank
     Li)

   - Add DT binding for optional i.MX95 Refclk and driver support to
     enable it if the platform hasn't enabled it (Richard Zhu)

   - Configure PHY based on controller being in Root Complex or Endpoint
     mode (Frank Li)

   - Rely on dbi2 and iATU base addresses from DT via
     dw_pcie_get_resources() instead of hardcoding them (Richard Zhu)

   - Deassert apps_reset in imx_pcie_deassert_core_reset() since it is
     asserted in imx_pcie_assert_core_reset() (Richard Zhu)

   - Add missing reference clock enable or disable logic for IMX6SX,
     IMX7D, IMX8MM (Richard Zhu)

   - Remove redundant imx7d_pcie_init_phy() since
     imx7d_pcie_enable_ref_clk() does the same thing (Richard Zhu)

  Freescale Layerscape PCIe controller driver:

   - Simplify by using syscon_regmap_lookup_by_phandle_args() instead
     of syscon_regmap_lookup_by_phandle() followed by
     of_property_read_u32_array() (Krzysztof Kozlowski)

  Marvell MVEBU PCIe controller driver:

   - Add MODULE_DEVICE_TABLE() to enable module autoloading (Liao Chen)

  MediaTek PCIe Gen3 controller driver:

   - Use clk_bulk_prepare_enable() instead of separate
     clk_bulk_prepare() and clk_bulk_enable() (Lorenzo Bianconi)

   - Rearrange reset assert/deassert so they're both done in the
     *_power_up() callbacks (Lorenzo Bianconi)

   - Document that Airoha EN7581 requires PHY init and power-on before
     PHY reset deassert, unlike other MediaTek Gen3 controllers (Lorenzo
     Bianconi)

   - Move Airoha EN7581 post-reset delay from the en7581 clock .enable()
     method to mtk_pcie_en7581_power_up() (Lorenzo Bianconi)

   - Sleep instead of delay during Airoha EN7581 power-up, since this is
     a non-atomic context (Lorenzo Bianconi)

   - Skip PERST# assertion on Airoha EN7581 during probe and
     suspend/resume to avoid a hardware defect (Lorenzo Bianconi)

   - Enable async probe to reduce system startup time (Douglas Anderson)

  Microchip PolarFlare PCIe controller driver:

   - Set up the inbound address translation based on whether the
     platform allows coherent or non-coherent DMA (Daire McNamara)

   - Update DT binding such that platforms are DMA-coherent by default
     and must specify 'dma-noncoherent' if needed (Conor Dooley)

  Mobiveil PCIe controller driver:

   - Convert mobiveil-pcie.txt to YAML and update 'interrupt-names'
     and 'reg-names' (Frank Li)

  Qualcomm PCIe controller driver:

   - Add DT SM8550 and SM8650 optional 'global' interrupt for link
     events (Neil Armstrong)

   - Add DT 'compatible' strings for IPQ5424 PCIe controller (Manikanta
     Mylavarapu)

   - If 'global' IRQ is supported for detection of Link Up events, tell
     DWC core not to wait for link up (Krishna chaitanya chundru)

  Renesas R-Car PCIe controller driver:

   - Avoid passing stack buffer as resource name (King Dix)

  Rockchip PCIe controller driver:

   - Simplify clock and reset handling by using bulk interfaces (Anand
     Moon)

   - Pass typed rockchip_pcie (not void) pointer to
     rockchip_pcie_disable_clocks() (Anand Moon)

   - Return -ENOMEM, not success, when pci_epc_mem_alloc_addr() fails
     (Dan Carpenter)

  Rockchip DesignWare PCIe controller driver:

   - Use dll_link_up IRQ to detect Link Up and enumerate devices so
     users don't have to manually rescan (Niklas Cassel)

   - Tell DWC core not to wait for link up since the 'sys' interrupt is
     required and detects Link Up events (Niklas Cassel)

  Synopsys DesignWare PCIe controller driver:

   - Don't wait for link up in DWC core if driver can detect Link Up
     event (Krishna chaitanya chundru)

   - Update ICC and OPP votes after Link Up events (Krishna chaitanya
     chundru)

   - Always stop link in dw_pcie_suspend_noirq(), which is required at
     least for i.MX8QM to re-establish link on resume (Richard Zhu)

   - Drop racy and unnecessary LTSSM state check before sending
     PME_TURN_OFF message in dw_pcie_suspend_noirq() (Richard Zhu)

   - Add struct of_pci_range.parent_bus_addr for devices that need their
     immediate parent bus address, not the CPU address, e.g., to program
     an internal Address Translation Unit (iATU) (Frank Li)

  TI DRA7xx PCIe controller driver:

   - Simplify by using syscon_regmap_lookup_by_phandle_args() instead of
     syscon_regmap_lookup_by_phandle() followed by
     of_parse_phandle_with_fixed_args() or of_property_read_u32_index()
     (Krzysztof Kozlowski)

  Xilinx Versal CPM PCIe controller driver:

   - Add DT binding and driver support for Xilinx Versal CPM5
     (Thippeswamy Havalige)

  MicroSemi Switchtec management driver:

   - Add Microchip PCI100X device IDs (Rakesh Babu Saladi)

  Miscellaneous:

   - Move reset related sysfs code from pci.c to pci-sysfs.c where other
     similar code lives (Ilpo Järvinen)

   - Simplify reset_method_store() memory management by using __free()
     instead of explicit kfree() cleanup (Ilpo Järvinen)

   - Constify struct bin_attribute for sysfs, VPD, P2PDMA, and the IBM
     ACPI hotplug driver (Thomas Weißschuh)

   - Remove redundant PCI_VSEC_HDR and PCI_VSEC_HDR_LEN_SHIFT (Dongdong
     Zhang)

   - Correct documentation of the 'config_acs=' kernel parameter
     (Akihiko Odaki)"

* tag 'pci-v6.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (111 commits)
  PCI: Batch BAR sizing operations
  dt-bindings: PCI: microchip,pcie-host: Allow dma-noncoherent
  PCI: microchip: Set inbound address translation for coherent or non-coherent mode
  Documentation: Fix pci=config_acs= example
  PCI: Remove redundant PCI_VSEC_HDR and PCI_VSEC_HDR_LEN_SHIFT
  PCI: Don't include 'pm_wakeup.h' directly
  selftests: pci_endpoint: Migrate to Kselftest framework
  selftests: Move PCI Endpoint tests from tools/pci to Kselftests
  misc: pci_endpoint_test: Fix IOCTL return value
  dt-bindings: PCI: qcom: Document the IPQ5424 PCIe controller
  dt-bindings: PCI: qcom,pcie-sm8550: Document 'global' interrupt
  dt-bindings: PCI: mobiveil: Convert mobiveil-pcie.txt to YAML
  PCI: switchtec: Add Microchip PCI100X device IDs
  misc: pci_endpoint_test: Remove redundant 'remainder' test
  misc: pci_endpoint_test: Add consecutive BAR test
  misc: pci_endpoint_test: Add support for capabilities
  PCI: endpoint: pci-epf-test: Add support for capabilities
  PCI: endpoint: pci-epf-test: Fix check for DMA MEMCPY test
  PCI: endpoint: pci-epf-test: Set dma_chan_rx pointer to NULL on error
  PCI: dwc: Simplify config resource lookup
  ...
2025-01-25 16:03:40 -08:00
Alex Williamson
ce9ff21ea8 vfio/platform: check the bounds of read/write syscalls
count and offset are passed from user space and not checked, only
offset is capped to 40 bits, which can be used to read/write out of
bounds of the device.

Fixes: 6e3f264560 (“vfio/platform: read and write support for the device fd”)
Cc: stable@vger.kernel.org
Reported-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-23 13:13:27 -07:00
Dongdong Zhang
188973e353 PCI: Remove redundant PCI_VSEC_HDR and PCI_VSEC_HDR_LEN_SHIFT
Remove duplicate macro PCI_VSEC_HDR and its related macro
PCI_VSEC_HDR_LEN_SHIFT from pci_regs.h to avoid redundancy and
inconsistencies. Update VFIO PCI code to use PCI_VNDR_HEADER and
PCI_VNDR_HEADER_LEN() for consistent naming and functionality.

These changes aim to streamline header handling while minimizing impact,
given the niche usage of these macros in userspace.

Link: https://lore.kernel.org/r/20241216013536.4487-1-zhangdongdong@eswincomputing.com
Signed-off-by: Dongdong Zhang <zhangdongdong@eswincomputing.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-21 17:29:11 -06:00
Greg Kroah-Hartman
dd19f4116e Merge 6.13-rc7 into driver-core-next
We need the debugfs / driver-core fixes in here as well for testing and
to build on top of.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-13 06:40:34 +01:00
Heiner Kallweit
827ed8b159 drivers: core: remove device_link argument from class_compat_[create|remove]_link
After 7e722083fc ("i2c: Remove I2C_COMPAT config symbol and related
code") there's no caller left passing a non-null device_link argument.
So remove this argument to simplify the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/db49131d-fd79-4f23-93f2-0ab541a345fa@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-10 15:42:20 +01:00
Yunxiang Li
e021e6cbfb vfio/pci: Expose setup ROM at ROM bar when needed
If ROM bar is missing for any reason, we can fallback to using pdev->rom
to expose the ROM content to the guest. This fixes some passthrough use
cases where the upstream bridge does not have enough address window.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Link: https://lore.kernel.org/r/20250102185013.15082-3-Yunxiang.Li@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-06 08:09:54 -07:00
Yunxiang Li
c5a8b5d740 vfio/pci: Remove shadow ROM specific code paths
After commit 0c0e0736ac ("PCI: Set ROM shadow location in arch code,
not in PCI core"), the shadow ROM works the same as regular ROM BARs so
these code paths are no longer needed.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Link: https://lore.kernel.org/r/20250102185013.15082-2-Yunxiang.Li@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-06 08:09:54 -07:00
Ramesh Thomas
b44a06bd28 vfio/pci: Remove #ifdef iowrite64 and #ifdef ioread64
Remove the #ifdef iowrite64 and #ifdef ioread64 checks around calls to
64 bit IO access. Since default implementations have been enabled, the
checks are not required.

Signed-off-by: Ramesh Thomas <ramesh.thomas@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241210131938.303500-3-ramesh.thomas@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-06 08:09:54 -07:00
Ramesh Thomas
2b938e3db3 vfio/pci: Enable iowrite64 and ioread64 for vfio pci
Definitions of ioread64 and iowrite64 macros in asm/io.h called by vfio
pci implementations are enclosed inside check for CONFIG_GENERIC_IOMAP.
They don't get defined if CONFIG_GENERIC_IOMAP is defined. Include
linux/io-64-nonatomic-lo-hi.h to define iowrite64 and ioread64 macros
when they are not defined. io-64-nonatomic-lo-hi.h maps the macros to
generic implementation in lib/iomap.c. The generic implementation does
64 bit rw if readq/writeq is defined for the architecture, otherwise it
would do 32 bit back to back rw.

Note that there are two versions of the generic implementation that
differs in the order the 32 bit words are written if 64 bit support is
not present. This is not the little/big endian ordering, which is
handled separately. This patch uses the lo followed by hi word ordering
which is consistent with current back to back implementation in the
vfio/pci code.

Signed-off-by: Ramesh Thomas <ramesh.thomas@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241210131938.303500-2-ramesh.thomas@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-06 08:09:53 -07:00
Alex Williamson
09dfc8a5f2 vfio/pci: Fallback huge faults for unaligned pfn
The PFN must also be aligned to the fault order to insert a huge
pfnmap.  Test the alignment and fallback when unaligned.

Fixes: f9e54c3a2f ("vfio/pci: implement huge_fault support")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219619
Reported-by: Athul Krishna <athul.krishna.kr@protonmail.com>
Reported-by: Precific <precification@posteo.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Precific <precification@posteo.de>
Link: https://lore.kernel.org/r/20250102183416.1841878-1-alex.williamson@redhat.com
Cc: stable@vger.kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-01-03 08:49:05 -07:00
Linus Torvalds
ec8e2d3889 VFIO fixes for v6.13-rc3
- Fix migration dirty page tracking support in the mlx5-vfio-pci
    variant driver in configurations where the system page size exceeds
    the device maximum message size, and anticipate device updates where
    the opposite may also be required. (Yishai Hadas)
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmdZ+IMbHGFsZXgud2ls
 bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsithwP/iYRj4FFMArTRxx8etEe
 QE7ZCm17Rd7eAWNdJx9Rr8XE34OEigzxJy4IKHd/EajmCmq/FvRvRM/Svho1fKp7
 3p4dzP6pGMC7fEf6OywrL1/51Lu8KU6Bkl7njyd81UEK6667fEPJNoifPlYXi3o9
 NPczBETYjIAwhz4b6vUzHZ8mX52cgz1+wq7L2yOnMWrrwVwkQaZDUQVJWE+x3bkz
 /NRSYzJqaL/UJzowGQM+YDnYPYIQs1lg81wecVvTM1Hu3Dseh/mIwpKG6r8HBpJw
 kmEFqYNAfeCnNSleJA5s9iGxvX0VODBESfiFJyXYH25Tofjd8TEttTrXIJoO7dif
 4Ryptc2J5cQXkIwTDgQdjwtNWlABA0qelssBUyVd+Lusf/GhhwRwUJK5uoe+faOq
 0EcHUEEDrgDPhCxvy+rNvnZtYRGCaYmYnFWgP61m08cbbvFr1x6exuXRxMuMag4p
 hN2uroHNGHuymAfmo/K6rB1u2c6y23rUmmr0wJ6qCx9XDGRbxnN/ZnJkvbKsfIGQ
 pCEq4wHm6xJ9BpqXhebRA+3dEU5pH7Q+gAmJ680hyRile2G3cpZL+J3qXhMqFD9r
 jaqrJ2droe+9wYEQiqBzQMaetm2yC9AlgWa3k//a9reVuI3tU3dI82HNcgw6S+mJ
 bAEtKLe3ohs+td9Yo963anrr
 =oI9e
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.13-rc3' of https://github.com/awilliam/linux-vfio

Pull vfio fix from Alex Williamson:

 - Fix migration dirty page tracking support in the mlx5-vfio-pci
   variant driver in configurations where the system page size exceeds
   the device maximum message size, and anticipate device updates where
   the opposite may also be required (Yishai Hadas)

* tag 'vfio-v6.13-rc3' of https://github.com/awilliam/linux-vfio:
  vfio/mlx5: Align the page tracking max message size with the device capability
2024-12-11 13:48:25 -08:00