2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

14881 Commits

Author SHA1 Message Date
Dr. David Alan Gilbert
0ee2399116 drm/amdgpu: Remove unused amdgpu_i2c functions
amdgpu_i2c_add and amdgpu_i2c_init were added in 2015's commit
d38ceaf99e ("drm/amdgpu: add core driver (v4)")
but never used.

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:36 -04:00
Dr. David Alan Gilbert
9d7a8bdb90 drm/amdgpu: Remove unused amdgpu_gfx_bit_to_me_queue
amdgpu_gfx_bit_to_me_queue has been unused since it was added in
commit 7470bfcf20 ("drm/amdgpu: add helper function for gfx queue/bitmap
transition")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:33 -04:00
Dr. David Alan Gilbert
1e10c12263 drm/amdgpu: Remove unused amdgpu_gmc_vram_cpu_pa
amdgpu_gmc_vram_cpu_pa has been unused since commit
087451f372 ("drm/amdgpu: use generic fb helpers instead of setting up AMD own's.")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:29 -04:00
Dr. David Alan Gilbert
6e261ecbb2 drm/amdgpu: Remove unused amdgpu_atpx functions
amdgpu_atpx_dgpu_req_power_for_displays has been unused since
commit bdb1ccb080 ("drm/amdgpu: remove ATPX_DGPU_REQ_POWER_FOR_DISPLAYS
check when hotplug-in")

amdgpu_atpx_get_dhandle has been unused since commit
f9b7f3703f ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:24 -04:00
Dr. David Alan Gilbert
632aac6299 drm/amdgpu: Remove unused amdgpu_device_ip_is_idle
amdgpu_device_ip_is_idle is unused.
It was renamed from 'amdgpu_is_idle' which was originally added in
commit 5dbbb60ba6 ("drm/amdgpu: add IP helpers for wait_for_idle and is_idle")

but hasn't been used.

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:19 -04:00
Lijo Lazar
1e4acf4d93 drm/amdgpu: Add reset on init handler for XGMI
In some cases, device needs to be reset before first use. Add handlers
for doing device reset during driver init sequence.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Feifei Xu <feifxu@amd.com>
Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:19 -04:00
Lijo Lazar
f501057aff drm/amdgpu: Add callback get xcp resource info
Add a callback interface to get the resource information of a partition
mode. Presently the information has number of resources and number of
entities sharing the resource.

Add the implementation for aquavanjaram SOCs.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:18 -04:00
Lijo Lazar
1bc0b33915 drm/amd: Add helper to get partition config modes
Add helper to get supported/available partition config modes

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:18 -04:00
WangYuli
307b4ab7ba drm/amdgpu: Fix typo "acccess" and improve the comment style here
There are some spelling mistakes of 'acccess' in comments which
should be instead of 'access'.

And the comment style should be like this:
 /*
  * Text
  * Text
  */

Suggested-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/all/f75fbe30-528e-404f-97e4-854d27d7a401@amd.com/
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://lore.kernel.org/all/0c768bf6-bc19-43de-a30b-ff5e3ddfd0b3@suse.de/
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:18 -04:00
Alex Deucher
b1281b6d55 drm/amdgpu/gfx9: Explicitly halt CP before init
Need to make sure it's halted as we don't know what state
the GPU may have been left in previously.

Reviewed-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:18 -04:00
Alex Deucher
993fcc40ae drm/amdgpu/gfx9: set additional bits on CP halt
Need to set the pipe reset and cache invalidation bits
on halt otherwise we can get stale state if the CP firmware
changes (e.g., on module unload and reload).

Reviewed-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:18 -04:00
Sunil Khatri
37b993225d drm/amdgpu: add amdgpu_device reference in ip block
To handle amdgpu_device reference for different GPUs
we add it's reference in each ip block which can be
used to differentiate between difference gpu devices.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:18 -04:00
Lijo Lazar
6e37ae8b08 drm/amdgpu: Separate reinitialization after reset
Move the reinitialization part after a reset to another function. No
functional changes.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:18 -04:00
Tim Huang
381ec8161d drm/amdgpu: check return for setting engine dram timings
This resolves the unchecded return value warning reported by Coverity.

Signed-off-by: Tim Huang <tim.huang@amd.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:18 -04:00
Lijo Lazar
5839d27d5b drm/amdgpu: Use init level for pending_reset flag
Drop pending_reset flag in gmc block. Instead use init level to
determine which type of init is preferred - in this case MINIMAL.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:18 -04:00
YiPeng Chai
9e0feb7946 amd/amdgpu: Reduce unnecessary repetitive GPU resets
In multiple GPUs case, after a GPU has started
resetting all GPUs on hive, other GPUs do not
need to trigger GPU reset again.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:18 -04:00
Lijo Lazar
14f2fe34f5 drm/amdgpu: Add init levels
Add init levels to define the level to which device needs to be
initialized.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:17 -04:00
Jane Jian
8a84d2a472 drm/amdgpu: Remove unneeded write in JPEG v4.0.3
HDP_DEBUG1(offset = 0x3fbc) is no longer functional, remove the redundant write.

Signed-off-by: Jane Jian <Jane.Jian@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:17 -04:00
Lijo Lazar
8c50bf9beb drm/amdgpu: Fix JPEG v4.0.3 register write
EXTERNAL_REG_INTERNAL_OFFSET/EXTERNAL_REG_WRITE_ADDR should be used in
pairs. If an external register shouldn't be written, both packets
shouldn't be sent.

Fixes: a78b481469 ("drm/amdgpu: Skip PCTL0_MMHUB_DEEPSLEEP_IB write in jpegv4.0.3 under SRIOV")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:17 -04:00
Feifei Xu
3eebfd5e9c drm/amdkfd:Add kfd function to config sq perfmon
Expose the interface for kfd to config sq perfmon.

Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:17 -04:00
Sathishkumar S
f0b19b84d3 drm/amdgpu: add amdgpu_jpeg_sched_mask debugfs
JPEG_4_0_3 has up to 32 jpeg cores and a single mjpeg video decode
will use all available cores on the hardware. This debugfs entry
helps to disable or enable job submission to a cluster of cores or
one specific core in the ip for debugging. The entry is populated
only if there is at least two or more cores in the jpeg ip.

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:17 -04:00
Feifei Xu
400a7591d9 drm/amdgpu: Add psp command CONFIG_SQ_PERFMON
Add support for enable/disable perfmon profiling.

Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:16 -04:00
Prike Liang
6704dbf719 drm/amdgpu: update suspend status for aborting from deeper suspend
There're some other suspend abort cases which can call the noirq
suspend except for executing _S3 method. In those cases need to
process as incomplete suspendsion.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:16 -04:00
Asad Kamal
dc443aa4ab drm/amd/amdgpu: Add helper to get ip block valid
Add helper function to check if ip block is enabled

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:16 -04:00
Jiadong Zhu
92c9b3e8e4 drm/amdgpu/sdma6: implement ring reset callback for sdma6
Implement sdma queue reset callback using mes_reset_queue_mmio.

v2: check instance id before reset queue.

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:16 -04:00
Jiadong Zhu
df190e6753 drm/amdgpu/sdma6: split out per instance resume function
Extract the resume sequence for individual sdma instance from sdma_v6_0_gfx_resume.
The function could be used for start/restart scenario on a certain instance.

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:16 -04:00
Jiadong Zhu
ced65debf4 drm/amdgpu/mes11: update mes_reset_queue function to support sdma queue
Reset sdma queue through mmio based on me_id and queue_id.

v2: simplify callflows and register calculation.

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:16 -04:00
Alex Deucher
34ad56a467 drm/amdgpu: bump driver version for cleared VRAM
Driver now clears VRAM on allocation.  Bump the
driver version so mesa knows when it will get
cleared vram by default.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.11.x
2024-09-26 17:04:47 -04:00
Alex Deucher
a8387ddc0d drm/amdgpu: fix vbios fetching for SR-IOV
SR-IOV fetches the vbios from VRAM in some cases.
Re-enable the VRAM path for dGPUs and rename the function
to make it clear that it is not IGP specific.

Fixes: 042658d17a ("drm/amdgpu: clean up vbios fetching code")
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Tested-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:04:10 -04:00
Frank Min
3cb576bc6d drm/amdgpu: fix PTE copy corruption for sdma 7
Without setting dcc bit, there is ramdon PTE copy corruption on sdma 7.

so add this bit and update the packet format accordingly.

Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.11.x
2024-09-26 17:03:39 -04:00
Saleemkhan Jamadar
8048e5ade8 drm/amdgpu/vcn: enable AV1 on both instances
v1 - remove cs parse code (Christian)

On VCN v4_0_6 AV1 is supported on both the instances.
Remove cs IB parse code since explict handling of AV1 schedule is
not required.

Signed-off-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-09-25 12:56:19 -04:00
Mukul Joshi
e45b011d2c drm/amdkfd: Fix CU occupancy for GFX 9.4.3
Make CU occupancy calculations work on GFX 9.4.3 by
updating the logic to handle multiple XCCs correctly.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-25 12:56:07 -04:00
Mukul Joshi
6ae9e1aba9 drm/amdkfd: Update logic for CU occupancy calculations
Currently, the code uses the IH_VMID_X_LUT register to map
a queue's vmid to the corresponding PASID. This logic is racy
since CP can update the VMID-PASID mapping anytime especially
when there are more processes than number of vmids. Update the
logic to calculate CU occupancy by matching doorbell offset of
the queue with valid wave counts against the process's queues.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-25 12:56:00 -04:00
ZhenGuo Yin
e1d27f7a9c drm/amdgpu: skip coredump after job timeout in SRIOV
VF FLR will be triggered by host driver before job timeout,
hence the error status of GPU get cleared. Performing a
coredump here is unnecessary.

Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-25 12:55:52 -04:00
Christian König
126be9b2be drm/amdgpu: sync to KFD fences before clearing PTEs
This patch tries to solve the basic problem we also need to sync to
the KFD fences of the BO because otherwise it can be that we clear
PTEs while the KFD queues are still running.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-25 12:55:44 -04:00
Jack Xiao
4771d2ecb7 drm/amdgpu/mes12: set enable_level_process_quantum_check
enable_level_process_quantum_check is requried to enable process
quantum based scheduling.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.11.x
2024-09-25 12:55:14 -04:00
Alex Deucher
84f76408ab drm/amdgpu/mes12: reduce timeout
The firmware timeout is 2s.  Reduce the driver timeout to
2.1 seconds to avoid back pressure on queue submissions.

Fixes: 94b51a3d01 ("drm/amdgpu/mes12: increase mes submission timeout")
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.11.x
2024-09-18 16:15:13 -04:00
Alex Deucher
856265caa9 drm/amdgpu/mes11: reduce timeout
The firmware timeout is 2s.  Reduce the driver timeout to
2.1 seconds to avoid back pressure on queue submissions.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3627
Fixes: f7c161a4c2 ("drm/amdgpu: increase mes submission timeout")
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-09-18 16:15:13 -04:00
Christian König
6dcba0975d drm/amdgpu: use GEM references instead of TTMs v2
Instead of a TTM reference grab a GEM reference whenever necessary.

v2: fix typo in amdgpu_bo_unref pointed out by Vitaly,
    initialize the GEM funcs for kernel allocations as well.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> (v1)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:15:13 -04:00
Frank Min
7b6df1d732 drm/amdgpu: update golden regs for gfx12
update golden regs for gfx12

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.11.x
2024-09-18 16:15:09 -04:00
Alex Deucher
042658d17a drm/amdgpu: clean up vbios fetching code
After splitting the logic between APU and dGPU,
clean up some of the APU and dGPU specific logic
that no longer applied.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:15:09 -04:00
Alex Deucher
375b035f68 drm/amdgpu/bios: split vbios fetching between APU and dGPU
We need some different logic for dGPUs and the APU path
can be simplified because there are some methods which
are never used on APUs.  This also fixes a regression
on some older APUs causing the driver to fetch the
unpatched ROM image rather than the patched image.

Fixes: 9c081c11c6 ("drm/amdgpu: Reorder to read EFI exported ROM first")
Reviewed-by: George Zhang <George.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:15:09 -04:00
Christian König
f2be7b39e4 drm/amdgpu: remove amdgpu_pin_restricted()
We haven't used the functionality to pin BOs in a certain range at all
while the driver existed. Just nuke it.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:15:09 -04:00
Christian König
54b86443fd drm/amdgpu: explicitely set the AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS flag
Instead of having that in the amdgpu_bo_pin() function applied for all
pinned BOs.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:15:09 -04:00
Lijo Lazar
42ac749d5b drm/amdgpu: Fix XCP instance mask calculation
Fix instance mask calculation for VCN IP. There are cases where VCN
instance could be shared across partitions. Fix here so that other
blocks don't need to check for any shared instances based on partition
mode.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:15:09 -04:00
Asad Kamal
ef126c06a9 drm/amdgpu: Fix get each xcp macro
Fix get each xcp macro to loop over each partition correctly

Fixes: 4bdca20579 ("drm/amdgpu: Add utility functions for xcp")
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:15:08 -04:00
Le Ma
2778701b16 drm/amdgpu: load sos binary properly on the basis of pmfw version
To be compatible with legacy IFWI, driver needs to carry legacy tOS and
query pmfw version to load them accordingly.

Add psp_firmware_header_v2_1 to handle the combined sos binary.

Double the sos count limit for the case of aux sos fw packed.

v2: pass the correct fw_bin_desc to parse_sos_bin_descriptor

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:15:06 -04:00
Le Ma
2ae6cd583c drm/amdgpu: add psp funcs callback to check if aux fw is needed
Query pmfw version to determine if aux sos fw needs to be loaded
in psp v13.0.

v2: refine callback to check if aux_fw loading is needed instead of
    getting pmfw version barely
v3: return the comparison directly

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:15:06 -04:00
Christian König
7181faaa47 drm/amdgpu: nuke the VM PD/PT shadow handling
This was only used as workaround for recovering the page tables after
VRAM was lost and is no longer necessary after the function
amdgpu_vm_bo_reset_state_machine() started to do the same.

Compute never used shadows either, so the only proplematic case left is
SVM and that is most likely not recoverable in any way when VRAM is
lost.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:15:06 -04:00
Alex Deucher
c1de938fb7 drm/amdgpu/gfx9.4.3: Explicitly halt MEC before init
Need to make sure it's halted as we don't know what state
the GPU may have been left in previously.

Tested-by: Amber Lin <Amber.Lin@amd.com>
Acked-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:15:06 -04:00
Alex Deucher
797fb15333 drm/amdgpu/gfx9.4.3: set additional bits on MEC halt
Need to set the pipe reset and cache invalidation bits
on halt otherwise we can get stale state if the CP firmware
changes (e.g., on module unload and reload).

Tested-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:15:06 -04:00
David Belanger
03b5038c0a drm/amdgpu: Fix selfring initialization sequence on soc24
Move enable_doorbell_selfring_aperture from common_hw_init
to common_late_init in soc24, otherwise selfring aperture is
initialized with an incorrect doorbell aperture base.

Port changes from this commit from soc21 to soc24:
commit 1c312e816c ("drm/amdgpu: Enable doorbell selfring after resize FB BAR")

Signed-off-by: David Belanger <david.belanger@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.11.x
2024-09-18 16:14:47 -04:00
Jack Xiao
3c75518cf2 drm/amdgpu/mes12: switch SET_SHADER_DEBUGGER pkt to mes schq pipe
The SET_SHADER_DEBUGGER packet must work with the added
hardware queue, switch the packet submitting to mes schq pipe.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.11.x
2024-09-18 16:14:27 -04:00
Andrew Kreimer
c400ec6990 drm/amdgpu: Fix a typo
Fix a typo in comments.

Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Kreimer <algonell@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:14:26 -04:00
Yan Zhen
0110ac1195 drm/amdgpu: fix typo in the comment
Correctly spelled comments make it easier for the reader to understand
the code.

Replace 'udpate' with 'update' in the comment &
replace 'recieved' with 'received' in the comment &
replace 'dsiable' with 'disable' in the comment &
replace 'Initiailize' with 'Initialize' in the comment &
replace 'disble' with 'disable' in the comment &
replace 'Disbale' with 'Disable' in the comment &
replace 'enogh' with 'enough' in the comment &
replace 'availabe' with 'available' in the comment.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:14:26 -04:00
Alex Deucher
bfc00a7754 drm/amdgpu/gfx9.4.3: drop extra wrapper
Drop wrapper used in one place.  gfx_v9_4_3_xcc_cp_enable()
is used in one place.  gfx_v9_4_3_xcc_cp_compute_enable()
is used everywhere else.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18 16:14:26 -04:00
Bob Zhou
28b0ef9227 drm/amdgpu: Fix missing check pcie_p2p module param
The module param pcie_p2p should be checked for kfd p2p feature, so add it.

Fixes: 75f0efbc4b ("drm/amdgpu: Take IOMMU remapping into account for p2p checks")
Signed-off-by: Bob Zhou <bob.zhou@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-17 10:04:57 -04:00
Tao Zhou
c389a0604c drm/amdgpu: disable GPU RAS bad page feature for specific ASIC
The feature is not applicable to specific app platform.

v2: update the disablement condition and commit description
v3: move the setting to amdgpu_ras_check_supported

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-17 10:04:57 -04:00
Tim Huang
0da531c82a drm/amdgpu: ensure the connector is not null before using it
This resolves the dereference null return value warning
reported by Coverity.

Signed-off-by: Tim Huang <tim.huang@amd.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-17 10:04:57 -04:00
Al Viro
4c3140fea6 drm/amdgpu: get rid of bogus includes of fdtable.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-10 13:44:30 -04:00
Al Viro
6c6ca71bc1 drm/amdgpu: fix a race in kfd_mem_export_dmabuf()
Using drm_gem_prime_handle_to_fd() to set dmabuf up and insert it into
descriptor table, only to have it looked up by file descriptor and
remove it from descriptor table is not just too convoluted - it's
racy; another thread might have modified the descriptor table while
we'd been going through that song and dance.

Switch kfd_mem_export_dmabuf() to using drm_gem_prime_handle_to_dmabuf()
and leave the descriptor table alone...

Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-10 13:44:30 -04:00
Srinivasan Shanmugam
b8faa981a7 drm/amdgpu: Fix kdoc entry in 'amdgpu_vm_cpu_prepare'
This commit updates described non-existent parameters 'resv' and
'sync_mode', and failed to describe the existing 'sync' parameter.

Fixes the below with gcc W=1:
drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c:50: warning: Function parameter or struct member 'sync' not described in 'amdgpu_vm_cpu_prepare'
drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c:50: warning: Excess function parameter 'resv' description in 'amdgpu_vm_cpu_prepare'
drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c:50: warning: Excess function parameter 'sync_mode' description in 'amdgpu_vm_cpu_prepare'

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-10 13:44:29 -04:00
David (Ming Qiang) Wu
3d5adbdf1d drm/amd/amdgpu: apply command submission parser for JPEG v1
Similar to jpeg_v2_dec_ring_parse_cs() but it has different
register ranges and a few other registers access.

Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-10 13:44:29 -04:00
David (Ming Qiang) Wu
88dcad2d07 drm/amd/amdgpu: apply command submission parser for JPEG v2+
This patch extends the same cs parser from JPEG v4.0.3 to
other JPEG versions (v2 and above).

Rename to more common name as jpeg_v2_dec_ring_parse_cs()
from jpeg_v4_0_3_dec_ring_parse_cs().

Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-10 13:44:29 -04:00
Jani Nikula
0df8ef6e1b drm/amdgpu: drop redundant W=1 warnings from Makefile
Since commit a61ddb4393 ("drm: enable (most) W=1 warnings by default
across the subsystem"), most of the extra warnings in the driver
Makefile are redundant. Remove them.

Note that -Wmissing-declarations and -Wmissing-prototypes are always
enabled by default in scripts/Makefile.extrawarn.

Reviewed-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:55:17 -04:00
Christian König
7ccde2e6c0 drm/amdgpu: revert "use CPU for page table update if SDMA is unavailable"
That is clearly not something we should do upstream. The SDMA is
mandatory for the driver to work correctly.

We could do this for emulation and bringup, but in those cases the
engineer should probably enabled CPU based updates manually.

This reverts commit 62eefd10ac.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:55:06 -04:00
Dan Carpenter
27f9dcb9cc drm/amdgpu/mes11: Indent an if statment
Indent the "break" statement one more tab.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:55:05 -04:00
Ramesh Errabolu
01be2b62c0 drm/amdgpu: Surface svm_default_granularity, a RW module parameter
Enables users to update SVM's default granularity, used in
buffer migration and handling of recoverable page faults.
Param value is set in terms of log(numPages(buffer)),
e.g. 9 for a 2 MIB buffer

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:55:05 -04:00
Jesse Zhang
e8397d327e drm/amdgpu: fix queue reset issue by mmio
Initialize the queue type before resetting the queue using mmio.

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:54:54 -04:00
Srinivasan Shanmugam
559a285816 drm/amdgpu: Replace 'amdgpu_job_submit_direct' with 'drm_sched_entity' in cleaner shader
This commit replaces the use of amdgpu_job_submit_direct which submits
the job to the ring directly, with drm_sched_entity in the cleaner
shader job submission process. The change allows the GPU scheduler to
manage the cleaner shader job.

- The job is then submitted to the GPU using the
  drm_sched_entity_push_job function, which allows the GPU scheduler to
  manage the job.

This change improves the reliability of the cleaner shader job
submission process by leveraging the capabilities of the GPU scheduler.

Fixes: d361ad5d2f ("drm/amdgpu: Add sysfs interface for running cleaner shader")
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:42:33 -04:00
Srinivasan Shanmugam
2578487ebe drm/amdgpu/: Add missing kdoc entry in amdgpu_vm_handle_fault function
This commit adds a description for the 'ts' parameter in the
amdgpu_vm_handle_fault function's comment block.

Fixes the below with gcc W=1:
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2781: warning: Function parameter or struct member 'ts' not described in 'amdgpu_vm_handle_fault'

Cc: Xiaogang.Chen <Xiaogang.Chen@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202408251419.vgZHg3GV-lkp@intel.com/
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Xiaogang Chen <Xiaogang.Chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:42:05 -04:00
Lijo Lazar
4481df364d drm/amdgpu: Normalize reg offsets on JPEG v4.0.3
On VFs and SOCs with GC 9.4.4, VCN RRMT is disabled.
Only local register offsets should be used on JPEG v4.0.3 as they cannot
handle remote access to other AIDs. Since only local offsets are used,
the special write to MCM_ADDR register is no longer needed.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:40:47 -04:00
Li Zetao
760e3c8b32 drm/amdgpu: use clamp() in amdgpu_vm_adjust_size()
When it needs to get a value within a certain interval, using clamp()
makes the code easier to understand than min(max()).

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:38:57 -04:00
Li Zetao
6fbbb660b1 drm/amd: use clamp() in amdgpu_pll_get_fb_ref_div()
When it needs to get a value within a certain interval, using clamp()
makes the code easier to understand than min(max()).

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:38:53 -04:00
Peng Liu
2c7795e245 drm/amdgpu: enable gfxoff quirk on HP 705G4
Enabling gfxoff quirk results in perfectly usable
graphical user interface on HP 705G4 DM with R5 2400G.

Without the quirk, X server is completely unusable as
every few seconds there is gpu reset due to ring gfx timeout.

Signed-off-by: Peng Liu <liupeng01@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:38:45 -04:00
Peng Liu
0126c0ae11 drm/amdgpu: add raven1 gfxoff quirk
Fix screen corruption with openkylin.

Link: https://bbs.openkylin.top/t/topic/171497
Signed-off-by: Peng Liu <liupeng01@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:38:33 -04:00
Lang Yu
4453808d9e drm/amdgpu: fix invalid fence handling in amdgpu_vm_tlb_flush
CPU based update doesn't produce a fence, handle such cases properly.

Fixes: d8a3f0a034 ("drm/amdgpu: implement TLB flush fence")
Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:36:50 -04:00
Christian König
4da5a95bf1 drm/amdgpu: re-work VM syncing
Rework how VM operations synchronize to submissions. Provide an
amdgpu_sync container to the backends instead of an reservation
object and fill in the amdgpu_sync object in the higher layers
of the code.

No intended functional change, just prepares for upcomming changes.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Friedrich Vock <friedrich.vock@gmx.de>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:36:24 -04:00
Alex Deucher
ead60e9c4e drm/amdgpu/gfx10: use rlc safe mode for soft recovery
Protect the MMIO access with safe mode.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:40 -04:00
Alex Deucher
3f2d35c325 drm/amdgpu/gfx11: use rlc safe mode for soft recovery
Protect the MMIO access with safe mode.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:38 -04:00
Alex Deucher
21818f39be drm/amdgpu/gfx12: use rlc safe mode for soft recovery
Protect the MMIO access with safe mode.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:34 -04:00
Alex Deucher
f8eee864ba drm/amdgpu/gfx12: use proper rlc safe mode helpers
Rather than open coding it for the queue reset.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:31 -04:00
Alex Deucher
01d05521f7 drm/amdgpu/gfx11: use proper rlc safe mode helpers
Rather than open coding it for the queue reset.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:28 -04:00
Alex Deucher
bcee4c3f89 drm/amdgpu/gfx10: use proper rlc safe mode helpers
Rather than open coding it for the queue reset.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:24 -04:00
Alex Deucher
1a1995b1dc drm/amdgpu/gfx12: per queue reset only on bare metal
It's not supported under SR-IOV at the moment.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:22 -04:00
Alex Deucher
01163079e1 drm/amdgpu/gfx11: per queue reset only on bare metal
It's not supported under SR-IOV at the moment.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:20 -04:00
Alex Deucher
4d5ddfa4b1 drm/amdgpu/gfx10: per queue reset only on bare metal
It's not supported under SR-IOV at the moment.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:16 -04:00
Jiadong Zhu
178ad0e280 drm/amdgpu/mes11: implement mmio queue reset for gfx11
Implement queue reset for graphic and compute queue.

v2: use amdgpu_gfx_rlc funcs to enter/exit safe mode.
v3: use gfx_v11_0_request_gfx_index_mutex()
v4: fix mutex handling

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:13 -04:00
Jiadong Zhu
01b4ae38e5 drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
The reset_queue api could be used from kfd or kgd.

v2: add use_mmio parameter for mes_reset_legacy_queue.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:07 -04:00
Jiadong Zhu
8b2429a13f drm/amdgpu/mes: modify mes api for mmio queue reset
Add me/pipe/queue parameters for queue reset input.

v2: fix build (Alex)

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:03 -04:00
Alex Deucher
8fe4fde381 drm/amdgpu/gfx12: fallback to driver reset compute queue directly
Since the MES FW resets kernel compute queue always failed, this
may caused by the KIQ failed to process unmap KCQ. So, before MES
FW work properly that will fallback to driver executes dequeue and
resets SPI directly. Besides, rework the ring reset function and make
the busy ring type reset in each function respectively.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:01 -04:00
Alex Deucher
2480599890 drm/amdgpu/gfx12: add ring reset callbacks
Add ring reset callbacks for gfx and compute.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:58 -04:00
Alex Deucher
d1f2144321 drm/amdgpu/gfx10: rework reset sequence
To match other GFX IPs.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:55 -04:00
Jiadong Zhu
097af47d3c drm/amdgpu/gfx10: wait for reset done before remap
There is a racing condition that cp firmware modifies
MQD in reset sequence after driver updates it for
remapping. We have to wait till CP_HQD_ACTIVE becoming
false then remap the queue.

v2: fix KIQ locking (Alex)
v3: fix KIQ locking harder (Jessie)

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:47 -04:00
Jiadong Zhu
2f3806f781 drm/amdgpu/gfx10: remap queue after reset successfully
Kiq command unmap_queues only does the dequeueing action.
We have to map the queue back with clean mqd.

v2: fix up error handling (Alex)

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:44 -04:00
Alex Deucher
1741281a15 drm/amdgpu/gfx10: add ring reset callbacks
Add ring reset callbacks for gfx and compute.

v2: fix gfx handling
v3: wait for KIQ to complete

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:38 -04:00
Jiadong Zhu
a10c93931b drm/amdgpu/gfx11: wait for reset done before remap
There is a racing condition that cp firmware modifies
MQD in reset sequence after driver updates it for
remapping. We have to wait till CP_HQD_ACTIVE becoming
false then remap the queue.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:34 -04:00
Alex Deucher
7d8e9e65f2 drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
Rename to gfx_v11_0_kgq_init_queue() to better align with
the other naming in the file.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:31 -04:00
Prike Liang
072b441478 drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)
Since the MES FW resets kernel compute queue always failed, this
may caused by the KIQ failed to process unmap KCQ. So, before MES
FW work properly that will fallback to driver executes dequeue and
resets SPI directly. Besides, rework the ring reset function and make
the busy ring type reset in each function respectively.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:21 -04:00
Alex Deucher
b3e9bfd866 drm/amdgpu/gfx11: add ring reset callbacks
Add ring reset callbacks for gfx and compute.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:35:12 -04:00
Prike Liang
ad17b124c3 drm/amdgpu/gfx9.4.3: Implement compute pipe reset
Implement the compute pipe reset, and the driver will
fallback to pipe reset when queue reset fails.
The pipe reset only deactivates the queue which is
scheduled in the pipe, and meanwhile the MEC pipe
will be reset to the firmware _start pointer. So,
it seems pipe reset will cost more cycles than the
queue reset; therefore, the driver tries to recover
by doing queue reset first.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:32:32 -04:00
Alex Deucher
6c0a7c3c69 drm/amdgpu: always allocate cleared VRAM for GEM allocations
This adds allocation latency, but aligns better with user
expectations.  The latency should improve with the drm buddy
clearing patches that Arun has been working on.

In addition this fixes the high CPU spikes seen when doing
wipe on release.

v2: always set AMDGPU_GEM_CREATE_VRAM_CLEARED (Christian)

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3528
Fixes: a68c7eaa7a ("drm/amdgpu: Enable clear page functionality")
Acked-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com> (v1)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Cc: Christian König <christian.koenig@amd.com>
2024-08-29 14:56:37 -04:00
Jack Xiao
52491d97aa drm/amdgpu/mes: add mes mapping legacy queue switch
For mes11 old firmware has issue to map legacy queue,
add a flag to switch mes to map legacy queue.

Fixes: f9d8c5c785 ("drm/amdgpu/gfx: enable mes to map legacy queue support")
Reported-by: Andrew Worsley <amworsley@gmail.com>
Link: https://lists.freedesktop.org/archives/amd-gfx/2024-August/112773.html
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:41:02 -04:00
Alex Deucher
1125f95cd2 drm/amdgpu/gfx12: return early in preempt_ib()
When MES is enabled KIQ is not available.  Return an error
when someone uses the debugfs preempt test interface in
that case.

Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:39:57 -04:00
Alex Deucher
1e487c9173 drm/amdgpu/gfx11: return early in preempt_ib()
When MES is enabled KIQ is not available.  Return an error
when someone uses the debugfs preempt test interface in
that case.

Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:39:48 -04:00
Sunil Khatri
30e8f4c2bd drm/amdgpu: Move the dumping log out of for loop
log message "Dumping IP State Completed" needs to
be logged only once when state dumping is complete.

Hence moving it out of the for loop.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Acked-by: Trigger Huang <Trigger.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:39:19 -04:00
Victor Zhao
af76ca8e18 drm/amd/amdgpu: move drain_workqueue before shutdown is set
[background] when unloading amdgpu driver right after running a
workload, drain_workqueue is causing "Fence fallback timer
expired on ring sdma0.0". Under sriov, this issue will cause sriov
full access timeout and a reset happening.

move drain_workqueue before shutdown is set to allow ih process and
before enter full access under sriov to avoid full access time cost.

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:39:07 -04:00
Trigger Huang
c67db6a6a6 drm/amdgpu: Do core dump immediately when job tmo
Do the coredump immediately after a job timeout to get a closer
representation of GPU's error status.

V2: This will skip printing vram_lost as the GPU reset is not
happened yet (Alex)

V3: Unconditionally call the core dump as we care about all the reset
functions(soft-recovery and queue reset and full adapter reset, Alex)

V4: Do the dump after adev->job_hang = true (Sunil)

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Acked-by:  Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:39:00 -04:00
Trigger Huang
6122f5c72e drm/amdgpu: skip printing vram_lost if needed
The vm lost status can only be obtained after a GPU reset occurs, but
sometimes a dev core dump can be happened before GPU reset. So a new
argument is added to tell the dev core dump implementation whether to
skip printing the vram_lost status in the dump.
And this patch is also trying to decouple the core dump function from
the GPU reset function, by replacing the argument amdgpu_reset_context
with amdgpu_job to specify the context for core dump.

V2: Inform user if VRAM lost check is skipped so users don't assume
VRAM wasn't lost (Alex)

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:38:53 -04:00
Alex Deucher
7c1a2d8aba drm/amdgpu/gfx9: put queue resets behind a debug option
Pending extended validation.

Reviewed-and-tested-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:38:48 -04:00
Alex Deucher
a9b67c036c drm/amdgpu: add experimental resets debug flag
Add this flag to enable experimental resets for testing before they
are fully validated.

Reviewed-and-tested-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:38:36 -04:00
Daniel Vetter
e55ef65510 amd-drm-next-6.12-2024-08-26:
amdgpu:
 - SDMA devcoredump support
 - DCN 4.0.1 updates
 - DC SUBVP fixes
 - Refactor OPP in DC
 - Refactor MMHUBBUB in DC
 - DC DML 2.1 updates
 - DC FAMS2 updates
 - RAS updates
 - GFX12 updates
 - VCN 4.0.3 updates
 - JPEG 4.0.3 updates
 - Enable wave kill (soft recovery) for compute queues
 - Clean up CP error interrupt handling
 - Enable CP bad opcode interrupts
 - VCN 4.x fixes
 - VCN 5.x fixes
 - GPU reset fixes
 - Fix vbios embedded EDID size handling
 - SMU 14.x updates
 - Misc code cleanups and spelling fixes
 - VCN devcoredump support
 - ISP MFD i2c support
 - DC vblank fixes
 - GFX 12 fixes
 - PSR fixes
 - Convert vbios embedded EDID to drm_edid
 - DCN 3.5 updates
 - DMCUB updates
 - Cursor fixes
 - Overdrive support for SMU 14.x
 - GFX CP padding optimizations
 - DCC fixes
 - DSC fixes
 - Preliminary per queue reset infrastructure
 - Initial per queue reset support for GFX 9
 - Initial per queue reset support for GFX 7, 8
 - DCN 3.2 fixes
 - DP MST fixes
 - SR-IOV fixes
 - GFX 9.4.3/4 devcoredump support
 - Add process isolation framework
 - Enable process isolation support for GFX 9.4.3/4
 - Take IOMMU remapping into account for P2P DMA checks
 
 amdkfd:
 - CRIU fixes
 - Improved input validation for user queues
 - HMM fix
 - Enable process isolation support for GFX 9.4.3/4
 - Initial per queue reset support for GFX 9
 - Allow users to target recommended SDMA engines
 
 radeon:
 - remove .load and drm_dev_alloc
 - Fix vbios embedded EDID size handling
 - Convert vbios embedded EDID to drm_edid
 - Use GEM references instead of TTM
 - r100 cp init cleanup
 - Fix potential overflows in evergreen CS offset tracking
 
 UAPI:
 - KFD support for targetting queues on recommended SDMA engines
   Proposed userspace:
   2f588a2406
   eb30a5bbc7
 
 drm/buddy:
 - Add start address support for trim function
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCZszhcQAKCRC93/aFa7yZ
 2M4ZAQD+xgIJkQ9HISQeqER5GblnfrorARd32yP/BH0c+JbGUAD9H/BIB41teZ80
 vw2WTx+4TyB39awgvtpDH8iEQdkcSAE=
 =717w
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.12-2024-08-26' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.12-2024-08-26:

amdgpu:
- SDMA devcoredump support
- DCN 4.0.1 updates
- DC SUBVP fixes
- Refactor OPP in DC
- Refactor MMHUBBUB in DC
- DC DML 2.1 updates
- DC FAMS2 updates
- RAS updates
- GFX12 updates
- VCN 4.0.3 updates
- JPEG 4.0.3 updates
- Enable wave kill (soft recovery) for compute queues
- Clean up CP error interrupt handling
- Enable CP bad opcode interrupts
- VCN 4.x fixes
- VCN 5.x fixes
- GPU reset fixes
- Fix vbios embedded EDID size handling
- SMU 14.x updates
- Misc code cleanups and spelling fixes
- VCN devcoredump support
- ISP MFD i2c support
- DC vblank fixes
- GFX 12 fixes
- PSR fixes
- Convert vbios embedded EDID to drm_edid
- DCN 3.5 updates
- DMCUB updates
- Cursor fixes
- Overdrive support for SMU 14.x
- GFX CP padding optimizations
- DCC fixes
- DSC fixes
- Preliminary per queue reset infrastructure
- Initial per queue reset support for GFX 9
- Initial per queue reset support for GFX 7, 8
- DCN 3.2 fixes
- DP MST fixes
- SR-IOV fixes
- GFX 9.4.3/4 devcoredump support
- Add process isolation framework
- Enable process isolation support for GFX 9.4.3/4
- Take IOMMU remapping into account for P2P DMA checks

amdkfd:
- CRIU fixes
- Improved input validation for user queues
- HMM fix
- Enable process isolation support for GFX 9.4.3/4
- Initial per queue reset support for GFX 9
- Allow users to target recommended SDMA engines

radeon:
- remove .load and drm_dev_alloc
- Fix vbios embedded EDID size handling
- Convert vbios embedded EDID to drm_edid
- Use GEM references instead of TTM
- r100 cp init cleanup
- Fix potential overflows in evergreen CS offset tracking

UAPI:
- KFD support for targetting queues on recommended SDMA engines
  Proposed userspace:
  2f588a2406
  eb30a5bbc7

drm/buddy:
- Add start address support for trim function

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240826201528.55307-1-alexander.deucher@amd.com
2024-08-27 14:33:12 +02:00
Daniel Vetter
4461e9e5c3 Linux 6.11-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmbK2B8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGFwkH/10QpUgzIfbFKbF+
 5hwcvaqS5myxWwJ4PjN0eR1qGE6RzVO0Tb24+TVql+7pxu+iWm1kYgC3+/T5xJsP
 ECAszdmPWSco1xaHrh2y3PyCJjaBiqFbIxdjPp7odjDpG9qarbcty8YpWs44u/gd
 RDXzHUuScEShBhEt0ZhvE1pIDL8jJ8JL3yqOMZ+XaDxtJbjaHw4GHp8efxlBWc8N
 jZKIVJi22q5NWG5T0tGtPWwzCm0ewA/JNMTEvE9leoSoAgO85NZ0ivxMC76q/tbj
 BrYk5KnzfhJs4b/n/KtIwWaLTgLyXKGqHMaMq8sbXtp410aUdgnRJO2cl3fI+1vc
 vxQfAfk=
 =RemI
 -----END PGP SIGNATURE-----

Merge v6.11-rc5 into drm-next

amdgpu pr conconflicts due to patches cherry-picked to -fixes, I might
as well catch up with a backmerge and handle them all. Plus both misc
and intel maintainers asked for a backmerge anyway.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2024-08-27 14:09:45 +02:00
Xiaogang Chen
6ef29715ac drm/amdkfd: Change kfd/svm page fault drain handling
When app unmap vm ranges(munmap) kfd/svm starts drain pending page fault and
not handle any incoming pages fault of this process until a deferred work item
got executed by default system wq. The time period of "not handle page fault"
can be long and is unpredicable. That is advese to kfd performance on page
faults recovery.

This patch uses time stamp of incoming page fault to decide to drop or recover
page fault. When app unmap vm ranges kfd records each gpu device's ih ring
current time stamp. These time stamps are used at kfd page fault recovery
routine.

Any page fault happened on unmapped ranges after unmap events is application
bug that accesses vm range after unmap. It is not driver work to cover that.

By using time stamp of page fault do not need drain page faults at deferred
work. So, the time period that kfd does not handle page faults is reduced
and can be controlled.

Signed-off-by: Xiaogang.Chen <Xiaogang.Chen@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-23 10:55:13 -04:00
Likun Gao
875ff9a7ee drm/amdgpu: support for gc_info table v1.3
Add gc_info table v1.3 for IP discovery.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-23 10:54:57 -04:00
Yang Wang
4416377ae1 drm/amdgpu: add list empty check to avoid null pointer issue
Add list empty check to avoid null pointer issues in some corner cases.
- list_for_each_entry_safe()

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-23 10:53:45 -04:00
Alex Deucher
40318a2406 drm/amdgpu/gfx12: set UNORD_DISPATCH in compute MQDs
This needs to be set to 1 to avoid a potential deadlock in
the GC 10.x and newer.  On GC 9.x and older, this needs
to be set to 0. This can lead to hangs in some mixed
graphics and compute workloads.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3575
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-23 10:53:25 -04:00
Hawking Zhang
b05d6476ae drm/amdgpu: Retire query_utcl2_poison_status callback
Driver switches to interrupt source id to identify
utcl2 poison event. polling interface is not needed.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-23 10:53:16 -04:00
Rahul Jain
75f0efbc4b drm/amdgpu: Take IOMMU remapping into account for p2p checks
when trying to enable p2p the amdgpu_device_is_peer_accessible()
checks the condition where address_mask overlaps the aper_base
and hence returns 0, due to which the p2p disables for this platform

IOMMU should remap the BAR addresses so the device can access
them. Hence check if peer_adev is remapping DMA

v5: (Felix, Alex)
- fixing comment as per Alex feedback
- refactor code as per Felix

v4: (Alex)
- fix the comment and description

v3:
- remove iommu_remap variable

v2: (Alex)
- Fix as per review comments
- add new function amdgpu_device_check_iommu_remap to check if iommu
  remap

Signed-off-by: Rahul Jain <Rahul.Jain@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-23 10:53:05 -04:00
Alex Deucher
9cead81eff drm/amdgpu: fix eGPU hotplug regression
The driver needs to wait for the on board firmware
to finish its initialization before probing the card.
Commit 959056982a ("drm/amdgpu: Fix discovery initialization failure during pci rescan")
switched from using msleep() to using usleep_range() which
seems to have caused init failures on some navi1x boards. Switch
back to msleep().

Fixes: 959056982a ("drm/amdgpu: Fix discovery initialization failure during pci rescan")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3559
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3500
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Ma Jun <Jun.Ma2@amd.com>
(cherry picked from commit c69b07f7bb)
Cc: stable@vger.kernel.org # 6.10.x
2024-08-20 23:07:11 -04:00
Candice Li
c99769bcea drm/amdgpu: Validate TA binary size
Add TA binary size validation to avoid OOB write.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit c0a04e3570)
Cc: stable@vger.kernel.org
2024-08-20 23:04:17 -04:00
Alex Deucher
e3e4bf58ba drm/amdgpu/sdma5.2: limit wptr workaround to sdma 5.2.1
The workaround seems to cause stability issues on other
SDMA 5.2.x IPs.

Fixes: a03ebf1163 ("drm/amdgpu/sdma5.2: Update wptr registers as well as doorbell")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3556
Acked-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 2dc3851ef7)
Cc: stable@vger.kernel.org
2024-08-20 22:51:37 -04:00
Yang Wang
0b43312902 drm/amdgpu: fixing rlc firmware loading failure issue
Skip rlc firmware validation to ignore firmware header size mismatch issues.
This restores the workaround added in
commit 849e133c97 ("drm/amdgpu: Fix the null pointer when load rlc firmware")

Fixes: 3af2c80ae2 ("drm/amdgpu: refine gfx10 firmware loading")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3551
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 89ec85d16e)
2024-08-20 22:51:31 -04:00
Alex Deucher
88c511dea1 drm/amd/gfx11: move the gfx mutex into the caller
Otherwise we can fail to drop the software mutex when
we fail to take the hardware mutex.

Fixes: 76acba7b7f ("drm/amdgpu/gfx11: add a mutex for the gfx semaphore")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:14:14 -04:00
Victor Zhao
bf2bc61638 drm/amd/amdgpu: allow use kiq to do hdp flush under sriov
when use cpu to do page table update under sriov runtime, since mmio
access is blocked, kiq has to be used to flush hdp.

change WREG32_NO_KIQ to WREG32 to allow kiq.

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:14:14 -04:00
Alex Deucher
c69b07f7bb drm/amdgpu: fix eGPU hotplug regression
The driver needs to wait for the on board firmware
to finish its initialization before probing the card.
Commit 959056982a ("drm/amdgpu: Fix discovery initialization failure during pci rescan")
switched from using msleep() to using usleep_range() which
seems to have caused init failures on some navi1x boards. Switch
back to msleep().

Fixes: 959056982a ("drm/amdgpu: Fix discovery initialization failure during pci rescan")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3559
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3500
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Ma Jun <Jun.Ma2@amd.com>
2024-08-20 22:14:14 -04:00
Candice Li
c0a04e3570 drm/amdgpu: Validate TA binary size
Add TA binary size validation to avoid OOB write.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:14:13 -04:00
Mukul Joshi
ccf8ef6b75 drm/amdgpu: Implement MES Suspend and Resume APIs for GFX11
Add implementation for MES Suspend and Resume APIs to unmap/map
all queues for GFX11. Support for GFX12 will be added when the
corresponding firmware support is in place.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:14:04 -04:00
Srinivasan Shanmugam
f846250b8a drm/amdgpu/gfx_v9_4_3: Apply Isolation Enforcement to GFX & Compute rings
This commit applies isolation enforcement to the GFX and Compute rings
in the gfx_v9_4_3 module.

The commit sets `amdgpu_gfx_enforce_isolation_ring_begin_use` and
`amdgpu_gfx_enforce_isolation_ring_end_use` as the functions to be
called when a ring begins and ends its use, respectively.

`amdgpu_gfx_enforce_isolation_ring_begin_use` is called when a ring
begins its use. This function cancels any scheduled
`enforce_isolation_work` and, if necessary, signals the Kernel Fusion
Driver (KFD) to stop the runqueue.

`amdgpu_gfx_enforce_isolation_ring_end_use` is called when a ring ends
its use. This function schedules `enforce_isolation_work` to be run
after a delay.

These functions are part of the Enforce Isolation Handler, which
enforces shader isolation on AMD GPUs to prevent data leakage between
different processes.

The commit also includes a check for the type of the ring. If the type
of the ring is `AMDGPU_RING_TYPE_COMPUTE`, the `xcp_id` of the
`enforce_isolation` structure in the `gfx` structure of the
`amdgpu_device` is set to the `xcp_id` of the ring. This ensures that
the correct `xcp_id` is used when enforcing isolation on compute rings.
The `xcp_id` is an identifier for an XCP partition, and different rings
can be associated with different XCP partitions.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
2024-08-20 22:08:02 -04:00
Srinivasan Shanmugam
b710dbe55d drm/amdgpu/gfx9: Apply Isolation Enforcement to GFX & Compute rings
This commit applies isolation enforcement to the GFX and Compute rings
in the gfx_v9_0 module.

The commit sets `amdgpu_gfx_enforce_isolation_ring_begin_use` and
`amdgpu_gfx_enforce_isolation_ring_end_use` as the functions to be
called when a ring begins and ends its use, respectively.

`amdgpu_gfx_enforce_isolation_ring_begin_use` is called when a ring
begins its use. This function cancels any scheduled
`enforce_isolation_work` and, if necessary, signals the Kernel Fusion
Driver (KFD) to stop the runqueue.

`amdgpu_gfx_enforce_isolation_ring_end_use` is called when a ring ends
its use. This function schedules `enforce_isolation_work` to be run
after a delay.

These functions are part of the Enforce Isolation Handler, which
enforces shader isolation on AMD GPUs to prevent data leakage between
different processes.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
2024-08-20 22:07:58 -04:00
Srinivasan Shanmugam
afefd6f245 drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization
This commit introduces the Enforce Isolation Handler designed to enforce
shader isolation on AMD GPUs, which helps to prevent data leakage
between different processes.

The handler counts the number of emitted fences for each GFX and compute
ring. If there are any fences, it schedules the `enforce_isolation_work`
to be run after a delay of `GFX_SLICE_PERIOD`. If there are no fences,
it signals the Kernel Fusion Driver (KFD) to resume the runqueue.

The function is synchronized using the `enforce_isolation_mutex`.

This commit also introduces a reference count mechanism
(kfd_sch_req_count) to keep track of the number of requests to enable
the KFD scheduler. When a request to enable the KFD scheduler is made,
the reference count is decremented. When the reference count reaches
zero, a delayed work is scheduled to enforce isolation after a delay of
GFX_SLICE_PERIOD.

When a request to disable the KFD scheduler is made, the function first
checks if the reference count is zero. If it is, it cancels the delayed
work for enforcing isolation and checks if the KFD scheduler is active.
If the KFD scheduler is active, it sends a request to stop the KFD
scheduler and sets the KFD scheduler state to inactive. Then, it
increments the reference count.

The function is synchronized using the kfd_sch_mutex to ensure that the
KFD scheduler state and reference count are updated atomically.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:07:35 -04:00
Amber Lin
234eebe161 drm/amdkfd: APIs to stop/start KFD scheduling
Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling
compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling.
When amdgpu_amdkfd_stop_sched is called, KFD will unmap queues from
runlist. If users send ioctls to KFD to create queues, they'll be added
but those queues won't be mapped to runlist (so not scheduled) until
amdgpu_amdkfd_start_sched is called.

v2: fix build (Alex)

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:07:28 -04:00
Srinivasan Shanmugam
b1f49ff9cb drm/amdgpu/gfx9: Add cleaner shader support for GFX9.4.4 hardware
This commit extends the cleaner shader feature to support GFX9.4.4
hardware.

The cleaner shader feature is used to clear or initialize certain GPU
resources, such as Local Data Share (LDS), Vector General Purpose
Registers (VGPRs), and Scalar General Purpose Registers (SGPRs). This
operation needs to be performed in isolation, while no other tasks
should be running on the GPU at the same time.

Previously, the cleaner shader feature was implemented for GFX9.4.3
hardware. This commit adds support for GFX9.4.4 hardware by allowing the
cleaner shader to be used with this hardware version.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:07:23 -04:00
Srinivasan Shanmugam
335288315a drm/amdgpu/gfx9: Add cleaner shader for GFX9.4.3
This commit adds the cleaner shader microcode for GFX9.4.3 GPUs. The
cleaner shader is a piece of GPU code that is used to clear or
initialize certain GPU resources, such as Local Data Share (LDS), Vector
General Purpose Registers (VGPRs), and Scalar General Purpose Registers
(SGPRs).

Clearing these resources is important for ensuring data isolation
between different workloads running on the GPU. Without the cleaner
shader, residual data from a previous workload could potentially be
accessed by a subsequent workload, leading to data leaks and incorrect
computation results.

The cleaner shader microcode is represented as an array of 32-bit words
(`gfx_9_4_3_cleaner_shader_hex`). This array is the binary
representation of the cleaner shader code, which is written in a
low-level GPU instruction set.

When the cleaner shader feature is enabled, the AMDGPU driver loads this
array into a specific location in the GPU memory. The GPU then reads
this memory location to fetch and execute the cleaner shader
instructions.

The cleaner shader is executed automatically by the GPU at the end of
each workload, before the next workload starts. This ensures that all
GPU resources are in a clean state before the start of each workload.

This addition is part of the cleaner shader feature implementation. The
cleaner shader feature helps improve GPU performance and resource
utilization by cleaning up GPU resources after they are used. It also
enhances security and reliability by preventing data leaks between
workloads.

v2: fix copyright date (Alex)

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:07:16 -04:00
Srinivasan Shanmugam
d4c3815495 drm/amdgpu/gfx9: Implement cleaner shader support for GFX9.4.3 hardware
The patch modifies the gfx_v9_4_3_kiq_set_resources function to write
the cleaner shader's memory controller address to the ring buffer. It
also adds a new function, gfx_v9_4_3_ring_emit_cleaner_shader, which
emits the PACKET3_RUN_CLEANER_SHADER packet to the ring buffer.

This patch adds support for the PACKET3_RUN_CLEANER_SHADER packet in the
gfx_v9_4_3 module. This packet is used to emit the cleaner shader, which
is used to clear GPU memory before it's reused, helping to prevent data
leakage between different processes.

Finally, the patch updates the ring function structures to include the
new gfx_v9_4_3_ring_emit_cleaner_shader function. This allows the
cleaner shader to be emitted as part of the ring's operations.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:07:11 -04:00
Srinivasan Shanmugam
c2e70d307f drm/amdgpu/gfx9: Implement cleaner shader support for GFX9 hardware
The patch modifies the gfx_v9_0_kiq_set_resources function to write
the cleaner shader's memory controller address to the ring buffer. It
also adds a new function, gfx_v9_0_ring_emit_cleaner_shader, which
emits the PACKET3_RUN_CLEANER_SHADER packet to the ring buffer.

This patch adds support for the PACKET3_RUN_CLEANER_SHADER packet in the
gfx_v9_0 module. This packet is used to emit the cleaner shader, which
is used to clear GPU memory before it's reused, helping to prevent data
leakage between different processes.

Finally, the patch updates the ring function structures to include the
new gfx_v9_0_ring_emit_cleaner_shader function. This allows the
cleaner shader to be emitted as part of the ring's operations.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:07:07 -04:00
Srinivasan Shanmugam
22ff907d4f drm/amdgpu: Add PACKET3_RUN_CLEANER_SHADER for cleaner shader execution
This commit adds the PACKET3_RUN_CLEANER_SHADER definition. This packet
is a command packet used to instruct the GPU to execute the cleaner
shader.

The cleaner shader is a piece of GPU code that is used to clear or
initialize certain GPU resources, such as Local Data Share (LDS), Vector
General Purpose Registers (VGPRs), and Scalar General Purpose Registers
(SGPRs). Clearing these resources is important for ensuring data
isolation between different workloads running on the GPU.

The PACKET3_RUN_CLEANER_SHADER packet is used to trigger the execution
of the cleaner shader on the GPU. The packet consists of a header
followed by a RESERVED field, which is programmed to zero. When the GPU
receives this packet, it fetches and executes the cleaner shader
instructions from the location specified in the packet.

The cleaner shader feature helps to enhances security and reliability by
preventing data leaks between workloads.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:07:02 -04:00
Srinivasan Shanmugam
d361ad5d2f drm/amdgpu: Add sysfs interface for running cleaner shader
This patch adds a new sysfs interface for running the cleaner shader on
AMD GPUs. The cleaner shader is used to clear GPU memory before it's
reused, which can help prevent data leakage between different processes.

The new sysfs file is write-only and is named `run_cleaner_shader`.
Write the number of the partition to this file to trigger the cleaner shader
on that partition. There is only one partition on GPUs which do not
support partitioning.

Changes made in this patch:

- Added `amdgpu_set_run_cleaner_shader` function to handle writes to the
  `run_cleaner_shader` sysfs file.
- Added `run_cleaner_shader` to the list of device attributes in
  `amdgpu_device_attrs`.
- Updated `default_attr_update` to handle `run_cleaner_shader`.
- Added `AMDGPU_DEVICE_ATTR_WO` macro to create write-only device
  attributes.

v2: fix error handling (Alex)

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
2024-08-20 22:06:56 -04:00
Srinivasan Shanmugam
e189be9b2e drm/amdgpu: Add enforce_isolation sysfs attribute
This commit adds a new sysfs attribute 'enforce_isolation' to control
the 'enforce_isolation' setting per GPU. The attribute can be read and
written, and accepts values 0 (disabled) and 1 (enabled).

When 'enforce_isolation' is enabled, reserved VMIDs are allocated for
each ring. When it's disabled, the reserved VMIDs are freed.

The set function locks a mutex before changing the 'enforce_isolation'
flag and the VMIDs, and unlocks it afterwards. This ensures that these
operations are atomic and prevents race conditions and other concurrency
issues.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-20 22:06:52 -04:00
Srinivasan Shanmugam
dba1a6cfc3 drm/amdgpu: Enforce isolation as part of the job
This patch adds a new parameter 'enforce_isolation' to the amdgpu_job
structure. This parameter is used to determine whether shader isolation
should be enforced for a job. The enforce_isolation parameter is then
stored in the amdgpu_job structure and used when flushing the VM.

The enforce_isolation field of the amdgpu_job structure is set directly
after the job is allocated

This change allows more fine-grained control over shader isolation,
making it possible to enforce isolation on a per-job basis rather than
globally. This can be useful in scenarios where only certain jobs
require isolation.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
2024-08-20 22:06:43 -04:00
Victor Skvortsov
19cff16559 drm/amdgpu: abort KIQ waits when there is a pending reset
Stop waiting for the KIQ to return back when there is a reset pending.
It's quite likely that the KIQ will never response.

Signed-off-by: Koenig Christian <Christian.Koenig@amd.com>
Suggested-by: Lazar Lijo <Lijo.Lazar@amd.com>
Tested-by: Victor Skvortsov <victor.skvortsov@amd.com>
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-16 14:27:50 -04:00
Srinivasan Shanmugam
9659520419 drm/amdgpu: Make enforce_isolation setting per GPU
This commit makes enforce_isolation setting to be per GPU and per
partition by adding the enforce_isolation array to the adev structure.
The adev variable is set based on the global enforce_isolation module
parameter during device initialization.

In amdgpu_ids.c, the adev->enforce_isolation value for the current GPU
is used to determine whether to enforce isolation between graphics and
compute processes on that GPU.

In amdgpu_ids.c, the adev->enforce_isolation value for the current GPU
and partition is used to determine whether to enforce isolation between
graphics and compute processes on that GPU and partition.

This allows the enforce_isolation setting to be controlled individually
for each GPU and each partition, which is useful in a system with
multiple GPUs and partitions where different isolation settings might be
desired for different GPUs and partitions.

v2: fix loop in amdgpu_vmid_mgr_init() (Alex)

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
2024-08-16 14:27:45 -04:00
Alex Deucher
ee7a846ea2 drm/amdgpu: Emit cleaner shader at end of IB submission
This commit introduces the emission of a cleaner shader at the end of
the IB submission process. This is achieved by adding a new function
pointer, `emit_cleaner_shader`, to the `amdgpu_ring_funcs` structure. If
the `emit_cleaner_shader` function is set in the ring functions, it is
called during the VM flush process.

The cleaner shader is only emitted if the `enable_cleaner_shader` flag
is set in the `amdgpu_device` structure. This allows the cleaner shader
emission to be controlled on a per-device basis.

By emitting a cleaner shader at the end of the IB submission, we can
ensure that the VM state is properly cleaned up after each submission.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
2024-08-16 14:27:42 -04:00
Srinivasan Shanmugam
aec773a1fb drm/amdgpu: Add infrastructure for Cleaner Shader feature
The cleaner shader is used by the CP firmware to clean LDS and GPRs
between processes on the CUs.

This adds an internal API for GFX IP code to allocate and initialize the
cleaner shader.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
2024-08-16 14:27:34 -04:00
Alex Deucher
f49280ffd2 drm/amdgpu: handle enforce isolation on non-0 gfxhub
Some chips have more than one gfxhub so check if we
are a gfxhub rather than just gfxhub 0.

Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-16 14:27:28 -04:00
Alex Deucher
2dc3851ef7 drm/amdgpu/sdma5.2: limit wptr workaround to sdma 5.2.1
The workaround seems to cause stability issues on other
SDMA 5.2.x IPs.

Fixes: a03ebf1163 ("drm/amdgpu/sdma5.2: Update wptr registers as well as doorbell")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3556
Acked-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-16 14:27:04 -04:00
Sunil Khatri
1a2103d685 drm/amdgpu: add vcn ip dump support for vcn_v2_6
Add support for logging the registers in devcoredump
buffer for vcn_v2_6.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-16 14:26:57 -04:00
Sunil Khatri
bc62abe1b9 drm/amdgpu: add print support for vcn_v2_5 ip dump
Add support for logging the registers in devcoredump
buffer for vcn_v2_5.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-16 14:26:52 -04:00
Sunil Khatri
0eea81ee2e drm/amdgpu: add vcn_v2_5 ip dump support
Add support of vcn ip dump in the devcoredump
for vcn_v2_5.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-16 14:26:47 -04:00
Sunil Khatri
b910cacb4e drm/amdgpu: add print support for vcn_v2_0 ip dump
Add support for logging the registers in devcoredump
buffer for vcn_v2_0.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-16 14:26:41 -04:00