The eviction fence destroy path incorrectly calls dma_fence_put() on
evf_mgr->ev_fence after releasing the ev_fence_lock. This introduces a
potential use-after-unlock or race because another thread concurrently
modifies evf_mgr->ev_fence.
Fix this by grabbing a local reference to evf_mgr->ev_fence under the
lock and using that for dma_fence_put() after waiting.
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
These structures are basically ported from MMSCH v4_0
The structures are the same as v4_0 except for the
init header
Signed-off-by: fanhuang <FangSheng.Huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Compatible NPS modes for a partition mode are exposed through xcp_config
interface. To determine if a compute partition mode is valid, check if
the current NPS mode is part of compatible NPS modes.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If the MEC firmware supports chaining runlists of XNACK+/XNACK-
processes, set SQ_CONFIG1 chicken bit and SET_RESOURCES bit 28.
When the MEC/HWS supports it, KFD checks the XNACK+/XNACK- processes mix
happens or not. If it does, enter over-subscription.
Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Expose the debugfs file node for user space to dump the IFWI image
on spirom.
For one transaction between PSP and host, it will read out the
images on both active and inactive partitions so a buffer with two
times the size of maximum IFWI image (currently 16MByte) is needed.
v2: move the vbios gfl macros to the common header and rename the
bo triplet struct to spirom_bo for this specific usage (Hawking)
v3: return directly the result of last command execution (Lijo)
Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
As the userq resource was already freed at the drm_release
early phase, it should avoid freeing userq resource again
at the later kms postclose callback.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
A circular locking dependency was detected between the global
`adev->userq_mutex` and per-file `userq_mgr->userq_mutex` when
creating user queues. The issue occurs because:
1. `amdgpu_userq_suspend()` and `amdgpu_userq_resume` take `adev->userq_mutex` first, then
`userq_mgr->userq_mutex`
2. While `amdgpu_userq_create()` takes them in reverse order
This patch resolves the issue by:
1. Moving the `adev->userq_mutex` lock earlier in `amdgpu_userq_create()`
to cover the `amdgpu_userq_ensure_ev_fence()` call
2. Releasing it after we're done with both queue creation and the
scheduling halt check
v2: remove unused adev->userq_mutex lock (Prike)
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
On VCN v4.0.5 there is a race condition where the WPTR is not
updated after starting from idle when doorbell is used. Adding
register read-back after written at function end is to ensure
all register writes are done before they can be used.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This patch enables per-queue and per-pipe reset functionality for
GFX IP v9.5.0 when using MEC firmware version 21 (0x15) or later.
This change:
1. Refactors the pipe reset support check in gfx_v9_4_3_pipe_reset_support()
to use the compute_supported_reset flags instead of hardcoding
version checks.
2. Adds support for GFX9.5.0 (IP 9.5.0) with MEC firmware version >= 21
to enable per-queue and per-pipe reset capabilities.
v2: Replaced mec version check with !!(adev->gfx.compute_supported_reset & AMDGPU_RESET_TYPE_PER_PIPE) (Lijo)
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Set vram type so we can take different actions according to the type.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Make the code more general, user doesn't need to pay attention to the
detail of flip bits setting.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The number of newly added de counts and the number of
newly added error addresses remain consistent
Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
During driver load, RAS event manager may not be initialized. This will
cause any ATHUB event during driver load to be skipped in dmesg log. Log
the error in dmesg log for easier diagnosis.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This resolves a deadlock between user queue management and GPU reset
paths by enforcing consistent lock ordering.
The deadlock occurred when:
1. Process exit path (amdgpu_userq_mgr_fini) would:
- Take uqm->userq_mutex
- Then try to take adev->userq_mutex for list operations
2. GPU reset path (amdgpu_userq_pre_reset) would:
- Take adev->userq_mutex first (for list traversal)
- Then take uqm->userq_mutex
The solution establishes a strict top-down locking order:
1. Always take adev->userq_mutex before any uqm->userq_mutex
2. Maintain this order consistently across all code paths
Changes made:
- Reordered locking in amdgpu_userq_mgr_fini() to take device lock first
- Kept existing proper order in amdgpu_userq_pre_reset()
- Simplified the fini flow by removing redundant operations
This prevents circular dependencies while maintaining thread safety
during both normal operation and GPU reset scenarios.
Fixes: 4ce60dbada ("drm/amdgpu: store userq_managers in a list in adev")
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Get RAS retire flip bits for HBM with different types in various NPS modes.
Also set flip row bit and MCA R13 bit in PA in different NPS modes.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The RAS bad page retire flip bits can be set per vram type,
vram vendor and NPS mode.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add the general interface to get flip bits for RAS bad page retirement.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Per UMC address conversion algorithm, the high row bits of UMC MCA
address are changed when they're converted into normalized address
on specific ASICs.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Support new version of HBM.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
there is only MCA records in V3, no need to care about PA records.
recalculate the value of ras_num_bad_pages when parsing failed and
go on with the left records instead of quit.
Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
On GFX1151, the reported MALL cache size reflects only
half of its actual size; this adjustment corrects the discrepancy.
Signed-off-by: Tim Huang <tim.huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To resolve the warning regarding the missing error code 'r' in
amdgpu_userq_wait_ioctl(), assign the value 'r = -EINVAL'.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202505080458.rnV8YfiY-lkp@intel.com/
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
After process exit to unmap csa and free GPU vm, if signal is accepted
and then waiting to take vm lock is interrupted and return, it causes
memory leaking and below warning backtrace.
Change to use uninterruptible wait lock fix the issue.
WARNING: CPU: 69 PID: 167800 at amd/amdgpu/amdgpu_kms.c:1525
amdgpu_driver_postclose_kms+0x294/0x2a0 [amdgpu]
Call Trace:
<TASK>
drm_file_free.part.0+0x1da/0x230 [drm]
drm_close_helper.isra.0+0x65/0x70 [drm]
drm_release+0x6a/0x120 [drm]
amdgpu_drm_release+0x51/0x60 [amdgpu]
__fput+0x9f/0x280
____fput+0xe/0x20
task_work_run+0x67/0xa0
do_exit+0x217/0x3c0
do_group_exit+0x3b/0xb0
get_signal+0x14a/0x8d0
arch_do_signal_or_restart+0xde/0x100
exit_to_user_mode_loop+0xc1/0x1a0
exit_to_user_mode_prepare+0xf4/0x100
syscall_exit_to_user_mode+0x17/0x40
do_syscall_64+0x69/0xc0
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Except HDP v5.2 all use a common logic for HDP flush. Use a generic
function. HDP v5.2 forces NO_KIQ logic, revisit it later.
Reapply after fixing up an HDP regression.
v2: merge the fix (Alex)
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> (v1)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
APU doesn't have second IH ring, so re-routing action here is a no-op.
It will take a lot of time to wait timeout from PSP during the
initialization. So remove the function in psp v12.
Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If there is a problem requiring a reset of the VCN engine, it is better to
reset the VCN engine rather than the entire GPU.
Add a reset callback for the ring which will stop and start VCN if an
issue happens.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250506204948.12048-4-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If there is a problem requiring a reset of the VCN engine, it is better to
reset the VCN engine rather than the entire GPU.
Add a reset callback for the ring which will stop and start VCN if an
issue happens.
Link: https://lore.kernel.org/r/20250506204948.12048-3-mario.limonciello@amd.com
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
There is a problem occurring on VCN 4.0.5 where in some situations a job
is timing out. This triggers a job timeout which then causes a GPU
reset for recovery. That has exposed a number of issues with GPU reset
that have since been fixed. But also a GPU reset isn't actually needed
for this circumstance. Just restarting the ring is enough.
Add a reset callback for the ring which will stop and start VCN if the
issue happens.
Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3909
Link: https://lore.kernel.org/r/20250506204948.12048-2-mario.limonciello@amd.com
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit 18a878fd8a.
Revert this temporarily to make it easier to fix a regression
in the HDP handling.
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Since secure firmware is more stable than bring up phase, I believe we
don't need such mdelays any more before wait PSP response on PSP v12.
Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Trigger Huang <Trigger.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If applications unmap the memory before destroying the userptr, it needs
trigger a segfault to notify user space to correct the free sequence in
VM debug mode.
v2: Send gpu access fault to user space
v3: Report gpu address to user space, remove unnecessary params
v4: update pr_err into one line, remove userptr log info
Signed-off-by: Shane Xiao <shane.xiao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In VM debug mode, it is desirable to notify the application
to correct the freeing sequence by unmapping the memory before
destroying the userptr in the old userptr path. Add a bitmask
to decide whether to send gpu vm fault to the applition.
Signed-off-by: Shane Xiao <shane.xiao@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It requires unlocking the reserved gem BO before returning from
attaching the eviction fence error.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>