Commit Graph

820 Commits

Author SHA1 Message Date
Alex Deucher
e9f58ff991 drm/amdgpu: rework how we handle TLB fences
Add a new VM flag to indicate whether or not we need
a TLB fence.  Userqs (KFD or KGD) require a TLB fence.
A TLB fence is not strictly required for kernel queues,
but it shouldn't hurt.  That said, enabling this
unconditionally should be fine, but it seems to tickle
some issues in KIQ/MES.  Only enable them for KFD,
or when KGD userq queues are enabled (currently via module
parameter).

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4798
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4749
Fixes: f3854e04b7 ("drm/amdgpu: attach tlb fence to the PTs update")
Cc: Christian König <christian.koenig@amd.com>
Cc: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 69c5fbd2b93b5ced77c6e79afe83371bca84c788)
Cc: stable@vger.kernel.org
2026-03-17 18:03:09 -04:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Linus Torvalds
d4a292c5f8 Merge tag 'drm-next-2026-02-21' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
 "This is the fixes and cleanups for the end of the merge window, it's
  nearly all amdgpu, with some amdkfd, then a pagemap core fix, i915/xe
  display fixes, and some xe driver fixes.

  Nothing seems out of the ordinary, except amdgpu is a little more
  volume than usual.

  pagemap:
   - drm/pagemap: pass pagemap_addr by reference

  amdgpu:
   - DML 2.1 fixes
   - Panel replay fixes
   - Display writeback fixes
   - MES 11 old firmware compat fix
   - DC CRC improvements
   - DPIA fixes
   - XGMI fixes
   - ASPM fix
   - SMU feature bit handling fixes
   - DC LUT fixes
   - RAS fixes
   - Misc memory leak in error path fixes
   - SDMA queue reset fixes
   - PG handling fixes
   - 5 level GPUVM page table fix
   - SR-IOV fix
   - Queue reset fix
   - SMU 13.x fixes
   - DC resume lag fix
   - MPO fixes
   - DCN 3.6 fix
   - VSDB fixes
   - HWSS clean up
   - Replay fixes
   - DCE cursor fixes
   - DCN 3.5 SR DDR5 latency fixes
   - HPD fixes
   - Error path unwind fixes
   - SMU13/14 mode1 reset fixes
   - PSP 15 updates
   - SMU 15 updates
   - Sync fix in amdgpu_dma_buf_move_notify()
   - HAINAN fix
   - PSP 13.x fix
   - GPUVM locking fix
   - Fixes for DC analog support
   - DC FAMS fixes
   - DML 2.1 fixes
   - eDP fixes
   - Misc DC fixes
   - Fastboot fix
   - 3DLUT fixes
   - GPUVM fixes
   - 64bpp format fix
   - Fix for MacBooks with switchable gfx

  amdkfd:
   - Fix possible double deletion of validate list
   - Event setup fix
   - Device disconnect regression fix
   - APU GTT as VRAM fix
   - Fix piority inversion with MQDs
   - NULL check fix

  radeon:
   - HAINAN fix

  i915/xe display:
   - Regresion fix for HDR 4k displays (#15503)
   - Fixup for Dell XPS 13 7390 eDP rate limit
   - Memory leak fix on ACPI _DSM handling
   - Add missing slice count check during DP mode validation

  xe:
   - drm/xe: Prevent VFs from exposing the CCS mode sysfs file
   - SRIOV related fixes
   - PAT cache fix
   - MMIO read fix
   - W/a fixes
   - Adjust type of xe_modparam.force_vram_bar_size
   - Wedge mode fix
   - HWMon fix

* tag 'drm-next-2026-02-21' of https://gitlab.freedesktop.org/drm/kernel: (143 commits)
  drm/amd/display: Remove unneeded DAC link encoder register
  drm/amd/display: Enable DAC in DCE link encoder
  drm/amd/display: Set CRTC source for DAC using registers
  drm/amd/display: Initialize DAC in DCE link encoder using VBIOS
  drm/amd/display: Turn off DAC in DCE link encoder using VBIOS
  drm/amd/display: Don't call find_analog_engine() twice
  drm/amdgpu: fix 4-level paging if GMC supports 57-bit VA v2
  drm/amdgpu: keep vga memory on MacBooks with switchable graphics
  drm/amdgpu: Set atomics to true for xgmi
  drm/amdkfd: Check for NULL return values
  drm/amd/display: Use same max plane scaling limits for all 64 bpp formats
  drm/amdgpu: Set vmid0 PAGE_TABLE_DEPTH for GFX12.1
  drm/amdkfd: Disable MQD queue priority
  drm/amd/display: Remove conditional for shaper 3DLUT power-on
  drm/amd/display: Check return of shaper curve to HW format
  drm/amd/display: Correct logic check error for fastboot
  drm/amd/display: Skip eDP detection when no sink
  Revert "drm/amd/display: Add Gfx Base Case For Linear Tiling Handling"
  Revert "drm/amd/display: Correct hubp GfxVersion verification"
  Revert "drm/amd/display: Add Handling for gfxversion DcGfxBase"
  ...
2026-02-20 15:36:38 -08:00
Christian König
aa25c111a7 drm/amdgpu: fix 4-level paging if GMC supports 57-bit VA v2
It turned that using 4 level page tables on GMC generations which support
57bit VAs actually doesn't work at all.

Background is that the GMC actually can't switch between 4 and 5 levels,
but rather just uses a subset of address space when less than 5 levels are
selected.

Philip already removed the automatically switch to 4levels, now fix it as
well should it be enabled by module parameters.

v2: fix AMDGPU_GMC_HOLE_MASK as well, fix off by one issue pointed out
    by Philip

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-19 12:16:12 -05:00
Christian König
fd1fa48b93 drm/amdgpu: lock both VM and BO in amdgpu_gem_object_open
The VM was not locked in the past since we initially only cleared the
linked list element and not added it to any VM state.

But this has changed quite some time ago, we just never realized this
problem because the VM state lock was masking it.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-12 15:24:59 -05:00
Linus Torvalds
136114e0ab Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:

 - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
   disk space by teaching ocfs2 to reclaim suballocator block group
   space (Heming Zhao)

 - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
   ARRAY_END() macro and uses it in various places (Alejandro Colomar)

 - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
   the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
   page size (Pnina Feder)

 - "kallsyms: Prevent invalid access when showing module buildid" cleans
   up kallsyms code related to module buildid and fixes an invalid
   access crash when printing backtraces (Petr Mladek)

 - "Address page fault in ima_restore_measurement_list()" fixes a
   kexec-related crash that can occur when booting the second-stage
   kernel on x86 (Harshit Mogalapalli)

 - "kho: ABI headers and Documentation updates" updates the kexec
   handover ABI documentation (Mike Rapoport)

 - "Align atomic storage" adds the __aligned attribute to atomic_t and
   atomic64_t definitions to get natural alignment of both types on
   csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)

 - "kho: clean up page initialization logic" simplifies the page
   initialization logic in kho_restore_page() (Pratyush Yadav)

 - "Unload linux/kernel.h" moves several things out of kernel.h and into
   more appropriate places (Yury Norov)

 - "don't abuse task_struct.group_leader" removes the usage of
   ->group_leader when it is "obviously unnecessary" (Oleg Nesterov)

 - "list private v2 & luo flb" adds some infrastructure improvements to
   the live update orchestrator (Pasha Tatashin)

* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
  watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
  procfs: fix missing RCU protection when reading real_parent in do_task_stat()
  watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
  kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
  kho: fix doc for kho_restore_pages()
  tests/liveupdate: add in-kernel liveupdate test
  liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
  liveupdate: luo_file: Use private list
  list: add kunit test for private list primitives
  list: add primitives for private list manipulations
  delayacct: fix uapi timespec64 definition
  panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
  netclassid: use thread_group_leader(p) in update_classid_task()
  RDMA/umem: don't abuse current->group_leader
  drm/pan*: don't abuse current->group_leader
  drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
  drm/amdgpu: don't abuse current->group_leader
  android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
  android/binder: don't abuse current->group_leader
  kho: skip memoryless NUMA nodes when reserving scratch areas
  ...
2026-02-12 12:13:01 -08:00
Philip Yang
3b948dd036 drm/amdgpu: Use 5-level paging if gmc support 57-bit VA
Regardless if CPU enable 5-level paging, GPU vm use 5-level paging if
gmc init with 57-bit address space support, because

ARM64 4-level paging support 48-bit VA, x86 and GPU 4-level paging
support 47-bit VA, require 5-level paging on GPU to support ARM64.

NPA address space 52-bit mapping on NPA GPU VM require 5-level paging.

Debugger trap get device snapshot expect LDS and Scratch base, limit
above 57-bit, which is set only for 5-level paging.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.19.x
2026-02-05 17:23:51 -05:00
Oleg Nesterov
a87da7a9fa drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
Nowadays task->group_leader->mm != task->mm is only possible if a) task is
not a group leader and b) task->group_leader->mm == NULL because
task->group_leader has already exited using sys_exit().

I don't think that drm/amd tries to detect/nack this case.

Link: https://lkml.kernel.org/r/aXY_yLVHd63UlWtm@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Christan König <christian.koenig@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Boris Brezillon <boris.brezillon@collabora.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-03 08:21:25 -08:00
Oleg Nesterov
7d08e0916a drm/amdgpu: don't abuse current->group_leader
Cleanup and preparation to simplify the next changes.

- Use current->tgid instead of current->group_leader->pid

- Use get_task_pid(current, PIDTYPE_TGID) instead of
  get_task_pid(current->group_leader, PIDTYPE_PID)

Link: https://lkml.kernel.org/r/aXY_wKewzV5lCa5I@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Boris Brezillon <boris.brezillon@collabora.com>
Cc: Christan König <christian.koenig@amd.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-03 08:21:25 -08:00
Prike Liang
808c2052f0 Revert "drm/amdgpu: don't attach the tlb fence for SI"
This reverts commit 820b3d376e.

It’s better to validate VM TLB flushes in the flush‑TLB backend
rather than in the generic VM layer.

Reverting this patch depends on
commit fa7c231fc2b0 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()")
being present in the tree.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9163fe4d79)
2026-01-14 15:06:51 -05:00
Prike Liang
9163fe4d79 Revert "drm/amdgpu: don't attach the tlb fence for SI"
This reverts commit 820b3d376e.

It’s better to validate VM TLB flushes in the flush‑TLB backend
rather than in the generic VM layer.

Reverting this patch depends on
commit fa7c231fc2b0 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()")
being present in the tree.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-01-14 14:28:59 -05:00
Alex Deucher
eb296c0980 drm/amdgpu: don't attach the tlb fence for SI
SI hardware doesn't support pasids, user mode queues, or
KIQ/MES so there is no need for this.  Doing so results in
a segfault as these callbacks are non-existent for SI.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4744
Fixes: f3854e04b7 ("drm/amdgpu: attach tlb fence to the PTs update")
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 820b3d376e)
2025-12-08 15:24:16 -05:00
Alex Deucher
820b3d376e drm/amdgpu: don't attach the tlb fence for SI
SI hardware doesn't support pasids, user mode queues, or
KIQ/MES so there is no need for this.  Doing so results in
a segfault as these callbacks are non-existent for SI.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4744
Fixes: f3854e04b7 ("drm/amdgpu: attach tlb fence to the PTs update")
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08 14:20:59 -05:00
Philip Yang
f6b1c1f5fd drm/amdgpu: GPU vm support 5-level page table
If GPU supports 5-level page table, but CPU disable 5-level page table
by using boot option no5lvl or CPU feature not available, the virtual
address will be 48bit, not needed to enable 5-level page table on GPU
vm.

If adev->vm_manager.num_level, number of pde levels, set to 4, then
gfxhub and mmhub register VM_CONTEXTx_CNTL/PAGE_TABLE_DEPTH will set
to 4 to enable 5-level page table in page table walker.

Set vm_manager.root_level to AMDGPU_VM_PDE3, then update GPU mapping
will allocate and update PDE3/PDE2/PDE1/PDE0/PTB 5-level page tables.

If max_level is not 4, no change for the logic to support features
needed by old ASICs.

v2: squash in CONFIG fix

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08 13:56:30 -05:00
James Zhu
0bebe9b9fc drm/amdkfd: refactor rlc/gfx spm
for adding multiple xcc support.

Signed-off-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Bing Ma <Bing.Ma@amd.com>
Reviewed-by: Gang Ba <gaba@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-08 13:56:30 -05:00
Natalie Vock
8defb4f081 drm/amdgpu: Forward VMID reservation errors
Otherwise userspace may be fooled into believing it has a reserved VMID
when in reality it doesn't, ultimately leading to GPU hangs when SPM is
used.

Fixes: 80e709ee6e ("drm/amdgpu: add option params to enforce process isolation between graphics and compute")
Cc: stable@vger.kernel.org
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Natalie Vock <natalie.vock@gmx.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-12-02 11:02:07 -05:00
Prike Liang
f3854e04b7 drm/amdgpu: attach tlb fence to the PTs update
Ensure the userq TLB flush is emitted only after
the VM update finishes and the PT BOs have been
annotated with bookkeeping fences.

Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-26 11:50:43 -05:00
Timur Kristóf
8feeab26c8 drm/amdgpu/vm: Check PRT uAPI flag instead of PTE flag
This fixes sparse mappings (aka. partially resident textures).

Check the correct flags.
Since a recent refactor, the code works with uAPI flags (for
mapping buffer objects), and not PTE (page table entry) flags.

Fixes: 6716a823d1 ("drm/amdgpu: rework how PTE flags are generated v3")
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-19 17:34:15 -05:00
Christian König
20459c098d drm/amdgpu: avoid memory allocation in the critical code path v3
When we run out of VMIDs we need to wait for some to become available.
Previously we were using a dma_fence_array for that, but this means that
we have to allocate memory.

Instead just wait for the first not signaled fence from the least recently
used VMID to signal. That is not as efficient since we end up in this
function multiple times again, but allocating memory can easily fail or
deadlock if we have to wait for memory to become available.

v2: remove now unused VM manager fields
v3: fix dma_fence reference

Signed-off-by: Christian König <christian.koenig@amd.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4258
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-14 11:27:46 -05:00
Alex Deucher
f903b85ed0 drm/amdgpu: fix possible fence leaks from job structure
If we don't end up initializing the fences, free them when
we free the job.  We can't set the hw_fence to NULL after
emitting it because we need it in the cleanup path for the
submit direct case.

v2: take a reference to the fences if we emit them
v3: handle non-job fence in error paths

Fixes: db36632ea5 ("drm/amdgpu: clean up and unify hw fence handling")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> (v1)
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04 11:53:59 -05:00
Christian König
c72d41a8f3 drm/amdgpu: grab a BO reference in vm_lock_done_list.
Otherwise it is possible that between dropping the status lock and
locking the BO that the BO is freed up.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04 11:53:21 -05:00
Alex Deucher
db36632ea5 drm/amdgpu: clean up and unify hw fence handling
Decouple the amdgpu fence from the amdgpu_job structure.
This lets us clean up the separate fence ops for the embedded
fence and other fences.  This also allows us to allocate the
vm fence up front when we allocate the job.

v2: Additional cleanup suggested by Christian
v3: Additional cleanups suggested by Christian
v4: Additional cleanups suggested by David and
    vm fence fix
v5: cast seqno (David)

Cc: David.Wu3@amd.com
Cc: christian.koenig@amd.com
Tested-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:35 -04:00
Prike Liang
2e7ceac0ea drm/amdgpu: validate userq va for GEM unmap
When a user unmaps a userq VA, the driver must ensure
the queue has no in-flight jobs. If there is pending work,
the kernel should wait for the attached eviction (bookkeeping)
fence to signal before deleting the mapping.

Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:34 -04:00
Christian König
a107aeb6a2 drm/amdgpu: partially revert "revert to old status lock handling v3"
The CI systems are pointing out list corruptions, so we still need to
fix something here.

Keep the asserts, but revert the lock changes for now.

Fixes: 59e4405e9e ("drm/amdgpu: revert to old status lock handling v3")
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Jesse.Zhang
8d557eab3a drm/amdgpu: Fix general protection fault in amdgpu_vm_bo_reset_state_machine
After GPU reset with VRAM loss, a general protection fault occurs
during user queue restoration when accessing vm_bo->vm after
spinlock release in amdgpu_vm_bo_reset_state_machine.

The root cause is that vm_bo points to the last entry from the
list_for_each_entry loop, but this becomes invalid after the
spinlock is released. Accessing vm_bo->vm at this point leads
to memory corruption.

Crash log shows:
[  326.981811] Oops: general protection fault, probably for non-canonical address 0x4156415741e58ac8: 0000 [#1] SMP NOPTI
[  326.981820] CPU: 13 UID: 0 PID: 1035 Comm: kworker/13:3 Tainted: G            E       6.16.0+ #25 PREEMPT(voluntary)
[  326.981826] Tainted: [E]=UNSIGNED_MODULE
[  326.981827] Hardware name: Gigabyte Technology Co., Ltd. X870E AORUS PRO ICE/X870E AORUS PRO ICE, BIOS F3i 12/19/2024
[  326.981831] Workqueue: events amdgpu_userq_restore_worker [amdgpu]
[  326.981999] RIP: 0010:amdgpu_vm_assert_locked+0x16/0x70 [amdgpu]
[  326.982094] Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 45 48 8b 87 80 03 00 00 48 85 c0 74 40 <48> 8b b8 80 01 00 00 48 85 ff 74 3b 8b 05 0c b7 0e f0 85 c0 75 05
[  326.982098] RSP: 0018:ffffaa91c2a6bc20 EFLAGS: 00010206
[  326.982100] RAX: 4156415741e58948 RBX: ffff9e8f013e8330 RCX: 0000000000000000
[  326.982102] RDX: 0000000000000005 RSI: 000000001d254e88 RDI: ffffffffc144814a
[  326.982104] RBP: ffffaa91c2a6bc68 R08: 0000004c21a25674 R09: 0000000000000001
[  326.982106] R10: 0000000000000001 R11: dccaf3f2f82863fc R12: ffff9e8f013e8000
[  326.982108] R13: ffff9e8f013e8000 R14: 0000000000000000 R15: ffff9e8f09980000
[  326.982110] FS:  0000000000000000(0000) GS:ffff9e9e79995000(0000) knlGS:0000000000000000
[  326.982112] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  326.982114] CR2: 000055ed6c9caa80 CR3: 0000000797060000 CR4: 0000000000750ef0
[  326.982116] PKRU: 55555554

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Jesse.Zhang
b809ca91a5 drm/amdgpu: Merge amdgpu_vm_set_pasid into amdgpu_vm_init
As KFD no longer uses a separate PASID, the global amdgpu_vm_set_pasid()function is no longer necessary.
Merge its functionality directly intoamdgpu_vm_init() to simplify code flow and eliminate redundant locking.

v2: remove superflous check
  adjust amdgpu_vm_fin and remove amdgpu_vm_set_pasid (Chritian)

v3: drop amdgpu_vm_assert_locked (Chritian)

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4614
Fixes: 59e4405e9e ("drm/amdgpu: revert to old status lock handling v3")
Reviewed-by: Christian König <christian.koenig@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:07 -04:00
Christian König
90e09ea4cf drm/amdgpu: revert "rework reserved VMID handling" v2
This reverts commit e44a0fe630.

Initially we used VMID reservation to enforce isolation between
processes. That has now been replaced by proper fence handling.

Both OpenGL, RADV and ROCm developers requested a way to reserve a VMID
for SPM, so restore that approach by reverting back to only allowing a
single process to use the reserved VMID.

Only compile tested for now.

v2: use -ENOENT instead of -EINVAL if VMID is not available

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:39:00 -04:00
Dave Airlie
342f141ba9 Merge tag 'amd-drm-next-6.18-2025-09-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.18-2025-09-19:

amdgpu:
- Fence drv clean up fix
- DPC fixes
- Misc display fixes
- Support the MMIO remap page as a ttm pool
- JPEG parser updates
- UserQ updates
- VCN ctx handling fixes
- Documentation updates
- Misc cleanups
- SMU 13.0.x updates
- SI DPM updates
- GC 11.x cleaner shader updates
- DMCUB updates
- DML fixes
- Improve fallback handling for pixel encoding
- VCN reset improvements
- DCE6 DC updates
- DSC fixes
- Use devm for i2c buses
- GPUVM locking updates
- GPUVM documentation improvements
- Drop non-DC DCE11 code
- S0ix fixes
- Backlight fix
- SR-IOV fixes

amdkfd:
- SVM updates

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250919193354.2989255-1-alexander.deucher@amd.com
2025-09-22 08:45:51 +10:00
Christian König
59e4405e9e drm/amdgpu: revert to old status lock handling v3
It turned out that protecting the status of each bo_va with a
spinlock was just hiding problems instead of solving them.

Revert the whole approach, add a separate stats_lock and lockdep
assertions that the correct reservation lock is held all over the place.

This not only allows for better checks if a state transition is properly
protected by a lock, but also switching back to using list macros to
iterate over the state of lists protected by the dma_resv lock of the
root PD.

v2: re-add missing check
v3: split into two patches

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18 16:59:14 -04:00
Sunil Khatri
0aa09d8a6c drm/amdgpu: add missing comment for the new argument
In function 'amdgpu_vm_lock_done_list' update the comment
for the new argument 'vm'.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202509180211.UAqME0zj-lkp@intel.com/
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18 09:43:38 -04:00
Christian König
930595df25 drm/amdgpu: remove check for BO reservation add assert instead
We should leave such checks to lockdep and not implement something
manually.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-16 17:51:35 -04:00
Christian König
39203f5e6d drm/amdgpu: fix userq VM validation v4
That was actually complete nonsense and not validating the BOs
at all. The code just cleared all VM areas were it couldn't grab the
lock for a BO.

Try to fix this. Only compile tested at the moment.

v2: fix fence slot reservation as well as pointed out by Sunil.
    also validate PDs, PTs, per VM BOs and update PDEs
v3: grab the status_lock while working with the done list.
v4: rename functions, add some comments, fix waiting for updates to
    complete.
v4: rename amdgpu_vm_lock_done_list(), add some more comments

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-16 17:47:06 -04:00
Dave Airlie
0d9f0083f7 Merge tag 'v6.17-rc6' into drm-next
This is a backmerge of Linux 6.17-rc6, needed for msm,
also requested by misc.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-09-15 17:51:07 +10:00
Dave Airlie
6dc1d3c191 Merge tag 'drm-misc-next-2025-09-04' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
drm-misc-next for v6.18:

Cross-subsystem Changes:

- Update a number of DT bindings for STM32MP25 Arm SoC

Core Changes:

gem:
- Simplify locking for GPUVM

panel-backlight-quirks:
- Add additional quirks for EDID, DMI, brightness

sched:
- Fix race condition in trace code
- Clean up

sysfb:
- Clean up

Driver Changes:

amdgpu:
- Give kernel jobs a unique id for better tracing

amdxdna:
- Improve error reporting

bridge:
- Improve ref counting on bridge management
- adv7511: Provide SPD and HDMI infoframes
- it6505: Replace crypto_shash with sha()
- synopsys: Add support for DW DPTX Controller plus DT bindings

gud:
- Replace simple-KMS pipe with regular atomic helpers

imagination:
- Improve power management
- Add support for TH1520 GPU
- Support Risc-V architectures

ivpu:
- Clean up

nouveau:
- Improve error reporting

panthor:
- Fail VM bind if BO has offset
- Clean up

rcar-du:
- Make number of lanes configurable

rockchip:
- Add support for RK3588 DPTX output

rocket:
- Use kfree() and sizeof() correctly
- Test DMA status
- Clean up

sitronix:
- st7571-i2c: Add support for inverted displays and 2-bit grayscale
- Clean up

stm:
- ltdc: Add support support for STM32MP257F-EV1 plus DT bindings

tidss:
- Convert to kernel's FIELD_ macros

v3d:
- Improve job management and locking

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://lore.kernel.org/r/20250904090932.GA193997@linux.fritz.box
2025-09-05 11:49:01 +10:00
Pierre-Eric Pelloux-Prayer
256576ed68 drm/amdgpu: give each kernel job a unique id
Userspace jobs have drm_file.client_id as a unique identifier
as job's owners. For kernel jobs, we can allocate arbitrary
values - the risk of overlap with userspace ids is small (given
that it's a u64 value).
In the unlikely case the overlap happens, it'll only impact
trace events.

Since this ID is traced in the gpu_scheduler trace events, this
allows to determine the source of each job sent to the hardware.

To make grepping easier, the IDs are defined as they will appear
in the trace output.

Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Link: https://lore.kernel.org/r/20250604122827.2191-1-pierre-eric.pelloux-prayer@amd.com
2025-09-01 10:49:31 +05:30
Maxime Ripard
1a2cf179e2 Merge drm/drm-fixes into drm-misc-fixes
Update drm-misc-fixes to -rc2.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-08-20 16:08:49 +02:00
Thomas Zimmermann
3eb61d7cb7 Revert "drm/amdgpu: Use dma_buf from GEM object instance"
This reverts commit 515986100d.

The dma_buf field in struct drm_gem_object is not stable over the
object instance's lifetime. The field becomes NULL when user space
releases the final GEM handle on the buffer object. This resulted
in a NULL-pointer deref.

Workarounds in commit 5307dce878 ("drm/gem: Acquire references on
GEM handles for framebuffers") and commit f6bfc9afc7 ("drm/framebuffer:
Acquire internal references on GEM handles") only solved the problem
partially. They especially don't work for buffer objects without a DRM
framebuffer associated.

Hence, this revert to going back to using .import_attach->dmabuf.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250715082635.34974-1-tzimmermann@suse.de
2025-08-13 11:14:17 +02:00
Liu01 Tong
aa5fc4362f drm/amdgpu: fix task hang from failed job submission during process kill
During process kill, drm_sched_entity_flush() will kill the vm
entities. The following job submissions of this process will fail, and
the resources of these jobs have not been released, nor have the fences
been signalled, causing tasks to hang and timeout.

Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to
stopped entity.

v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in
function amdgpu_cs_vm_handling().

Fixes: 1f02f2044b ("drm/amdgpu: Avoid extra evict-restore process.")
Signed-off-by: Liu01 Tong <Tong.Liu01@amd.com>
Signed-off-by: Lin.Cao <lincao12@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f101c13a87)
2025-08-12 16:16:36 -04:00
Liu01 Tong
f101c13a87 drm/amdgpu: fix task hang from failed job submission during process kill
During process kill, drm_sched_entity_flush() will kill the vm
entities. The following job submissions of this process will fail, and
the resources of these jobs have not been released, nor have the fences
been signalled, causing tasks to hang and timeout.

Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to
stopped entity.

v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in
function amdgpu_cs_vm_handling().

Signed-off-by: Liu01 Tong <Tong.Liu01@amd.com>
Signed-off-by: Lin.Cao <lincao12@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-12 14:22:54 -04:00
Christian König
6716a823d1 drm/amdgpu: rework how PTE flags are generated v3
Previously we tried to keep the HW specific PTE flags in each mapping,
but for CRIU that isn't sufficient any more since the original value is
needed for the checkpoint procedure.

So rework the whole handling, nuke the early mapping function, keep the
UAPI flags in each mapping instead of the HW flags and translate them to
the HW flags while filling in the PTEs.

Only tested on Navi 23 for now, so probably needs quite a bit of more
work.

v2: fix KFD and SVN handling
v3: one more SVN fix pointed out by Felix
v4: squash in gfx12 fix from David

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:26:38 -04:00
Gang Ba
1f02f2044b drm/amdgpu: Avoid extra evict-restore process.
If vm belongs to another process, this is fclose after fork,
wait may enable signaling KFD eviction fence and cause parent process queue evicted.

[677852.634569]  amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu]
[677852.634814]  __dma_fence_enable_signaling+0x3e/0xe0
[677852.634820]  dma_fence_wait_timeout+0x3a/0x140
[677852.634825]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[677852.634831]  amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu]
[677852.635026]  amdgpu_flush+0x34/0x50 [amdgpu]
[677852.635208]  filp_flush+0x38/0x90
[677852.635213]  filp_close+0x14/0x30
[677852.635216]  do_close_on_exec+0xdd/0x130
[677852.635221]  begin_new_exec+0x1da/0x490
[677852.635225]  load_elf_binary+0x307/0xea0
[677852.635231]  ? srso_alias_return_thunk+0x5/0xfbef5
[677852.635235]  ? ima_bprm_check+0xa2/0xd0
[677852.635240]  search_binary_handler+0xda/0x260
[677852.635245]  exec_binprm+0x58/0x1a0
[677852.635249]  bprm_execve.part.0+0x16f/0x210
[677852.635254]  bprm_execve+0x45/0x80
[677852.635257]  do_execveat_common.isra.0+0x190/0x200

Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Gang Ba <Gang.Ba@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-07-28 16:25:19 -04:00
Alex Deucher
77cc0da39c drm/amdgpu: track ring state associated with a fence
We need to know the wptr and sequence number associated
with a fence so that we can re-emit the unprocessed state
after a ring reset.  Pre-allocate storage space for
the ring buffer contents and add helpers to save off
and re-emit the unprocessed state so that it can be
re-emitted after the queue is reset.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:14:11 -04:00
Dave Airlie
7e2818386a Merge tag 'amd-drm-next-6.17-2025-07-01' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.17-2025-07-01:

amdgpu:
- FAMS2 fixes
- OLED fixes
- Misc cleanups
- AUX fixes
- DMCUB updates
- SR-IOV hibernation support
- RAS updates
- DP tunneling fixes
- DML2 fixes
- Backlight improvements
- Suspend improvements
- Use scaling for non-native modes on eDP
- SDMA 4.4.x fixes
- PCIe DPM fixes
- SDMA 5.x fixes
- Cleaner shader updates for GC 9.x
- Remove fence slab
- ISP genpd support
- Parition handling rework
- SDMA FW checks for userq support
- Add missing firmware declaration
- Fix leak in amdgpu_ctx_mgr_entity_fini()
- Freesync fix
- Ring reset refactoring
- Legacy dpm verbosity changes

amdkfd:
- GWS fix
- mtype fix for ext coherent system memory
- MMU notifier fix
- gfx7/8 fix

radeon:
- CS validation support for additional GL extensions
- Bump driver version for new CS validation checks

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250701194707.32905-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-07-04 10:06:29 +10:00
Thomas Zimmermann
515986100d drm/amdgpu: Use dma_buf from GEM object instance
Avoid dereferencing struct drm_gem_object.import_attach for the
imported dma-buf. The dma_buf field in the GEM object instance refers
to the same buffer. Prepares to make import_attach an implementation
detail of PRIME.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:55:16 -04:00
Thomas Zimmermann
4948e6c7fb drm/amdgpu: Test for imported buffers with drm_gem_is_imported()
Instead of testing import_attach for imported GEM buffers, invoke
drm_gem_is_imported() to do the test.

v2:
- keep amdgpu_bo_print_info() as-is (Christian)

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:54:59 -04:00
Lijo Lazar
a3e510fd69 drm/amdgpu: Convert from DRM_* to dev_*
Convert from generic DRM_* to dev_* calls to have device context info.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:54:55 -04:00
André Almeida
35dc4ce200 drm: amdgpu: Use struct drm_wedge_task_info inside of struct amdgpu_task_info
To avoid a cast when calling drm_dev_wedged_event(), replace pid and
task name inside of struct amdgpu_task_info with struct
drm_wedge_task_info.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-6-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00
André Almeida
3bfd1af74a drm: amdgpu: Create amdgpu_vm_print_task_info()
To avoid repetitive code in amdgpu, create a function that prints the
content of struct amdgpu_task_info.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-3-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00
André Almeida
2a4f069d0f drm: amdgpu: Allow NULL pointers at amdgpu_vm_put_task_info()
Allow NULL pointers at amdgpu_vm_put_task_info() as it common practice
for "put" or "free" functions. This avoid an extra check for NULL for
callers.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-2-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00