2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

13994 Commits

Author SHA1 Message Date
YiPeng Chai
bfa579b38b drm/amdgpu: prepare to handle pasid poison consumption
Prepare to handle pasid poison consumption.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:42 -04:00
YiPeng Chai
314c38cde6 drm/amdgpu: retire bad pages for umc v12_0
Retire bad pages for umc v12_0.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:42 -04:00
YiPeng Chai
e74313be5a drm/amdgpu: add condition check for amdgpu_umc_fill_error_record
Add condition check for amdgpu_umc_fill_error_record.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
YiPeng Chai
2cf8e50ec3 drm/amdgpu: Add delay work to retire bad pages
Add delay work to retire bad pages.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
YiPeng Chai
f27defca68 drm/amdgpu: umc v12_0 logs ecc errors
1. umc v12_0 logs ecc errors.
2. Reserve newly detected ecc error pages.
3. Add tag for bad pages, so that they can
   be retired later.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
YiPeng Chai
b2aa6b108d drm/amdgpu: umc v12_0 converts error address
Umc v12_0 converts error address.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
YiPeng Chai
95b4063de4 drm/amdgpu: add interface to update umc v12_0 ecc status
Add interface to update umc v12_0 ecc status.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
YiPeng Chai
a734adfbcd drm/amdgpu: add poison creation handler
Add poison creation handler.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
YiPeng Chai
f493dd64ee drm/amdgpu: prepare for logging ecc errors
Prepare for logging ecc errors.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
YiPeng Chai
98b5bc878d drm/amdgpu: add message fifo to handle RAS poison events
Add message fifo to handle RAS poison events.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
Jesse Zhang
88a9a467c5 drm/amdgpu: Using uninitialized value *size when calling amdgpu_vce_cs_reloc
Initialize the size before calling amdgpu_vce_cs_reloc, such as case 0x03000001.
V2: To really improve the handling we would actually
   need to have a separate value of 0xffffffff.(Christian)

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
Alex Deucher
eef016ba89 drm/amdgpu/mes11: Use a separate fence per transaction
We can't use a shared fence location because each transaction
should be considered independently.

Reviewed-by: Shaoyun.liu <shaoyunl@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
Alex Deucher
497d7cee24 drm/amdgpu: add a spinlock to wb allocation
As we use wb slots more dynamically, we need to lock
access to avoid racing on allocation or free.

Reviewed-by: Shaoyun.liu <shaoyunl@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:40 -04:00
Sonny Jiang
754c366e41 drm/amdgpu: update fw_share for VCN5
kmd_fw_shared changed in VCN5

Signed-off-by: Sonny Jiang <sonny.jiang@amd.com>
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:40 -04:00
Mukul Joshi
8e1d190595 drm/amdgpu: Fix VRAM memory accounting
Subtract the VRAM pinned memory when checking for available memory
in amdgpu_amdkfd_reserve_mem_limit function since that memory is not
available for use.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:40 -04:00
Sathishkumar S
c551316e15 drm/amdgpu: update jpeg max decode resolution
jpeg ip version v2.1 and higher supports 16kx16k resolution decode

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:40 -04:00
Sunil Khatri
af8644121e drm/amdgpu: add ip dump for each ip in devcoredump
Add ip dump for each ip of the asic in the
devcoredump for all the ips where a callback
is registered for register dump.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:40 -04:00
Sunil Khatri
e043a35dc2 drm/amdgpu: dump ip state before reset for each ip
Invoke the dump_ip_state function for each ip before
the asic resets and save the register values for
debugging via devcoredump.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:39 -04:00
Sunil Khatri
c8732c80de drm/amdgpu: add support for gfx v10 print
Add support to print ip information to be
used to print registers in devcoredump
buffer.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:39 -04:00
Sunil Khatri
40356542c3 drm/amdgpu: add protype for print ip state
Add the protoype for print ip state to be used
to print the registers in devcoredump during
a gpu reset.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:39 -04:00
Sunil Khatri
c395dbb68b drm/amdgpu: add support of gfx10 register dump
Adding gfx10 gc registers to be used for register
dump via devcoredump during a gpu reset.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:39 -04:00
Sunil Khatri
e21d253bd7 drm/amdgpu: add prototype for ip dump
Add the prototype to dump ip registers
for all ips of different asics and set
them to NULL for now. Based on the
requirement add a function pointer for
each of them.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:39 -04:00
YiPeng Chai
af730e0820 drm/amdgpu: Add interface to reserve bad page
Add interface to reserve bad page.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:39 -04:00
Ma Jun
60c448439f drm/amdgpu: Fix uninitialized variable warnings
return 0 to avoid returning an uninitialized variable r

Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:39 -04:00
Jack Xiao
f88da7fbf6 drm/amdgpu/mes: fix use-after-free issue
Delete fence fallback timer to fix the ramdom
use-after-free issue.

v2: move to amdgpu_mes.c

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:39 -04:00
Alex Deucher
e0a9bbeea0 drm/amdgpu/sdma5.2: use legacy HDP flush for SDMA2/3
This avoids a potential conflict with firmwares with the newer
HDP flush mechanism.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:39 -04:00
Rajneesh Bhardwaj
a16b951586 drm/amdgpu: Update CGCG settings for GFXIP 9.4.3
Tune coarse grain clock gating idle threshold and rlc idle timeout to
achieve better kernel launch latency.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:38 -04:00
Frank Min
ea9238a81b drm/amdgpu: replace tmz flag into buffer flag
Replace tmz flag into buffer flag to make it easier to understand
and extend

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:38 -04:00
Le Ma
92ed1e9cd5 drm/amdgpu: init microcode chip name from ip versions
To adapt to different gc versions in gfx_v9_4_3.c file.

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:38 -04:00
Prike Liang
26de73bc0a drm/amdgpu: Fix the ring buffer size for queue VM flush
Here are the corrections needed for the queue ring buffer size
calculation for the following cases:
- Remove the KIQ VM flush ring usage.
- Add the invalidate TLBs packet for gfx10 and gfx11 queue.
- There's no VM flush and PFP sync, so remove the gfx9 real
  ring and compute ring buffer usage.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:38 -04:00
Lang Yu
a522ec528c drm/amdgpu/umsch: don't execute umsch test when GPU is in reset/suspend
umsch test needs full GPU functionality(e.g., VM update, TLB flush,
possibly buffer moving under memory pressure) which may be not ready
under these states. Just skip it to avoid potential issues.

Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Veerabadhran Gopalakrishnan <Veerabadhran.Gopalakrishnan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:38 -04:00
Damien Le Moal
6927b01680 drm/amdgpu: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
Use the macro PCI_IRQ_INTX instead of the deprecated PCI_IRQ_LEGACY macro.

Link: https://lore.kernel.org/r/20240325070944.3600338-11-dlemoal@kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-25 12:53:31 -05:00
Jack Xiao
9482552820 drm/amdgpu/mes: fix use-after-free issue
Delete fence fallback timer to fix the ramdom
use-after-free issue.

v2: move to amdgpu_mes.c

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-23 23:23:46 -04:00
Alex Deucher
9792b7cc18 drm/amdgpu/sdma5.2: use legacy HDP flush for SDMA2/3
This avoids a potential conflict with firmwares with the newer
HDP flush mechanism.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-23 23:23:46 -04:00
Prike Liang
fe93b0927b drm/amdgpu: Fix the ring buffer size for queue VM flush
Here are the corrections needed for the queue ring buffer size
calculation for the following cases:
- Remove the KIQ VM flush ring usage.
- Add the invalidate TLBs packet for gfx10 and gfx11 queue.
- There's no VM flush and PFP sync, so remove the gfx9 real
  ring and compute ring buffer usage.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-23 23:23:46 -04:00
Lang Yu
661d71ee5a drm/amdgpu/umsch: don't execute umsch test when GPU is in reset/suspend
umsch test needs full GPU functionality(e.g., VM update, TLB flush,
possibly buffer moving under memory pressure) which may be not ready
under these states. Just skip it to avoid potential issues.

Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Veerabadhran Gopalakrishnan <Veerabadhran.Gopalakrishnan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-23 23:23:28 -04:00
Felix Kuehling
b0b13d5321 drm/amdgpu: Update BO eviction priorities
Make SVM BOs more likely to get evicted than other BOs. These BOs
opportunistically use available VRAM, but can fall back relatively
seamlessly to system memory. It also avoids SVM migrations evicting
other, more important BOs as they will evict other SVM allocations
first.

Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Mukul Joshi <mukul.joshi@amd.com>
Tested-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-23 23:23:28 -04:00
Peyton Lee
d59198d2d0 drm/amdgpu/vpe: fix vpe dpm setup failed
The vpe dpm settings should be done before firmware is loaded.
Otherwise, the frequency cannot be successfully raised.

Signed-off-by: Peyton Lee <peytolee@amd.com>
Reviewed-by: Lang Yu <lang.yu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-23 23:23:02 -04:00
Lijo Lazar
aebd3eb9d3 drm/amdgpu: Assign correct bits for SDMA HDP flush
HDP Flush request bit can be kept unique per AID, and doesn't need to be
unique SOC-wide. Assign only bits 10-13 for SDMA v4.4.2.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-23 23:22:54 -04:00
Lang Yu
9c783a1121 drm/amdkfd: make sure VM is ready for updating operations
When page table BOs were evicted but not validated before
updating page tables, VM is still in evicting state,
amdgpu_vm_update_range returns -EBUSY and
restore_process_worker runs into a dead loop.

v2: Split the BO validation and page table update into two
separate loops in amdgpu_amdkfd_restore_process_bos. (Felix)
  1.Validate BOs
  2.Validate VM (and DMABuf attachments)
  3.Update page tables for the BOs validated above

Fixes: 50661eb1a2 ("drm/amdgpu: Auto-validate DMABuf imports in compute VMs")
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-23 23:22:23 -04:00
Mukul Joshi
25e9227c6a drm/amdgpu: Fix leak when GPU memory allocation fails
Free the sync object if the memory allocation fails for any
reason.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-23 23:17:30 -04:00
Felix Kuehling
e76691f45a drm/amdgpu: Update BO eviction priorities
Make SVM BOs more likely to get evicted than other BOs. These BOs
opportunistically use available VRAM, but can fall back relatively
seamlessly to system memory. It also avoids SVM migrations evicting
other, more important BOs as they will evict other SVM allocations
first.

Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Mukul Joshi <mukul.joshi@amd.com>
Tested-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-23 12:08:31 -04:00
Pierre-Eric Pelloux-Prayer
6e042cee74 drm/amdgpu/vcn: fix unitialized variable warnings
Avoid returning an uninitialized value if we never enter the loop.
This case should never be hit in practice, but returning 0 doesn't
hurt.

The same fix is applied to the 4 places using the same pattern.

v2: - fixed typos in commit message (Alex)
    - use "return 0;" before the done label instead of initializing
      r to 0

Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-23 12:08:30 -04:00
Alex Deucher
3f0664110a drm/amdgpu/mes11: print MES opcodes rather than numbers
Makes it easier to review the logs when there are MES
errors.

v2: use dbg for emitted, add helpers for fetching strings
v3: fix missing commas (Harish)
v4: drop command prefixes (Felix)
v5: squash in bounds fix (Jun)

Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed by Shaoyun.liu <Shaoyun.liu@amd.com> (v2)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-23 12:08:30 -04:00
Peyton Lee
2476c6bd95 drm/amdgpu/vpe: fix vpe dpm setup failed
The vpe dpm settings should be done before firmware is loaded.
Otherwise, the frequency cannot be successfully raised.

Signed-off-by: Peyton Lee <peytolee@amd.com>
Reviewed-by: Lang Yu <lang.yu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-23 12:08:30 -04:00
Lijo Lazar
7b19f1f346 drm/amdgpu: Assign correct bits for SDMA HDP flush
HDP Flush request bit can be kept unique per AID, and doesn't need to be
unique SOC-wide. Assign only bits 10-13 for SDMA v4.4.2.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-23 12:08:30 -04:00
Stanley.Yang
939c475181 drm/amdgpu: Support setting reset_method at runtime
In order to support more test cases, support user
change reset_method at runtime.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-23 12:08:30 -04:00
Maxime Ripard
c058e7a8f8
Merge drm/drm-next into drm-misc-next
Maíra needs a backmerge to apply v3d patches, and Danilo for some
nouveau patches.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
2024-04-23 08:48:56 +02:00
Arunpravin Paneer Selvam
a68c7eaa7a drm/amdgpu: Enable clear page functionality
Add clear page support in vram memory region.

v1(Christian):
  - Dont handle clear page as TTM flag since when moving the BO back
    in from GTT again we don't need that.
  - Make a specialized version of amdgpu_fill_buffer() which only
    clears the VRAM areas which are not already cleared
  - Drop the TTM_PL_FLAG_WIPE_ON_RELEASE check in
    amdgpu_object.c

v2:
  - Modify the function name amdgpu_ttm_* (Alex)
  - Drop the delayed parameter (Christian)
  - handle amdgpu_res_cleared(&cursor) just above the size
    calculation (Christian)
  - Use AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE for clearing the buffers
    in the free path to properly wait for fences etc.. (Christian)

v3(Christian):
  - Remove buffer clear code in VRAM manager instead change the
    AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE handling to set
    the DRM_BUDDY_CLEARED flag.
  - Remove ! from amdgpu_res_cleared(&cursor) check.

v4(Christian):
  - vres flag setting move to vram manager file
  - use dma_fence_get_stub in amdgpu_ttm_clear_buffer function
  - make fence a mandatory parameter and drop the if and the get/put dance

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240419063538.11957-2-Arunpravin.PaneerSelvam@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
2024-04-22 19:44:16 +02:00
Arunpravin Paneer Selvam
96950929eb drm/buddy: Implement tracking clear page feature
- Add tracking clear page feature.

- Driver should enable the DRM_BUDDY_CLEARED flag if it
  successfully clears the blocks in the free path. On the otherhand,
  DRM buddy marks each block as cleared.

- Track the available cleared pages size

- If driver requests cleared memory we prefer cleared memory
  but fallback to uncleared if we can't find the cleared blocks.
  when driver requests uncleared memory we try to use uncleared but
  fallback to cleared memory if necessary.

- When a block gets freed we clear it and mark the freed block as cleared,
  when there are buddies which are cleared as well we can merge them.
  Otherwise, we prefer to keep the blocks as separated.

- Add a function to support defragmentation.

v1:
  - Depends on the flag check DRM_BUDDY_CLEARED, enable the block as
    cleared. Else, reset the clear flag for each block in the list(Christian)
  - For merging the 2 cleared blocks compare as below,
    drm_buddy_is_clear(block) != drm_buddy_is_clear(buddy)(Christian)
  - Defragment the memory beginning from min_order
    till the required memory space is available.

v2: (Matthew)
  - Add a wrapper drm_buddy_free_list_internal for the freeing of blocks
    operation within drm buddy.
  - Write a macro block_incompatible() to allocate the required blocks.
  - Update the xe driver for the drm_buddy_free_list change in arguments.
  - add a warning if the two blocks are incompatible on
    defragmentation
  - call full defragmentation in the fini() function
  - place a condition to test if min_order is equal to 0
  - replace the list with safe_reverse() variant as we might
    remove the block from the list.

v3:
  - fix Gitlab user reported lockup issue.
  - Keep DRM_BUDDY_HEADER_CLEAR define sorted(Matthew)
  - modify to pass the root order instead max_order in fini()
    function(Matthew)
  - change bool 1 to true(Matthew)
  - add check if min_block_size is power of 2(Matthew)
  - modify the min_block_size datatype to u64(Matthew)

v4:
  - rename the function drm_buddy_defrag with __force_merge.
  - Include __force_merge directly in drm buddy file and remove
    the defrag use in amdgpu driver.
  - Remove list_empty() check(Matthew)
  - Remove unnecessary space, headers and placement of new variables(Matthew)
  - Add a unit test case(Matthew)

v5:
  - remove force merge support to actual range allocation and not to bail
    out when contains && split(Matthew)
  - add range support to force merge function.

v6:
  - modify the alloc_range() function clear page non merged blocks
    allocation(Matthew)
  - correct the list_insert function name(Matthew).

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Suggested-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240419063538.11957-1-Arunpravin.PaneerSelvam@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
2024-04-22 19:44:16 +02:00
Dave Airlie
377b5b397d amd-drm-next-6.10-2024-04-19:
amdgpu:
 - DC resource allocation logic updates
 - DC IPS fixes
 - DC YUV fixes
 - DMCUB fixes
 - DML2 fixes
 - Devcoredump updates
 - USB-C DSC fix
 - Misc display code cleanups
 - PSR fixes
 - MES timeout fix
 - RAS updates
 - UAF fix in VA IOCTL
 - Fix visible VRAM handling during faults
 - Fix IP discovery handling during PCI rescans
 - Misc code cleanups
 - PSP 14 updates
 - More runtime PM code rework
 - SMU 14.0.2 support
 - GPUVM page fault redirection to secondary IH rings for IH 6.x
 - Suspend/resume fixes
 - SR-IOV fixes
 
 amdkfd:
 - Fix eviction fence handling
 - Fix leak in GPU memory allocation failure case
 - DMABuf import handling fix
 
 radeon:
 - Silence UBSAN warnings related to flexible arrays
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCZiLyawAKCRC93/aFa7yZ
 2BzfAPoDVZjTunizh6SyCFmQamR3eelnxWeY1xaVzmKBHqLCOAEAo2EyThRGyPCH
 SjD+f+ZlflaXQZtZpiQrOr0rkLvh5Q4=
 =96hT
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.10-2024-04-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.10-2024-04-19:

amdgpu:
- DC resource allocation logic updates
- DC IPS fixes
- DC YUV fixes
- DMCUB fixes
- DML2 fixes
- Devcoredump updates
- USB-C DSC fix
- Misc display code cleanups
- PSR fixes
- MES timeout fix
- RAS updates
- UAF fix in VA IOCTL
- Fix visible VRAM handling during faults
- Fix IP discovery handling during PCI rescans
- Misc code cleanups
- PSP 14 updates
- More runtime PM code rework
- SMU 14.0.2 support
- GPUVM page fault redirection to secondary IH rings for IH 6.x
- Suspend/resume fixes
- SR-IOV fixes

amdkfd:
- Fix eviction fence handling
- Fix leak in GPU memory allocation failure case
- DMABuf import handling fix

radeon:
- Silence UBSAN warnings related to flexible arrays

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240419224332.2938259-1-alexander.deucher@amd.com
2024-04-22 12:28:49 +10:00
Lang Yu
81bf14519a drm/amdkfd: make sure VM is ready for updating operations
When page table BOs were evicted but not validated before
updating page tables, VM is still in evicting state,
amdgpu_vm_update_range returns -EBUSY and
restore_process_worker runs into a dead loop.

v2: Split the BO validation and page table update into two
separate loops in amdgpu_amdkfd_restore_process_bos. (Felix)
  1.Validate BOs
  2.Validate VM (and DMABuf attachments)
  3.Update page tables for the BOs validated above

Fixes: 50661eb1a2 ("drm/amdgpu: Auto-validate DMABuf imports in compute VMs")
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:54:49 -04:00
Mukul Joshi
e53a1713de drm/amdgpu: Fix leak when GPU memory allocation fails
Free the sync object if the memory allocation fails for any
reason.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:54:49 -04:00
Zhigang Luo
6a009ca1bf drm/amdgpu: remove virt_init_data_exchange from poison consumption handler
Host will initiate an FLR for all poison consumption.
Guest should wait for FLR message to re-init data exchange.

Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:47:26 -04:00
Lijo Lazar
8954c3fbe7 drm/amdgpu: Change AID detection logic
On GFX 9.4.3 SOCs, only 2 SDMA instances need to be available to be
considered as a valid AID.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:47:19 -04:00
Sunil Khatri
93522c1948 drm/amdgpu: enable redirection of irq's for IH V6.1
Enable redirection of irq for pagefaults for specific
clients to avoid overflow without dropping interrupts.

So here we redirect the interrupts to another IH ring
i.e ring1 where only these interrupts are processed.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:46:56 -04:00
Ahmad Rehman
ea137071ad drm/amdgpu: Skip the coredump collection on reset during driver reload
In passthrough environment, the driver triggers the mode-1 reset on
reload. The reset causes the core dump collection which is delayed task
and prevents driver from unloading until it is completed. Since we do
not need to collect data on "reset on reload" case, we can skip core
dump collection.

v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well.

Signed-off-by: Ahmad Rehman <Ahmad.Rehman@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:46:45 -04:00
Sunil Khatri
ca0afa2f41 drm/amdgpu: enable redirection of irq's for IH V6.0
Enable redirection of irq for pagefaults for specific
clients to avoid overflow without dropping interrupts.

So here we redirect the interrupts to another IH ring
i.e ring1 where only these interrupts are processed.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:46:37 -04:00
Sunil Khatri
5adcd78fa2 drm:amdgpu: enable IH ring1 for IH v6.1
We need IH ring1 for handling the pagefault
interrupts which over flow in default
ring for specific usecases.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:45:37 -04:00
Sunil Khatri
eefc85a277 drm:amdgpu: enable IH RB ring1 for IH v6.0
We need IH ring1 for handling the pagefault
interrupts which are overflowing the default
ring for specific usecases.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:45:22 -04:00
Christian König
a6ff969fe9 drm/amdgpu: fix visible VRAM handling during faults
When we removed the hacky start code check we actually didn't took into
account that *all* VRAM pages needs to be CPU accessible.

Clean up the code and unify the handling into a single helper which
checks if the whole resource is CPU accessible.

The only place where a partial check would make sense is during
eviction, but that is neglitible.

Signed-off-by: Christian König <christian.koenig@amd.com>
Fixes: aed01a6804 ("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
CC: stable@vger.kernel.org
2024-04-17 11:50:43 -04:00
xinhui pan
6fef2d4c00 drm/amdgpu: validate the parameters of bo mapping operations more clearly
Verify the parameters of
amdgpu_vm_bo_(map/replace_map/clearing_mappings) in one common place.

Fixes: dc54d3d174 ("drm/amdgpu: implement AMDGPU_VA_OP_CLEAR v2")
Cc: stable@vger.kernel.org
Reported-by: Vlad Stolyarov <hexed@google.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-17 11:50:43 -04:00
Christian König
ca7c4507ba drm/amdgpu: remove invalid resource->start check v2
The majority of those where removed in the commit aed01a6804
("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2")

But this one was missed because it's working on the resource and not the
BO. Since we also no longer use a fake start address for visible BOs
this will now trigger invalid mapping errors.

v2: also remove the unused variable

Signed-off-by: Christian König <christian.koenig@amd.com>
Fixes: aed01a6804 ("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2")
CC: stable@vger.kernel.org
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-17 11:02:57 -04:00
Dave Airlie
34633158b8 amd-drm-next-6.10-2024-04-13:
amdgpu:
 - HDCP fixes
 - ODM fixes
 - RAS fixes
 - Devcoredump improvements
 - Misc code cleanups
 - Expose VCN activity via sysfs
 - SMY 13.0.x updates
 - Enable fast updates on DCN 3.1.4
 - Add dclk and vclk reporting on additional devices
 - Add ACA RAS infrastructure
 - Implement TLB flush fence
 - EEPROM handling fixes
 - SMUIO 14.0.2 support
 - SMU 14.0.1 Updates
 - Sync page table freeing with TLB flushes
 - DML2 refactor
 - DC debug improvements
 - SR-IOV fixes
 - Suspend and Resume fixes
 - DCN 3.5.x Updates
 - Z8 fixes
 - UMSCH fixes
 - GPU reset fixes
 - HDP fix for second GFX pipe on GC 10.x
 - Enable secondary GFX pipe on GC 10.3
 - Refactor and clean up BACO/BOCO/BAMACO handling
 - VCN partitioning fix
 - DC DWB fixes
 - VSC SDP fixes
 - DCN 3.1.6 fix
 - GC 11.5 fixes
 - Remove invalid TTM resource start check
 - DCN 1.0 fixes
 
 amdkfd:
 - MQD handling cleanup
 - Preemption handling fixes for XCDs
 - TLB flush fix for GC 9.4.2
 - Properly clean up workqueue during module unload
 - Fix memory leak process create failure
 - Range check CP bad op exception targets to avoid reporting invalid exceptions to userspace
 
 radeon:
 - Misc code cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCZhr4EAAKCRC93/aFa7yZ
 2B8jAP9z1JpOnjSQvc2mhHAooXRYO4Mj5HCQ25ZE8N4c8ZZhjAEAqefmEx5/UyLh
 lv2pWILL4o597qhq9nA7hJ6tTICLPAU=
 =HUwY
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.10-2024-04-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.10-2024-04-13:

amdgpu:
- HDCP fixes
- ODM fixes
- RAS fixes
- Devcoredump improvements
- Misc code cleanups
- Expose VCN activity via sysfs
- SMY 13.0.x updates
- Enable fast updates on DCN 3.1.4
- Add dclk and vclk reporting on additional devices
- Add ACA RAS infrastructure
- Implement TLB flush fence
- EEPROM handling fixes
- SMUIO 14.0.2 support
- SMU 14.0.1 Updates
- Sync page table freeing with TLB flushes
- DML2 refactor
- DC debug improvements
- SR-IOV fixes
- Suspend and Resume fixes
- DCN 3.5.x Updates
- Z8 fixes
- UMSCH fixes
- GPU reset fixes
- HDP fix for second GFX pipe on GC 10.x
- Enable secondary GFX pipe on GC 10.3
- Refactor and clean up BACO/BOCO/BAMACO handling
- VCN partitioning fix
- DC DWB fixes
- VSC SDP fixes
- DCN 3.1.6 fix
- GC 11.5 fixes
- Remove invalid TTM resource start check
- DCN 1.0 fixes

amdkfd:
- MQD handling cleanup
- Preemption handling fixes for XCDs
- TLB flush fix for GC 9.4.2
- Properly clean up workqueue during module unload
- Fix memory leak process create failure
- Range check CP bad op exception targets to avoid reporting invalid exceptions to userspace

radeon:
- Misc code cleanups

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240413213708.3427038-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2024-04-17 15:48:59 +10:00
Kenneth Feng
0c1195ca0d drm/amd/swsmu: support smu block discovery for smu v14
Support for smu ip block add for SMU v14.

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:16 -04:00
Hawking Zhang
577cbed318 drm/amdgpu: rename DBG_DRV to HAD_DRV for psp v14
Add a psp bl command enum for HAD_DRV.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:16 -04:00
Ma Jun
1347853271 drm/amdgpu: refactoring the runtime pm mode detection code
refactor the code of runtime pm mode detection to support
amdgpu_runtime_pm =2 and 1 two cases

Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:16 -04:00
Hawking Zhang
6c6acc5f33 drm/amdgpu: Load ipkeymgr drv for psp v14
while DBG_DRV is renamed to HAD_DRV for psp v14,
part of its APIs/functionality is moved to a new
component named Ipkeymgr_Drv.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:16 -04:00
Thorsten Blum
12b8b4e685 drm/amdgpu: Add missing space to DRM_WARN() message
s/,please/, please/

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:15 -04:00
Ma Jun
959056982a drm/amdgpu: Fix discovery initialization failure during pci rescan
Waiting for system ready to fix the discovery initialization
failure issue. This failure usually occurs when dGPU is removed
and then rescanned via command line.
It's caused by following two errors:
[1] vram size is 0
[2] wrong binary signature

Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:15 -04:00
Christian König
394ae0603a drm/amdgpu: fix visible VRAM handling during faults
When we removed the hacky start code check we actually didn't took into
account that *all* VRAM pages needs to be CPU accessible.

Clean up the code and unify the handling into a single helper which
checks if the whole resource is CPU accessible.

The only place where a partial check would make sense is during
eviction, but that is neglitible.

Signed-off-by: Christian König <christian.koenig@amd.com>
Fixes: aed01a6804 ("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
CC: stable@vger.kernel.org
2024-04-16 22:39:15 -04:00
xinhui pan
98856136c4 drm/amdgpu: validate the parameters of bo mapping operations more clearly
Verify the parameters of
amdgpu_vm_bo_(map/replace_map/clearing_mappings) in one common place.

Fixes: dc54d3d174 ("drm/amdgpu: implement AMDGPU_VA_OP_CLEAR v2")
Cc: stable@vger.kernel.org
Reported-by: Vlad Stolyarov <hexed@google.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:15 -04:00
Yang Wang
f23558627f drm/amdgpu: add new aca smu callback func parse_error_code()
add new aca smu callback parse_error_code{} to avoid specific asic check
in amdgpu_aca.c file

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:15 -04:00
Jonathan Kim
f7c161a4c2 drm/amdgpu: increase mes submission timeout
MES internally has a timeout allowance of 2 seconds.
Increase driver timeout to 3 seconds to be safe.

Signed-off-by: Jonathan Kim <Jonathan.Kim@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:38:59 -04:00
Sunil Khatri
3c858cf65e drm/amdgpu: add missing vbios version from devcoredump
Add vbios version in the devcoredump along with formatting
the information with proper alignment.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 21:25:28 -04:00
Alex Deucher
8b9130bae0 drm/amdgpu/gfx11: properly handle regGRBM_GFX_CNTL in soft reset
Need to take the srbm_mutex and while we are here, use the
helper function soc21_grbm_select();

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 21:25:23 -04:00
Christian König
c8962679af drm/amdgpu: remove invalid resource->start check v2
The majority of those where removed in the commit aed01a6804
("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2")

But this one was missed because it's working on the resource and not the
BO. Since we also no longer use a fake start address for visible BOs
this will now trigger invalid mapping errors.

v2: also remove the unused variable

Signed-off-by: Christian König <christian.koenig@amd.com>
Fixes: aed01a6804 ("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2")
CC: stable@vger.kernel.org
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-12 00:33:51 -04:00
Jack Xiao
a0e002cdac drm/amdgpu/sdma6: set sdma hang watchdog
Set SDMAx_WATCHDOG_CNTL.QUEUE_HANG_COUNT registers
to improve SDMA reliability.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-12 00:33:43 -04:00
Luqmaan Irshad
6b0d78032f drm/amd/amdgpu: Update PF2VF Header
Adding a new field for GPU Capacity to align the header with the host.

Signed-off-by: Luqmaan Irshad <Luqmaan.Irshad@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-12 00:33:11 -04:00
Yifan Zhang
6dba20d23e drm/amdgpu: differentiate external rev id for gfx 11.5.0
This patch to differentiate external rev id for gfx 11.5.0.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Tim Huang <Tim.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-10 00:00:32 -04:00
Tim Huang
bbca7f414a drm/amdgpu: fix incorrect number of active RBs for gfx11
The RB bitmap should be global active RB bitmap &
active RB bitmap based on active SA.

Signed-off-by: Tim Huang <Tim.Huang@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-09 23:30:30 -04:00
ZhenGuo Yin
e33997e18d drm/amdgpu: clear set_q_mode_offs when VM changed
[Why]
set_q_mode_offs don't get cleared after GPU reset, nexting SET_Q_MODE
packet to init shadow memory will be skiped, hence there has a page fault.

[How]
VM flush is needed after GPU reset, clear set_q_mode_offs when
emitting VM flush.

Fixes: 8bc75586ea ("drm/amdgpu: workaround to avoid SET_Q_MODE packets v2")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-09 23:27:58 -04:00
Lijo Lazar
f7e232de51 drm/amdgpu: Fix VCN allocation in CPX partition
VCN need not be shared in CPX mode always for all GFX 9.4.3 SOC SKUs. In
certain configs, VCN instance can be exclusively allocated to a
partition even under CPX mode.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:27:23 -04:00
Kenneth Feng
3818708e9c drm/amd/pm: fix the high voltage issue after unload
fix the high voltage issue after unload on smu 13.0.10

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:26:32 -04:00
Tao Zhou
f886b49fea drm/amdgpu: implement IRQ_STATE_ENABLE for SDMA v4.4.2
SDMA_CNTL is not set in some cases, driver configures it by itself.

v2: simplify code

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:16:51 -04:00
Yifan Zhang
533eefb9be drm/amdgpu: add smu 14.0.1 discovery support
This patch to add smu 14.0.1 support

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:15:05 -04:00
shaoyunl
5b0cd091d9 drm/amdgpu : Increase the mes log buffer size as per new MES FW version
From MES version 0x54, the log entry increased and require the log buffer
size to be increased. The 16k is maximum size agreed

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:14:40 -04:00
shaoyunl
a3a4c0b123 drm/amdgpu : Add mes_log_enable to control mes log feature
The MES log might slow down the performance for extra step of log the data,
disable it by default and introduce a parameter can enable it when necessary

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:14:18 -04:00
Lijo Lazar
8b2be55f4d drm/amdgpu: Reset dGPU if suspend got aborted
For SOC21 ASICs, there is an issue in re-enabling PM features if a
suspend got aborted. In such cases, reset the device during resume
phase. This is a workaround till a proper solution is finalized.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-09 23:12:44 -04:00
Lang Yu
0f1bbcc2ba drm/amdgpu/umsch: reinitialize write pointer in hw init
Otherwise the old one will be used during GPU reset.
That's not expected.

Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-09 23:12:20 -04:00
Lijo Lazar
4b18a91faf drm/amdgpu: Refine IB schedule error logging
Downgrade to debug information when IBs are skipped. Also, use dev_* to
identify the device.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:11:59 -04:00
Alex Deucher
65ff8092e4 drm/amdgpu: always force full reset for SOC21
There are cases where soft reset seems to succeed, but
does not, so always use mode1/2 for now.

Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-09 23:10:06 -04:00
Yifan Zhang
526b184e88 drm/amdgpu: differentiate external rev id for gfx 11.5.0
This patch to differentiate external rev id for gfx 11.5.0.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Tim Huang <Tim.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:20:01 -04:00
Tim Huang
d6d6561f93 drm/amdgpu: fix incorrect number of active RBs for gfx11
The RB bitmap should be global active RB bitmap &
active RB bitmap based on active SA.

Signed-off-by: Tim Huang <Tim.Huang@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:14:55 -04:00
Zhigang Luo
d1999b4017 amd/amdgpu: improve VF recover time
1. change AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT from 30 to 5.
2. set fatel error detected flag.

Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:14:30 -04:00
Lijo Lazar
b41f742d6f drm/amdgpu: Set fatal errror detected flag earlier
In case of fatal errors, set FED status when interrupt is received. Set
the flag on other devices in the hive before RAS recovery work.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:13:36 -04:00
ZhenGuo Yin
05e4014168 drm/amdgpu: clear set_q_mode_offs when VM changed
[Why]
set_q_mode_offs don't get cleared after GPU reset, nexting SET_Q_MODE
packet to init shadow memory will be skiped, hence there has a page fault.

[How]
VM flush is needed after GPU reset, clear set_q_mode_offs when
emitting VM flush.

Fixes: 8bc75586ea ("drm/amdgpu: workaround to avoid SET_Q_MODE packets v2")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:09:21 -04:00
Tao Zhou
4b0cb230bd drm/amdgpu: retire UMC v12 mca_addr_to_pa
RAS TA will handle it, the function is useless.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:09:15 -04:00
chongli2
f6ac084236 drm/amd/amdgpu: support MES command SET_HW_RESOURCE1 in sriov
support MES command SET_HW_RESOURCE1 in sriov

Signed-off-by: chongli2 <chongli2@amd.com>
Reviewed-by: Jingwen Chen <Jingwen.Chen2@amd.com>
Acked-by: Jingwen Chen <Jingwen.Chen2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:08:53 -04:00
Tao Zhou
9ecef5b2d0 drm/amdgpu: update check condition for XGMI ACA UE
Check more possible ext error codes.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:08:47 -04:00