linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00

Author	SHA1	Message	Date
fanhuang	2f0268ca1c	drm/amdgpu/jpeg: sriov support for jpeg_v5_0_1 initialization table handshake with mmsch Signed-off-by: fanhuang <FangSheng.Huang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-16 13:39:14 -04:00
fanhuang	56fc141a5c	drm/amdgpu/vcn: sriov support for vcn_v5_0_1 initialization table handshake with mmsch Signed-off-by: fanhuang <FangSheng.Huang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-16 13:39:10 -04:00
Arvind Yadav	13d0724f0f	drm/amdgpu: fix use-after-unlock in eviction fence destroy The eviction fence destroy path incorrectly calls dma_fence_put() on evf_mgr->ev_fence after releasing the ev_fence_lock. This introduces a potential use-after-unlock or race because another thread concurrently modifies evf_mgr->ev_fence. Fix this by grabbing a local reference to evf_mgr->ev_fence under the lock and using that for dma_fence_put() after waiting. Cc: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-16 13:38:10 -04:00
Lijo Lazar	cc473057bb	drm/amdgpu: Allow NPS2-CPX combination for VFs CPX partition mode is compatible with NPS2 on aquavanjaram VFs. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-16 13:38:05 -04:00
fanhuang	67cc7f9096	drm/amdgpu/mmsch: Add MMSCH v5_0 support for sriov These structures are basically ported from MMSCH v4_0 The structures are the same as v4_0 except for the init header Signed-off-by: fanhuang <FangSheng.Huang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-16 13:37:57 -04:00
Lijo Lazar	3aa37922c6	drm/amdgpu: Use compatible NPS mode info Compatible NPS modes for a partition mode are exposed through xcp_config interface. To determine if a compute partition mode is valid, check if the current NPS mode is part of compatible NPS modes. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-16 13:37:50 -04:00
Asad Kamal	58c397890f	drm/amdgpu: Add pldm version reporting Add pldm version reporting through sysfs node Signed-off-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-16 13:37:38 -04:00
Amber Lin	e3d0870a90	drm/amdkfd: Support chain runlists of XNACK+/XNACK- If the MEC firmware supports chaining runlists of XNACK+/XNACK- processes, set SQ_CONFIG1 chicken bit and SET_RESOURCES bit 28. When the MEC/HWS supports it, KFD checks the XNACK+/XNACK- processes mix happens or not. If it does, enter over-subscription. Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-16 13:37:29 -04:00
Shiwu Zhang	72ea78335e	drm/amdgpu: add debugfs for spirom IFWI dump Expose the debugfs file node for user space to dump the IFWI image on spirom. For one transaction between PSP and host, it will read out the images on both active and inactive partitions so a buffer with two times the size of maximum IFWI image (currently 16MByte) is needed. v2: move the vbios gfl macros to the common header and rename the bo triplet struct to spirom_bo for this specific usage (Hawking) v3: return directly the result of last command execution (Lijo) Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-14 11:30:15 -04:00
Prike Liang	64db767013	drm/amdgpu: fix userq resource double freed As the userq resource was already freed at the drm_release early phase, it should avoid freeing userq resource again at the later kms postclose callback. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-14 11:30:03 -04:00
Jesse.Zhang	96a86dcb5b	drm/amdgpu: Fix circular locking in userq creation A circular locking dependency was detected between the global `adev->userq_mutex` and per-file `userq_mgr->userq_mutex` when creating user queues. The issue occurs because: 1. `amdgpu_userq_suspend()` and `amdgpu_userq_resume` take `adev->userq_mutex` first, then `userq_mgr->userq_mutex` 2. While `amdgpu_userq_create()` takes them in reverse order This patch resolves the issue by: 1. Moving the `adev->userq_mutex` lock earlier in `amdgpu_userq_create()` to cover the `amdgpu_userq_ensure_ev_fence()` call 2. Releasing it after we're done with both queue creation and the scheduling halt check v2: remove unused adev->userq_mutex lock (Prike) Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-14 11:29:38 -04:00
David (Ming Qiang) Wu	07c9db090b	drm/amdgpu: read back register after written for VCN v4.0.5 On VCN v4.0.5 there is a race condition where the WPTR is not updated after starting from idle when doorbell is used. Adding register read-back after written at function end is to ensure all register writes are done before they can be used. Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528 Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Tested-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Ruijing Dong <ruijing.dong@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-14 11:28:39 -04:00
Arunpravin Paneer Selvam	553ad6fc2b	drm/amdgpu/userq: Fix DEBUG_LOCKS_WARN_ON(lock->magic != lock) Fix DEBUG_LOCKS_WARN_ON(lock->magic != lock) warning logs. Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 13:39:52 -04:00
Arunpravin Paneer Selvam	bc5bab82d3	drm/amdgpu: Fix userq ttm_bo_pin and ttm_bo_unpin lockdep warnings The ttm_bo_pin and ttm_bo_unpin warnings are resolved by moving the doorbell bo reserve up before pin/unpin. WARNING: CPU: 11 PID: 1818 at drivers/gpu/drm/ttm/ttm_bo.c:592 ttm_bo_pin+0x1f6/0x270 [ttm] [ +0.000277] CPU: 11 UID: 1000 PID: 1818 Comm: Xwayland Tainted: G W 6.12.0+ #15 [ +0.000006] Tainted: [W]=WARN [ +0.000004] Hardware name: ASUS System Product Name/TUF GAMING B650-PLUS, BIOS 3072 12/20/2024 [ +0.000004] RIP: 0010:ttm_bo_pin+0x1f6/0x270 [ttm] [ +0.000005] RSP: 0018:ffff88846ca879d0 EFLAGS: 00010246 [ +0.000007] RAX: 0000000000000000 RBX: ffff88810b7ca848 RCX: 0000000000000000 [ +0.000004] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ +0.000005] RBP: ffff88846ca879e8 R08: 0000000000000000 R09: 0000000000000000 [ +0.000004] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88810b7ca848 [ +0.000004] R13: ffff88846c666250 R14: 1ffff1108d950f44 R15: ffff88846ca87aa0 [ +0.000005] FS: 00007c45ff436d00(0000) GS:ffff888409580000(0000) knlGS:0000000000000000 [ +0.000004] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000005] CR2: 00005b0c142a60e0 CR3: 000000012ce5a000 CR4: 0000000000f50ef0 [ +0.000004] PKRU: 55555554 [ +0.000004] Call Trace: [ +0.000004] <TASK> [ +0.000005] ? show_regs+0x6c/0x80 [ +0.000007] ? __warn+0xd2/0x2d0 [ +0.000007] ? ttm_bo_pin+0x1f6/0x270 [ttm] [ +0.000031] ? report_bug+0x282/0x2f0 [ +0.000012] ? handle_bug+0x6e/0xc0 [ +0.000007] ? exc_invalid_op+0x18/0x50 [ +0.000007] ? asm_exc_invalid_op+0x1b/0x20 [ +0.000017] ? ttm_bo_pin+0x1f6/0x270 [ttm] [ +0.000014] amdgpu_bo_pin+0x365/0x9d0 [amdgpu] [ +0.000191] ? __pfx_amdgpu_bo_pin+0x10/0x10 [amdgpu] [ +0.000185] ? drm_gem_object_lookup+0x81/0xc0 [ +0.000008] ? kasan_save_alloc_info+0x37/0x60 [ +0.000007] ? __kasan_kmalloc+0xc3/0xd0 [ +0.000013] amdgpu_userqueue_get_doorbell_index+0xee/0x5f0 [amdgpu] [ +0.000209] amdgpu_userq_ioctl+0x6b4/0xd40 [amdgpu] [ +0.000193] ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu] [ +0.000211] ? lock_acquire+0x7c/0xc0 [ +0.000006] ? drm_dev_enter+0x51/0x190 [ +0.000015] drm_ioctl_kernel+0x18b/0x330 [ +0.000007] ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu] [ +0.000190] ? __pfx_drm_ioctl_kernel+0x10/0x10 [ +0.000005] ? lock_acquire+0x7c/0xc0 [ +0.000009] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? __kasan_check_write+0x14/0x30 [ +0.000005] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000011] drm_ioctl+0x589/0xd00 [ +0.000005] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000006] ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu] [ +0.000194] ? __pfx_drm_ioctl+0x10/0x10 [ +0.000006] ? __pm_runtime_resume+0x80/0x110 [ +0.000021] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? trace_hardirqs_on+0x53/0x60 [ +0.000005] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? _raw_spin_unlock_irqrestore+0x51/0x80 [ +0.000013] amdgpu_drm_ioctl+0xd2/0x1c0 [amdgpu] [ +0.000185] __x64_sys_ioctl+0x13a/0x1c0 [ +0.000010] x64_sys_call+0x11ad/0x25f0 [ +0.000007] do_syscall_64+0x91/0x180 [ +0.000007] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? irqentry_exit+0x77/0xb0 [ +0.000005] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? exc_page_fault+0x93/0x150 [ +0.000009] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000005] RIP: 0033:0x7c45ff924ded [ +0.000005] RSP: 002b:00007ffff7167810 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.000008] RAX: ffffffffffffffda RBX: 00000000c0486456 RCX: 00007c45ff924ded [ +0.000004] RDX: 00007ffff7167870 RSI: 00000000c0486456 RDI: 000000000000000b [ +0.000004] RBP: 00007ffff7167860 R08: ffff800100000000 R09: 0000000000010000 [ +0.000005] R10: 00007ffff7167950 R11: 0000000000000246 R12: 00005b0c2a51bc48 [ +0.000004] R13: 000000000000000b R14: 0000000000000000 R15: 00007ffff7167950 [ +0.000022] </TASK> [ +0.000004] irq event stamp: 80693 [ +0.000004] hardirqs last enabled at (80699): [<ffffffff86a693a9>] __up_console_sem+0x79/0xa0 [ +0.000005] hardirqs last disabled at (80704): [<ffffffff86a6938e>] __up_console_sem+0x5e/0xa0 [ +0.000005] softirqs last enabled at (80390): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0 [ +0.000005] softirqs last disabled at (80385): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0 [ +0.000006] ---[ end trace 0000000000000000 ]--- ------------------------------------------------------------------------------------------------------ [ +0.000006] WARNING: CPU: 10 PID: 1818 at drivers/gpu/drm/ttm/ttm_bo.c:611 ttm_bo_unpin+0x21f/0x2c0 [ttm] [ +0.000280] CPU: 10 UID: 1000 PID: 1818 Comm: Xwayland Tainted: G W 6.12.0+ #15 [ +0.000006] Tainted: [W]=WARN [ +0.000004] Hardware name: ASUS System Product Name/TUF GAMING B650-PLUS, BIOS 3072 12/20/2024 [ +0.000004] RIP: 0010:ttm_bo_unpin+0x21f/0x2c0 [ttm] [ +0.000005] RSP: 0018:ffff88846ca87888 EFLAGS: 00010246 [ +0.000007] RAX: 0000000000000000 RBX: ffff88810b7ca848 RCX: 0000000000000000 [ +0.000005] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ +0.000004] RBP: ffff88846ca878a0 R08: 0000000000000000 R09: 0000000000000000 [ +0.000004] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888164e90050 [ +0.000005] R13: ffff88846c666200 R14: 0000000000000001 R15: ffff888168402d28 [ +0.000004] FS: 00007c45ff436d00(0000) GS:ffff888409500000(0000) knlGS:0000000000000000 [ +0.000005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000004] CR2: 00007c45f7373b20 CR3: 000000012ce5a000 CR4: 0000000000f50ef0 [ +0.000005] PKRU: 55555554 [ +0.000004] Call Trace: [ +0.000004] <TASK> [ +0.000005] ? show_regs+0x6c/0x80 [ +0.000008] ? __warn+0xd2/0x2d0 [ +0.000007] ? ttm_bo_unpin+0x21f/0x2c0 [ttm] [ +0.000012] ? report_bug+0x282/0x2f0 [ +0.000013] ? handle_bug+0x6e/0xc0 [ +0.000006] ? exc_invalid_op+0x18/0x50 [ +0.000008] ? asm_exc_invalid_op+0x1b/0x20 [ +0.000017] ? ttm_bo_unpin+0x21f/0x2c0 [ttm] [ +0.000011] ? ttm_bo_unpin+0x217/0x2c0 [ttm] [ +0.000011] amdgpu_bo_unpin+0x45/0x250 [amdgpu] [ +0.000216] amdgpu_userq_ioctl+0x2c3/0xd40 [amdgpu] [ +0.000226] ? drm_dev_exit+0x2d/0x60 [ +0.000010] ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu] [ +0.000201] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? lock_acquire+0x7c/0xc0 [ +0.000006] ? drm_dev_enter+0x51/0x190 [ +0.000015] drm_ioctl_kernel+0x18b/0x330 [ +0.000007] ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu] [ +0.000188] ? __pfx_drm_ioctl_kernel+0x10/0x10 [ +0.000006] ? lock_acquire+0x7c/0xc0 [ +0.000008] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? __kasan_check_write+0x14/0x30 [ +0.000006] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000010] drm_ioctl+0x589/0xd00 [ +0.000005] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000006] ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu] [ +0.000211] ? __pfx_drm_ioctl+0x10/0x10 [ +0.000006] ? __pm_runtime_resume+0x80/0x110 [ +0.000020] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000006] ? trace_hardirqs_on+0x53/0x60 [ +0.000005] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? _raw_spin_unlock_irqrestore+0x51/0x80 [ +0.000013] amdgpu_drm_ioctl+0xd2/0x1c0 [amdgpu] [ +0.000186] __x64_sys_ioctl+0x13a/0x1c0 [ +0.000010] x64_sys_call+0x11ad/0x25f0 [ +0.000007] do_syscall_64+0x91/0x180 [ +0.000007] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? do_syscall_64+0x9d/0x180 [ +0.000007] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000010] ? __pfx___rseq_handle_notify_resume+0x10/0x10 [ +0.000005] ? __pfx_blkcg_maybe_throttle_current+0x10/0x10 [ +0.000013] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000009] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000008] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? syscall_exit_to_user_mode+0x95/0x260 [ +0.000008] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? do_syscall_64+0x9d/0x180 [ +0.000007] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? do_syscall_64+0x9d/0x180 [ +0.000011] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000010] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000009] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000008] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? irqentry_exit_to_user_mode+0x8b/0x260 [ +0.000007] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000006] ? irqentry_exit+0x77/0xb0 [ +0.000004] ? srso_alias_return_thunk+0x5/0xfbef5 [ +0.000005] ? exc_page_fault+0x93/0x150 [ +0.000010] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000005] RIP: 0033:0x7c45ff924ded [ +0.000005] RSP: 002b:00007ffff7168790 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.000008] RAX: ffffffffffffffda RBX: 00000000c0486456 RCX: 00007c45ff924ded [ +0.000005] RDX: 00007ffff71687f0 RSI: 00000000c0486456 RDI: 000000000000000b [ +0.000004] RBP: 00007ffff71687e0 R08: 00005b0c2a49b010 R09: 0000000000000007 [ +0.000004] R10: 00005b0c2a4d7140 R11: 0000000000000246 R12: 000000000000000b [ +0.000004] R13: 00007c45ff19e5cc R14: 00005b0c2a51c538 R15: 00005b0c2a51bbd8 [ +0.000022] </TASK> [ +0.000005] irq event stamp: 87419 [ +0.000004] hardirqs last enabled at (87425): [<ffffffff86a693a9>] __up_console_sem+0x79/0xa0 [ +0.000005] hardirqs last disabled at (87430): [<ffffffff86a6938e>] __up_console_sem+0x5e/0xa0 [ +0.000005] softirqs last enabled at (87058): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0 [ +0.000006] softirqs last disabled at (87053): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0 [ +0.000005] ---[ end trace 0000000000000000 ]--- Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 13:39:46 -04:00
Arunpravin Paneer Selvam	218caca4ba	drm/amdgpu/userq: Fix lock contention in userq fence Fix lockdep warnings. [ +0.000637] ================================ [ +0.000004] WARNING: inconsistent lock state [ +0.000004] 6.12.0+ #18 Tainted: G W OE [ +0.000004] -------------------------------- [ +0.000004] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ +0.000004] Xwayland/1952 [HC0[0]:SC0[0]:HE1:SE1] takes: [ +0.000005] ffff8884636f4740 (&fence_drv->fence_list_lock){?...}-{2:2}, at: amdgpu_userq_fence_driver_destroy+0xb8/0x540 [amdgpu] [ +0.000208] {IN-HARDIRQ-W} state was registered at: [ +0.000004] lock_acquire.part.0+0x116/0x360 [ +0.000005] lock_acquire+0x7c/0xc0 [ +0.000005] _raw_spin_lock+0x2f/0x60 [ +0.000005] amdgpu_userq_fence_driver_process+0x75/0x400 [amdgpu] [ +0.000185] gfx_v12_0_eop_irq+0x29f/0x420 [amdgpu] [ +0.000210] amdgpu_irq_dispatch+0x2a4/0x7b0 [amdgpu] [ +0.000191] amdgpu_ih_process+0x1e1/0x3d0 [amdgpu] [ +0.000185] amdgpu_irq_handler+0x28/0xc0 [amdgpu] [ +0.000183] __handle_irq_event_percpu+0x1bb/0x590 [ +0.000005] handle_irq_event+0xab/0x1d0 [ +0.000005] handle_edge_irq+0x1fd/0xc10 [ +0.000005] __common_interrupt+0x83/0x190 [ +0.000004] common_interrupt+0xb1/0xe0 [ +0.000005] asm_common_interrupt+0x27/0x40 [ +0.000004] cpuidle_enter_state+0x2ba/0x530 [ +0.000005] cpuidle_enter+0x4f/0xb0 [ +0.000006] call_cpuidle+0x46/0xd0 [ +0.000005] do_idle+0x367/0x430 [ +0.000004] cpu_startup_entry+0x58/0x70 [ +0.000005] start_secondary+0x224/0x2b0 [ +0.000005] common_startup_64+0x13e/0x141 [ +0.000005] irq event stamp: 88271 [ +0.000004] hardirqs last enabled at (88271): [<ffffffffad9ca7a1>] _raw_spin_unlock_irqrestore+0x51/0x80 [ +0.000005] hardirqs last disabled at (88270): [<ffffffffad9ca424>] _raw_spin_lock_irqsave+0x74/0x80 [ +0.000005] softirqs last enabled at (87858): [<ffffffffaa67377e>] __irq_exit_rcu+0x17e/0x1d0 [ +0.000005] softirqs last disabled at (87849): [<ffffffffaa67377e>] __irq_exit_rcu+0x17e/0x1d0 [ +0.000005] other info that might help us debug this: [ +0.000004] Possible unsafe locking scenario: [ +0.000003] CPU0 [ +0.000004] ---- [ +0.000003] lock(&fence_drv->fence_list_lock); v2: Drop fence_list_flags and use xa_lock_irqsave() flags parameter (Christian) Fix merge conflicts. Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 13:39:20 -04:00
Jesse.Zhang	3b63602614	drm/amdgpu: Add GFX 9.5.0 support for per-queue/pipe reset This patch enables per-queue and per-pipe reset functionality for GFX IP v9.5.0 when using MEC firmware version 21 (0x15) or later. This change: 1. Refactors the pipe reset support check in gfx_v9_4_3_pipe_reset_support() to use the compute_supported_reset flags instead of hardcoding version checks. 2. Adds support for GFX9.5.0 (IP 9.5.0) with MEC firmware version >= 21 to enable per-queue and per-pipe reset capabilities. v2: Replaced mec version check with !!(adev->gfx.compute_supported_reset & AMDGPU_RESET_TYPE_PER_PIPE) (Lijo) Signed-off-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:37:32 -04:00
Tao Zhou	5d6fddac55	drm/amdgpu: set vram type for GC 9.5.0 Set vram type so we can take different actions according to the type. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:37:28 -04:00
Tao Zhou	dc111f8fb1	drm/amdgpu: set flip bits for RAS bad pages Make the code more general, user doesn't need to pay attention to the detail of flip bits setting. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:37:19 -04:00
Ce Sun	533aa8bdbe	drm/amdgpu: Modify the count method of defer error The number of newly added de counts and the number of newly added error addresses remain consistent Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:35:12 -04:00
Lijo Lazar	937467b7d5	drm/amdgpu: Log RAS errors during load During driver load, RAS event manager may not be initialized. This will cause any ATHUB event during driver load to be skipped in dmesg log. Log the error in dmesg log for easier diagnosis. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:34:02 -04:00
Jesse.Zhang	648a0dc0d7	drm/amdgpu: Fix user queue deadlock by reordering mutex locking This resolves a deadlock between user queue management and GPU reset paths by enforcing consistent lock ordering. The deadlock occurred when: 1. Process exit path (amdgpu_userq_mgr_fini) would: - Take uqm->userq_mutex - Then try to take adev->userq_mutex for list operations 2. GPU reset path (amdgpu_userq_pre_reset) would: - Take adev->userq_mutex first (for list traversal) - Then take uqm->userq_mutex The solution establishes a strict top-down locking order: 1. Always take adev->userq_mutex before any uqm->userq_mutex 2. Maintain this order consistently across all code paths Changes made: - Reordered locking in amdgpu_userq_mgr_fini() to take device lock first - Kept existing proper order in amdgpu_userq_pre_reset() - Simplified the fini flow by removing redundant operations This prevents circular dependencies while maintaining thread safety during both normal operation and GPU reset scenarios. Fixes: `4ce60dbada` ("drm/amdgpu: store userq_managers in a list in adev") Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:32:25 -04:00
Ce Sun	f71509fdd0	drm/amdgpu: Fix the kernel panic caused by RAS records exceed threshold kernel panic caused by RAS records exceeding the threshold when load driver specifying RMA(bad_page_threshold=128) 1.Fix the warnings caused by disabling the interrupt source before it was enabled 2.Fix kernel panic when xcp sysfs is not initialized,null pointer appears during fini 3.Fix the memory leak caused by the device's early exit due to rma The first reason: [ 2744.246650] ------------[ cut here ]------------ [ 2744.246651] WARNING: CPU: 0 PID: 289 at /tmp/amd.BkfTLqYV/amd/amdgpu/amdgpu_irq.c:635 amdgpu_irq_put.cold+0x42/0x6e [amdgpu] [ 2744.247108] Modules linked in: amdgpu(OE+) amddrm_ttm_helper(OE) amdttm(OE) amdxcp(OE) amddrm_buddy(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay binfmt_misc intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel nls_iso8859_1 kvm rapl isst_if_mbox_pci isst_if_mmio pmt_telemetry pmt_crashlog isst_if_common pmt_class mei_me mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath [ 2744.247167] linear mlx5_ib ib_uverbs ib_core ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper crct10dif_pclmul syscopyarea crc32_pclmul ghash_clmulni_intel mlx5_core sysfillrect sysimgblt aesni_intel mlxfw fb_sys_fops psample cec crypto_simd cryptd rc_core i2c_i801 nvme xhci_pci tls intel_pmt drm pci_hyperv_intf nvme_core i2c_smbus i2c_ismt xhci_pci_renesas wmi pinctrl_emmitsburg [ 2744.247194] CPU: 0 PID: 289 Comm: kworker/0:1 Tainted: G OE 5.15.0-70-generic #77-Ubuntu [ 2744.247197] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C23.AG.2 11/21/2024 [ 2744.247198] Workqueue: events work_for_cpu_fn [ 2744.247206] RIP: 0010:amdgpu_irq_put.cold+0x42/0x6e [amdgpu] [ 2744.247634] Code: 79 7f ff 44 89 ee 48 c7 c7 4d 5a 42 c2 89 55 d4 e8 90 09 bc bf 8b 55 d4 4c 89 e6 4c 89 ff e8 3c 76 7f ff 8b 55 d4 84 c0 75 07 <0f> 0b e9 95 79 7f ff 49 03 5c 24 08 f0 ff 0b 75 13 4c 89 e6 4c 89 [ 2744.247636] RSP: 0018:ffa0000019e27cb0 EFLAGS: 00010246 [ 2744.247639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ff11000150fa87c0 [ 2744.247641] RDX: 0000000000000000 RSI: ffffffffc2222430 RDI: ff1100019f200000 [ 2744.247642] RBP: ffa0000019e27ce0 R08: 0000000000000003 R09: ffffffffffe41a08 [ 2744.247643] R10: 0000000000ffff0a R11: 0000000000000001 R12: ff1100019f22ce60 [ 2744.247644] R13: 0000000000000000 R14: 00000000ffffffea R15: ff1100019f200000 [ 2744.247645] FS: 0000000000000000(0000) GS:ff11007e7e400000(0000) knlGS:0000000000000000 [ 2744.247647] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2744.247649] CR2: 00007f3d2002819c CR3: 0000000006810003 CR4: 0000000000771ef0 [ 2744.247650] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2744.247651] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 2744.247652] PKRU: 55555554 [ 2744.247653] Call Trace: [ 2744.247654] <TASK> [ 2744.247656] sdma_v4_4_2_hw_fini+0x7a/0xc0 [amdgpu] [ 2744.247997] ? vcn_v4_0_3_hw_fini+0x5f/0xa0 [amdgpu] [ 2744.248336] amdgpu_ip_block_hw_fini+0x31/0x61 [amdgpu] [ 2744.248776] amdgpu_device_fini_hw+0x3bb/0x47b [amdgpu] [ 2744.249197] ? blocking_notifier_chain_unregister+0x56/0xb0 [ 2744.249202] amdgpu_driver_unload_kms+0x51/0x60 [amdgpu] [ 2744.249482] amdgpu_driver_load_kms.cold+0x18/0x2e [amdgpu] [ 2744.249913] amdgpu_pci_probe+0x23e/0x590 [amdgpu] [ 2744.250187] local_pci_probe+0x48/0x90 [ 2744.250191] work_for_cpu_fn+0x17/0x30 [ 2744.250196] process_one_work+0x228/0x3d0 [ 2744.250198] worker_thread+0x223/0x420 [ 2744.250200] ? process_one_work+0x3d0/0x3d0 [ 2744.250201] kthread+0x127/0x150 [ 2744.250204] ? set_kthread_struct+0x50/0x50 [ 2744.250207] ret_from_fork+0x1f/0x30 [ 2744.250212] </TASK> [ 2744.250213] ---[ end trace 488c997a88508bc3 ]--- The second reason: [ 5139.303446] Memory manager not clean during takedown. [ 5139.303509] WARNING: CPU: 145 PID: 117699 at drivers/gpu/drm/drm_mm.c:998 drm_mm_takedown+0x27/0x30 [drm] [ 5139.303542] Modules linked in: amdgpu(OE+) amddrm_ttm_helper(OE) amdttm(OE) amdxcp(OE) amddrm_buddy(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel binfmt_misc kvm nls_iso8859_1 rapl isst_if_mbox_pci pmt_telemetry pmt_crashlog isst_if_mmio pmt_class isst_if_common mei_me mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath [ 5139.303572] linear mlx5_ib ib_uverbs ib_core crct10dif_pclmul ast crc32_pclmul i2c_algo_bit ghash_clmulni_intel aesni_intel crypto_simd drm_vram_helper cryptd drm_ttm_helper mlx5_core ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core mlxfw psample intel_pmt nvme xhci_pci drm tls i2c_i801 pci_hyperv_intf nvme_core i2c_smbus i2c_ismt xhci_pci_renesas wmi pinctrl_emmitsburg [last unloaded: amdkcl] [ 5139.303588] CPU: 145 PID: 117699 Comm: modprobe Tainted: G U OE 5.15.0-70-generic #77-Ubuntu [ 5139.303590] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C23.AG.2 11/21/2024 [ 5139.303591] RIP: 0010:drm_mm_takedown+0x27/0x30 [drm] [ 5139.303605] Code: cc 66 90 0f 1f 44 00 00 48 8b 47 38 48 83 c7 38 48 39 f8 75 05 c3 cc cc cc cc 55 48 c7 c7 18 d0 10 c0 48 89 e5 e8 5a bc c3 c1 <0f> 0b 5d c3 cc cc cc cc 90 0f 1f 44 00 00 55 b9 15 00 00 00 48 89 [ 5139.303607] RSP: 0018:ffa00000325c3940 EFLAGS: 00010286 [ 5139.303608] RAX: 0000000000000000 RBX: ff1100012f5cecb0 RCX: 0000000000000027 [ 5139.303609] RDX: ff11007e7fa60588 RSI: 0000000000000001 RDI: ff11007e7fa60580 [ 5139.303610] RBP: ffa00000325c3940 R08: 0000000000000003 R09: fffffffff00c2b78 [ 5139.303610] R10: 000000000000002b R11: 0000000000000001 R12: ff1100012f5cec00 [ 5139.303611] R13: ff1100012138f068 R14: 0000000000000000 R15: ff1100012f5cec90 [ 5139.303611] FS: 00007f42ffca0000(0000) GS:ff11007e7fa40000(0000) knlGS:0000000000000000 [ 5139.303612] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5139.303613] CR2: 00007f23d945ab68 CR3: 00000001212ce005 CR4: 0000000000771ee0 [ 5139.303614] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5139.303615] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 5139.303615] PKRU: 55555554 [ 5139.303616] Call Trace: [ 5139.303617] <TASK> [ 5139.303619] amdttm_range_man_fini_nocheck+0xfe/0x1c0 [amdttm] [ 5139.303625] amdgpu_ttm_fini+0x2ed/0x390 [amdgpu] [ 5139.303800] amdgpu_bo_fini+0x27/0xc0 [amdgpu] [ 5139.303959] gmc_v9_0_sw_fini+0x63/0x90 [amdgpu] [ 5139.304144] amdgpu_device_fini_sw+0x125/0x6a0 [amdgpu] [ 5139.304302] amdgpu_driver_release_kms+0x16/0x30 [amdgpu] [ 5139.304455] devm_drm_dev_init_release+0x4a/0x80 [drm] [ 5139.304472] devm_action_release+0x12/0x20 [ 5139.304476] release_nodes+0x3d/0xb0 [ 5139.304478] devres_release_all+0x9b/0xd0 [ 5139.304480] really_probe+0x11d/0x420 [ 5139.304483] __driver_probe_device+0x119/0x190 [ 5139.304485] driver_probe_device+0x23/0xc0 [ 5139.304487] __driver_attach+0xf7/0x1f0 [ 5139.304489] ? __device_attach_driver+0x140/0x140 [ 5139.304491] bus_for_each_dev+0x7c/0xd0 [ 5139.304493] driver_attach+0x1e/0x30 [ 5139.304494] bus_add_driver+0x148/0x220 [ 5139.304496] driver_register+0x95/0x100 [ 5139.304498] __pci_register_driver+0x68/0x70 [ 5139.304500] amdgpu_init+0xbc/0x1000 [amdgpu] [ 5139.304655] ? 0xffffffffc0b8f000 [ 5139.304657] do_one_initcall+0x46/0x1e0 [ 5139.304659] ? kmem_cache_alloc_trace+0x19e/0x2e0 [ 5139.304663] do_init_module+0x52/0x260 [ 5139.304665] load_module+0xb2b/0xbc0 [ 5139.304667] __do_sys_finit_module+0xbf/0x120 [ 5139.304669] __x64_sys_finit_module+0x18/0x20 [ 5139.304670] do_syscall_64+0x59/0xc0 [ 5139.304673] ? exit_to_user_mode_prepare+0x37/0xb0 [ 5139.304676] ? syscall_exit_to_user_mode+0x27/0x50 [ 5139.304678] ? __x64_sys_mmap+0x33/0x50 [ 5139.304680] ? do_syscall_64+0x69/0xc0 [ 5139.304681] entry_SYSCALL_64_after_hwframe+0x61/0xcb [ 5139.304684] RIP: 0033:0x7f42ffdbf88d [ 5139.304686] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48 [ 5139.304687] RSP: 002b:00007ffcb7427158 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 5139.304688] RAX: ffffffffffffffda RBX: 000055ce8b8f3150 RCX: 00007f42ffdbf88d [ 5139.304689] RDX: 0000000000000000 RSI: 000055ce8b8f9a70 RDI: 000000000000000a [ 5139.304690] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000011 [ 5139.304690] R10: 000000000000000a R11: 0000000000000246 R12: 000055ce8b8f9a70 [ 5139.304691] R13: 000055ce8b8f2ec0 R14: 000055ce8b8f2ab0 R15: 000055ce8b8f9aa0 [ 5139.304692] </TASK> [ 5139.304693] ---[ end trace 8536b052f7883003 ]--- Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:32:11 -04:00
Tao Zhou	b7674ae75b	drm/amdgu: get RAS retire flip bits for new type of HBM Get RAS retire flip bits for HBM with different types in various NPS modes. Also set flip row bit and MCA R13 bit in PA in different NPS modes. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:32:08 -04:00
Tao Zhou	9b5b71895b	drm/amdgpu: implement get_retire_flip_bits for UMC v12 The RAS bad page retire flip bits can be set per vram type, vram vendor and NPS mode. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:32:05 -04:00
Tao Zhou	699bff37a5	drm/amdgpu: add get_retire_flip_bits for UMC Add the general interface to get flip bits for RAS bad page retirement. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:32:01 -04:00
Tao Zhou	4ce5b99128	drm/amdgpu: adjust high bits for RAS retired page Per UMC address conversion algorithm, the high row bits of UMC MCA address are changed when they're converted into normalized address on specific ASICs. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:31:46 -04:00
Tao Zhou	1df57411a6	drm/amd: add definition for new memory type Support new version of HBM. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:31:40 -04:00
ganglxie	f5db59067c	Refine RAS bad page records counting and parsing in eeprom V3 there is only MCA records in V3, no need to care about PA records. recalculate the value of ras_num_bad_pages when parsing failed and go on with the left records instead of quit. Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:31:33 -04:00
Tim Huang	0a5c060b59	drm/amdgpu: fix incorrect MALL size for GFX1151 On GFX1151, the reported MALL cache size reflects only half of its actual size; this adjustment corrects the discrepancy. Signed-off-by: Tim Huang <tim.huang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:23:23 -04:00
Arvind Yadav	010503a3cb	drm/amdgpu: Fix amdgpu_userq_wait_ioctl() warn missing error code 'r' To resolve the warning regarding the missing error code 'r' in amdgpu_userq_wait_ioctl(), assign the value 'r = -EINVAL'. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202505080458.rnV8YfiY-lkp@intel.com/ Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:21:56 -04:00
Arvind Yadav	f10eb185ad	drm/amdgpu: Fix NULL dereference in amdgpu_userq_restore_worker Switch cancel_delayed_work() to cancel_delayed_work_sync() to ensure the delayed work has finished executing before proceeding with resource cleanup. This prevents a potential use-after-free or NULL dereference if the resume_work is still running during finalization. BUG: kernel NULL pointer dereference, address: 0000000000000140 [ +0.000050] #PF: supervisor read access in kernel mode [ +0.000019] #PF: error_code(0x0000) - not-present page [ +0.000021] PGD 0 P4D 0 [ +0.000015] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI [ +0.000021] CPU: 17 UID: 0 PID: 196299 Comm: kworker/17:0 Tainted: G U 6.14.0-org-staging #1 [ +0.000032] Tainted: [U]=USER [ +0.000015] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F39 03/22/2024 [ +0.000029] Workqueue: events amdgpu_userq_restore_worker [amdgpu] [ +0.000426] RIP: 0010:drm_exec_lock_obj+0x32/0x210 [drm_exec] [ +0.000025] Code: e5 41 57 41 56 41 55 49 89 f5 41 54 49 89 fc 48 83 ec 08 4c 8b 77 30 4d 85 f6 0f 85 c0 00 00 00 4c 8d 7f 08 48 39 77 38 74 54 <49> 8b bd f8 00 00 00 4c 89 fe 41 f6 04 24 01 75 3c e8 08 50 bc e0 [ +0.000046] RSP: 0018:ffffab1b04da3ce8 EFLAGS: 00010297 [ +0.000020] RAX: 0000000000000001 RBX: ffff930cc60e4bc0 RCX: 0000000000000000 [ +0.000025] RDX: 0000000000000004 RSI: 0000000000000048 RDI: ffffab1b04da3d88 [ +0.000028] RBP: ffffab1b04da3d10 R08: ffff930cc60e4000 R09: 0000000000000000 [ +0.000022] R10: ffffab1b04da3d18 R11: 0000000000000001 R12: ffffab1b04da3d88 [ +0.000023] R13: 0000000000000048 R14: 0000000000000000 R15: ffffab1b04da3d90 [ +0.000023] FS: 0000000000000000(0000) GS:ffff9313dea80000(0000) knlGS:0000000000000000 [ +0.000024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000021] CR2: 0000000000000140 CR3: 000000018351a000 CR4: 0000000000350ef0 [ +0.000025] Call Trace: [ +0.000018] <TASK> [ +0.000015] ? show_regs+0x69/0x80 [ +0.000022] ? __die+0x25/0x70 [ +0.000019] ? page_fault_oops+0x15d/0x510 [ +0.000024] ? do_user_addr_fault+0x312/0x690 [ +0.000024] ? sched_clock_cpu+0x10/0x1a0 [ +0.000028] ? exc_page_fault+0x78/0x1b0 [ +0.000025] ? asm_exc_page_fault+0x27/0x30 [ +0.000024] ? drm_exec_lock_obj+0x32/0x210 [drm_exec] [ +0.000024] drm_exec_prepare_obj+0x21/0x60 [drm_exec] [ +0.000021] amdgpu_vm_lock_pd+0x22/0x30 [amdgpu] [ +0.000266] amdgpu_userq_validate_bos+0x6c/0x320 [amdgpu] [ +0.000333] amdgpu_userq_restore_worker+0x4a/0x120 [amdgpu] [ +0.000316] process_one_work+0x189/0x3c0 [ +0.000021] worker_thread+0x2a4/0x3b0 [ +0.000022] kthread+0x109/0x220 [ +0.000018] ? __pfx_worker_thread+0x10/0x10 [ +0.000779] ? _raw_spin_unlock_irq+0x1f/0x40 [ +0.000560] ? __pfx_kthread+0x10/0x10 [ +0.000543] ret_from_fork+0x3c/0x60 [ +0.000507] ? __pfx_kthread+0x10/0x10 [ +0.000515] ret_from_fork_asm+0x1a/0x30 [ +0.000515] </TASK> v2: Replace cancel_delayed_work() to cancel_delayed_work_sync() in amdgpu_userq_destroy() and amdgpu_userq_evict(). Cc: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Arvind Yadav <arvind.yadav@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:21:39 -04:00
Philip Yang	7dbbfb3c17	drm/amdgpu: csa unmap use uninterruptible lock After process exit to unmap csa and free GPU vm, if signal is accepted and then waiting to take vm lock is interrupted and return, it causes memory leaking and below warning backtrace. Change to use uninterruptible wait lock fix the issue. WARNING: CPU: 69 PID: 167800 at amd/amdgpu/amdgpu_kms.c:1525 amdgpu_driver_postclose_kms+0x294/0x2a0 [amdgpu] Call Trace: <TASK> drm_file_free.part.0+0x1da/0x230 [drm] drm_close_helper.isra.0+0x65/0x70 [drm] drm_release+0x6a/0x120 [drm] amdgpu_drm_release+0x51/0x60 [amdgpu] __fput+0x9f/0x280 ____fput+0xe/0x20 task_work_run+0x67/0xa0 do_exit+0x217/0x3c0 do_group_exit+0x3b/0xb0 get_signal+0x14a/0x8d0 arch_do_signal_or_restart+0xde/0x100 exit_to_user_mode_loop+0xc1/0x1a0 exit_to_user_mode_prepare+0xf4/0x100 syscall_exit_to_user_mode+0x17/0x40 do_syscall_64+0x69/0xc0 Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-13 09:21:31 -04:00
Dave Airlie	1faeeb315f	amd-drm-next-6.16-2025-05-09: amdgpu: - IPS fixes - DSC cleanup - DC Scaling updates - DC FP fixes - Fused I2C-over-AUX updates - SubVP fixes - Freesync fix - DMUB AUX fixes - VCN fix - Hibernation fixes - HDP fixes - DCN 2.1 fixes - DPIA fixes - DMUB updates - Use drm_file_err in amdgpu - Enforce isolation updates - Use new dma_fence helpers - USERQ fixes - Documentation updates - Misc code cleanups - SR-IOV updates - RAS updates - PSP 12 cleanups amdkfd: - Update error messages for SDMA - Userptr updates drm: - Add drm_file_err function dma-buf: - Add a helper to sort and deduplicate dma_fence arrays -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaB6KhwAKCRC93/aFa7yZ 2FYIAQD43sIYoZCo6l7lZt0m54d6Mt1jangVhMK/jH31eOgKYQEAoRJKSMjT6Ktl FLBzG8FXekwQELNu6EgyG8Ywwim9AQ0= =dfkK -----END PGP SIGNATURE----- Merge tag 'amd-drm-next-6.16-2025-05-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.16-2025-05-09: amdgpu: - IPS fixes - DSC cleanup - DC Scaling updates - DC FP fixes - Fused I2C-over-AUX updates - SubVP fixes - Freesync fix - DMUB AUX fixes - VCN fix - Hibernation fixes - HDP fixes - DCN 2.1 fixes - DPIA fixes - DMUB updates - Use drm_file_err in amdgpu - Enforce isolation updates - Use new dma_fence helpers - USERQ fixes - Documentation updates - Misc code cleanups - SR-IOV updates - RAS updates - PSP 12 cleanups amdkfd: - Update error messages for SDMA - Userptr updates drm: - Add drm_file_err function dma-buf: - Add a helper to sort and deduplicate dma_fence arrays From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250509230951.3871914-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>	2025-05-12 07:14:34 +10:00
Lijo Lazar	afc6053d4c	Reapply: drm/amdgpu: Use generic hdp flush function Except HDP v5.2 all use a common logic for HDP flush. Use a generic function. HDP v5.2 forces NO_KIQ logic, revisit it later. Reapply after fixing up an HDP regression. v2: merge the fix (Alex) Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> (v1) Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-08 11:21:37 -04:00
Alex Deucher	dbc064adfc	drm/amdgpu/hdp7: use memcfg register to post the write for HDP flush Reading back the remapped HDP flush register seems to cause problems on some platforms. All we need is a read, so read back the memcfg register. Fixes: `689275140c` ("drm/amdgpu/hdp7.0: do a posting read when flushing HDP") Reported-by: Alexey Klimov <alexey.klimov@linaro.org> Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908 Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-08 11:21:12 -04:00
Alex Deucher	84141ff615	drm/amdgpu/hdp6: use memcfg register to post the write for HDP flush Reading back the remapped HDP flush register seems to cause problems on some platforms. All we need is a read, so read back the memcfg register. Fixes: `abe1cbaec6` ("drm/amdgpu/hdp6.0: do a posting read when flushing HDP") Reported-by: Alexey Klimov <alexey.klimov@linaro.org> Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908 Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-08 11:20:48 -04:00
Huang Rui	793fa8ce4e	drm/amdgpu: cleanup sriov function for psp v12 PSP v12 won't have SRIOV function. Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-08 11:20:43 -04:00
Alex Deucher	4a89b7698e	drm/amdgpu/hdp5.2: use memcfg register to post the write for HDP flush Reading back the remapped HDP flush register seems to cause problems on some platforms. All we need is a read, so read back the memcfg register. Fixes: `f756dbac1c` ("drm/amdgpu/hdp5.2: do a posting read when flushing HDP") Reported-by: Alexey Klimov <alexey.klimov@linaro.org> Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908 Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-08 11:20:19 -04:00
Alex Deucher	a5cb344033	drm/amdgpu/hdp5: use memcfg register to post the write for HDP flush Reading back the remapped HDP flush register seems to cause problems on some platforms. All we need is a read, so read back the memcfg register. Fixes: `cf424020e0` ("drm/amdgpu/hdp5.0: do a posting read when flushing HDP") Reported-by: Alexey Klimov <alexey.klimov@linaro.org> Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908 Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-08 11:18:30 -04:00
Huang Rui	518e22b42c	drm/amdgpu: remove re-route ih in psp v12 APU doesn't have second IH ring, so re-routing action here is a no-op. It will take a lot of time to wait timeout from PSP during the initialization. So remove the function in psp v12. Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-08 11:18:24 -04:00
Mario Limonciello	b54695dae9	drm/amd: Add per-ring reset for vcn v5.0.0 use If there is a problem requiring a reset of the VCN engine, it is better to reset the VCN engine rather than the entire GPU. Add a reset callback for the ring which will stop and start VCN if an issue happens. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250506204948.12048-4-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-07 17:48:24 -04:00
Mario Limonciello	b8b6e6f165	drm/amd: Add per-ring reset for vcn v4.0.0 use If there is a problem requiring a reset of the VCN engine, it is better to reset the VCN engine rather than the entire GPU. Add a reset callback for the ring which will stop and start VCN if an issue happens. Link: https://lore.kernel.org/r/20250506204948.12048-3-mario.limonciello@amd.com Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-07 17:48:19 -04:00
Mario Limonciello	d1a46cdd00	drm/amd: Add per-ring reset for vcn v4.0.5 use There is a problem occurring on VCN 4.0.5 where in some situations a job is timing out. This triggers a job timeout which then causes a GPU reset for recovery. That has exposed a number of issues with GPU reset that have since been fixed. But also a GPU reset isn't actually needed for this circumstance. Just restarting the ring is enough. Add a reset callback for the ring which will stop and start VCN if the issue happens. Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3909 Link: https://lore.kernel.org/r/20250506204948.12048-2-mario.limonciello@amd.com Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-07 17:48:11 -04:00
Alex Deucher	5c937b4a60	drm/amdgpu/hdp4: use memcfg register to post the write for HDP flush Reading back the remapped HDP flush register seems to cause problems on some platforms. All we need is a read, so read back the memcfg register. Fixes: `c9b8dcabb5` ("drm/amdgpu/hdp4.0: do a posting read when flushing HDP") Reported-by: Alexey Klimov <alexey.klimov@linaro.org> Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908 Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-07 17:47:33 -04:00
Alex Deucher	e8614fc769	Revert "drm/amdgpu: Use generic hdp flush function" This reverts commit `18a878fd8a`. Revert this temporarily to make it easier to fix a regression in the HDP handling. Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-07 17:47:02 -04:00
Sunil Khatri	c2a3bac7c8	drm/amdgpu: fix the indentation fix the indentation drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c:6992 gfx_v11_ip_dump compiler: gcc-11 (Debian 11.3.0-12) 11.3.0 Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202505071619.7sHTLpNg-lkp@intel.com/ Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-07 17:45:27 -04:00
Huang Rui	8465f0a372	drm/amdgpu: remove mdelay in psp v12 Since secure firmware is more stable than bring up phase, I believe we don't need such mdelays any more before wait PSP response on PSP v12. Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Trigger Huang <Trigger.Huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-07 17:45:16 -04:00
Shane Xiao	2d274bf709	amd/amdkfd: Trigger segfault for early userptr unmmapping If applications unmap the memory before destroying the userptr, it needs trigger a segfault to notify user space to correct the free sequence in VM debug mode. v2: Send gpu access fault to user space v3: Report gpu address to user space, remove unnecessary params v4: update pr_err into one line, remove userptr log info Signed-off-by: Shane Xiao <shane.xiao@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-07 17:45:09 -04:00
Shane Xiao	8e320f67d4	drm/amdgpu: Add debug bit for userptr usage In VM debug mode, it is desirable to notify the application to correct the freeing sequence by unmapping the memory before destroying the userptr in the old userptr path. Add a bitmask to decide whether to send gpu vm fault to the applition. Signed-off-by: Shane Xiao <shane.xiao@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-07 17:45:04 -04:00
Prike Liang	def41146b9	drm/amdgpu: unreserve the gem BO before returning from attach error It requires unlocking the reserved gem BO before returning from attaching the eviction fence error. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-05-07 17:44:59 -04:00

1 2 3 4 5 ...

15900 Commits