2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
linux/drivers/gpu/drm/amd/amdkfd
Philip Yang cf234231fc drm/amdkfd: Don't call mmput from MMU notifier callback
If the process is exiting, the mmput inside mmu notifier callback from
compactd or fork or numa balancing could release the last reference
of mm struct to call exit_mmap and free_pgtable, this triggers deadlock
with below backtrace.

The deadlock will leak kfd process as mmu notifier release is not called
and cause VRAM leaking.

The fix is to take mm reference mmget_non_zero when adding prange to the
deferred list to pair with mmput in deferred list work.

If prange split and add into pchild list, the pchild work_item.mm is not
used, so remove the mm parameter from svm_range_unmap_split and
svm_range_add_child.

The backtrace of hung task:

 INFO: task python:348105 blocked for more than 64512 seconds.
 Call Trace:
  __schedule+0x1c3/0x550
  schedule+0x46/0xb0
  rwsem_down_write_slowpath+0x24b/0x4c0
  unlink_anon_vmas+0xb1/0x1c0
  free_pgtables+0xa9/0x130
  exit_mmap+0xbc/0x1a0
  mmput+0x5a/0x140
  svm_range_cpu_invalidate_pagetables+0x2b/0x40 [amdgpu]
  mn_itree_invalidate+0x72/0xc0
  __mmu_notifier_invalidate_range_start+0x48/0x60
  try_to_unmap_one+0x10fa/0x1400
  rmap_walk_anon+0x196/0x460
  try_to_unmap+0xbb/0x210
  migrate_page_unmap+0x54d/0x7e0
  migrate_pages_batch+0x1c3/0xae0
  migrate_pages_sync+0x98/0x240
  migrate_pages+0x25c/0x520
  compact_zone+0x29d/0x590
  compact_zone_order+0xb6/0xf0
  try_to_compact_pages+0xbe/0x220
  __alloc_pages_direct_compact+0x96/0x1a0
  __alloc_pages_slowpath+0x410/0x930
  __alloc_pages_nodemask+0x3a9/0x3e0
  do_huge_pmd_anonymous_page+0xd7/0x3e0
  __handle_mm_fault+0x5e3/0x5f0
  handle_mm_fault+0xf7/0x2e0
  hmm_vma_fault.isra.0+0x4d/0xa0
  walk_pmd_range.isra.0+0xa8/0x310
  walk_pud_range+0x167/0x240
  walk_pgd_range+0x55/0x100
  __walk_page_range+0x87/0x90
  walk_page_range+0xf6/0x160
  hmm_range_fault+0x4f/0x90
  amdgpu_hmm_range_get_pages+0x123/0x230 [amdgpu]
  amdgpu_ttm_tt_get_user_pages+0xb1/0x150 [amdgpu]
  init_user_pages+0xb1/0x2a0 [amdgpu]
  amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu]
  kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu]
  kfd_ioctl+0x29d/0x500 [amdgpu]

Fixes: fa582c6f36 ("drm/amdkfd: Use mmget_not_zero in MMU notifier")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a29e067bd3)
Cc: stable@vger.kernel.org
2025-06-30 13:57:12 -04:00
..
cik_event_interrupt.c drm/amdkfd: Identical code for different branches 2025-05-29 10:56:54 -04:00
cik_int.h
cik_regs.h
cwsr_trap_handler_gfx8.asm
cwsr_trap_handler_gfx9.asm drm/amdkfd: Clear MODE.VSKIP in gfx9 trap handler 2025-01-24 09:53:05 -05:00
cwsr_trap_handler_gfx10.asm drm/amdkfd: Move gfx12 trap handler to separate file 2025-01-09 16:02:56 -05:00
cwsr_trap_handler_gfx12.asm drm/amdkfd: Fix instruction hazard in gfx12 trap handler 2025-03-13 23:12:28 -04:00
cwsr_trap_handler.h drm/amdkfd: Fix instruction hazard in gfx12 trap handler 2025-03-13 23:12:28 -04:00
Kconfig drm/amdkfd: enable kfd on RISCV systems 2025-06-03 15:02:48 -04:00
kfd_chardev.c drm/amdkfd: Change svm_range_get_info return type 2025-05-22 12:00:30 -04:00
kfd_crat.c drm/amdgpu: simplify xgmi peer info calls 2025-02-25 11:45:12 -05:00
kfd_crat.h drm/amdkfd: Reconcile the definition and use of oem_id in struct kfd_topology_device 2024-05-08 18:46:26 -04:00
kfd_debug.c drm/amdkfd: delete stray tab in kfd_dbg_set_mes_debug_mode() 2025-03-11 12:36:38 -04:00
kfd_debug.h drm/amdkfd: add gc 9.5.0 support on kfd 2024-12-10 10:26:51 -05:00
kfd_debugfs.c drm/amdkfd: add pasid debugfs entries 2025-04-30 18:06:14 -04:00
kfd_device_queue_manager_cik.c drm/amdkfd: Add support for more per-process flag 2025-03-07 15:33:49 -05:00
kfd_device_queue_manager_v9.c drm/amdkfd: Correct F8_MODE for gfx950 2025-03-13 23:13:12 -04:00
kfd_device_queue_manager_v10.c drm/amdkfd: Add support for more per-process flag 2025-03-07 15:33:49 -05:00
kfd_device_queue_manager_v11.c drm/amdkfd: Add support for more per-process flag 2025-03-07 15:33:49 -05:00
kfd_device_queue_manager_v12.c drm/amdkfd: Add support for more per-process flag 2025-03-07 15:33:49 -05:00
kfd_device_queue_manager_vi.c drm/amdkfd: Add support for more per-process flag 2025-03-07 15:33:49 -05:00
kfd_device_queue_manager.c drm/amdkfd: change error to warning message for SDMA queues creation 2025-05-07 17:41:27 -04:00
kfd_device_queue_manager.h drm/amdkfd: Add pm_config_dequeue_wait_counts API 2025-03-10 13:23:12 -04:00
kfd_device.c drm/amdkfd: Drop workaround for GC v9.4.3 revID 0 2025-04-07 15:18:59 -04:00
kfd_doorbell.c drm/amdkfd: get doorbell's absolute offset based on the db_size 2023-10-09 17:02:34 -04:00
kfd_events.c amd/amdkfd: fix a kfd_process ref leak 2025-05-29 10:57:02 -04:00
kfd_events.h
kfd_flat_memory.c drm/amdkfd: Use device based logging for errors 2024-07-01 16:10:47 -04:00
kfd_int_process_v9.c drm/amdkfd: Use dev_* instead of pr_* for messages 2025-04-07 15:18:31 -04:00
kfd_int_process_v10.c drm/amdkfd: drop warning in event_interrupt_isr_v1*() 2025-05-13 09:34:09 -04:00
kfd_int_process_v11.c drm/amdkfd: drop warning in event_interrupt_isr_v1*() 2025-05-13 09:34:09 -04:00
kfd_interrupt.c drm/amdgpu: Show warning message if IH ring overflow 2024-12-18 12:39:07 -05:00
kfd_kernel_queue.c drm/amdkfd: Use the correct wptr size 2024-11-21 15:55:20 -05:00
kfd_kernel_queue.h drm/amdkfd: Skip packet submission on fatal error 2024-02-26 11:14:31 -05:00
kfd_migrate.c drm/amdkfd: add a new flag to manage where VRAM allocations go 2025-02-12 21:04:08 -05:00
kfd_migrate.h drm/amdkfd: Use partial migrations/mapping for GPU/CPU page faults in SVM 2023-12-06 15:22:32 -05:00
kfd_module.c
kfd_mqd_manager_cik.c drm/amdkfd: Check preemption status on all XCDs 2024-03-20 13:38:12 -04:00
kfd_mqd_manager_v9.c drm/amdkfd: Set SDMA_RLCx_IB_CNTL/SWITCH_INSIDE_IB 2025-04-30 18:05:46 -04:00
kfd_mqd_manager_v10.c drm/amdkfd: Preserve cp_hqd_pq_control on update_mqd 2025-02-25 11:43:58 -05:00
kfd_mqd_manager_v11.c drm/amdkfd: Preserve cp_hqd_pq_control on update_mqd 2025-02-25 11:43:58 -05:00
kfd_mqd_manager_v12.c drm/amdkfd: Preserve cp_hqd_pq_control on update_mqd 2025-02-25 11:43:58 -05:00
kfd_mqd_manager_vi.c drm/amdkfd: Check preemption status on all XCDs 2024-03-20 13:38:12 -04:00
kfd_mqd_manager.c drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer 2024-07-23 17:34:44 -04:00
kfd_mqd_manager.h drm/amdkfd: Check preemption status on all XCDs 2024-03-20 13:38:12 -04:00
kfd_packet_manager_v9.c drm/amdkfd: Fix race in GWS queue scheduling 2025-06-18 13:17:32 -04:00
kfd_packet_manager_vi.c drm/amdkfd: Add pm_config_dequeue_wait_counts API 2025-03-10 13:23:12 -04:00
kfd_packet_manager.c drm/amdkfd: Support chain runlists of XNACK+/XNACK- 2025-05-16 13:37:29 -04:00
kfd_pm4_headers_ai.h drm/amdkfd: Support chain runlists of XNACK+/XNACK- 2025-05-16 13:37:29 -04:00
kfd_pm4_headers_aldebaran.h drm/amdkfd: Enable processes isolation on gfx9 2024-08-20 22:08:07 -04:00
kfd_pm4_headers_vi.h
kfd_pm4_headers.h
kfd_pm4_opcodes.h
kfd_priv.h amd/amdkfd: Trigger segfault for early userptr unmmapping 2025-05-07 17:45:09 -04:00
kfd_process_queue_manager.c drm/amdkfd: Map wptr BO to GART unconditionally 2025-05-29 10:58:44 -04:00
kfd_process.c drm/amdkfd: add pasid debugfs entries 2025-04-30 18:06:14 -04:00
kfd_queue.c drm/amdkfd: Drop workaround for GC v9.4.3 revID 0 2025-04-07 15:18:59 -04:00
kfd_smi_events.c drm/amdkfd: fix a bug of smi event for superuser 2025-04-21 10:54:44 -04:00
kfd_smi_events.h drm/amdkfd: add smi events for process start and end 2025-04-11 17:01:25 -04:00
kfd_svm.c drm/amdkfd: Don't call mmput from MMU notifier callback 2025-06-30 13:57:12 -04:00
kfd_svm.h drm/amdkfd: Change svm_range_get_info return type 2025-05-22 12:00:30 -04:00
kfd_topology.c drm/amdkfd: move SDMA queue reset capability check to node_show 2025-06-18 12:56:52 -04:00
kfd_topology.h drm/amdkfd: flag per-sdma queue reset supported to user space 2025-03-05 10:47:33 -05:00
Makefile drm/amdkfd: remove kfd_pasid.c from amdgpu driver build 2025-02-27 16:50:04 -05:00
soc15_int.h drm/amdkfd: Check int source id for utcl2 poison event 2024-08-23 10:52:33 -04:00