linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-04-18 04:09:08 +08:00

Author	SHA1	Message	Date
Lucas De Marchi	59b7ed0ba2	drm/xe/configfs: Improve doc for ctx_restore* attributes Spell out the syntax instead of only using examples. Particularly important the <engine-class> part since that's different than engines_allowed and may confuse users. The same batch buffer is used for all engines of a certain class. Cc: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Fixes: `e2a9854d80` ("drm/xe/configfs: Allow to select by class only") Link: https://lore.kernel.org/r/20250924152709.659483-4-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `47ca7acff4`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-02 21:57:51 -07:00
Lucas De Marchi	7646423c7f	drm/xe/configfs: Fix engine class parsing If mask is NULL, only the engine class should be accepted, so the pattern string should be completely parsed. This should fix passing e.g. rcs0 to ctx_restore_post_bb when it's only expecting the engine class. Reported-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Closes: https://lore.kernel.org/r/20250922155544.67712-1-jonathan.cavitt@intel.com Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/aNJKnrCQmL9xS9Gv@stanley.mountain Fixes: `e2a9854d80` ("drm/xe/configfs: Allow to select by class only") Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Link: https://lore.kernel.org/r/20250924152709.659483-3-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `dd79796716`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-02 21:57:51 -07:00
Michal Wajdeczko	7bd03e3914	drm/xe/tests: Fix build break on clang 16.0.6 The following error was reported when building with clang 16.0.6: In file included from drivers/gpu/drm/xe/xe_pci.c:1104: >> drivers/gpu/drm/xe/tests/xe_pci.c:214:2: error: initializer \ element is not a compile-time constant graphics_ip_xelp, ^~~~~~~~~~~~~~~~ drivers/gpu/drm/xe/tests/xe_pci.c:221:2: error: initializer \ element is not a compile-time constant media_ip_xem, ^~~~~~~~~~~~ 2 errors generated. Fix that by explicit re-definition of pre-GMDID IPs, as there are not so many of them. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509192041.tQwdE4DS-lkp@intel.com/ Fixes: `5bb5258e35` ("drm/xe/tests: Add pre-GMDID IP descriptors to param generators") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250922101207.192028-1-michal.wajdeczko@intel.com (cherry picked from commit `2de80e2da7`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-02 21:57:51 -07:00
Dave Airlie	b2ec5ca9d5	Merge tag 'amd-drm-next-6.18-2025-09-26' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.18-2025-09-26: amdgpu: - Misc fixes - Misc cleanups - SMU 13.x fixes - MES fix - VCN 5.0.1 reset fixes - DCN 3.2 watermark fixes - AVI infoframe fixes - PSR fix - Brightness fixes - DCN 3.1.4 fixes - DCN 3.1+ DTM fixes - DCN powergating fixes - DMUB fixes - DCN/SMU cleanup - DCN stutter fixes - DCN 3.5 fixes - GAMMA_LUT fixes - Add UserQ documentation - GC 9.4 reset fixes - Enforce isolation cleanups - UserQ fixes - DC/non-DC common modes cleanup - DCE6-10 fixes amdkfd: - Fix a race in sw_fini - Switch partition fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250926143918.2030854-1-alexander.deucher@amd.com	2025-09-30 09:26:31 +10:00
Dave Airlie	62bea0e1d5	Merge tag 'drm-habanalabs-next-2025-09-25' of https://github.com/HabanaAI/drivers.accel.habanalabs.kernel into drm-next This tag contains habanalabs driver changes for v6.18. It continues the previous upstream work from tags/drm-habanalabs-next-2024-06-23, Including improvements in debug and visibility, alongside general code cleanups, and new features such as vmalloc-backed coherent mmap, HLDIO infrastructure, etc. Signed-off-by: Dave Airlie <airlied@redhat.com> From: "Elbaz, Koby" <koby.elbaz@intel.com> Link: https://lore.kernel.org/r/da02d370-9967-49d2-9eef-7aeaa40c987c@intel.com	2025-09-26 13:26:51 +10:00
Dave Airlie	a2caae58f8	Merge tag 'drm-misc-next-fixes-2025-09-25' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next Short summary of fixes pull: bridge: - waveshare-dsi: Fix error handling in probe function pixpaper: - select GEM SHMEM helpers Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://lore.kernel.org/r/20250925064257.GA9107@linux.fritz.box	2025-09-26 13:06:39 +10:00
Mario Limonciello	df2ba57094	drm/amd: Add name to modes from amdgpu_connector_add_common_modes() [Why] When DC adds common modes it adds modes with a string to match what they are. Non-DC doesn't. This can be inconsistent when turning on/off DC support. [How] Add a name member to common_modes[] and copy it into the drm display mode. Cc: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Link: https://lore.kernel.org/r/20250924161624.1975819-6-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:54:22 -04:00
Mario Limonciello	6d622755bc	drm/amd: Drop some common modes from amdgpu_connector_add_common_modes() [Why] DC and non-DC codepaths have different sets of common modes that are added for eDP and LVDS cases. This can cause different behaviors for turning on DC on hardware that can support both. [How] Drop extra modes from amdgpu_connector_add_common_modes() not present in amdgpu_dm_connector_add_common_modes(). Cc: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250924161624.1975819-5-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:54:12 -04:00
Alex Deucher	dbf2341569	drm/amdgpu: update MODULE_PARM_DESC for freesync_video To better describe what it does. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3756 Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:54:07 -04:00
Mario Limonciello	123a1750c5	drm/amd: Use dynamic array size declaration for amdgpu_connector_add_common_modes() [Why] Adding or removing a mode from common_modes[] can be fragile if a user forgot to update the for loop boundaries. [How] Use ARRAY_SIZE() to detect size of the array and use that instead. Cc: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Link: https://lore.kernel.org/r/20250924161624.1975819-4-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:53:55 -04:00
Timur Kristóf	1f721ebcf3	drm/amd/display: Share dce100_validate_global with DCE6-8 The dce100_validate_global function was verbatim exactly the same as dce60_validate_global and dce80_validate_global. Share dce100_validate_global between DCE6-10 to save code size. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:53:46 -04:00
Timur Kristóf	ee352f6c56	drm/amd/display: Share dce100_validate_bandwidth with DCE6-8 DCE6-8 have very similar capabilities to DCE10, they support the same DP and HDMI versions and work similarly. Share dce100_validate_bandwidth between DCE6-10 to reduce code duplication in the DC driver. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:53:33 -04:00
Jesse.Zhang	b8ae2640f9	drm/amdgpu: Fix fence signaling race condition in userqueue This commit fixes a potential race condition in the userqueue fence signaling mechanism by replacing dma_fence_is_signaled_locked() with dma_fence_is_signaled(). The issue occurred because: 1. dma_fence_is_signaled_locked() should only be used when holding the fence's individual lock, not just the fence list lock 2. Using the locked variant without the proper fence lock could lead to double-signaling scenarios: - Hardware completion signals the fence - Software path also tries to signal the same fence By using dma_fence_is_signaled() instead, we properly handle the locking hierarchy and avoid the race condition while still maintaining the necessary synchronization through the fence_list_lock. v2: drop the comment (Christian) Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:53:23 -04:00
Yifan Zhang	45da20e00d	amd/amdkfd: enhance kfd process check in switch partition current switch partition only check if kfd_processes_table is empty. kfd_prcesses_table entry is deleted in kfd_process_notifier_release, but kfd_process tear down is in kfd_process_wq_release. consider two processes: Process A (workqueue) -> kfd_process_wq_release -> Access kfd_node member Process B switch partition -> amdgpu_xcp_pre_partition_switch -> amdgpu_amdkfd_device_fini_sw -> kfd_node tear down. Process A and B may trigger a race as shown in dmesg log. This patch is to resolve the race by adding an atomic kfd_process counter kfd_processes_count, it increment as create kfd process, decrement as finish kfd_process_wq_release. v2: Put kfd_processes_count per kfd_dev, move decrement to kfd_process_destroy_pdds and bug fix. (Philip Yang) [3966658.307702] divide error: 0000 [#1] SMP NOPTI [3966658.350818] i10nm_edac [3966658.356318] CPU: 124 PID: 38435 Comm: kworker/124:0 Kdump: loaded Tainted [3966658.356890] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu] [3966658.362839] nfit [3966658.366457] RIP: 0010:kfd_get_num_sdma_engines+0x17/0x40 [amdgpu] [3966658.366460] Code: 00 00 e9 ac 81 02 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 48 8b 4f 08 48 8b b7 00 01 00 00 8b 81 58 26 03 00 99 <f7> be b8 01 00 00 80 b9 70 2e 00 00 00 74 0b 83 f8 02 ba 02 00 00 [3966658.380967] x86_pkg_temp_thermal [3966658.391529] RSP: 0018:ffffc900a0edfdd8 EFLAGS: 00010246 [3966658.391531] RAX: 0000000000000008 RBX: ffff8974e593b800 RCX: ffff888645900000 [3966658.391531] RDX: 0000000000000000 RSI: ffff888129154400 RDI: ffff888129151c00 [3966658.391532] RBP: ffff8883ad79d400 R08: 0000000000000000 R09: ffff8890d2750af4 [3966658.391532] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000000 [3966658.391533] R13: ffff8883ad79d400 R14: ffffe87ff662ba00 R15: ffff8974e593b800 [3966658.391533] FS: 0000000000000000(0000) GS:ffff88fe7f600000(0000) knlGS:0000000000000000 [3966658.391534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3966658.391534] CR2: 0000000000d71000 CR3: 000000dd0e970004 CR4: 0000000002770ee0 [3966658.391535] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3966658.391535] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [3966658.391536] PKRU: 55555554 [3966658.391536] Call Trace: [3966658.391674] deallocate_sdma_queue+0x38/0xa0 [amdgpu] [3966658.391762] process_termination_cpsch+0x1ed/0x480 [amdgpu] [3966658.399754] intel_powerclamp [3966658.402831] kfd_process_dequeue_from_all_devices+0x5b/0xc0 [amdgpu] [3966658.402908] kfd_process_wq_release+0x1a/0x1a0 [amdgpu] [3966658.410516] coretemp [3966658.434016] process_one_work+0x1ad/0x380 [3966658.434021] worker_thread+0x49/0x310 [3966658.438963] kvm_intel [3966658.446041] ? process_one_work+0x380/0x380 [3966658.446045] kthread+0x118/0x140 [3966658.446047] ? __kthread_bind_mask+0x60/0x60 [3966658.446050] ret_from_fork+0x1f/0x30 [3966658.446053] Modules linked in: kpatch_20765354(OEK) [3966658.455310] kvm [3966658.464534] mptcp_diag xsk_diag raw_diag unix_diag af_packet_diag netlink_diag udp_diag act_pedit act_mirred act_vlan cls_flower kpatch_21951273(OEK) kpatch_18424469(OEK) kpatch_19749756(OEK) [3966658.473462] idxd_mdev [3966658.482306] kpatch_17971294(OEK) sch_ingress xt_conntrack amdgpu(OE) amdxcp(OE) amddrm_buddy(OE) amd_sched(OE) amdttm(OE) amdkcl(OE) intel_ifs iptable_mangle tcm_loop target_core_pscsi tcp_diag target_core_file inet_diag target_core_iblock target_core_user target_core_mod coldpgs kpatch_18383292(OEK) ip6table_nat ip6table_filter ip6_tables ip_set_hash_ipportip ip_set_hash_ipportnet ip_set_hash_ipport ip_set_bitmap_port xt_comment iptable_nat nf_nat iptable_filter ip_tables ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sn_core_odd(OE) i40e overlay binfmt_misc tun bonding(OE) aisqos(OE) aisqos_hotfixes(OE) rfkill uio_pci_generic uio cuse fuse nf_tables nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm idxd_mdev [3966658.491237] vfio_pci [3966658.501196] vfio_pci vfio_virqfd mdev vfio_iommu_type1 vfio iax_crypto intel_pmt_telemetry iTCO_wdt intel_pmt_class iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq [3966658.508537] vfio_virqfd [3966658.517569] snd_seq_device ipmi_ssif isst_if_mbox_pci isst_if_mmio pcspkr snd_pcm idxd intel_uncore ses isst_if_common intel_vsec idxd_bus enclosure snd_timer mei_me snd i2c_i801 i2c_smbus mei i2c_ismt soundcore joydev acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad vfat fat [3966658.526851] mdev [3966658.536096] nfsd auth_rpcgss nfs_acl lockd grace slb_vtoa(OE) sunrpc dm_mod hookers mlx5_ib(OE) ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm mlx5_core(OE) mlxfw(OE) [3966658.540381] vfio_iommu_type1 [3966658.544341] nvme mpt3sas tls drm nvme_core pci_hyperv_intf raid_class psample libcrc32c crc32c_intel mlxdevm(OE) i2c_core [3966658.551254] vfio [3966658.558742] scsi_transport_sas wmi pinctrl_emmitsburg sd_mod t10_pi sg ahci libahci libata rdma_ucm(OE) ib_uverbs(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_umad(OE) ib_core(OE) ib_ucm(OE) mlx_compat(OE) [3966658.563004] iax_crypto [3966658.570988] [last unloaded: diagnose] [3966658.571027] ---[ end trace cc9dbb180f9ae537 ]--- Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Philip.Yang<Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:53:08 -04:00
Yifan Zhang	99d7181bca	amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw There is race in amdgpu_amdkfd_device_fini_sw and interrupt. if amdgpu_amdkfd_device_fini_sw run in b/w kfd_cleanup_nodes and kfree(kfd), and KGD interrupt generated. kernel panic log: BUG: kernel NULL pointer dereference, address: 0000000000000098 amdgpu 0000:c8:00.0: amdgpu: Requesting 4 partitions through PSP PGD d78c68067 P4D d78c68067 kfd kfd: amdgpu: Allocated 3969056 bytes on gart PUD 1465b8067 PMD @ Oops: @002 [#1] SMP NOPTI kfd kfd: amdgpu: Total number of KFD nodes to be created: 4 CPU: 115 PID: @ Comm: swapper/115 Kdump: loaded Tainted: G S W OE K RIP: 0010:_raw_spin_lock_irqsave+0x12/0x40 Code: 89 e@ 41 5c c3 cc cc cc cc 66 66 2e Of 1f 84 00 00 00 00 00 OF 1f 40 00 Of 1f 44% 00 00 41 54 9c 41 5c fa 31 cO ba 01 00 00 00 <fO> OF b1 17 75 Ba 4c 89 e@ 41 Sc 89 c6 e8 07 38 5d RSP: 0018: ffffc90@1a6b0e28 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000018 0000000000000001 RSI: ffff8883bb623e00 RDI: 0000000000000098 ffff8883bb000000 RO8: ffff888100055020 ROO: ffff888100055020 0000000000000000 R11: 0000000000000000 R12: 0900000000000002 ffff888F2b97da0@ R14: @000000000000098 R15: ffff8883babdfo00 CS: 010 DS: 0000 ES: 0000 CRO: 0000000080050033 CR2: 0000000000000098 CR3: 0000000e7cae2006 CR4: 0000000002770ce0 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 0000000000000000 DR6: 00000000fffeO7FO DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> kgd2kfd_interrupt+@x6b/0x1f@ [amdgpu] ? amdgpu_fence_process+0xa4/0x150 [amdgpu] kfd kfd: amdgpu: Node: 0, interrupt_bitmap: 3 YcpxFl Rant tErace amdgpu_irq_dispatch+0x165/0x210 [amdgpu] amdgpu_ih_process+0x80/0x100 [amdgpu] amdgpu: Virtual CRAT table created for GPU amdgpu_irq_handler+0x1f/@x60 [amdgpu] __handle_irq_event_percpu+0x3d/0x170 amdgpu: Topology: Add dGPU node [0x74a2:0x1002] handle_irq_event+0x5a/@xcO handle_edge_irq+0x93/0x240 kfd kfd: amdgpu: KFD node 1 partition @ size 49148M asm_call_irq_on_stack+0xf/@x20 </IRQ> common_interrupt+0xb3/0x130 asm_common_interrupt+0x1le/0x40 5.10.134-010.a1i5000.a18.x86_64 #1 Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:52:36 -04:00
Timur Kristóf	118800b079	drm/amd/display: Reject modes with too high pixel clock on DCE6-10 Reject modes with a pixel clock higher than the maximum display clock. Use 400 MHz as a fallback value when the maximum display clock is not known. Pixel clocks that are higher than the display clock just won't work and are not supported. With the addition of the YUV422 fallback, DC can now accidentally select a mode requiring higher pixel clock than actually supported when the DP version supports the required bandwidth but the clock is otherwise too high for the display engine. DCE 6-10 don't support these modes but they don't have a bandwidth calculation to reject them properly. Fixes: `db291ed173` ("drm/amd/display: Add fallback path for YCBCR422") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:51:07 -04:00
Mario Limonciello	210844d2c0	drm/amd: Drop unnecessary check in amdgpu_connector_add_common_modes() [Why] amdgpu_connector_add_common_modes() has a check for the width and height of common modes being too small, but the array of common_modes[] has fixed values. The check is dead code. [How] Drop unnecessary check. Cc: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250924161624.1975819-3-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:50:58 -04:00
Mario Limonciello	0fb915d64d	drm/amd/display: Only enable common modes for eDP and LVDS [Why] The main reason common modes are added is for compatibility with clone mode when a laptop is connected to a projector or external monitor. Since commit `978fa2f6d0` ("drm/amd/display: Use scaling for non-native resolutions on eDP") when non-native modes are picked for eDP the GPU scalar will be used. This is because it is inconsistent whether eDP panels have the capability to actually drive non-native resolutions. With panels connected to other connectors this limitation generally doesn't exist as we the EDID will advertise support for a number of resolutions and monitors will use built in scaling hardware. Comparing DC and non-DC code paths the non-DC code path only adds common modes for LVDS and eDP whereas the DC codepath does it for all connector types. In the past there was an experiment done to disable common mode adding for eDP and LVDS from commit `6d396e7ac1` ("drm/amd/display: Disable common modes for LVDS") and commit `7948afb46a` ("drm/amd/display: Disable common modes for eDP") but this was reverted in commit `a8b79b0918` ("drm/amd: Re-enable common modes for eDP and LVDS") because it caused problems with Xorg. [How] Only add common modes for eDP and LVDS for DC, matching the behavior of non-DC. Suggested-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250924161624.1975819-2-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:47:55 -04:00
Sunil Khatri	4e3b45d7b6	drm/amdgpu: remove the redeclaration of variable i Variable "i" has been redeclared as integer later in the function which is wrong and not serving any purpose. Fixes: `899fbde146` ("drm/amdgpu: replace get_user_pages with HMM mirror helpers") Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:46:35 -04:00
Prike Liang	883bd89d00	drm/amdgpu/userq: assign an error code for invalid userq va It should return an error code if userq VA validation fails. Fixes: `9e46b8bb05` ("drm/amdgpu: validate userq buffer virtual address and size") Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:40:18 -04:00
Christian König	90e09ea4cf	drm/amdgpu: revert "rework reserved VMID handling" v2 This reverts commit `e44a0fe630`. Initially we used VMID reservation to enforce isolation between processes. That has now been replaced by proper fence handling. Both OpenGL, RADV and ROCm developers requested a way to reserve a VMID for SPM, so restore that approach by reverting back to only allowing a single process to use the reserved VMID. Only compile tested for now. v2: use -ENOENT instead of -EINVAL if VMID is not available Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:39:00 -04:00
Christian König	66f3883dbc	drm/amdgpu: remove leftover from enforcing isolation by VMID Initially we enforced isolation by reserving a VMID, but that practice was now removed. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:38:54 -04:00
Jesse.Zhang	7469567d88	drm/amdgpu: Add fallback to pipe reset if KCQ ring reset fails Add a fallback mechanism to attempt pipe reset when KCQ reset fails to recover the ring. After performing the KCQ reset and queue remapping, test the ring functionality. If the ring test fails, initiate a pipe reset as an additional recovery step. v2: fix the typo (Lijo) v3: try pipeline reset when kiq mapping fails (Lijo) Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:38:48 -04:00
Pavan S	6ca282c3e6	accel/habanalabs: add Infineon version check On HL338 ASICs, the Infineon first‑stage firmware is not present and the reported version is 0. In this case printing a version number is misleading, as it suggests valid firmware when it does not exist. Fix this by printing the first‑stage Infineon firmware version only if the reported value is non‑zero. This avoids confusing or incorrect log messages on devices where the first stage is not applicable. Signed-off-by: Pavan S <pavan.sreenivas@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:14:45 +03:00
Konstantin Sinyuk	a0d866bab1	accel/habanalabs/gaudi2: read preboot status after recovering from dirty state Dirty state can occur when the host VM undergoes a reset while the device does not. In such a case, the driver must reset the device before it can be used again. As part of this reset, the device capabilities are zeroed. Therefore, the driver must read the Preboot status again to learn the Preboot state, capabilities, and security configuration. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:32 +03:00
Ariel Aviad	65a3f5bc33	accel/habanalabs: add HL_GET_P_STATE passthrough type Add a new passthrough type HL_GET_P_STATE to the cpucp generic ioctl to allow userspace to read the device performance state via firmware. Signed-off-by: Ariel Aviad <ariel.aviad@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:31 +03:00
Konstantin Sinyuk	eeb38d0e91	accel/habanalabs: add debugfs interface for HLDIO testing Add debugfs files for NVMe Direct I/O (HLDIO) functionality. This interface allows userspace access to direct SSD ↔ device transfers through debugfs nodes. Four debugfs files are created under /sys/kernel/debug/habanalabs/hlN/: - dio_ssd2hl : trigger SSD-to-device transfers - dio_hl2ssd : trigger device-to-SSD transfers (placeholder, not yet implemented) - dio_stats : show transfer statistics - dio_reset : reset statistics counters Usage examples: # Perform SSD → device transfer echo "fd=3 va=0x10000 off=0 len=4096" > \ /sys/kernel/debug/habanalabs/hl0/dio_ssd2hl # View statistics cat /sys/kernel/debug/habanalabs/hl0/dio_stats # Reset counters echo 1 > /sys/kernel/debug/habanalabs/hl0/dio_reset This interface provides access to HLDIO functionality for validation and diagnostics. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Farah Kassabri <farah.kassabri@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:31 +03:00
Konstantin Sinyuk	8cbacc9a27	accel/habanalabs: add NVMe Direct I/O (HLDIO) infrastructure Introduce NVMe Direct I/O (HLDIO) infrastructure to support peer‑to‑peer DMA in the habanalabs driver. This adds internal helpers and data structures to enable direct transfers between NVMe storage and device memory. The feature is built only when CONFIG_HL_HLDIO is enabled. A debugfs interface is also provided for functional validation. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Farah Kassabri <farah.kassabri@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:30 +03:00
Moti Haimovski	513024d5a0	accel/habanalabs: support mapping cb with vmalloc-backed coherent memory When IOMMU is enabled, dma_alloc_coherent() with GFP_USER may return addresses from the vmalloc range. If such an address is mapped without VM_MIXEDMAP, vm_insert_page() will trigger a BUG_ON due to the VM_PFNMAP restriction. Fix this by checking for vmalloc addresses and setting VM_MIXEDMAP in the VMA before mapping. This ensures safe mapping and avoids kernel crashes. The memory is still driver-allocated and cannot be accessed directly by userspace. Signed-off-by: Moti Haimovski <moti.haimovski@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:30 +03:00
Ilia Levi	0668db41b5	accel/habanalabs: remove old interface variation of 'access_ok()' The access_ok() API no longer requires the VERIFY_WRITE argument, and the use of the old interface with VERIFY_WRITE is deprecated. Clean up the habanalabs memory manager to use the modern access_ok() interface consistently. This removes old #ifdef guards and aligns the driver with current upstream kernel APIs. Signed-off-by: Ilia Levi <ilia.levi@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:29 +03:00
Konstantin Sinyuk	0529b191ac	accel/habanalabs/gaudi2: use the CPLD_SHUTDOWN event handler After CPLD shutdown event the device is not usable anymore. The common CPLD_SHUTDOWN event handler disables any subsequent device access. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:29 +03:00
Konstantin Sinyuk	083c53a854	accel/habanalabs: disable device access after CPLD_SHUTDOWN After a CPLD shutdown event the device becomes unusable. Prevent further device access once this event is received. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:28 +03:00
Tomer Tayar	cade027efa	accel/habanalabs: fix typo in trace output (cms -> cmd) Fix a typo in TP_printk format string of habanalabs tracepoint: replace "cms" with "cmd". Signed-off-by: Tomer Tayar <tomer.tayar@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:28 +03:00
Tomer Tayar	d0dd796bec	accel/habanalabs: clarify ctx use after hl_ctx_put() in dmabuf release In hl_release_dmabuf(), ctx is dereferenced after calling hl_ctx_put() to obtain the compute device file. This is safe because the dma-buf object holds a file reference taken in export_dmabuf(), and the file release (which drops another ctx reference) can only happen after we drop that file reference via fput(). Thus, this hl_ctx_put() call cannot be the last one at this point. Add a comment explaining this to avoid confusion. Signed-off-by: Tomer Tayar <tomer.tayar@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:27 +03:00
Sharley Calzolari	b5cddeb0dc	accel/habanalabs/gaudi2: add support for logging register accesses from debugfs Add infrastructure for logging the last configuration register accesses that occur via debugfs read/write operations. At interrupt time, these log entries can be dumped to dmesg, which helps in diagnosing the cause of RAZWI and ADDR_DEC interrupts. The logging is implemented as a ring buffer of access entries, with each entry recording timestamp and access details. To ensure correctness under concurrent access, operations are now protected using spinlocks. Entries are copied under lock and then printed after releasing it, which minimizes time spent in the critical section. Signed-off-by: Sharley Calzolari <sharley.calzolari@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:26 +03:00
Ariel Suller	214e26a43f	accel/habanalabs/gaudi2: stringify engine/queue ids Print engine/queue names instead of numerical engine/queue IDs to make logs and debug output more readable. Signed-off-by: Ariel Suller <ariel.suller@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:25 +03:00
Vitaly Margolin	5295be6c4e	accel/habanalabs: add generic message type to get error counters Add a new CPUCP generic message type to retrieve HBM, SRAM and critical error counters from the device. Signed-off-by: Vitaly Margolin <vitaly.margolin@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:25 +03:00
Vered Yavniely	b4fd8e56c9	accel/habanalabs/gaudi2: fix BMON disable configuration Change the BMON_CR register value back to its original state before enabling, so that BMON does not continue to collect information after being disabled. Signed-off-by: Vered Yavniely <vered.yavniely@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:24 +03:00
Tomer Tayar	9f5067531c	accel/habanalabs: return ENOMEM if less than requested pages were pinned EFAULT is currently returned if less than requested user pages are pinned. This value means a "bad address" which might be confusing to the user, as the address of the given user memory is not necessarily "bad". Modify the return value to ENOMEM, as "out of memory" is more suitable in this case. Signed-off-by: Tomer Tayar <tomer.tayar@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:22 +03:00
Jesse.Zhang	4c709ccc47	drm/amd/pm: Add VCN reset message support for SMU v13.0.12 This commit adds support for VCN reset functionality in SMU v13.0.12 by: 1. Adding two new PPSMC messages in smu_v13_0_12_ppsmc.h: - PPSMC_MSG_ResetVCN (0x5E) - Updates PPSMC_Message_Count to 0x5F to account for new messages 2. Adding message mapping for ResetVCN in smu_v13_0_12_ppt.c: - Maps SMU_MSG_ResetVCN to PPSMC_MSG_ResetVCN These changes enable proper VCN reset handling through the SMU firmware interface for compatible AMD GPUs. v2: Added fw version check to support vcn queue reset. Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Sonny Jiang <sonny.jiang@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:41:41 -04:00
Jesse.Zhang	5886090032	drm/amdgpu: Move VCN reset mask setup to late_init for VCN 5.0.1 This patch moves the initialization of the VCN supported_reset mask from sw_init to a new late_init function for VCN 5.0.1. The change ensures that all necessary hardware and firmware initialization is complete before determining the supported reset types. Key changes: - Added vcn_v5_0_1_late_init() function to handle late initialization - Moved supported_reset mask setup from sw_init to late_init - Added check for per-queue reset support via amdgpu_dpm_reset_vcn_is_supported() - Updated ip_funcs to use the new late_init function This change helps ensure proper reset behavior by waiting until all dependencies are initialized before determining available reset types. Reviewed-by: Sonny Jiang <sonny.jiang@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Ruili Ji <ruiliji2@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:41:37 -04:00
Jesse.Zhang	dc704458dd	drm/amdgpu: Add ring reset support for VCN v5.0.1 Implement the ring reset callback for VCN v5.0.1 to properly handle hardware recovery when encountering GPU hangs. The new functionality: 1. Adds vcn_v5_0_1_ring_reset() function that: - Prepares for reset using amdgpu_ring_reset_helper_begin() - Performs VCN instance reset via amdgpu_dpm_reset_vcn() - Re-initializes hardware through vcn_v5_0_1_hw_init_inst() - Restarts DPG mode with vcn_v5_0_1_start_dpg_mode() - Completes reset with amdgpu_ring_reset_helper_end() 2. Hooks the reset function into the unified ring functions via: - Adding .reset = vcn_v5_0_1_ring_reset to vcn_v5_0_1_unified_ring_vm_funcs 3. Maintains existing behavior for SR-IOV VF cases by checking RRMT status This provides proper hardware recovery capabilities for VCN 5.0.1 IP block during fault conditions, matching functionality available in other VCN versions. v2: Remove the RRMT_ENABLED cap setting in the reset function and replace adev->vcn.inst[ring->me].indirect_sram with vinst->indirect_sram (Lijo) Reviewed-by: Sonny Jiang <sonny.jiang@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Ruili Ji <ruiliji2@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:41:27 -04:00
Jesse.Zhang	eb6910cdaa	drm/amdgpu: Refactor VCN v5.0.1 HW init into separate instance function Split the per-instance initialization code from vcn_v5_0_1_hw_init() into a new vcn_v5_0_1_hw_init_inst() function. This improves code organization by: 1. Separating the instance-specific initialization logic 2. Making the main init function more readable 3. Following the pattern used in queue reset The SR-IOV specific initialization remains in the main function since it has different requirements. Reviewed-by: Sonny Jiang <sonny.jiang@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Ruili Ji <ruiliji2@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:41:11 -04:00
Alex Deucher	0c1f3fe9a5	Documentation: add initial documenation for user queues Add an initial documentation page for user mode queues. Reviewed-by: Rodrigo Siqueira <siqueira@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:41:06 -04:00
Melissa Wen	752e6f283e	drm/amd/display: remove output_tf_change flag Remove this flag as the driver stopped managing it individually since commit `a4056c2a63` ("drm/amd/display: use HW hdr mult for brightness boost"). After some back and forth it was reintroduced as a condition to `set_output_transfer_func()` in [1]. Without direct management, this flag only changes value when all surface update flags are set true on UPDATE_TYPE_FULL with no output TF status meaning. Fixes: `bb622e0c00` ("drm/amd/display: program output tf when required") [1] Signed-off-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:37:01 -04:00
Kuan-Wei Chiu	43f06e8165	drm/amd/display: Optimize remove_duplicates() from O(N^2) to O(N) Replace the previous O(N^2) implementation of remove_duplicates() with a O(N) version using a fast/slow pointer approach. The new version keeps only the first occurrence of each element and compacts the array in place, improving efficiency without changing functionality. Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:36:53 -04:00
Melissa Wen	51cb93aa0c	drm/amd/display: change dc stream color settings only in atomic commit Don't update DC stream color components during atomic check. The driver will continue validating the new CRTC color state but will not change DC stream color components. The DC stream color state will only be programmed at commit time in the `atomic_setup_commit` stage. It fixes gamma LUT loss reported by KDE users when changing brightness quickly or changing Display settings (such as overscan) with nightlight on and HDR. As KWin can do a test commit with color settings different from those that should be applied in a non-test-only commit, if the driver changes DC stream color state in atomic check, this state can be eventually HW programmed in commit tail, instead of the respective state set by the non-blocking commit. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4444 Reported-by: Xaver Hugl <xaver.hugl@gmail.com> Signed-off-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:36:15 -04:00
YiPeng Chai	2330437da0	drm/amd/ras: Add rascore status definition Add rascore status definition. V5: Merge the previous empty files. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:36:02 -04:00
Rahul Kumar	86a54e45fd	drm/amdgpu: Use kmalloc_array() instead of kmalloc() Documentation/process/deprecated.rst recommends against the use of kmalloc with dynamic size calculations due to the risk of overflow and smaller allocation being made than the caller was expecting. Replace kmalloc() with kmalloc_array() in amdgpu_amdkfd_gfx_v10.c, amdgpu_amdkfd_gfx_v10_3.c, amdgpu_amdkfd_gfx_v11.c and amdgpu_amdkfd_gfx_v12.c to make the intended allocation size clearer and avoid potential overflow issues. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Rahul Kumar <rk0006818@gmail.com> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:35:54 -04:00
Melissa Wen	2f9c638837	drm/amd/display: update color on atomic commit time Use `atomic_commit_setup` to change the DC stream state. It's a preparation to remove from `atomic_check` changes in CRTC color components of DC stream state and prevent DC to commit TEST_ONLY changes. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/4444 Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Melissa Wen <mwen@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:35:17 -04:00

1 2 3 4 5 ...

1384744 Commits