linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-03-21 23:16:50 +08:00

Author	SHA1	Message	Date
Matthew Brost	fb3738693c	drm/xe: Forcefully tear down exec queues in GuC submit fini In GuC submit fini, forcefully tear down any exec queues by disabling CTs, stopping the scheduler (which cleans up lost G2H), killing all remaining queues, and resuming scheduling to allow any remaining cleanup actions to complete and signal any remaining fences. Split guc_submit_fini into device related and software only part. Using device-managed and drm-managed action guarantees the correct ordering of cleanup. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: stable@vger.kernel.org Reviewed-by: Zhanjun Dong <zhanjun.dong@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260310225039.1320161-3-zhanjun.dong@intel.com (cherry picked from commit a6ab444a111a59924bd9d0c1e0613a75a0a40b89) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>	2026-03-19 14:22:28 +01:00
Daniele Ceraolo Spurio	9b72283ec9	drm/xe/guc: Fail immediately on GuC load error By using the same variable for both the return of poll_timeout_us and the return of the polled function guc_wait_ucode, the return value of the latter is overwritten and lost after exiting the polling loop. Since guc_wait_ucode returns -1 on GuC load failure, we lose that information and always continue as if the GuC had been loaded correctly. This is fixed by simply using 2 separate variables. Fixes: `a4916b4da4` ("drm/xe/guc: Refactor GuC load to use poll_timeout_us()") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://patch.msgid.link/20260303001732.2540493-2-daniele.ceraolospurio@intel.com (cherry picked from commit c85ec5c5753a46b5c2aea1292536487be9470ffe) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>	2026-03-19 14:22:17 +01:00
Daniele Ceraolo Spurio	6e035abf98	drm/xe/guc: Fix CFI violation in debugfs access. xe_guc_print_info is void-returning, but the function pointer it is assigned to expects an int-returning function, leading to the following CFI error: [ 206.873690] CFI failure at guc_debugfs_show+0xa1/0xf0 [xe] (target: xe_guc_print_info+0x0/0x370 [xe]; expected type: 0xbe3bc66a) Fix this by updating xe_guc_print_info to return an integer. Fixes: `e15826bb3c` ("drm/xe/guc: Refactor GuC debugfs initialization") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: George D Sworo <george.d.sworo@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260129182547.32899-2-daniele.ceraolospurio@intel.com (cherry picked from commit dd8ea2f2ab71b98887fdc426b0651dbb1d1ea760) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2026-02-05 08:03:25 -05:00
Daniele Ceraolo Spurio	8d87fa1916	drm/xe/gt: Add engine masks for each class Follow up patches will need the engine masks for VCS and VECS engines. Since we already have a macro for the CCS engines, just extend the same approach to all classes. To avoid confusion with the XE_HW_ENGINE_*_MASK masks, the new macros use the _INSTANCES suffix instead. For consistency, rename CCS_MASK to CCS_INSTANCES as well. Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251218223846.1146344-15-daniele.ceraolospurio@intel.com	2025-12-22 10:21:56 -08:00
Michal Wajdeczko	47f5cee419	drm/xe/guc: Fix version check for page-reclaim feature Page reclamation interfaces were introduced in GuC firmware version 70.31.0 (which corresponds to GuC ABI version 1.14.0), but since this feature is also available for the VFs and VFs don't know the firmware version, use GuC compatibility version check instead. Fixes: `77ebc7c10d` ("drm/xe/guc: Add page reclamation interface to GuC") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Brian Nguyen <brian3.nguyen@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Brian Nguyen <brian3.nguyen@intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251215170433.196398-1-michal.wajdeczko@intel.com	2025-12-15 13:33:55 -08:00
Brian Nguyen	77ebc7c10d	drm/xe/guc: Add page reclamation interface to GuC Add page reclamation related changes to GuC interface, handlers, and senders to support page reclamation. Currently TLB invalidations will perform an entire PPC flush in order to prevent stale memory access for noncoherent system memory. Page reclamation is an extension of the typical TLB invalidation workflow, allowing disabling of full PPC flush and enable selective PPC flushing. Selective flushing will be decided by a list of pages whom's address is passed to GuC at time of action. Page reclamation interfaces require at least GuC FW ver 70.31.0. v2: - Moved send_page_reclaim to first patch usage. - Add comments explaining shared done handler. (Matthew B) - Add FW version fallback to disable page reclaim on older versions. (Matthew B, Shuicheng) Signed-off-by: Brian Nguyen <brian3.nguyen@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251212213225.3564537-16-brian3.nguyen@intel.com	2025-12-12 16:59:09 -08:00
Satyanarayana K V P	75e7d26281	drm/xe/vf: Requeue recovery on GuC MIGRATION error during VF post-migration Handle GuC response `XE_GUC_RESPONSE_VF_MIGRATED` as a special case in the VF post-migration recovery flow. When this error occurs, it indicates that a new migration was detected while the resource fixup process was still in progress. Instead of failing immediately, requeue the VF into the recovery path to allow proper handling of the new migration event. This improves robustness of VF recovery in SR-IOV environments where migrations can overlap with resource fixup steps. Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251201095011.21453-9-satyanarayana.k.v.p@intel.com	2025-12-02 16:17:25 +01:00
Raag Jadav	43fb9e113b	drm/xe/gt: Introduce runtime suspend/resume If power state is retained between suspend/resume cycle, we don't need to perform full GT re-initialization. Introduce runtime helpers for GT which greatly reduce suspend/resume delay. v2: Drop redundant xe_gt_sanitize() and xe_guc_ct_stop() (Daniele) Use runtime naming for guc helpers (Daniele) v3: Drop redundant logging, add kernel doc (Michal) Use runtime naming for ct helpers (Michal) v4: Fix tags (Rodrigo) v5: Include host_l2_vram workaround (Daniele) Reuse xe_guc_submit_enable/disable() helpers (Daniele) Co-developed-by: Riana Tauro <riana.tauro@intel.com> Signed-off-by: Riana Tauro <riana.tauro@intel.com> Signed-off-by: Raag Jadav <raag.jadav@intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patch.msgid.link/20251030122357.128825-5-raag.jadav@intel.com	2025-11-27 09:05:28 -08:00
Zhanjun Dong	f4c8298cf5	drm/xe/guc: Cleanup GuC log buffer macros and helpers Cleanup GuC log buffer macros and helpers, add Xe style macro prefix. Update buffer type values to align with the GuC specification Update buffer offset calculation. Remove helper functions, replaced with macros. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://patch.msgid.link/20251105233143.1168759-1-zhanjun.dong@intel.com	2025-11-24 10:50:07 -08:00
Matt Roper	3947e482b5	drm/xe/guc: Use scope-based cleanup Use scope-based cleanup for forcewake and runtime PM. Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com> Link: https://patch.msgid.link/20251118164338.3572146-34-matthew.d.roper@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>	2025-11-19 11:58:57 -08:00
Michał Winiarski	d608fbf400	drm/xe/pf: Increase PF GuC Buffer Cache size and use it for VF migration Contiguous PF GGTT VMAs can be scarce after creating VFs. Increase the GuC buffer cache size to 8M for PF so that we can fit GuC migration data (which currently maxes out at just over 4M) and use the cache instead of allocating fresh BOs. Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251112132220.516975-13-michal.winiarski@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>	2025-11-13 11:48:19 +01:00
Balasubramani Vivekanandan	e681ddca30	drm/xe/xe3p_lpm: Add special check in Media GT for Main GAMCTRL For Xe3p arch some subunits of an IP may be different. The GMD_ID register returns the Xe3p arch and dedicates the reserved field to mark possible subunit differences. Generally this is an under-the-hood implementation detail that drivers don't need to worry about, but the new Main_GAMCTRL may be enabled or not depending on those. Those reserved bits are described for Xe3p as: "If Zero, No special case to be handled. If Non-Zero, special case to be handled by Software agent.". That special case is defined per Arch. So if media version is 35, also check the additional reserved bits. To avoid confusion with the usual meaning of "reserved", define them as GMD_ID_SUBIP_FLAG_MASK. Bspec: 74201 Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20251019-xe3p-gamctrl-v1-2-ad66d3c1908f@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-20 17:21:11 -07:00
Brian Welty	94edd65186	drm/xe/xe3p_lpm: Configure MAIN_GAMCTRL_QUEUE_SELECT Starting from Xe3p, there are two different copies of some of the GAM registers: the traditional MCR variant at their old locations, and a new unicast copy known as "main_gamctrl." The Xe driver doesn't use these registers directly, but we need to instruct the GuC on which set it should use. Since the new, unicast registers are preferred (since they avoid the need for unnecessary MCR synchronization), set a new GuC feature flag, GUC_CTL_MAIN_GAMCTRL_QUEUES to convey this decision. A new helper function, xe_guc_using_main_gamctrl_queues(), is added for use in the 3 independent places that need to handle configuration of the new reporting queues. The mmio write to enable the main gamctl is only done during the general GuC upload. The gamctrl registers are not accessed by the GuC during hwconfig load. Last, the ADS blob for communicating the queue addresses contains both a DPA and GGTT offset. The GuC documentation states that DPA is now MBZ when using the MAIN_GAMCTRL queues. Bspec: 76445, 73540 Signed-off-by: Brian Welty <brian.welty@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20251019-xe3p-gamctrl-v1-1-ad66d3c1908f@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-20 17:21:11 -07:00
Satyanarayana K V P	60e2667557	drm/xe/guc: Increase wait timeout to 2sec after BUSY reply from GuC Some VF2GUC actions may take longer to process. Increase default timeout after received BUSY indication to 2sec to cover all worst case scenarios. Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20251008214532.3442967-35-matthew.brost@intel.com	2025-10-09 03:24:28 -07:00
Lucas De Marchi	a4916b4da4	drm/xe/guc: Refactor GuC load to use poll_timeout_us() Currently there are 2 wait loops for loading GuC: one in xe_mmio_wait32_not() and one guc_wait_ucode(). Now that there's a generic poll_timeout_us(), refactor the code to use that to be more readable. Main change in behavior is that there's no exponential wait anymore: that is now replaced by a 10msec retry. Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://lore.kernel.org/r/20250922-xe-iopoll-v4-5-06438311a63f@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-09-24 21:23:19 -07:00
Lucas De Marchi	abde96d844	drm/xe/guc: Extract function to print load error Move the error parsing and print out of guc_wait_ucode() into a helper to clean up the wait function. Since now the `load_done != 1` condition has a return statement, also simplify the if/else chain. Reviewed-by: John Harrison <John.C.Harrison@Intel.com> # v2 Link: https://lore.kernel.org/r/20250922-xe-iopoll-v4-4-06438311a63f@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-09-24 21:23:19 -07:00
Lucas De Marchi	2a16f47dcc	drm/xe/guc: Drop helper to read freq As the forcewake is already held during GuC load, there's no need to use a helper function to call xe_guc_pc_get_cur_freq(). Just call xe_guc_pc_get_cur_freq_fw() directly. Suggested-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://lore.kernel.org/r/20250922-xe-iopoll-v4-3-06438311a63f@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-09-24 21:23:19 -07:00
John Harrison	3b09b11805	drm/xe/guc: Return an error code if the GuC load fails Due to multiple explosion issues in the early days of the Xe driver, the GuC load was hacked to never return a failure. That prevented kernel panics and such initially, but now all it achieves is creating more confusing errors when the driver tries to submit commands to a GuC it already knows is not there. So fix that up. As a stop-gap and to help with debug of load failures due to invalid GuC init params, a wedge call had been added to the inner GuC load function. The reason being that it leaves the GuC log accessible via debugfs. However, for an end user, simply aborting the module load is much cleaner than wedging and trying to continue. The wedge blocks user submissions but it seems that various bits of the driver itself still try to submit to a dead GuC and lots of subsequent errors occur. And with regards to developers debugging why their particular code change is being rejected by the GuC, it is trivial to either add the wedge back in and hack the return code to zero again or to just do a GuC log dump to dmesg. v2: Add support for error injection testing and drop the now redundant wedge call. CC: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com> Link: https://lore.kernel.org/r/20250909224132.536320-1-John.C.Harrison@Intel.com	2025-09-16 12:11:08 -07:00
John Harrison	456b32c9c1	drm/xe/guc: Add test for G2G communications Add a test for sending messages from every GuC to every other GuC to test G2G communications. Note that, being a debug only feature, the test interface only exists in pre-production builds of the GuC firmware. v2: Fix 'default' case to actually use the driver's registration code as well as allocation. Add comments explaining the different test types. Fix (C) date and an assert. Review feedback from Daniele. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://lore.kernel.org/r/20250910210237.603576-5-John.C.Harrison@Intel.com	2025-09-15 09:53:26 -07:00
Daniele Ceraolo Spurio	8843444843	drm/xe/guc: Set RCS/CCS yield policy All recent platforms (including all the ones officially supported by the Xe driver) do not allow concurrent execution of RCS and CCS workloads from different address spaces, with the HW blocking the context switch when it detects such a scenario. The DUAL_QUEUE flag helps with this, by causing the GuC to not submit a context it knows will not be able to execute. This, however, causes a new problem: if RCS and CCS queues have pending workloads from different address spaces, the GuC needs to choose from which of the 2 queues to pick the next workload to execute. By default, the GuC prioritizes RCS submissions over CCS ones, which can lead to CCS workloads being significantly (or completely) starved of execution time. The driver can tune this by setting a dedicated scheduling policy KLV; this KLV allows the driver to specify a quantum (in ms) and a ratio (percentage value between 0 and 100), and the GuC will prioritize the CCS for that percentage of each quantum. Given that we want to guarantee enough RCS throughput to avoid missing frames, we set the yield policy to 20% of each 80ms interval. v2: updated quantum and ratio, improved comment, use xe_guc_submit_disable in gt_sanitize Fixes: `d9a1ae0d17` ("drm/xe/guc: Enable WA_DUAL_QUEUE for newer platforms") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Tested-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://lore.kernel.org/r/20250905235632.3333247-2-daniele.ceraolospurio@intel.com	2025-09-11 09:45:35 -07:00
John Harrison	0b05857dc1	drm/xe/guc: Clean up of GuC 'CTL' defines All the field generation for the CTL defines (used for GuC init data) were hand-rolled rather than using FIELD_PREP/REG_GENMASK/BIT macros. Also, there were a bunch of macros defined for verbosity settings that were never used. So fix that all up. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250904195752.3846138-2-John.C.Harrison@Intel.com	2025-09-05 13:31:32 -07:00
Satyanarayana K V P	ee4b32220a	drm/xe/guc: Add devm release action to safely tear down CT When a buffer object (BO) is allocated with the XE_BO_FLAG_GGTT_INVALIDATE flag, the driver initiates TLB invalidation requests via the CTB mechanism while releasing the BO. However a premature release of the CTB BO can lead to system crashes, as observed in: Oops: Oops: 0000 [#1] SMP NOPTI RIP: 0010:h2g_write+0x2f3/0x7c0 [xe] Call Trace: guc_ct_send_locked+0x8b/0x670 [xe] xe_guc_ct_send_locked+0x19/0x60 [xe] send_tlb_invalidation+0xb4/0x460 [xe] xe_gt_tlb_invalidation_ggtt+0x15e/0x2e0 [xe] ggtt_invalidate_gt_tlb.part.0+0x16/0x90 [xe] ggtt_node_remove+0x110/0x140 [xe] xe_ggtt_node_remove+0x40/0xa0 [xe] xe_ggtt_remove_bo+0x87/0x250 [xe] Introduce a devm-managed release action during xe_guc_ct_init() and xe_guc_ct_init_post_hwconfig() to ensure proper CTB disablement before resource deallocation, preventing the use-after-free scenario. Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Summers Stuart <stuart.summers@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250901072541.31461-1-satyanarayana.k.v.p@intel.com	2025-09-02 08:21:58 +02:00
Vinay Belgaumkar	95b3899b4d	drm/xe/psmi: Add Wa_16023683509 This WA ensures GuC will restore the media MCFG registers at C6 exit. Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com> Link: https://lore.kernel.org/r/20250821-psmi-v5-5-34ab7550d3d8@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-08-22 11:46:44 -07:00
Lucas De Marchi	efeb036ffd	drm/xe/psmi: Add GuC flag to enable PSMI PSMI allows to capture data from the GPU useful for early validation. From the kernel side there isn't much to be done, just a few things: 1) Toggle the feature support in GuC 2) Enable some additional WAs 3) Allocate buffers Here is the first step, with the next ones to follow. For now everything is disabled through a check in configfs that is currently hardcoded to disabled. Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://lore.kernel.org/r/20250821-psmi-v5-1-34ab7550d3d8@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-08-22 11:46:43 -07:00
Matt Atwood	4d5c98eb77	drm/xe: rename XE_WA to XE_GT_WA Now that there are two types of wa tables and infrastructure, be more concise in the naming of GT wa macros. v2: update the documentation Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250807214224.32728-1-matthew.s.atwood@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-08-08 10:50:45 -04:00
John Harrison	45fbb51050	drm/xe/guc: Add more GuC load error status codes The GuC load process will abort if certain status codes (which are indicative of a fatal error) are reported. Otherwise, it keeps waiting until the 'success' code is returned. New error codes have been added in recent GuC releases, so add support for aborting on those as well. v2: Shuffle HWCONFIG_START to the front of the switch to keep the ordering as per the enum define for clarity (review feedback by Jonathan). Also add a description for the basic 'invalid init data' code which was missing. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://lore.kernel.org/r/20250726024337.4056272-1-John.C.Harrison@Intel.com	2025-07-28 14:38:44 -07:00
Zhanjun Dong	b2c4ac219f	drm/xe/uc: Disable GuC communication on hardware initialization error Disable GuC communication on Xe micro controller hardware initialization error. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4917 Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250707231108.3217573-1-zhanjun.dong@intel.com	2025-07-08 14:56:49 -07:00
Matthew Brost	ec9223b49a	drm/xe: Drop bo->size bo->size is redundant because the base GEM object already has a size field with the same value. Drop bo->size and use the base GEM object’s size instead. While at it, introduce xe_bo_size() to abstract the BO size. v2: - Fix typo in kernel doc (Ashutosh) - Fix kunit (CI) - Fix line wrap (Checkpatch) v3: - Fix sriov build (CI) v4: - Fix display build (CI) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://lore.kernel.org/r/20250625144128.2827577-1-matthew.brost@intel.com	2025-06-27 14:52:31 -07:00
Daniele Ceraolo Spurio	9c7d93a8f1	drm/xe/guc: Enable the Dynamic Inhibit Context Switch optimization The Dynamic Inhibit Context Switch is an optimization aimed at reducing the amount of time the HW is stuck waiting on an unsatisfied semaphore. When this optimization is enabled, the GuC will dynamically modify the CTX_CTRL_INHIBIT_SYN_CTX_SWITCH in the CTX_CONTEXT_CONTROL register of LRCs to enable immediate switching out on an unsatisfied semaphore wait when multiple contexts are competing for time on the same engine. This feature is available on recent HW from GuC 70.40.1 onwards and it is enabled via a per-VF feature opt-in. v2: rebase v3: switch to using guc_buf_cache instead of dedicated alloc v4: add helper to check for feature availability (Michal), don't enable if multi-lrc is possible. Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://lore.kernel.org/r/20250625205405.1653212-4-daniele.ceraolospurio@intel.com	2025-06-27 11:08:50 -07:00
Daniele Ceraolo Spurio	a7ffcea863	drm/xe/guc: Enable extended CAT error reporting On newer HW (Xe2 onwards + PVC) it is possible to get extra information when a CAT error occurs, specifically a dword reporting the error type. To enable this extra reporting, we need to opt-in with the GuC, which is done via a specific per-VF feature opt-in H2G. On platforms where the HW does not support the extra reporting, the GuC will set the type to 0xdeadbeef, so we can keep the code simple and opt-in to the feature on every platform and then just discard the data if it is invalid. Note that on native/PF we're guaranteed that the opt in is available because we don't support any GuC old enough to not have it, but if we're a VF we might be running on a non-XE PF with an older GuC, so we need to handle that case. We can re-use the invalid type above to handle this scenario the same way as if the feature was not supported in HW. Given that this patch is the first user of the guc_buf_cache on native and VF, it also extends that feature to non-PF use-cases. v2: simpler print for the error type (John), rebase v3: use guc_buf_cache instead of new alloc, simpler doc (Michal) Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> #v1 Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://lore.kernel.org/r/20250625205405.1653212-3-daniele.ceraolospurio@intel.com	2025-06-27 11:08:40 -07:00
Maarten Lankhorst	a42939ee86	drm/xe: Remove xe_uc_fini_hw xe_uc_init_hw() is called multiple times from xe_gt.c, and that makes the name xe_uc_fini_hw(), called for a different reason in xe_guc.c confusing. Remove it and inline the xe_uc_sanitize_reset into xe_guc.c directly. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250619104858.418440-23-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-26 22:11:35 +02:00
Maarten Lankhorst	396044c9d8	drm/xe: Simplify GuC early initialization Add a 2-stage GuC init. An early one for everything that is needed for VF, and a full one that loads GuC and is allowed to do allocations. Link: https://lore.kernel.org/r/20250619104858.418440-16-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-06-26 22:10:30 +02:00
Maarten Lankhorst	b3412d7233	drm/xe/sriov: Move VF bootstrap and query_config to vf_guc_init We want to split up GUC init to an alloc and noalloc part to keep the init path the same for VF and !VF as much as possible. Everything in vf_guc_init should be done as early as possible, otherwise VRAM probing becomes impossible. Also move xe_gt_mmio_init to the end of xe_gt_init_early(), cleaning up the init in xe_device slightly. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250619104858.418440-15-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>	2025-06-26 22:07:44 +02:00
Daniele Ceraolo Spurio	90f4d3f756	drm/xe/vf: Boostrap all GTs immediately after MMIO init Currently we perform the bootstrap for the primary GT early on during device init, while the media GT bootstrap happens when we try and fetch the hwconfig table. For consistency, move the bootstrap of the media GT happen at the same time as the primary GT, so that all the subsequent code can rely on both GTs being in the same state. v2: Also drop config query from min_guc_load since we now do it early (Michal) Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250603235432.720833-8-daniele.ceraolospurio@intel.com	2025-06-06 08:33:18 -07:00
Michal Wajdeczko	d65650a9d1	drm/xe/guc: Resend potentially lost H2G MMIO request There could be a scenario where the VF driver is resuming faster than the driver PF is able to complete the VF FLR sequence which includes reset of the VF scratch registers. This may result in deletion of the ongoing HXG message (it could be either a host request or a GuC response). When we detect that HXG message was likey lost (scratch register with HXG header was zeroed) try to send this request once more before giving up. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Link: https://lore.kernel.org/r/20250528090021.329-1-michal.wajdeczko@intel.com	2025-06-02 19:22:03 +02:00
Michal Wajdeczko	b86babc9d9	drm/xe/guc: Unblock GuC buffer cache for all modes Today we were using GuC buffer cache only in the PF mode, but shortly we will want to use it also in native and VF mode. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://lore.kernel.org/r/20250512220018.172-2-michal.wajdeczko@intel.com	2025-05-15 12:29:54 +02:00
Daniele Ceraolo Spurio	dba7d17d50	drm/xe/vf: Fix guc_info debugfs for VFs The guc_info debugfs attempts to read a bunch of registers that the VFs doesn't have access to, so fix it by skipping the reads. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4775 Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lukasz Laguna <lukasz.laguna@intel.com> Reviewed-by: Lukasz Laguna <lukasz.laguna@intel.com> Link: https://lore.kernel.org/r/20250423173908.1571412-1-daniele.ceraolospurio@intel.com	2025-04-29 13:20:57 -07:00
Satyanarayana K V P	e9dea328e8	drm/xe: Introduce fault injection for guc mmio send/recv. Fault can be injected with below steps. FAILTYPE=fail_function FAILFUNC=xe_guc_mmio_send_recv echo > /sys/kernel/debug/$FAILTYPE/inject echo $FAILFUNC > /sys/kernel/debug/$FAILTYPE/inject printf %#x -5 > /sys/kernel/debug/$FAILTYPE/$FAILFUNC/retval echo N > /sys/kernel/debug/$FAILTYPE/task-filter echo 10 > /sys/kernel/debug/$FAILTYPE/probability echo 0 > /sys/kernel/debug/$FAILTYPE/interval echo -1 > /sys/kernel/debug/$FAILTYPE/times echo 0 > /sys/kernel/debug/$FAILTYPE/space echo 1 > /sys/kernel/debug/$FAILTYPE/verbose Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Michał Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://lore.kernel.org/r/20250403120641.7258-2-satyanarayana.k.v.p@intel.com	2025-04-17 22:11:16 +02:00
Matthew Brost	045448da87	drm/xe: Add XE_BO_FLAG_PINNED_NORESTORE Not all BOs need to be restored on resume / d3cold exit, add XE_BO_FLAG_PINNED_NO_RESTORE which skips restoring of BOs rather just allocates VRAM for the BO. This should slightly speedup resume / d3cold exit flows. Marking GuC ADS, GuC CT, GuC log, GuC PC, and SA as NORESTORE. v2: - s/WONTNEED/NORESTORE (Vivi) - Rebase on newly added g2g and backup object flow Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250403102440.266113-11-matthew.auld@intel.com	2025-04-04 11:41:01 +01:00
Rodrigo Vivi	fc858ddf9c	drm/xe/guc_pc: Remove duplicated pc_start call xe_guc_pc_start() was getting called from both xe_uc_init_hw() and from xe_guc_start(). But both are called from do_gt_restart() and only xe_uc_init_hw() is called at initialization. So, let's remove the duplication in the regular gt_restart path. The only place where xe_guc_pc_start() won't get called now is on the gt_reset failure path. However, if gt_reset has failed, it is really unlikely that the PC start will work or is desired. Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250306220643.1014049-1-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2025-03-07 11:10:05 -05:00
Riana Tauro	6978c5f5a6	drm/xe/xe_pmu: Add PMU support for engine activity PMU provides two counters (engine-active-ticks, engine-total-ticks) to calculate engine activity. When querying engine activity, user must group these 2 counters using the perf_event group mechanism to ensure both counters are sampled together. To list the events ./perf list xe_0000_03_00.0/engine-active-ticks/ [Kernel PMU event] xe_0000_03_00.0/engine-total-ticks/ [Kernel PMU event] The formats to be used with the above are engine_instance - config:12-19 engine_class - config:20-27 gt - config:60-63 The events can then be read using perf tool ./perf stat -e xe_0000_03_00.0/engine-active-ticks,gt=0, engine_class=0,engine_instance=0/, xe_0000_03_00.0/engine-total-ticks,gt=0, engine_class=0,engine_instance=0/ -I 1000 Engine activity can then be calculated as below engine activity % = (engine active ticks/engine total ticks) * 100 v2: validate gt rename total-ticks to engine-total-ticks add helper to get hwe (Umesh) v3: fix checkpatch warning add details to documentation (Umesh) remove ascii formats from documentation (Lucas) v4: remove unnecessary warn within raw_spinlock (Lucas) Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250224053903.2253539-5-riana.tauro@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-02-24 13:23:57 -08:00
Michal Wajdeczko	696bfdf273	drm/xe/guc: Introduce the GuC Buffer Cache The purpose of the GuC Buffer Cache is to maintain a set ofreusable buffers that could be used while sending some of the CTB H2G actions that require separate buffer with indirect data. Currently only few PF actions need this so initialize it only when running as a PF. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241220194205.995-9-michal.wajdeczko@intel.com	2025-01-19 00:12:03 +01:00
Daniele Ceraolo Spurio	d9a1ae0d17	drm/xe/guc: Enable WA_DUAL_QUEUE for newer platforms The DUAL_QUEUE_WA tells the GuC to not allow concurrent submissions on RCS and CCSes with different address spaces, which on DG2 is required as a WA for an HW bug. On newer platforms, this block has been moved in HW at the CS level, by stalling the RCS/CCS context switch when one of the other RCS/CCSes is busy with a different address space. While functionally correct, having a submission stalled on the HW limits the GuC ability to shuffle things around and can cause complications if the non-stalled submission runs for a long time, because the GuC doesn't know that the stalled submission isn't actually running and might declare it as hung. Therefore, we enable the DUAL_QUEUE_WA on all newer platforms to move management back to the GuC. Note that the GuC specs also recommend enabling this for all platforms starting from MTL that have a CCS. v2: only apply the WA on GTs that have CCS engines v3: split comment (Jonathan) Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Jesus Narvaez <jesus.narvaez@intel.com> Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241213181012.2178794-1-daniele.ceraolospurio@intel.com	2024-12-16 13:24:27 -08:00
John Harrison	a9f7b97dda	drm/xe/guc: Add support for G2G communications Some features require inter-GuC communication channels on multi-tile devices. So allocate and enable such. v2: Correct use of xe_bo_get/put (review feedback from Matthew Brost) Add extra assert, re-order a calculation for better clarity and add comments to slot calculation (review feedback from Daniele). Also slightly re-work the slot calc to avoid code duplication. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241120000222.204095-3-John.C.Harrison@Intel.com	2024-11-22 19:10:56 -08:00
Tomasz Lis	f7278da76d	drm/xe/guc: Do not assert CTB state while sending MMIO During VF post-migration recovery, MMIO communication channel to GuC is used, despite CTB channel being enabled. This behavior is rooted in the save-restore architecture specification. Therefore, a VF driver cannot assert that CTB is disabled while sending MMIO messages to GuC. Such assertion needs to be PF only, or be removed. This patch simply removes the assertion. Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Suggested-by: Michał Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107151357.1623733-2-tomasz.lis@intel.com	2024-11-09 15:11:13 +01:00
Tomasz Lis	6e6d7b41f9	drm/xe/vf: React to MIGRATED interrupt To properly support VF Save/Restore procedure, fixups need to be applied after PF driver finishes its part of VF Restore. The fixups are required to adjust the ongoing execution for a hardware switch that happened, because some GFX resources are not fully virtualized, and assigned to a VF as range from a global pool. The VF on which a VM is restored will often have different ranges provisioned than the VF on which save process happened. Those resource fixups are applied by the VF driver within a restored VM. A VF driver gets informed that it was migrated by receiving an interrupt from each GuC. The interrupt assigned for that purpose is "GUC SW interrupt 0". Seeing that fields set from within the irq handler should be the trigger for fixups. The VF can safely do post-migration fixups on resources associated to each GuC only after that GuC issued the MIGRATED interrupt. This change introduces a worker to be used for post-migration fixups, and a mechanism to schedule said worker when all GuCs sent the irq. v2: renamed and moved functions, updated logged messages, removed unused includes, used anon struct (Michal) v3: ordering, kerneldoc, asserts, debug messages, on_all_tiles -> on_all_gts (Michal) v4: fixed missing header include v5: Explained what fixups are, explained which IRQ is used, style fixes (Michal) Bspec: 50868 Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241104213449.1455694-2-tomasz.lis@intel.com Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>	2024-11-06 14:53:35 +01:00
John Harrison	db38fdb7bf	drm/xe/guc: Separate full CTB content from guc_info debugfs The guc_info debugfs file is meant to be a quick view of the current software state of the GuC interface. Including the full CTB contents makes the file as a whole much less human readable and is not partiular useful in the general case. So don't pollute the info dump with the full buffers. Instead, move those into a separate debugfs entry that can be read when that information is actually required. Also, improve the human readability by adding a few extra blank lines to delimt the sections. v2: Hide the internal capture/print params from external callers that don't need to know (review feedback from Matthew Brost). Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241024002554.1983101-3-John.C.Harrison@Intel.com	2024-10-29 13:11:34 -07:00
Fei Yang	6ef3bb6055	drm/xe: enable lite restore The lite restore is a performance improvement feature which avoids unnecessary context switch (flush, save and restore) if the incoming context has a ContextID matching that of the outgoing context. The scheduling is done by the GuC firmware, so on the driver side it's just a matter of setting corresponding GUC_CTL_FEATURE flag. This is supposed to be enabled by default, thus the flag is set unconditionally. Signed-off-by: Fei Yang <fei.yang@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241017162710.942553-2-fei.yang@intel.com	2024-10-21 10:34:45 -07:00
Himal Prasad Ghimiray	31a5dce0a3	drm/xe/guc: Update handling of xe_force_wake_get return xe_force_wake_get() now returns the reference count-incremented domain mask. If it fails for individual domains, the return value will always be 0. However, for XE_FORCEWAKE_ALL, it may return a non-zero value even in the event of failure. Use helper xe_force_wake_ref_has_domain to verify all domains are initialized or not. Update the return handling of xe_force_wake_get() to reflect this behavior, and ensure that the return value is passed as input to xe_force_wake_put(). v3 - return xe_wakeref_t instead of int in xe_force_wake_get() - xe_force_wake_put() error doesn't need to be checked. It internally WARNS on domain ack failure. v5 - return unsigned int from xe_force_wake_get() - Remove redundant xe_gt_WARN_ON v6 - use helper xe_force_wake_ref_has_domain() v7 - Fix commit message v9 - Rebase Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241014075601.2324382-17-himal.prasad.ghimiray@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-10-17 10:17:08 -04:00
Zhanjun Dong	9c8c7a7e6f	drm/xe/guc: Prepare GuC register list and update ADS size for error capture Add referenced registers defines and list of registers. Update GuC ADS size allocation to include space for the lists of error state capture register descriptors. Then, populate GuC ADS with the lists of registers we want GuC to report back to host on engine reset events. This list should include global, engine-class and engine-instance registers for every engine-class type on the current hardware. Ensure we allocate a persistent storage for the register lists that are populated into ADS so that we don't need to allocate memory during GT resets when GuC is reloaded and ADS population happens again. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-2-zhanjun.dong@intel.com	2024-10-08 09:34:04 -07:00

1 2 3 4

152 Commits