mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	 85d86c8aa5
			
		
	
	
		85d86c8aa5
		
	
	
	
	
		
			
			Use the new eeh_dev_check and eeh_dev_break interfaces to test EEH recovery. Historically this has been done manually using platform specific EEH error injection facilities (e.g. via RTAS). However, documentation on how to use these facilities is haphazard at best and non-existent at worst so it's hard to develop a cross-platform test. The new debugfs interfaces allow the kernel to handle the platform specific details so we can write a more generic set of sets. This patch adds the most basic of recovery tests where: a) Errors are injected and recovered from sequentially, b) Errors are not injected into PCI-PCI bridges, such as PCIe switches. c) Errors are only injected into device function zero. d) No errors are injected into Virtual Functions. a), b) and c) are largely due to limitations of Linux's EEH support. EEH recovery is serialised in the EEH recovery thread which forces a). Similarly, multi-function PCI devices are almost always grouped into the same PE so injecting an error on one function exercises the same code paths. c) is because we currently more or less ignore PCI bridges during recovery and assume that the recovered topology will be the same as the original. d) is due to the limits of the eeh_dev_break interface. With the current implementation we can't inject an error into a specific VF without potentially causing additional errors on other VFs. Due to the serialised recovery process we might end up timing out waiting for another function to recover before the function of interest is recovered. The platform specific error injection facilities are finer-grained and allow this capability, but doing that requires working out how to use those facilities first. Basicly, it's better than nothing and it's a base to build on. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190903101605.2890-15-oohall@gmail.com
		
			
				
	
	
		
			77 lines
		
	
	
		
			1.8 KiB
		
	
	
	
		
			Bash
		
	
	
		
			Executable File
		
	
	
	
	
			
		
		
	
	
			77 lines
		
	
	
		
			1.8 KiB
		
	
	
	
		
			Bash
		
	
	
		
			Executable File
		
	
	
	
	
| #!/bin/sh
 | |
| # SPDX-License-Identifier: GPL-2.0-only
 | |
| 
 | |
| pe_ok() {
 | |
| 	local dev="$1"
 | |
| 	local path="/sys/bus/pci/devices/$dev/eeh_pe_state"
 | |
| 
 | |
| 	if ! [ -e "$path" ] ; then
 | |
| 		return 1;
 | |
| 	fi
 | |
| 
 | |
| 	local fw_state="$(cut -d' ' -f1 < $path)"
 | |
| 	local sw_state="$(cut -d' ' -f2 < $path)"
 | |
| 
 | |
| 	# If EEH_PE_ISOLATED or EEH_PE_RECOVERING are set then the PE is in an
 | |
| 	# error state or being recovered. Either way, not ok.
 | |
| 	if [ "$((sw_state & 0x3))" -ne 0 ] ; then
 | |
| 		return 1
 | |
| 	fi
 | |
| 
 | |
| 	# A functioning PE should have the EEH_STATE_MMIO_ACTIVE and
 | |
| 	# EEH_STATE_DMA_ACTIVE flags set. For some goddamn stupid reason
 | |
| 	# the platform backends set these when the PE is in reset. The
 | |
| 	# RECOVERING check above should stop any false positives though.
 | |
| 	if [ "$((fw_state & 0x18))" -ne "$((0x18))" ] ; then
 | |
| 		return 1
 | |
| 	fi
 | |
| 
 | |
| 	return 0;
 | |
| }
 | |
| 
 | |
| eeh_supported() {
 | |
| 	test -e /proc/powerpc/eeh && \
 | |
| 	grep -q 'EEH Subsystem is enabled' /proc/powerpc/eeh
 | |
| }
 | |
| 
 | |
| eeh_one_dev() {
 | |
| 	local dev="$1"
 | |
| 
 | |
| 	# Using this function from the command line is sometimes useful for
 | |
| 	# testing so check that the argument is a well-formed sysfs device
 | |
| 	# name.
 | |
| 	if ! test -e /sys/bus/pci/devices/$dev/ ; then
 | |
| 		echo "Error: '$dev' must be a sysfs device name (DDDD:BB:DD.F)"
 | |
| 		return 1;
 | |
| 	fi
 | |
| 
 | |
| 	# Break it
 | |
| 	echo $dev >/sys/kernel/debug/powerpc/eeh_dev_break
 | |
| 
 | |
| 	# Force an EEH device check. If the kernel has already
 | |
| 	# noticed the EEH (due to a driver poll or whatever), this
 | |
| 	# is a no-op.
 | |
| 	echo $dev >/sys/kernel/debug/powerpc/eeh_dev_check
 | |
| 
 | |
| 	# Enforce a 30s timeout for recovery. Even the IPR, which is infamously
 | |
| 	# slow to reset, should recover within 30s.
 | |
| 	max_wait=30
 | |
| 
 | |
| 	for i in `seq 0 ${max_wait}` ; do
 | |
| 		if pe_ok $dev ; then
 | |
| 			break;
 | |
| 		fi
 | |
| 		echo "$dev, waited $i/${max_wait}"
 | |
| 		sleep 1
 | |
| 	done
 | |
| 
 | |
| 	if ! pe_ok $dev ; then
 | |
| 		echo "$dev, Failed to recover!"
 | |
| 		return 1;
 | |
| 	fi
 | |
| 
 | |
| 	echo "$dev, Recovered after $i seconds"
 | |
| 	return 0;
 | |
| }
 | |
| 
 |