AMDGPU RAS Support¶
The AMDGPU RAS interfaces are exposed via sysfs (for informational queries) and debugfs (for error injection).
RAS debugfs/sysfs Control and Error Injection Interfaces¶
The control interface accepts struct ras_debug_if which has two members.
First member: ras_debug_if::head or ras_debug_if::inject.
head is used to indicate which IP block will be under control.
head has four members, they are block, type, sub_block_index, name. block: which IP will be under control. type: what kind of error will be enabled/disabled/injected. sub_block_index: some IPs have subcomponets. say, GFX, sDMA. name: the name of IP.
inject has two more members than head, they are address, value. As their names indicate, inject operation will write the value to the address.
The second member: struct ras_debug_if::op. It has three kinds of operations.
0: disable RAS on the block. Take ::head as its data.
1: enable RAS on the block. Take ::head as its data.
2: inject errors on the block. Take ::inject as its data.
How to use the interface?
In a program
Copy the struct ras_debug_if in your code and initialize it. Write the struct to the control interface.
echo "disable <block>" > /sys/kernel/debug/dri/<N>/ras/ras_ctrl echo "enable <block> <error>" > /sys/kernel/debug/dri/<N>/ras/ras_ctrl echo "inject <block> <error> <sub-block> <address> <value> > /sys/kernel/debug/dri/<N>/ras/ras_ctrl
Where N, is the card which you want to affect.
“disable” requires only the block. “enable” requires the block and error type. “inject” requires the block, error type, address, and value.
- The block is one of: umc, sdma, gfx, etc.
see ras_block_string for details
- The error type is one of: ue, ce, where,
ue is multi-uncorrectable ce is single-correctable
The sub-block is a the sub-block index, pass 0 if there is no sub-block. The address and value are hexadecimal numbers, leading 0x is optional.
echo inject umc ue 0x0 0x0 0x0 > /sys/kernel/debug/dri/0/ras/ras_ctrl echo inject umc ce 0 0 0 > /sys/kernel/debug/dri/0/ras/ras_ctrl echo disable umc > /sys/kernel/debug/dri/0/ras/ras_ctrl
How to check the result of the operation?
To check disable/enable, see “ras” features at, /sys/class/drm/card[0/1/2…]/device/ras/features
To check inject, see the corresponding error count at, /sys/class/drm/card[0/1/2…]/device/ras/[gfx|sdma|umc|…]_err_count
Operations are only allowed on blocks which are supported. Check the “ras” mask at /sys/module/amdgpu/parameters/ras_mask to see which blocks support RAS on a particular asic.
RAS Reboot Behavior for Unrecoverable Errors¶
Normally when there is an uncorrectable error, the driver will reset the GPU to recover. However, in the event of an unrecoverable error, the driver provides an interface to reboot the system automatically in that event.
The following file in debugfs provides that interface: /sys/kernel/debug/dri/[0/1/2…]/ras/auto_reboot
echo true > .../ras/auto_reboot
RAS Error Count sysfs Interface¶
It allows the user to read the error count for each IP block on the gpu through /sys/class/drm/card[0/1/2…]/device/ras/[gfx/sdma/…]_err_count
It outputs the multiple lines which report the uncorrected (ue) and corrected (ce) error counts.
The format of one line is below,
ue: 0 ce: 1
RAS EEPROM debugfs Interface¶
Some boards contain an EEPROM which is used to persistently store a list of bad pages which experiences ECC errors in vram. This interface provides a way to reset the EEPROM, e.g., after testing error injection.
echo 1 > ../ras/ras_eeprom_reset
will reset EEPROM table to 0 entries.
RAS VRAM Bad Pages sysfs Interface¶
It allows user to read the bad pages of vram on the gpu through /sys/class/drm/card[0/1/2…]/device/ras/gpu_vram_bad_pages
It outputs multiple lines, and each line stands for one gpu page.
The format of one line is below, gpu pfn : gpu page size : flags
gpu pfn and gpu page size are printed in hex format. flags can be one of below character,
R: reserved, this gpu page is reserved and not able to use.
P: pending for reserve, this gpu page is marked as bad, will be reserved in next window of page_reserve.
F: unable to reserve. this gpu page can’t be reserved due to some reasons.
0x00000001 : 0x00001000 : R 0x00000002 : 0x00001000 : P
Sample code for testing error injection can be found here: https://cgit.freedesktop.org/mesa/drm/tree/tests/amdgpu/ras_tests.c
This is part of the libdrm amdgpu unit tests which cover several areas of the GPU. There are four sets of tests:
RAS Basic Test
The test verifies the RAS feature enabled status and makes sure the necessary sysfs and debugfs files are present.
RAS Query Test
This test checks the RAS availability and enablement status for each supported IP block as well as the error counts.
RAS Inject Test
This test injects errors for each IP.
RAS Disable Test
This test tests disabling of RAS features for each IP block.