The Linux Kernel
5.7.0
  • The Linux kernel user’s and administrator’s guide
  • Kernel Build System
  • The Linux kernel firmware guide
  • The Linux kernel user-space API guide
  • Working with the kernel development community
  • Development tools for the kernel
  • How to write kernel documentation
  • Kernel Hacking Guides
  • Linux Tracing Technologies
  • Kernel Maintainer Handbook
  • fault-injection
  • Kernel Livepatching
  • The Linux driver implementer’s API guide
  • Core API Documentation
  • locking
  • Accounting
  • Block
  • cdrom
  • Linux CPUFreq - CPU frequency and voltage scaling code in the Linux(TM) kernel
  • Integrated Drive Electronics (IDE)
  • Frame Buffer
  • fpga
  • Human Interface Devices (HID)
  • I2C/SMBus Subsystem
  • Industrial I/O
  • ISDN
  • InfiniBand
  • LEDs
  • Linux Media Subsystem Documentation
  • NetLabel
  • Linux Networking Documentation
    • netdev FAQ
    • AF_XDP
    • Bare UDP Tunnelling Module Documentation
    • batman-adv
    • SocketCAN - Controller Area Network
    • The UCAN Protocol
    • Vendor Device Drivers
    • Distributed Switch Architecture
    • Linux Devlink Documentation
      • Interface documentation
        • Devlink DPIPE
        • Devlink Health
        • Devlink Info
        • Devlink Flash
        • Devlink Params
        • Devlink Region
        • Devlink Resource
        • Devlink Trap
      • Driver-specific documentation
    • Netlink interface for ethtool
    • IEEE 802.15.4 Developer’s Guide
    • J1939 Documentation
    • Linux Networking and Network Devices APIs
    • Z8530 Programming Guide
    • MSG_ZEROCOPY
    • FAILOVER
    • Net DIM - Generic Network Dynamic Interrupt Moderation
    • NET_FAILOVER
    • PHY Abstraction Layer
    • phylink
    • IP-Aliasing
    • Ethernet Bridging
    • SNMP counter
    • Checksum Offloads
    • Segmentation Offloads
    • Scaling in the Linux Networking Stack
    • Kernel TLS
    • Kernel TLS offload
    • Linux NFC subsystem
    • Netdev private dataroom for 6lowpan interfaces
  • pcmcia
  • Power Management
  • TCM Virtual Device
  • timers
  • Serial Peripheral Interface (SPI)
  • 1-Wire Subsystem
  • Linux Watchdog Support
  • Linux Virtualization Support
  • The Linux Input Documentation
  • Linux Hardware Monitoring
  • Linux GPU Driver Developer’s Guide
  • Security Documentation
  • Linux Sound Subsystem Documentation
  • Linux Kernel Crypto API
  • Filesystems in the Linux kernel
  • Linux Memory Management Documentation
  • BPF Documentation
  • USB support
  • Linux PCI Bus Subsystem
  • Linux SCSI Subsystem
  • Assorted Miscellaneous Devices Documentation
  • Linux Scheduler
  • MHI
  • Assembler Annotations
  • ARM Architecture
  • ARM64 Architecture
  • IA-64 Architecture
  • m68k Architecture
  • MIPS-specific Documentation
  • Linux on the Nios II architecture
  • OpenRISC Architecture
  • PA-RISC Architecture
  • powerpc
  • RISC-V architecture
  • s390 Architecture
  • SuperH Interfaces Guide
  • Sparc Architecture
  • x86-specific Documentation
  • Xtensa Architecture
  • ext4 Data Structures and Algorithms
  • Translations
The Linux Kernel
  • Docs »
  • Linux Networking Documentation »
  • Linux Devlink Documentation »
  • Devlink Health
  • View page source

Devlink Health¶

Background¶

The devlink health mechanism is targeted for Real Time Alerting, in order to know when something bad happened to a PCI device.

  • Provide alert debug information.
  • Self healing.
  • If problem needs vendor support, provide a way to gather all needed debugging information.

Overview¶

The main idea is to unify and centralize driver health reports in the generic devlink instance and allow the user to set different attributes of the health reporting and recovery procedures.

The devlink health reporter: Device driver creates a “health reporter” per each error/health type. Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error) or unknown (driver specific). For each registered health reporter a driver can issue error/health reports asynchronously. All health reports handling is done by devlink. Device driver can provide specific callbacks for each “health reporter”, e.g.:

  • Recovery procedures
  • Diagnostics procedures
  • Object dump procedures
  • OOB initial parameters

Different parts of the driver can register different types of health reporters with different handlers.

Actions¶

Once an error is reported, devlink health will perform the following actions:

  • A log is being send to the kernel trace events buffer
  • Health status and statistics are being updated for the reporter instance
  • Object dump is being taken and saved at the reporter instance (as long as there is no other dump which is already stored)
  • Auto recovery attempt is being done. Depends on: - Auto-recovery configuration - Grace period vs. time passed since last recover

User Interface¶

User can access/change each reporter’s parameters and driver specific callbacks via devlink, e.g per error type (per health reporter):

  • Configure reporter’s generic parameters (like: disable/enable auto recovery)
  • Invoke recovery procedure
  • Run diagnostics
  • Object dump
List of devlink health interfaces¶
Name Description
DEVLINK_CMD_HEALTH_REPORTER_GET Retrieves status and configuration info per DEV and reporter.
DEVLINK_CMD_HEALTH_REPORTER_SET Allows reporter-related configuration setting.
DEVLINK_CMD_HEALTH_REPORTER_RECOVER Triggers a reporter’s recovery procedure.
DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE Retrieves diagnostics data from a reporter on a device.
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET Retrieves the last stored dump. Devlink health saves a single dump. If an dump is not already stored by the devlink for this reporter, devlink generates a new dump. dump output is defined by the reporter.
DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR Clears the last saved dump file for the specified reporter.

The following diagram provides a general overview of devlink-health:

                                               netlink
                                      +--------------------------+
                                      |                          |
                                      |            +             |
                                      |            |             |
                                      +--------------------------+
                                                   |request for ops
                                                   |(diagnose,
 mlx5_core                             devlink     |recover,
                                                   |dump)
+--------+                            +--------------------------+
|        |                            |    reporter|             |
|        |                            |  +---------v----------+  |
|        |   ops execution            |  |                    |  |
|     <----------------------------------+                    |  |
|        |                            |  |                    |  |
|        |                            |  + ^------------------+  |
|        |                            |    | request for ops     |
|        |                            |    | (recover, dump)     |
|        |                            |    |                     |
|        |                            |  +-+------------------+  |
|        |     health report          |  | health handler     |  |
|        +------------------------------->                    |  |
|        |                            |  +--------------------+  |
|        |     health reporter create |                          |
|        +---------------------------->                          |
+--------+                            +--------------------------+
Next Previous

© Copyright The kernel development community

Built with Sphinx using a theme provided by Read the Docs.