Aya Levin | db2ab7a | 2019-02-07 11:36:42 +0200 | [diff] [blame] | 1 | The health mechanism is targeted for Real Time Alerting, in order to know when |
| 2 | something bad had happened to a PCI device |
| 3 | - Provide alert debug information |
| 4 | - Self healing |
| 5 | - If problem needs vendor support, provide a way to gather all needed debugging |
| 6 | information. |
| 7 | |
| 8 | The main idea is to unify and centralize driver health reports in the |
| 9 | generic devlink instance and allow the user to set different |
| 10 | attributes of the health reporting and recovery procedures. |
| 11 | |
| 12 | The devlink health reporter: |
| 13 | Device driver creates a "health reporter" per each error/health type. |
| 14 | Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error) |
| 15 | or unknown (driver specific). |
| 16 | For each registered health reporter a driver can issue error/health reports |
| 17 | asynchronously. All health reports handling is done by devlink. |
| 18 | Device driver can provide specific callbacks for each "health reporter", e.g. |
| 19 | - Recovery procedures |
| 20 | - Diagnostics and object dump procedures |
| 21 | - OOB initial parameters |
| 22 | Different parts of the driver can register different types of health reporters |
| 23 | with different handlers. |
| 24 | |
| 25 | Once an error is reported, devlink health will do the following actions: |
| 26 | * A log is being send to the kernel trace events buffer |
| 27 | * Health status and statistics are being updated for the reporter instance |
| 28 | * Object dump is being taken and saved at the reporter instance (as long as |
| 29 | there is no other dump which is already stored) |
| 30 | * Auto recovery attempt is being done. Depends on: |
| 31 | - Auto-recovery configuration |
| 32 | - Grace period vs. time passed since last recover |
| 33 | |
| 34 | The user interface: |
| 35 | User can access/change each reporter's parameters and driver specific callbacks |
| 36 | via devlink, e.g per error type (per health reporter) |
| 37 | - Configure reporter's generic parameters (like: disable/enable auto recovery) |
| 38 | - Invoke recovery procedure |
| 39 | - Run diagnostics |
| 40 | - Object dump |
| 41 | |
| 42 | The devlink health interface (via netlink): |
| 43 | DEVLINK_CMD_HEALTH_REPORTER_GET |
| 44 | Retrieves status and configuration info per DEV and reporter. |
| 45 | DEVLINK_CMD_HEALTH_REPORTER_SET |
| 46 | Allows reporter-related configuration setting. |
| 47 | DEVLINK_CMD_HEALTH_REPORTER_RECOVER |
| 48 | Triggers a reporter's recovery procedure. |
| 49 | DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE |
| 50 | Retrieves diagnostics data from a reporter on a device. |
| 51 | DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET |
| 52 | Retrieves the last stored dump. Devlink health |
| 53 | saves a single dump. If an dump is not already stored by the devlink |
| 54 | for this reporter, devlink generates a new dump. |
| 55 | dump output is defined by the reporter. |
| 56 | DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR |
| 57 | Clears the last saved dump file for the specified reporter. |
| 58 | |
| 59 | |
| 60 | netlink |
| 61 | +--------------------------+ |
| 62 | | | |
| 63 | | + | |
| 64 | | | | |
| 65 | +--------------------------+ |
| 66 | |request for ops |
| 67 | |(diagnose, |
| 68 | mlx5_core devlink |recover, |
| 69 | |dump) |
| 70 | +--------+ +--------------------------+ |
| 71 | | | | reporter| | |
| 72 | | | | +---------v----------+ | |
| 73 | | | ops execution | | | | |
| 74 | | <----------------------------------+ | | |
| 75 | | | | | | | |
| 76 | | | | + ^------------------+ | |
| 77 | | | | | request for ops | |
| 78 | | | | | (recover, dump) | |
| 79 | | | | | | |
| 80 | | | | +-+------------------+ | |
| 81 | | | health report | | health handler | | |
| 82 | | +-------------------------------> | | |
| 83 | | | | +--------------------+ | |
| 84 | | | health reporter create | | |
| 85 | | +----------------------------> | |
| 86 | +--------+ +--------------------------+ |