Mauro Carvalho Chehab | cc2a2d1 | 2019-06-12 14:53:01 -0300 | [diff] [blame] | 1 | =========================== |
| 2 | HPE iLO NMI Watchdog Driver |
| 3 | =========================== |
| 4 | |
| 5 | for iLO based ProLiant Servers |
| 6 | ============================== |
| 7 | |
Jerry Hoemann | 18bd196 | 2018-08-20 13:31:23 -0600 | [diff] [blame] | 8 | Last reviewed: 08/20/2018 |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 9 | |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 10 | |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 11 | The HPE iLO NMI Watchdog driver is a kernel module that provides basic |
Jerry Hoemann | 18bd196 | 2018-08-20 13:31:23 -0600 | [diff] [blame] | 12 | watchdog functionality and handler for the iLO "Generate NMI to System" |
| 13 | virtual button. |
| 14 | |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 15 | All references to iLO in this document imply it also works on iLO2 and all |
| 16 | subsequent generations. |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 17 | |
| 18 | Watchdog functionality is enabled like any other common watchdog driver. That |
| 19 | is, an application needs to be started that kicks off the watchdog timer. A |
Tom Saeger | 718d50e | 2017-10-12 15:24:10 -0500 | [diff] [blame] | 20 | basic application exists in tools/testing/selftests/watchdog/ named |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 21 | watchdog-test.c. Simply compile the C file and kick it off. If the system |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 22 | gets into a bad state and hangs, the HPE ProLiant iLO timer register will |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 23 | not be updated in a timely fashion and a hardware system reset (also known as |
| 24 | an Automatic Server Recovery (ASR)) event will occur. |
| 25 | |
Jerry Hoemann | 18bd196 | 2018-08-20 13:31:23 -0600 | [diff] [blame] | 26 | The hpwdt driver also has the following module parameters: |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 27 | |
Mauro Carvalho Chehab | cc2a2d1 | 2019-06-12 14:53:01 -0300 | [diff] [blame] | 28 | ============ ================================================================ |
| 29 | soft_margin allows the user to set the watchdog timer value. |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 30 | Default value is 30 seconds. |
Mauro Carvalho Chehab | cc2a2d1 | 2019-06-12 14:53:01 -0300 | [diff] [blame] | 31 | timeout an alias of soft_margin. |
| 32 | pretimeout allows the user to set the watchdog pretimeout value. |
Jerry Hoemann | 18bd196 | 2018-08-20 13:31:23 -0600 | [diff] [blame] | 33 | This is the number of seconds before timeout when an |
| 34 | NMI is delivered to the system. Setting the value to |
| 35 | zero disables the pretimeout NMI. |
| 36 | Default value is 9 seconds. |
Mauro Carvalho Chehab | cc2a2d1 | 2019-06-12 14:53:01 -0300 | [diff] [blame] | 37 | nowayout basic watchdog parameter that does not allow the timer to |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 38 | be restarted or an impending ASR to be escaped. |
Nigel Croxon | 84df082 | 2016-04-06 14:40:05 -0400 | [diff] [blame] | 39 | Default value is set when compiling the kernel. If it is set |
| 40 | to "Y", then there is no way of disabling the watchdog once |
| 41 | it has been started. |
Jerry Hoemann | f213fcf | 2019-05-17 14:59:42 -0600 | [diff] [blame] | 42 | kdumptimeout Minimum timeout in seconds to apply upon receipt of an NMI |
| 43 | before calling panic. (-1) disables the watchdog. When value |
| 44 | is > 0, the timer is reprogrammed with the greater of |
| 45 | value or current timeout value. |
Mauro Carvalho Chehab | cc2a2d1 | 2019-06-12 14:53:01 -0300 | [diff] [blame] | 46 | ============ ================================================================ |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 47 | |
Mauro Carvalho Chehab | cc2a2d1 | 2019-06-12 14:53:01 -0300 | [diff] [blame] | 48 | NOTE: |
| 49 | More information about watchdog drivers in general, including the ioctl |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 50 | interface to /dev/watchdog can be found in |
Mauro Carvalho Chehab | cc2a2d1 | 2019-06-12 14:53:01 -0300 | [diff] [blame] | 51 | Documentation/watchdog/watchdog-api.rst and Documentation/IPMI.txt. |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 52 | |
Jerry Hoemann | 18bd196 | 2018-08-20 13:31:23 -0600 | [diff] [blame] | 53 | Due to limitations in the iLO hardware, the NMI pretimeout if enabled, |
| 54 | can only be set to 9 seconds. Attempts to set pretimeout to other |
| 55 | non-zero values will be rounded, possibly to zero. Users should verify |
| 56 | the pretimeout value after attempting to set pretimeout or timeout. |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 57 | |
Jerry Hoemann | 18bd196 | 2018-08-20 13:31:23 -0600 | [diff] [blame] | 58 | Upon receipt of an NMI from the iLO, the hpwdt driver will initiate a |
| 59 | panic. This is to allow for a crash dump to be collected. It is incumbent |
| 60 | upon the user to have properly configured the system for kdump. |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 61 | |
Jerry Hoemann | 18bd196 | 2018-08-20 13:31:23 -0600 | [diff] [blame] | 62 | The default Linux kernel behavior upon panic is to print a kernel tombstone |
| 63 | and loop forever. This is generally not what a watchdog user wants. |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 64 | |
Jerry Hoemann | 18bd196 | 2018-08-20 13:31:23 -0600 | [diff] [blame] | 65 | For those wishing to learn more please see: |
Mauro Carvalho Chehab | bff9e34 | 2019-07-15 05:31:06 -0300 | [diff] [blame] | 66 | Documentation/admin-guide/kdump/kdump.rst |
Jerry Hoemann | 18bd196 | 2018-08-20 13:31:23 -0600 | [diff] [blame] | 67 | Documentation/admin-guide/kernel-parameters.txt (panic=) |
| 68 | Your Linux Distribution specific documentation. |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 69 | |
Jerry Hoemann | 18bd196 | 2018-08-20 13:31:23 -0600 | [diff] [blame] | 70 | If the hpwdt does not receive the NMI associated with an expiring timer, |
| 71 | the iLO will proceed to reset the system at timeout if the timer hasn't |
| 72 | been updated. |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 73 | |
Jerry Hoemann | 18bd196 | 2018-08-20 13:31:23 -0600 | [diff] [blame] | 74 | -- |
Thomas Mingarelli | 47bece8 | 2009-06-04 19:50:45 +0000 | [diff] [blame] | 75 | |
Jerry Hoemann | 18bd196 | 2018-08-20 13:31:23 -0600 | [diff] [blame] | 76 | The HPE iLO NMI Watchdog Driver and documentation were originally developed |
| 77 | by Tom Mingarelli. |