lockup_detector: Combine nmi_watchdog and softlockup detector
The new nmi_watchdog (which uses the perf event subsystem) is very
similar in structure to the softlockup detector. Using Ingo's
suggestion, I combined the two functionalities into one file:
kernel/watchdog.c.
Now both the nmi_watchdog (or hardlockup detector) and softlockup
detector sit on top of the perf event subsystem, which is run every
60 seconds or so to see if there are any lockups.
To detect hardlockups, cpus not responding to interrupts, I
implemented an hrtimer that runs 5 times for every perf event
overflow event. If that stops counting on a cpu, then the cpu is
most likely in trouble.
To detect softlockups, tasks not yielding to the scheduler, I used the
previous kthread idea that now gets kicked every time the hrtimer fires.
If the kthread isn't being scheduled neither is anyone else and the
warning is printed to the console.
I tested this on x86_64 and both the softlockup and hardlockup paths
work.
V2:
- cleaned up the Kconfig and softlockup combination
- surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI
- seperated out the softlockup case from perf event subsystem
- re-arranged the enabling/disabling nmi watchdog from proc space
- added cpumasks for hardlockup failure cases
- removed fallback to soft events if no PMU exists for hard events
V3:
- comment cleanups
- drop support for older softlockup code
- per_cpu cleanups
- completely remove software clock base hardlockup detector
- use per_cpu masking on hard/soft lockup detection
- #ifdef cleanups
- rename config option NMI_WATCHDOG to LOCKUP_DETECTOR
- documentation additions
V4:
- documentation fixes
- convert per_cpu to __get_cpu_var
- powerpc compile fixes
V5:
- split apart warn flags for hard and soft lockups
TODO:
- figure out how to make an arch-agnostic clock2cycles call
(if possible) to feed into perf events as a sample period
[fweisbec: merged conflict patch]
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
LKML-Reference: <1273266711-18706-2-git-send-email-dzickus@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 220ae60..49e285d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -153,7 +153,7 @@
points; some don't and need to be caught.
config DETECT_SOFTLOCKUP
- bool "Detect Soft Lockups"
+ bool
depends on DEBUG_KERNEL && !S390
default y
help
@@ -171,17 +171,27 @@
can be detected via the NMI-watchdog, on platforms that
support it.)
-config NMI_WATCHDOG
- bool "Detect Hard Lockups with an NMI Watchdog"
- depends on DEBUG_KERNEL && PERF_EVENTS && PERF_EVENTS_NMI
+config LOCKUP_DETECTOR
+ bool "Detect Hard and Soft Lockups"
+ depends on DEBUG_KERNEL
+ default DETECT_SOFTLOCKUP
help
- Say Y here to enable the kernel to use the NMI as a watchdog
- to detect hard lockups. This is useful when a cpu hangs for no
- reason but can still respond to NMIs. A backtrace is displayed
- for reviewing and reporting.
+ Say Y here to enable the kernel to act as a watchdog to detect
+ hard and soft lockups.
- The overhead should be minimal, just an extra NMI every few
- seconds.
+ Softlockups are bugs that cause the kernel to loop in kernel
+ mode for more than 60 seconds, without giving other tasks a
+ chance to run. The current stack trace is displayed upon
+ detection and the system will stay locked up.
+
+ Hardlockups are bugs that cause the CPU to loop in kernel mode
+ for more than 60 seconds, without letting other interrupts have a
+ chance to run. The current stack trace is displayed upon detection
+ and the system will stay locked up.
+
+ The overhead should be minimal. A periodic hrtimer runs to
+ generate interrupts and kick the watchdog task every 10-12 seconds.
+ An NMI is generated every 60 seconds or so to check for hardlockups.
config BOOTPARAM_SOFTLOCKUP_PANIC
bool "Panic (Reboot) On Soft Lockups"