Changbin Du | 8df2d75 | 2018-02-17 13:39:48 +0800 | [diff] [blame] | 1 | ========================= |
| 2 | Hardware Latency Detector |
| 3 | ========================= |
| 4 | |
| 5 | Introduction |
Jon Masters | c850ed3 | 2015-04-10 14:57:46 -0400 | [diff] [blame] | 6 | ------------- |
| 7 | |
| 8 | The tracer hwlat_detector is a special purpose tracer that is used to |
| 9 | detect large system latencies induced by the behavior of certain underlying |
| 10 | hardware or firmware, independent of Linux itself. The code was developed |
| 11 | originally to detect SMIs (System Management Interrupts) on x86 systems, |
| 12 | however there is nothing x86 specific about this patchset. It was |
| 13 | originally written for use by the "RT" patch since the Real Time |
| 14 | kernel is highly latency sensitive. |
| 15 | |
| 16 | SMIs are not serviced by the Linux kernel, which means that it does not |
| 17 | even know that they are occuring. SMIs are instead set up by BIOS code |
| 18 | and are serviced by BIOS code, usually for "critical" events such as |
| 19 | management of thermal sensors and fans. Sometimes though, SMIs are used for |
| 20 | other tasks and those tasks can spend an inordinate amount of time in the |
| 21 | handler (sometimes measured in milliseconds). Obviously this is a problem if |
| 22 | you are trying to keep event service latencies down in the microsecond range. |
| 23 | |
| 24 | The hardware latency detector works by hogging one of the cpus for configurable |
| 25 | amounts of time (with interrupts disabled), polling the CPU Time Stamp Counter |
| 26 | for some period, then looking for gaps in the TSC data. Any gap indicates a |
| 27 | time when the polling was interrupted and since the interrupts are disabled, |
| 28 | the only thing that could do that would be an SMI or other hardware hiccup |
| 29 | (or an NMI, but those can be tracked). |
| 30 | |
| 31 | Note that the hwlat detector should *NEVER* be used in a production environment. |
| 32 | It is intended to be run manually to determine if the hardware platform has a |
| 33 | problem with long system firmware service routines. |
| 34 | |
Changbin Du | 8df2d75 | 2018-02-17 13:39:48 +0800 | [diff] [blame] | 35 | Usage |
Jon Masters | c850ed3 | 2015-04-10 14:57:46 -0400 | [diff] [blame] | 36 | ------ |
| 37 | |
| 38 | Write the ASCII text "hwlat" into the current_tracer file of the tracing system |
| 39 | (mounted at /sys/kernel/tracing or /sys/kernel/tracing). It is possible to |
| 40 | redefine the threshold in microseconds (us) above which latency spikes will |
| 41 | be taken into account. |
| 42 | |
Changbin Du | 8df2d75 | 2018-02-17 13:39:48 +0800 | [diff] [blame] | 43 | Example:: |
Jon Masters | c850ed3 | 2015-04-10 14:57:46 -0400 | [diff] [blame] | 44 | |
| 45 | # echo hwlat > /sys/kernel/tracing/current_tracer |
| 46 | # echo 100 > /sys/kernel/tracing/tracing_thresh |
| 47 | |
| 48 | The /sys/kernel/tracing/hwlat_detector interface contains the following files: |
| 49 | |
Changbin Du | 8df2d75 | 2018-02-17 13:39:48 +0800 | [diff] [blame] | 50 | - width - time period to sample with CPUs held (usecs) |
| 51 | must be less than the total window size (enforced) |
| 52 | - window - total period of sampling, width being inside (usecs) |
Jon Masters | c850ed3 | 2015-04-10 14:57:46 -0400 | [diff] [blame] | 53 | |
| 54 | By default the width is set to 500,000 and window to 1,000,000, meaning that |
| 55 | for every 1,000,000 usecs (1s) the hwlat detector will spin for 500,000 usecs |
| 56 | (0.5s). If tracing_thresh contains zero when hwlat tracer is enabled, it will |
| 57 | change to a default of 10 usecs. If any latencies that exceed the threshold is |
| 58 | observed then the data will be written to the tracing ring buffer. |
| 59 | |
| 60 | The minimum sleep time between periods is 1 millisecond. Even if width |
| 61 | is less than 1 millisecond apart from window, to allow the system to not |
| 62 | be totally starved. |
| 63 | |
| 64 | If tracing_thresh was zero when hwlat detector was started, it will be set |
| 65 | back to zero if another tracer is loaded. Note, the last value in |
| 66 | tracing_thresh that hwlat detector had will be saved and this value will |
| 67 | be restored in tracing_thresh if it is still zero when hwlat detector is |
| 68 | started again. |
| 69 | |
| 70 | The following tracing directory files are used by the hwlat_detector: |
| 71 | |
| 72 | in /sys/kernel/tracing: |
| 73 | |
Changbin Du | 8df2d75 | 2018-02-17 13:39:48 +0800 | [diff] [blame] | 74 | - tracing_threshold - minimum latency value to be considered (usecs) |
| 75 | - tracing_max_latency - maximum hardware latency actually observed (usecs) |
| 76 | - tracing_cpumask - the CPUs to move the hwlat thread across |
| 77 | - hwlat_detector/width - specified amount of time to spin within window (usecs) |
| 78 | - hwlat_detector/window - amount of time between (width) runs (usecs) |
Daniel Bristot de Oliveira | 8fa826b | 2021-06-22 16:42:20 +0200 | [diff] [blame] | 79 | - hwlat_detector/mode - the thread mode |
Steven Rostedt (Red Hat) | 0330f7a | 2016-07-15 15:48:56 -0400 | [diff] [blame] | 80 | |
Daniel Bristot de Oliveira | f46b165 | 2021-06-22 16:42:22 +0200 | [diff] [blame] | 81 | By default, one hwlat detector's kernel thread will migrate across each CPU |
Daniel Bristot de Oliveira | 8fa826b | 2021-06-22 16:42:20 +0200 | [diff] [blame] | 82 | specified in cpumask at the beginning of a new window, in a round-robin |
| 83 | fashion. This behavior can be changed by changing the thread mode, |
| 84 | the available options are: |
| 85 | |
| 86 | - none: do not force migration |
| 87 | - round-robin: migrate across each CPU specified in cpumask [default] |
Daniel Bristot de Oliveira | f46b165 | 2021-06-22 16:42:22 +0200 | [diff] [blame] | 88 | - per-cpu: create one thread for each cpu in tracing_cpumask |