Johannes Weiner | eb41468 | 2018-10-26 15:06:27 -0700 | [diff] [blame] | 1 | ================================ |
| 2 | PSI - Pressure Stall Information |
| 3 | ================================ |
| 4 | |
| 5 | :Date: April, 2018 |
| 6 | :Author: Johannes Weiner <hannes@cmpxchg.org> |
| 7 | |
| 8 | When CPU, memory or IO devices are contended, workloads experience |
| 9 | latency spikes, throughput losses, and run the risk of OOM kills. |
| 10 | |
| 11 | Without an accurate measure of such contention, users are forced to |
| 12 | either play it safe and under-utilize their hardware resources, or |
| 13 | roll the dice and frequently suffer the disruptions resulting from |
| 14 | excessive overcommit. |
| 15 | |
| 16 | The psi feature identifies and quantifies the disruptions caused by |
| 17 | such resource crunches and the time impact it has on complex workloads |
| 18 | or even entire systems. |
| 19 | |
| 20 | Having an accurate measure of productivity losses caused by resource |
| 21 | scarcity aids users in sizing workloads to hardware--or provisioning |
| 22 | hardware according to workload demand. |
| 23 | |
| 24 | As psi aggregates this information in realtime, systems can be managed |
| 25 | dynamically using techniques such as load shedding, migrating jobs to |
| 26 | other systems or data centers, or strategically pausing or killing low |
| 27 | priority or restartable batch jobs. |
| 28 | |
| 29 | This allows maximizing hardware utilization without sacrificing |
| 30 | workload health or risking major disruptions such as OOM kills. |
| 31 | |
| 32 | Pressure interface |
| 33 | ================== |
| 34 | |
| 35 | Pressure information for each resource is exported through the |
| 36 | respective file in /proc/pressure/ -- cpu, memory, and io. |
| 37 | |
| 38 | The format for CPU is as such: |
| 39 | |
| 40 | some avg10=0.00 avg60=0.00 avg300=0.00 total=0 |
| 41 | |
| 42 | and for memory and IO: |
| 43 | |
| 44 | some avg10=0.00 avg60=0.00 avg300=0.00 total=0 |
| 45 | full avg10=0.00 avg60=0.00 avg300=0.00 total=0 |
| 46 | |
| 47 | The "some" line indicates the share of time in which at least some |
| 48 | tasks are stalled on a given resource. |
| 49 | |
| 50 | The "full" line indicates the share of time in which all non-idle |
| 51 | tasks are stalled on a given resource simultaneously. In this state |
| 52 | actual CPU cycles are going to waste, and a workload that spends |
| 53 | extended time in this state is considered to be thrashing. This has |
| 54 | severe impact on performance, and it's useful to distinguish this |
| 55 | situation from a state where some tasks are stalled but the CPU is |
| 56 | still doing productive work. As such, time spent in this subset of the |
| 57 | stall state is tracked separately and exported in the "full" averages. |
| 58 | |
| 59 | The ratios are tracked as recent trends over ten, sixty, and three |
| 60 | hundred second windows, which gives insight into short term events as |
| 61 | well as medium and long term trends. The total absolute stall time is |
| 62 | tracked and exported as well, to allow detection of latency spikes |
| 63 | which wouldn't necessarily make a dent in the time averages, or to |
| 64 | average trends over custom time frames. |
Johannes Weiner | 2ce7135 | 2018-10-26 15:06:31 -0700 | [diff] [blame^] | 65 | |
| 66 | Cgroup2 interface |
| 67 | ================= |
| 68 | |
| 69 | In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem |
| 70 | mounted, pressure stall information is also tracked for tasks grouped |
| 71 | into cgroups. Each subdirectory in the cgroupfs mountpoint contains |
| 72 | cpu.pressure, memory.pressure, and io.pressure files; the format is |
| 73 | the same as the /proc/pressure/ files. |