Shailabh Nagar | c757249 | 2006-07-14 00:24:40 -0700 | [diff] [blame] | 1 | Per-task statistics interface |
| 2 | ----------------------------- |
| 3 | |
| 4 | |
| 5 | Taskstats is a netlink-based interface for sending per-task and |
| 6 | per-process statistics from the kernel to userspace. |
| 7 | |
| 8 | Taskstats was designed for the following benefits: |
| 9 | |
| 10 | - efficiently provide statistics during lifetime of a task and on its exit |
| 11 | - unified interface for multiple accounting subsystems |
| 12 | - extensibility for use by future accounting patches |
| 13 | |
| 14 | Terminology |
| 15 | ----------- |
| 16 | |
| 17 | "pid", "tid" and "task" are used interchangeably and refer to the standard |
| 18 | Linux task defined by struct task_struct. per-pid stats are the same as |
| 19 | per-task stats. |
| 20 | |
| 21 | "tgid", "process" and "thread group" are used interchangeably and refer to the |
| 22 | tasks that share an mm_struct i.e. the traditional Unix process. Despite the |
| 23 | use of tgid, there is no special treatment for the task that is thread group |
| 24 | leader - a process is deemed alive as long as it has any task belonging to it. |
| 25 | |
| 26 | Usage |
| 27 | ----- |
| 28 | |
Shailabh Nagar | 9e06d3f | 2006-07-14 00:24:45 -0700 | [diff] [blame] | 29 | To get statistics during a task's lifetime, userspace opens a unicast netlink |
Shailabh Nagar | c757249 | 2006-07-14 00:24:40 -0700 | [diff] [blame] | 30 | socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid. |
| 31 | The response contains statistics for a task (if pid is specified) or the sum of |
| 32 | statistics for all tasks of the process (if tgid is specified). |
| 33 | |
Shailabh Nagar | 9e06d3f | 2006-07-14 00:24:45 -0700 | [diff] [blame] | 34 | To obtain statistics for tasks which are exiting, the userspace listener |
| 35 | sends a register command and specifies a cpumask. Whenever a task exits on |
| 36 | one of the cpus in the cpumask, its per-pid statistics are sent to the |
| 37 | registered listener. Using cpumasks allows the data received by one listener |
| 38 | to be limited and assists in flow control over the netlink interface and is |
| 39 | explained in more detail below. |
| 40 | |
| 41 | If the exiting task is the last thread exiting its thread group, |
| 42 | an additional record containing the per-tgid stats is also sent to userspace. |
| 43 | The latter contains the sum of per-pid stats for all threads in the thread |
| 44 | group, both past and present. |
Shailabh Nagar | c757249 | 2006-07-14 00:24:40 -0700 | [diff] [blame] | 45 | |
Shailabh Nagar | a3baf64 | 2006-07-14 00:24:42 -0700 | [diff] [blame] | 46 | getdelays.c is a simple utility demonstrating usage of the taskstats interface |
Shailabh Nagar | 9e06d3f | 2006-07-14 00:24:45 -0700 | [diff] [blame] | 47 | for reporting delay accounting statistics. Users can register cpumasks, |
| 48 | send commands and process responses, listen for per-tid/tgid exit data, |
| 49 | write the data received to a file and do basic flow control by increasing |
| 50 | receive buffer sizes. |
Shailabh Nagar | c757249 | 2006-07-14 00:24:40 -0700 | [diff] [blame] | 51 | |
| 52 | Interface |
| 53 | --------- |
| 54 | |
| 55 | The user-kernel interface is encapsulated in include/linux/taskstats.h |
| 56 | |
| 57 | To avoid this documentation becoming obsolete as the interface evolves, only |
| 58 | an outline of the current version is given. taskstats.h always overrides the |
| 59 | description here. |
| 60 | |
| 61 | struct taskstats is the common accounting structure for both per-pid and |
| 62 | per-tgid data. It is versioned and can be extended by each accounting subsystem |
| 63 | that is added to the kernel. The fields and their semantics are defined in the |
| 64 | taskstats.h file. |
| 65 | |
| 66 | The data exchanged between user and kernel space is a netlink message belonging |
| 67 | to the NETLINK_GENERIC family and using the netlink attributes interface. |
| 68 | The messages are in the format |
| 69 | |
| 70 | +----------+- - -+-------------+-------------------+ |
| 71 | | nlmsghdr | Pad | genlmsghdr | taskstats payload | |
| 72 | +----------+- - -+-------------+-------------------+ |
| 73 | |
| 74 | |
| 75 | The taskstats payload is one of the following three kinds: |
| 76 | |
Shailabh Nagar | 9e06d3f | 2006-07-14 00:24:45 -0700 | [diff] [blame] | 77 | 1. Commands: Sent from user to kernel. Commands to get data on |
| 78 | a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID, |
| 79 | containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes |
| 80 | the task/process for which userspace wants statistics. |
| 81 | |
| 82 | Commands to register/deregister interest in exit data from a set of cpus |
| 83 | consist of one attribute, of type |
| 84 | TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the |
| 85 | attribute payload. The cpumask is specified as an ascii string of |
| 86 | comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8 |
| 87 | the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest |
| 88 | in cpus before closing the listening socket, the kernel cleans up its interest |
| 89 | set over time. However, for the sake of efficiency, an explicit deregistration |
| 90 | is advisable. |
Shailabh Nagar | c757249 | 2006-07-14 00:24:40 -0700 | [diff] [blame] | 91 | |
| 92 | 2. Response for a command: sent from the kernel in response to a userspace |
| 93 | command. The payload is a series of three attributes of type: |
| 94 | |
| 95 | a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates |
| 96 | a pid/tgid will be followed by some stats. |
| 97 | |
| 98 | b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats |
Matt LaPlante | fa00e7e | 2006-11-30 04:55:36 +0100 | [diff] [blame] | 99 | are being returned. |
Shailabh Nagar | c757249 | 2006-07-14 00:24:40 -0700 | [diff] [blame] | 100 | |
Matt LaPlante | fa00e7e | 2006-11-30 04:55:36 +0100 | [diff] [blame] | 101 | c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The |
Shailabh Nagar | c757249 | 2006-07-14 00:24:40 -0700 | [diff] [blame] | 102 | same structure is used for both per-pid and per-tgid stats. |
| 103 | |
| 104 | 3. New message sent by kernel whenever a task exits. The payload consists of a |
| 105 | series of attributes of the following type: |
| 106 | |
| 107 | a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats |
| 108 | b) TASKSTATS_TYPE_PID: contains exiting task's pid |
| 109 | c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats |
| 110 | d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats |
| 111 | e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs |
| 112 | f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process |
| 113 | |
| 114 | |
| 115 | per-tgid stats |
| 116 | -------------- |
| 117 | |
| 118 | Taskstats provides per-process stats, in addition to per-task stats, since |
| 119 | resource management is often done at a process granularity and aggregating task |
| 120 | stats in userspace alone is inefficient and potentially inaccurate (due to lack |
| 121 | of atomicity). |
| 122 | |
| 123 | However, maintaining per-process, in addition to per-task stats, within the |
Shailabh Nagar | ad4ecbc | 2006-07-14 00:24:44 -0700 | [diff] [blame] | 124 | kernel has space and time overheads. To address this, the taskstats code |
Matt LaPlante | 4ae0edc | 2006-11-30 04:58:40 +0100 | [diff] [blame^] | 125 | accumulates each exiting task's statistics into a process-wide data structure. |
| 126 | When the last task of a process exits, the process level data accumulated also |
Shailabh Nagar | ad4ecbc | 2006-07-14 00:24:44 -0700 | [diff] [blame] | 127 | gets sent to userspace (along with the per-task data). |
Shailabh Nagar | c757249 | 2006-07-14 00:24:40 -0700 | [diff] [blame] | 128 | |
Shailabh Nagar | ad4ecbc | 2006-07-14 00:24:44 -0700 | [diff] [blame] | 129 | When a user queries to get per-tgid data, the sum of all other live threads in |
Matt LaPlante | 4ae0edc | 2006-11-30 04:58:40 +0100 | [diff] [blame^] | 130 | the group is added up and added to the accumulated total for previously exited |
Shailabh Nagar | ad4ecbc | 2006-07-14 00:24:44 -0700 | [diff] [blame] | 131 | threads of the same thread group. |
Shailabh Nagar | c757249 | 2006-07-14 00:24:40 -0700 | [diff] [blame] | 132 | |
| 133 | Extending taskstats |
| 134 | ------------------- |
| 135 | |
| 136 | There are two ways to extend the taskstats interface to export more |
| 137 | per-task/process stats as patches to collect them get added to the kernel |
| 138 | in future: |
| 139 | |
| 140 | 1. Adding more fields to the end of the existing struct taskstats. Backward |
| 141 | compatibility is ensured by the version number within the |
| 142 | structure. Userspace will use only the fields of the struct that correspond |
| 143 | to the version its using. |
| 144 | |
| 145 | 2. Defining separate statistic structs and using the netlink attributes |
| 146 | interface to return them. Since userspace processes each netlink attribute |
| 147 | independently, it can always ignore attributes whose type it does not |
| 148 | understand (because it is using an older version of the interface). |
| 149 | |
| 150 | |
| 151 | Choosing between 1. and 2. is a matter of trading off flexibility and |
| 152 | overhead. If only a few fields need to be added, then 1. is the preferable |
| 153 | path since the kernel and userspace don't need to incur the overhead of |
| 154 | processing new netlink attributes. But if the new fields expand the existing |
| 155 | struct too much, requiring disparate userspace accounting utilities to |
| 156 | unnecessarily receive large structures whose fields are of no interest, then |
| 157 | extending the attributes structure would be worthwhile. |
| 158 | |
Shailabh Nagar | 9e06d3f | 2006-07-14 00:24:45 -0700 | [diff] [blame] | 159 | Flow control for taskstats |
| 160 | -------------------------- |
| 161 | |
| 162 | When the rate of task exits becomes large, a listener may not be able to keep |
| 163 | up with the kernel's rate of sending per-tid/tgid exit data leading to data |
| 164 | loss. This possibility gets compounded when the taskstats structure gets |
| 165 | extended and the number of cpus grows large. |
| 166 | |
| 167 | To avoid losing statistics, userspace should do one or more of the following: |
| 168 | |
| 169 | - increase the receive buffer sizes for the netlink sockets opened by |
| 170 | listeners to receive exit data. |
| 171 | |
| 172 | - create more listeners and reduce the number of cpus being listened to by |
| 173 | each listener. In the extreme case, there could be one listener for each cpu. |
| 174 | Users may also consider setting the cpu affinity of the listener to the subset |
| 175 | of cpus to which it listens, especially if they are listening to just one cpu. |
| 176 | |
| 177 | Despite these measures, if the userspace receives ENOBUFS error messages |
| 178 | indicated overflow of receive buffers, it should take measures to handle the |
| 179 | loss of data. |
| 180 | |
Shailabh Nagar | c757249 | 2006-07-14 00:24:40 -0700 | [diff] [blame] | 181 | ---- |