Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 1 | ===================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 2 | I/O statistics fields |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 3 | ===================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 4 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 5 | Since 2.4.20 (and some versions before, with patches), and 2.5.45, |
| 6 | more extensive disk statistics have been introduced to help measure disk |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 7 | activity. Tools such as ``sar`` and ``iostat`` typically interpret these and do |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 8 | the work for you, but in case you are interested in creating your own |
| 9 | tools, the fields are explained here. |
| 10 | |
| 11 | In 2.4 now, the information is found as additional fields in |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 12 | ``/proc/partitions``. In 2.6 and upper, the same information is found in two |
| 13 | places: one is in the file ``/proc/diskstats``, and the other is within |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 14 | the sysfs file system, which must be mounted in order to obtain |
| 15 | the information. Throughout this document we'll assume that sysfs |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 16 | is mounted on ``/sys``, although of course it may be mounted anywhere. |
| 17 | Both ``/proc/diskstats`` and sysfs use the same source for the information |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 18 | and so should not differ. |
| 19 | |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 20 | Here are examples of these different formats:: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 21 | |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 22 | 2.4: |
| 23 | 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 |
| 24 | 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030 |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 25 | |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 26 | 2.6+ sysfs: |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 27 | 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 |
| 28 | 35486 38030 38030 38030 |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 29 | |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 30 | 2.6+ diskstats: |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 31 | 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 |
| 32 | 3 1 hda1 35486 38030 38030 38030 |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 33 | |
Michael Callahan | bdca3c8 | 2018-07-18 04:47:40 -0700 | [diff] [blame] | 34 | 4.18+ diskstats: |
| 35 | 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0 |
| 36 | |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 37 | On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have |
| 38 | a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``. |
| 39 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 40 | The advantage of one over the other is that the sysfs choice works well |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 41 | if you are watching a known, small set of disks. ``/proc/diskstats`` may |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 42 | be a better choice if you are watching a large number of disks because |
| 43 | you'll avoid the overhead of 50, 100, or 500 or more opens/closes with |
| 44 | each snapshot of your disk statistics. |
| 45 | |
| 46 | In 2.4, the statistics fields are those after the device name. In |
| 47 | the above example, the first field of statistics would be 446216. |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 48 | By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 49 | find just the eleven fields, beginning with 446216. If you look at |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 50 | ``/proc/diskstats``, the eleven fields will be preceded by the major and |
Randy Dunlap | 9d2e157 | 2011-03-23 20:44:18 +0100 | [diff] [blame] | 51 | minor device numbers, and device name. Each of these formats provides |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 52 | eleven fields of statistics, each meaning exactly the same things. |
| 53 | All fields except field 9 are cumulative since boot. Field 9 should |
Randy Dunlap | 9d2e157 | 2011-03-23 20:44:18 +0100 | [diff] [blame] | 54 | go to zero as I/Os complete; all others only increase (unless they |
| 55 | overflow and wrap). Yes, these are (32-bit or 64-bit) unsigned long |
| 56 | (native word size) numbers, and on a very busy or long-lived system they |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 57 | may wrap. Applications should be prepared to deal with that; unless |
| 58 | your observations are measured in large numbers of minutes or hours, |
| 59 | they should not wrap twice before you notice them. |
| 60 | |
| 61 | Each set of stats only applies to the indicated device; if you want |
| 62 | system-wide stats you'll have to find all the devices and sum them all up. |
| 63 | |
Jerome Marchand | 0e53c2b | 2008-02-08 11:10:56 +0100 | [diff] [blame] | 64 | Field 1 -- # of reads completed |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 65 | This is the total number of reads completed successfully. |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 66 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 67 | Field 2 -- # of reads merged, field 6 -- # of writes merged |
| 68 | Reads and writes which are adjacent to each other may be merged for |
| 69 | efficiency. Thus two 4K reads may become one 8K read before it is |
| 70 | ultimately handed to the disk, and so it will be counted (and queued) |
| 71 | as only one I/O. This field lets you know how often this was done. |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 72 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 73 | Field 3 -- # of sectors read |
| 74 | This is the total number of sectors read successfully. |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 75 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 76 | Field 4 -- # of milliseconds spent reading |
| 77 | This is the total number of milliseconds spent by all reads (as |
| 78 | measured from __make_request() to end_that_request_last()). |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 79 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 80 | Field 5 -- # of writes completed |
| 81 | This is the total number of writes completed successfully. |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 82 | |
David P Hilton | 69963a0 | 2013-02-20 16:44:28 -0700 | [diff] [blame] | 83 | Field 6 -- # of writes merged |
| 84 | See the description of field 2. |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 85 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 86 | Field 7 -- # of sectors written |
| 87 | This is the total number of sectors written successfully. |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 88 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 89 | Field 8 -- # of milliseconds spent writing |
| 90 | This is the total number of milliseconds spent by all writes (as |
| 91 | measured from __make_request() to end_that_request_last()). |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 92 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 93 | Field 9 -- # of I/Os currently in progress |
| 94 | The only field that should go to zero. Incremented as requests are |
Jens Axboe | 165125e | 2007-07-24 09:28:11 +0200 | [diff] [blame] | 95 | given to appropriate struct request_queue and decremented as they finish. |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 96 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 97 | Field 10 -- # of milliseconds spent doing I/Os |
Jim Cromie | 50ed380 | 2010-07-03 23:18:11 -0600 | [diff] [blame] | 98 | This field increases so long as field 9 is nonzero. |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 99 | |
Konstantin Khlebnikov | 9d9b889 | 2019-06-09 14:14:36 +0300 | [diff] [blame] | 100 | Since 5.0 this field counts jiffies when at least one request was |
| 101 | started or completed. If request runs more than 2 jiffies then some |
| 102 | I/O time will not be accounted unless there are other requests. |
| 103 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 104 | Field 11 -- weighted # of milliseconds spent doing I/Os |
| 105 | This field is incremented at each I/O start, I/O completion, I/O |
| 106 | merge, or read of these stats by the number of I/Os in progress |
| 107 | (field 9) times the number of milliseconds spent doing I/O since the |
| 108 | last update of this field. This can provide an easy measure of both |
| 109 | I/O completion time and the backlog that may be accumulating. |
| 110 | |
Michael Callahan | bdca3c8 | 2018-07-18 04:47:40 -0700 | [diff] [blame] | 111 | Field 12 -- # of discards completed |
| 112 | This is the total number of discards completed successfully. |
| 113 | |
| 114 | Field 13 -- # of discards merged |
| 115 | See the description of field 2 |
| 116 | |
| 117 | Field 14 -- # of sectors discarded |
| 118 | This is the total number of sectors discarded successfully. |
| 119 | |
| 120 | Field 15 -- # of milliseconds spent discarding |
| 121 | This is the total number of milliseconds spent by all discards (as |
| 122 | measured from __make_request() to end_that_request_last()). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 123 | |
Konstantin Khlebnikov | b686631 | 2019-11-21 13:40:26 +0300 | [diff] [blame^] | 124 | Field 16 -- # of flush requests completed |
| 125 | This is the total number of flush requests completed successfully. |
| 126 | |
| 127 | Block layer combines flush requests and executes at most one at a time. |
| 128 | This counts flush requests executed by disk. Not tracked for partitions. |
| 129 | |
| 130 | Field 17 -- # of milliseconds spent flushing |
| 131 | This is the total number of milliseconds spent by all flush requests. |
| 132 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 133 | To avoid introducing performance bottlenecks, no locks are held while |
| 134 | modifying these counters. This implies that minor inaccuracies may be |
| 135 | introduced when changes collide, so (for instance) adding up all the |
| 136 | read I/Os issued per partition should equal those made to the disks ... |
| 137 | but due to the lack of locking it may only be very close. |
| 138 | |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 139 | In 2.6+, there are counters for each CPU, which make the lack of locking |
Randy Dunlap | 9d2e157 | 2011-03-23 20:44:18 +0100 | [diff] [blame] | 140 | almost a non-issue. When the statistics are read, the per-CPU counters |
| 141 | are summed (possibly overflowing the unsigned long variable they are |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 142 | summed to) and the result given to the user. There is no convenient |
Randy Dunlap | 9d2e157 | 2011-03-23 20:44:18 +0100 | [diff] [blame] | 143 | user interface for accessing the per-CPU counters themselves. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 144 | |
| 145 | Disks vs Partitions |
| 146 | ------------------- |
| 147 | |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 148 | There were significant changes between 2.4 and 2.6+ in the I/O subsystem. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 149 | As a result, some statistic information disappeared. The translation from |
| 150 | a disk address relative to a partition to the disk address relative to |
| 151 | the host disk happens much earlier. All merges and timings now happen |
| 152 | at the disk level rather than at both the disk and partition level as |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 153 | in 2.4. Consequently, you'll see a different statistics output on 2.6+ for |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 154 | partitions from that for disks. There are only *four* fields available |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 155 | for partitions on 2.6+ machines. This is reflected in the examples above. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 156 | |
| 157 | Field 1 -- # of reads issued |
| 158 | This is the total number of reads issued to this partition. |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 159 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 160 | Field 2 -- # of sectors read |
| 161 | This is the total number of sectors requested to be read from this |
| 162 | partition. |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 163 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 164 | Field 3 -- # of writes issued |
| 165 | This is the total number of writes issued to this partition. |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 166 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 167 | Field 4 -- # of sectors written |
| 168 | This is the total number of sectors requested to be written to |
| 169 | this partition. |
| 170 | |
| 171 | Note that since the address is translated to a disk-relative one, and no |
| 172 | record of the partition-relative address is kept, the subsequent success |
| 173 | or failure of the read cannot be attributed to the partition. In other |
| 174 | words, the number of reads for partitions is counted slightly before time |
| 175 | of queuing for partitions, and at completion for whole disks. This is |
| 176 | a subtle distinction that is probably uninteresting for most cases. |
| 177 | |
Jerome Marchand | 0e53c2b | 2008-02-08 11:10:56 +0100 | [diff] [blame] | 178 | More significant is the error induced by counting the numbers of |
| 179 | reads/writes before merges for partitions and after for disks. Since a |
| 180 | typical workload usually contains a lot of successive and adjacent requests, |
| 181 | the number of reads/writes issued can be several times higher than the |
| 182 | number of reads/writes completed. |
| 183 | |
| 184 | In 2.6.25, the full statistic set is again available for partitions and |
| 185 | disk and partition statistics are consistent again. Since we still don't |
| 186 | keep record of the partition-relative address, an operation is attributed to |
| 187 | the partition which contains the first sector of the request after the |
| 188 | eventual merges. As requests can be merged across partition, this could lead |
Matt LaPlante | d919588 | 2008-07-25 19:45:33 -0700 | [diff] [blame] | 189 | to some (probably insignificant) inaccuracy. |
Jerome Marchand | 0e53c2b | 2008-02-08 11:10:56 +0100 | [diff] [blame] | 190 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 191 | Additional notes |
| 192 | ---------------- |
| 193 | |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 194 | In 2.6+, sysfs is not mounted by default. If your distribution of |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 195 | Linux hasn't added it already, here's the line you'll want to add to |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 196 | your ``/etc/fstab``:: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 197 | |
Mauro Carvalho Chehab | 378012c | 2017-05-14 14:52:53 -0300 | [diff] [blame] | 198 | none /sys sysfs defaults 0 0 |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 199 | |
| 200 | |
Mauro Carvalho Chehab | 877b638 | 2017-05-14 15:08:22 -0300 | [diff] [blame] | 201 | In 2.6+, all disk statistics were removed from ``/proc/stat``. In 2.4, they |
| 202 | appear in both ``/proc/partitions`` and ``/proc/stat``, although the ones in |
| 203 | ``/proc/stat`` take a very different format from those in ``/proc/partitions`` |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 204 | (see proc(5), if your system has it.) |
| 205 | |
| 206 | -- ricklind@us.ibm.com |