Mauro Carvalho Chehab | 31771f4 | 2020-02-17 17:12:23 +0100 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ======================= |
| 4 | Squashfs 4.0 Filesystem |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 5 | ======================= |
| 6 | |
| 7 | Squashfs is a compressed read-only filesystem for Linux. |
Mauro Carvalho Chehab | 31771f4 | 2020-02-17 17:12:23 +0100 | [diff] [blame] | 8 | |
Phillip Lougher | 6242164 | 2014-11-27 18:48:44 +0000 | [diff] [blame] | 9 | It uses zlib, lz4, lzo, or xz compression to compress files, inodes and |
| 10 | directories. Inodes in the system are very small and all blocks are packed to |
| 11 | minimise data overhead. Block sizes greater than 4K are supported up to a |
| 12 | maximum of 1Mbytes (default block size 128K). |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 13 | |
| 14 | Squashfs is intended for general read-only filesystem use, for archival |
| 15 | use (i.e. in cases where a .tar.gz file may be used), and in constrained |
| 16 | block device/memory systems (e.g. embedded systems) where low overhead is |
| 17 | needed. |
| 18 | |
| 19 | Mailing list: squashfs-devel@lists.sourceforge.net |
| 20 | Web site: www.squashfs.org |
| 21 | |
Mauro Carvalho Chehab | 31771f4 | 2020-02-17 17:12:23 +0100 | [diff] [blame] | 22 | 1. Filesystem Features |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 23 | ---------------------- |
| 24 | |
| 25 | Squashfs filesystem features versus Cramfs: |
| 26 | |
Mauro Carvalho Chehab | 31771f4 | 2020-02-17 17:12:23 +0100 | [diff] [blame] | 27 | ============================== ========= ========== |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 28 | Squashfs Cramfs |
Mauro Carvalho Chehab | 31771f4 | 2020-02-17 17:12:23 +0100 | [diff] [blame] | 29 | ============================== ========= ========== |
| 30 | Max filesystem size 2^64 256 MiB |
| 31 | Max file size ~ 2 TiB 16 MiB |
| 32 | Max files unlimited unlimited |
| 33 | Max directories unlimited unlimited |
| 34 | Max entries per directory unlimited unlimited |
| 35 | Max block size 1 MiB 4 KiB |
| 36 | Metadata compression yes no |
| 37 | Directory indexes yes no |
| 38 | Sparse file support yes no |
| 39 | Tail-end packing (fragments) yes no |
| 40 | Exportable (NFS etc.) yes no |
| 41 | Hard link support yes no |
| 42 | "." and ".." in readdir yes no |
| 43 | Real inode numbers yes no |
| 44 | 32-bit uids/gids yes no |
| 45 | File creation time yes no |
| 46 | Xattr support yes no |
| 47 | ACL support no no |
| 48 | ============================== ========= ========== |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 49 | |
| 50 | Squashfs compresses data, inodes and directories. In addition, inode and |
| 51 | directory data are highly compacted, and packed on byte boundaries. Each |
| 52 | compressed inode is on average 8 bytes in length (the exact length varies on |
| 53 | file type, i.e. regular file, directory, symbolic link, and block/char device |
| 54 | inodes have different sizes). |
| 55 | |
Mauro Carvalho Chehab | 31771f4 | 2020-02-17 17:12:23 +0100 | [diff] [blame] | 56 | 2. Using Squashfs |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 57 | ----------------- |
| 58 | |
| 59 | As squashfs is a read-only filesystem, the mksquashfs program must be used to |
| 60 | create populated squashfs filesystems. This and other squashfs utilities |
| 61 | can be obtained from http://www.squashfs.org. Usage instructions can be |
| 62 | obtained from this site also. |
| 63 | |
Phillip Lougher | 812753d | 2011-07-22 02:26:52 +0100 | [diff] [blame] | 64 | The squashfs-tools development tree is now located on kernel.org |
| 65 | git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 66 | |
Mauro Carvalho Chehab | 31771f4 | 2020-02-17 17:12:23 +0100 | [diff] [blame] | 67 | 3. Squashfs Filesystem Design |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 68 | ----------------------------- |
| 69 | |
Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 70 | A squashfs filesystem consists of a maximum of nine parts, packed together on a |
Mauro Carvalho Chehab | 31771f4 | 2020-02-17 17:12:23 +0100 | [diff] [blame] | 71 | byte alignment:: |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 72 | |
| 73 | --------------- |
| 74 | | superblock | |
| 75 | |---------------| |
Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 76 | | compression | |
| 77 | | options | |
| 78 | |---------------| |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 79 | | datablocks | |
| 80 | | & fragments | |
| 81 | |---------------| |
| 82 | | inode table | |
| 83 | |---------------| |
| 84 | | directory | |
| 85 | | table | |
| 86 | |---------------| |
| 87 | | fragment | |
| 88 | | table | |
| 89 | |---------------| |
| 90 | | export | |
| 91 | | table | |
| 92 | |---------------| |
| 93 | | uid/gid | |
| 94 | | lookup table | |
Phillip Lougher | 899f453 | 2010-05-25 02:47:00 +0100 | [diff] [blame] | 95 | |---------------| |
| 96 | | xattr | |
| 97 | | table | |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 98 | --------------- |
| 99 | |
| 100 | Compressed data blocks are written to the filesystem as files are read from |
| 101 | the source directory, and checked for duplicates. Once all file data has been |
Phillip Lougher | 89cab5b | 2011-12-29 13:54:17 +0000 | [diff] [blame] | 102 | written the completed inode, directory, fragment, export, uid/gid lookup and |
| 103 | xattr tables are written. |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 104 | |
Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 105 | 3.1 Compression options |
| 106 | ----------------------- |
| 107 | |
| 108 | Compressors can optionally support compression specific options (e.g. |
| 109 | dictionary size). If non-default compression options have been used, then |
| 110 | these are stored here. |
| 111 | |
| 112 | 3.2 Inodes |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 113 | ---------- |
| 114 | |
| 115 | Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each |
| 116 | compressed block is prefixed by a two byte length, the top bit is set if the |
| 117 | block is uncompressed. A block will be uncompressed if the -noI option is set, |
| 118 | or if the compressed block was larger than the uncompressed block. |
| 119 | |
| 120 | Inodes are packed into the metadata blocks, and are not aligned to block |
| 121 | boundaries, therefore inodes overlap compressed blocks. Inodes are identified |
| 122 | by a 48-bit number which encodes the location of the compressed metadata block |
| 123 | containing the inode, and the byte offset into that block where the inode is |
| 124 | placed (<block, offset>). |
| 125 | |
| 126 | To maximise compression there are different inodes for each file type |
| 127 | (regular file, directory, device, etc.), the inode contents and length |
| 128 | varying with the type. |
| 129 | |
| 130 | To further maximise compression, two types of regular file inode and |
| 131 | directory inode are defined: inodes optimised for frequently occurring |
| 132 | regular files and directories, and extended types where extra |
| 133 | information has to be stored. |
| 134 | |
Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 135 | 3.3 Directories |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 136 | --------------- |
| 137 | |
| 138 | Like inodes, directories are packed into compressed metadata blocks, stored |
| 139 | in a directory table. Directories are accessed using the start address of |
| 140 | the metablock containing the directory and the offset into the |
| 141 | decompressed block (<block, offset>). |
| 142 | |
| 143 | Directories are organised in a slightly complex way, and are not simply |
| 144 | a list of file names. The organisation takes advantage of the |
| 145 | fact that (in most cases) the inodes of the files will be in the same |
| 146 | compressed metadata block, and therefore, can share the start block. |
| 147 | Directories are therefore organised in a two level list, a directory |
| 148 | header containing the shared start block value, and a sequence of directory |
| 149 | entries, each of which share the shared start block. A new directory header |
| 150 | is written once/if the inode start block changes. The directory |
| 151 | header/directory entry list is repeated as many times as necessary. |
| 152 | |
| 153 | Directories are sorted, and can contain a directory index to speed up |
| 154 | file lookup. Directory indexes store one entry per metablock, each entry |
| 155 | storing the index/filename mapping to the first directory header |
| 156 | in each metadata block. Directories are sorted in alphabetical order, |
| 157 | and at lookup the index is scanned linearly looking for the first filename |
| 158 | alphabetically larger than the filename being looked up. At this point the |
| 159 | location of the metadata block the filename is in has been found. |
Phillip Lougher | 89cab5b | 2011-12-29 13:54:17 +0000 | [diff] [blame] | 160 | The general idea of the index is to ensure only one metadata block needs to be |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 161 | decompressed to do a lookup irrespective of the length of the directory. |
| 162 | This scheme has the advantage that it doesn't require extra memory overhead |
| 163 | and doesn't require much extra storage on disk. |
| 164 | |
Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 165 | 3.4 File data |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 166 | ------------- |
| 167 | |
| 168 | Regular files consist of a sequence of contiguous compressed blocks, and/or a |
| 169 | compressed fragment block (tail-end packed block). The compressed size |
| 170 | of each datablock is stored in a block list contained within the |
| 171 | file inode. |
| 172 | |
| 173 | To speed up access to datablocks when reading 'large' files (256 Mbytes or |
| 174 | larger), the code implements an index cache that caches the mapping from |
| 175 | block index to datablock location on disk. |
| 176 | |
| 177 | The index cache allows Squashfs to handle large files (up to 1.75 TiB) while |
| 178 | retaining a simple and space-efficient block list on disk. The cache |
| 179 | is split into slots, caching up to eight 224 GiB files (128 KiB blocks). |
| 180 | Larger files use multiple slots, with 1.75 TiB files using all 8 slots. |
| 181 | The index cache is designed to be memory efficient, and by default uses |
| 182 | 16 KiB. |
| 183 | |
Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 184 | 3.5 Fragment lookup table |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 185 | ------------------------- |
| 186 | |
| 187 | Regular files can contain a fragment index which is mapped to a fragment |
| 188 | location on disk and compressed size using a fragment lookup table. This |
| 189 | fragment lookup table is itself stored compressed into metadata blocks. |
| 190 | A second index table is used to locate these. This second index table for |
| 191 | speed of access (and because it is small) is read at mount time and cached |
| 192 | in memory. |
| 193 | |
Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 194 | 3.6 Uid/gid lookup table |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 195 | ------------------------ |
| 196 | |
| 197 | For space efficiency regular files store uid and gid indexes, which are |
| 198 | converted to 32-bit uids/gids using an id look up table. This table is |
| 199 | stored compressed into metadata blocks. A second index table is used to |
| 200 | locate these. This second index table for speed of access (and because it |
| 201 | is small) is read at mount time and cached in memory. |
| 202 | |
Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 203 | 3.7 Export table |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 204 | ---------------- |
| 205 | |
| 206 | To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems |
| 207 | can optionally (disabled with the -no-exports Mksquashfs option) contain |
| 208 | an inode number to inode disk location lookup table. This is required to |
| 209 | enable Squashfs to map inode numbers passed in filehandles to the inode |
| 210 | location on disk, which is necessary when the export code reinstantiates |
| 211 | expired/flushed inodes. |
| 212 | |
| 213 | This table is stored compressed into metadata blocks. A second index table is |
| 214 | used to locate these. This second index table for speed of access (and because |
| 215 | it is small) is read at mount time and cached in memory. |
| 216 | |
Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 217 | 3.8 Xattr table |
Phillip Lougher | 899f453 | 2010-05-25 02:47:00 +0100 | [diff] [blame] | 218 | --------------- |
| 219 | |
| 220 | The xattr table contains extended attributes for each inode. The xattrs |
| 221 | for each inode are stored in a list, each list entry containing a type, |
| 222 | name and value field. The type field encodes the xattr prefix |
| 223 | ("user.", "trusted." etc) and it also encodes how the name/value fields |
| 224 | should be interpreted. Currently the type indicates whether the value |
| 225 | is stored inline (in which case the value field contains the xattr value), |
| 226 | or if it is stored out of line (in which case the value field stores a |
| 227 | reference to where the actual value is stored). This allows large values |
| 228 | to be stored out of line improving scanning and lookup performance and it |
| 229 | also allows values to be de-duplicated, the value being stored once, and |
Lucas De Marchi | 25985ed | 2011-03-30 22:57:33 -0300 | [diff] [blame] | 230 | all other occurrences holding an out of line reference to that value. |
Phillip Lougher | 899f453 | 2010-05-25 02:47:00 +0100 | [diff] [blame] | 231 | |
| 232 | The xattr lists are packed into compressed 8K metadata blocks. |
| 233 | To reduce overhead in inodes, rather than storing the on-disk |
| 234 | location of the xattr list inside each inode, a 32-bit xattr id |
| 235 | is stored. This xattr id is mapped into the location of the xattr |
| 236 | list using a second xattr id lookup table. |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 237 | |
Mauro Carvalho Chehab | 31771f4 | 2020-02-17 17:12:23 +0100 | [diff] [blame] | 238 | 4. TODOs and Outstanding Issues |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 239 | ------------------------------- |
| 240 | |
Mauro Carvalho Chehab | 31771f4 | 2020-02-17 17:12:23 +0100 | [diff] [blame] | 241 | 4.1 TODO list |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 242 | ------------- |
| 243 | |
Phillip Lougher | 899f453 | 2010-05-25 02:47:00 +0100 | [diff] [blame] | 244 | Implement ACL support. |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 245 | |
Mauro Carvalho Chehab | 31771f4 | 2020-02-17 17:12:23 +0100 | [diff] [blame] | 246 | 4.2 Squashfs Internal Cache |
Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 247 | --------------------------- |
| 248 | |
| 249 | Blocks in Squashfs are compressed. To avoid repeatedly decompressing |
| 250 | recently accessed data Squashfs uses two small metadata and fragment caches. |
| 251 | |
| 252 | The cache is not used for file datablocks, these are decompressed and cached in |
| 253 | the page-cache in the normal way. The cache is used to temporarily cache |
| 254 | fragment and metadata blocks which have been read as a result of a metadata |
| 255 | (i.e. inode or directory) or fragment access. Because metadata and fragments |
| 256 | are packed together into blocks (to gain greater compression) the read of a |
| 257 | particular piece of metadata or fragment will retrieve other metadata/fragments |
| 258 | which have been packed with it, these because of locality-of-reference may be |
| 259 | read in the near future. Temporarily caching them ensures they are available |
| 260 | for near future access without requiring an additional read and decompress. |
| 261 | |
| 262 | In the future this internal cache may be replaced with an implementation which |
| 263 | uses the kernel page cache. Because the page cache operates on page sized |
| 264 | units this may introduce additional complexity in terms of locking and |
| 265 | associated race conditions. |