Blame - Documentation/filesystems/ceph.rst - SHIFTPHONES/mainline/linux

blob: 4942e018db855e0733a8bb9445bb5c9e2bc2df2c [file] [log] [blame]

Mauro Carvalho Chehab	471379a	2020-02-17 17:11:55 +0100	[diff] [blame]	1	.. SPDX-License-Identifier: GPL-2.0
				2
				3	============================
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	4	Ceph Distributed File System
				5	============================
				6
				7	Ceph is a distributed network file system designed to provide good
				8	performance, reliability, and scalability.
				9
				10	Basic features include:
				11
				12	* POSIX semantics
				13	* Seamless scaling from 1 to many thousands of nodes
Cheng Renquan	8136b58	2010-03-29 19:05:57 +0800	[diff] [blame]	14	* High availability and reliability. No single point of failure.
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	15	* N-way replication of data across storage nodes
				16	* Fast recovery from node failures
				17	* Automatic rebalancing of data on node addition/removal
				18	* Easy deployment: most FS components are userspace daemons
				19
				20	Also,
Mauro Carvalho Chehab	471379a	2020-02-17 17:11:55 +0100	[diff] [blame]	21
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	22	* Flexible snapshots (on any directory)
				23	* Recursive accounting (nested files, directories, bytes)
				24
				25	In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely
				26	on symmetric access by all clients to shared block devices, Ceph
				27	separates data and metadata management into independent server
				28	clusters, similar to Lustre. Unlike Lustre, however, metadata and
Jeff Layton	d11ae8e	2019-03-05 07:34:41 -0500	[diff] [blame]	29	storage nodes run entirely as user space daemons. File data is striped
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	30	across storage nodes in large chunks to distribute workload and
				31	facilitate high throughputs. When storage nodes fail, data is
				32	re-replicated in a distributed fashion by the storage nodes themselves
				33	(with some minimal coordination from a cluster monitor), making the
				34	system extremely efficient and scalable.
				35
				36	Metadata servers effectively form a large, consistent, distributed
				37	in-memory cache above the file namespace that is extremely scalable,
				38	dynamically redistributes metadata in response to workload changes,
				39	and can tolerate arbitrary (well, non-Byzantine) node failures. The
				40	metadata server takes a somewhat unconventional approach to metadata
				41	storage to significantly improve performance for common workloads. In
				42	particular, inodes with only a single link are embedded in
				43	directories, allowing entire directories of dentries and inodes to be
				44	loaded into its cache with a single I/O operation. The contents of
				45	extremely large directories can be fragmented and managed by
				46	independent metadata servers, allowing scalable concurrent access.
				47
				48	The system offers automatic data rebalancing/migration when scaling
				49	from a small cluster of just a few nodes to many hundreds, without
				50	requiring an administrator carve the data set into static volumes or
				51	go through the tedious process of migrating data between servers.
				52	When the file system approaches full, new nodes can be easily added
				53	and things will "just work."
				54
				55	Ceph includes flexible snapshot mechanism that allows a user to create
				56	a snapshot on any subdirectory (and its nested contents) in the
				57	system. Snapshot creation and deletion are as simple as 'mkdir
				58	.snap/foo' and 'rmdir .snap/foo'.
				59
				60	Ceph also provides some recursive accounting on directories for nested
				61	files and bytes. That is, a 'getfattr -d foo' on any directory in the
				62	system will reveal the total number of nested regular files and
				63	subdirectories, and a summation of all nested file sizes. This makes
				64	the identification of large disk space consumers relatively quick, as
				65	no 'du' or similar recursive scan of the file system is required.
				66
Luis Henriques	fb18a57	2018-01-05 10:47:18 +0000	[diff] [blame]	67	Finally, Ceph also allows quotas to be set on any directory in the system.
				68	The quota can restrict the number of bytes or the number of files stored
				69	beneath that point in the directory hierarchy. Quotas can be set using
Mauro Carvalho Chehab	471379a	2020-02-17 17:11:55 +0100	[diff] [blame]	70	extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg::
Luis Henriques	fb18a57	2018-01-05 10:47:18 +0000	[diff] [blame]	71
				72	setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir
				73	getfattr -n ceph.quota.max_bytes /some/dir
				74
				75	A limitation of the current quotas implementation is that it relies on the
				76	cooperation of the client mounting the file system to stop writers when a
				77	limit is reached. A modified or adversarial client cannot be prevented
				78	from writing as much data as it needs.
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	79
				80	Mount Syntax
				81	============
				82
Mauro Carvalho Chehab	471379a	2020-02-17 17:11:55 +0100	[diff] [blame]	83	The basic mount syntax is::
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	84
Venky Shankar	e1b9eb5	2021-07-14 15:35:54 +0530	[diff] [blame]	85	# mount -t ceph user@fsid.fs_name=/[subdir] mnt -o mon_addr=monip1[:port][/monip2[:port]]
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	86
				87	You only need to specify a single monitor, as the client will get the
				88	full list when it connects. (However, if the monitor you specify
				89	happens to be down, the mount won't succeed.) The port can be left
				90	off if the monitor is using the default. So if the monitor is at
Mauro Carvalho Chehab	471379a	2020-02-17 17:11:55 +0100	[diff] [blame]	91	1.2.3.4::
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	92
Venky Shankar	e1b9eb5	2021-07-14 15:35:54 +0530	[diff] [blame]	93	# mount -t ceph cephuser@07fe3187-00d9-42a3-814b-72a4d5e7d5be.cephfs=/ /mnt/ceph -o mon_addr=1.2.3.4
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	94
				95	is sufficient. If /sbin/mount.ceph is installed, a hostname can be
Venky Shankar	e1b9eb5	2021-07-14 15:35:54 +0530	[diff] [blame]	96	used instead of an IP address and the cluster FSID can be left out
				97	(as the mount helper will fill it in by reading the ceph configuration
				98	file)::
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	99
Venky Shankar	e1b9eb5	2021-07-14 15:35:54 +0530	[diff] [blame]	100	# mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=mon-addr
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	101
Venky Shankar	e1b9eb5	2021-07-14 15:35:54 +0530	[diff] [blame]	102	Multiple monitor addresses can be passed by separating each address with a slash (`/`)::
				103
				104	# mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=192.168.1.100/192.168.1.101
				105
				106	When using the mount helper, monitor address can be read from ceph
				107	configuration file if available. Note that, the cluster FSID (passed as part
				108	of the device string) is validated by checking it with the FSID reported by
				109	the monitor.
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	110
				111	Mount Options
				112	=============
				113
Venky Shankar	e1b9eb5	2021-07-14 15:35:54 +0530	[diff] [blame]	114	mon_addr=ip_address[:port][/ip_address[:port]]
				115	Monitor address to the cluster. This is used to bootstrap the
				116	connection to the cluster. Once connection is established, the
				117	monitor addresses in the monitor map are followed.
				118
				119	fsid=cluster-id
				120	FSID of the cluster (from `ceph fsid` command).
				121
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	122	ip=A.B.C.D[:N]
				123	Specify the IP and/or port the client should bind to locally.
				124	There is normally not much reason to do this. If the IP is not
				125	specified, the client's IP address is determined by looking at the
Francis Galiegue	a33f322	2010-04-23 00:08:02 +0200	[diff] [blame]	126	address its connection to the monitor originates from.
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	127
				128	wsize=X
Linus Torvalds	fcc95f0	2020-04-08 21:44:05 -0700	[diff] [blame]	129	Specify the maximum write size in bytes. Default: 64 MB.
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	130
				131	rsize=X
Linus Torvalds	fcc95f0	2020-04-08 21:44:05 -0700	[diff] [blame]	132	Specify the maximum read size in bytes. Default: 64 MB.
Andreas Gerstmayr	92c1037	2016-09-15 21:23:01 +0200	[diff] [blame]	133
				134	rasize=X
Chengguang Xu	c7f0494	2018-06-04 20:10:05 +0800	[diff] [blame]	135	Specify the maximum readahead size in bytes. Default: 8 MB.
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	136
				137	mount_timeout=X
				138	Specify the timeout value for mount (in seconds), in the case
Linus Torvalds	fcc95f0	2020-04-08 21:44:05 -0700	[diff] [blame]	139	of a non-responsive Ceph file system. The default is 60
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	140	seconds.
				141
Yan, Zheng	fe33032	2019-02-01 14:57:15 +0800	[diff] [blame]	142	caps_max=X
				143	Specify the maximum number of caps to hold. Unused caps are released
				144	when number of caps exceeds the limit. The default is 0 (no limit)
				145
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	146	rbytes
				147	When stat() is called on a directory, set st_size to 'rbytes',
				148	the summation of file sizes over all files nested beneath that
				149	directory. This is the default.
				150
				151	norbytes
				152	When stat() is called on a directory, set st_size to the
				153	number of entries in that directory.
				154
				155	nocrc
Sage Weil	23ab15a	2010-03-22 09:37:14 -0700	[diff] [blame]	156	Disable CRC32C calculation for data writes. If set, the storage node
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	157	must rely on TCP's error correction to detect data corruption
				158	in the data payload.
				159
Sage Weil	a40dc6c	2012-01-10 09:12:55 -0800	[diff] [blame]	160	dcache
				161	Use the dcache contents to perform negative lookups and
				162	readdir when the client has the entire directory contents in
				163	its cache. (This does not change correctness; the client uses
				164	cached metadata only when a lease or capability ensures it is
				165	valid.)
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	166
Sage Weil	a40dc6c	2012-01-10 09:12:55 -0800	[diff] [blame]	167	nodcache
				168	Do not use the dcache as above. This avoids a significant amount of
				169	complex code, sacrificing performance without affecting correctness,
				170	and is useful for tracking down bugs.
				171
				172	noasyncreaddir
				173	Do not use the dcache as above for readdir.
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	174
Luis Henriques	9122eed	2018-01-31 10:53:13 +0000	[diff] [blame]	175	noquotadf
				176	Report overall filesystem usage in statfs instead of using the root
				177	directory quota.
				178
Luis Henriques	ea4cdc5	2018-10-15 16:46:00 +0100	[diff] [blame]	179	nocopyfrom
				180	Don't use the RADOS 'copy-from' operation to perform remote object
				181	copies. Currently, it's only used in copy_file_range, which will revert
				182	to the default VFS implementation if this option is used.
				183
Yan, Zheng	131d7eb	2019-07-25 20:16:47 +0800	[diff] [blame]	184	recover_session=<no\|clean>
Ilya Dryomov	0b98acd	2020-09-14 13:39:19 +0200	[diff] [blame]	185	Set auto reconnect mode in the case where the client is blocklisted. The
Yan, Zheng	131d7eb	2019-07-25 20:16:47 +0800	[diff] [blame]	186	available modes are "no" and "clean". The default is "no".
				187
				188	* no: never attempt to reconnect when client detects that it has been
Ilya Dryomov	0b98acd	2020-09-14 13:39:19 +0200	[diff] [blame]	189	blocklisted. Operations will generally fail after being blocklisted.
Yan, Zheng	131d7eb	2019-07-25 20:16:47 +0800	[diff] [blame]	190
				191	* clean: client reconnects to the ceph cluster automatically when it
Ilya Dryomov	0b98acd	2020-09-14 13:39:19 +0200	[diff] [blame]	192	detects that it has been blocklisted. During reconnect, client drops
Mauro Carvalho Chehab	471379a	2020-02-17 17:11:55 +0100	[diff] [blame]	193	dirty data/metadata, invalidates page caches and writable file handles.
				194	After reconnect, file locks become stale because the MDS loses track
				195	of them. If an inode contains any stale file locks, read/write on the
				196	inode is not allowed until applications release all stale file locks.
Yan, Zheng	131d7eb	2019-07-25 20:16:47 +0800	[diff] [blame]	197
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	198	More Information
				199	================
				200
				201	For more information on Ceph, see the home page at
Jeff Layton	d11ae8e	2019-03-05 07:34:41 -0500	[diff] [blame]	202	https://ceph.com/
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	203
				204	The Linux kernel client source tree is available at
Mauro Carvalho Chehab	471379a	2020-02-17 17:11:55 +0100	[diff] [blame]	205	- https://github.com/ceph/ceph-client.git
				206	- git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
Sage Weil	7ad920b	2009-10-06 11:31:05 -0700	[diff] [blame]	207
				208	and the source for the full system is at
Jeff Layton	d11ae8e	2019-03-05 07:34:41 -0500	[diff] [blame]	209	https://github.com/ceph/ceph.git