blob: e47047e32e27e5abcb72f7dcf9206567c4a58145 [file] [log] [blame]
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -03001====================
2TCM Userspace Design
3====================
Andy Groverce876852014-10-01 16:07:04 -07004
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -03005
6.. Contents:
7
Mauro Carvalho Chehabc44166f2020-03-20 16:11:02 +01008 1) Design
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -03009 a) Background
10 b) Benefits
11 c) Design constraints
12 d) Implementation overview
13 i. Mailbox
14 ii. Command ring
15 iii. Data Area
16 e) Device discovery
17 f) Device events
18 g) Other contingencies
19 2) Writing a user pass-through handler
20 a) Discovering and configuring TCMU uio devices
21 b) Waiting for events on the device(s)
22 c) Managing the command ring
23 3) A final note
Andy Groverce876852014-10-01 16:07:04 -070024
25
Mauro Carvalho Chehabc44166f2020-03-20 16:11:02 +010026Design
27======
Andy Groverce876852014-10-01 16:07:04 -070028
29TCM is another name for LIO, an in-kernel iSCSI target (server).
30Existing TCM targets run in the kernel. TCMU (TCM in Userspace)
31allows userspace programs to be written which act as iSCSI targets.
32This document describes the design.
33
34The existing kernel provides modules for different SCSI transport
35protocols. TCM also modularizes the data storage. There are existing
36modules for file, block device, RAM or using another SCSI device as
37storage. These are called "backstores" or "storage engines". These
38built-in modules are implemented entirely as kernel code.
39
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -030040Background
41----------
Andy Groverce876852014-10-01 16:07:04 -070042
43In addition to modularizing the transport protocol used for carrying
44SCSI commands ("fabrics"), the Linux kernel target, LIO, also modularizes
45the actual data storage as well. These are referred to as "backstores"
46or "storage engines". The target comes with backstores that allow a
47file, a block device, RAM, or another SCSI device to be used for the
48local storage needed for the exported SCSI LUN. Like the rest of LIO,
49these are implemented entirely as kernel code.
50
51These backstores cover the most common use cases, but not all. One new
52use case that other non-kernel target solutions, such as tgt, are able
53to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
54target then serves as a translator, allowing initiators to store data
55in these non-traditional networked storage systems, while still only
56using standard protocols themselves.
57
58If the target is a userspace process, supporting these is easy. tgt,
59for example, needs only a small adapter module for each, because the
60modules just use the available userspace libraries for RBD and GLFS.
61
62Adding support for these backstores in LIO is considerably more
63difficult, because LIO is entirely kernel code. Instead of undertaking
64the significant work to port the GLFS or RBD APIs and protocols to the
65kernel, another approach is to create a userspace pass-through
66backstore for LIO, "TCMU".
67
68
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -030069Benefits
70--------
Andy Groverce876852014-10-01 16:07:04 -070071
72In addition to allowing relatively easy support for RBD and GLFS, TCMU
73will also allow easier development of new backstores. TCMU combines
74with the LIO loopback fabric to become something similar to FUSE
75(Filesystem in Userspace), but at the SCSI layer instead of the
76filesystem layer. A SUSE, if you will.
77
78The disadvantage is there are more distinct components to configure, and
79potentially to malfunction. This is unavoidable, but hopefully not
80fatal if we're careful to keep things as simple as possible.
81
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -030082Design constraints
83------------------
Andy Groverce876852014-10-01 16:07:04 -070084
85- Good performance: high throughput, low latency
86- Cleanly handle if userspace:
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -030087
Andy Groverce876852014-10-01 16:07:04 -070088 1) never attaches
89 2) hangs
90 3) dies
91 4) misbehaves
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -030092
Andy Groverce876852014-10-01 16:07:04 -070093- Allow future flexibility in user & kernel implementations
94- Be reasonably memory-efficient
95- Simple to configure & run
96- Simple to write a userspace backend
97
98
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -030099Implementation overview
100-----------------------
Andy Groverce876852014-10-01 16:07:04 -0700101
102The core of the TCMU interface is a memory region that is shared
103between kernel and userspace. Within this region is: a control area
104(mailbox); a lockless producer/consumer circular buffer for commands
105to be passed up, and status returned; and an in/out data buffer area.
106
107TCMU uses the pre-existing UIO subsystem. UIO allows device driver
108development in userspace, and this is conceptually very close to the
109TCMU use case, except instead of a physical device, TCMU implements a
110memory-mapped layout designed for SCSI commands. Using UIO also
111benefits TCMU by handling device introspection (e.g. a way for
112userspace to determine how large the shared region is) and signaling
113mechanisms in both directions.
114
115There are no embedded pointers in the memory region. Everything is
116expressed as an offset from the region's starting address. This allows
117the ring to still work if the user process dies and is restarted with
118the region mapped at a different virtual address.
119
120See target_core_user.h for the struct definitions.
121
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300122The Mailbox
123-----------
Andy Groverce876852014-10-01 16:07:04 -0700124
125The mailbox is always at the start of the shared memory region, and
126contains a version, details about the starting offset and size of the
127command ring, and head and tail pointers to be used by the kernel and
128userspace (respectively) to put commands on the ring, and indicate
129when the commands are completed.
130
131version - 1 (userspace should abort if otherwise)
Andy Groverce876852014-10-01 16:07:04 -0700132
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300133flags:
134 - TCMU_MAILBOX_FLAG_CAP_OOOC:
135 indicates out-of-order completion is supported.
136 See "The Command Ring" for details.
137
138cmdr_off
139 The offset of the start of the command ring from the start
140 of the memory region, to account for the mailbox size.
141cmdr_size
142 The size of the command ring. This does *not* need to be a
143 power of two.
144cmd_head
145 Modified by the kernel to indicate when a command has been
146 placed on the ring.
147cmd_tail
148 Modified by userspace to indicate when it has completed
149 processing of a command.
150
151The Command Ring
152----------------
Andy Groverce876852014-10-01 16:07:04 -0700153
154Commands are placed on the ring by the kernel incrementing
155mailbox.cmd_head by the size of the command, modulo cmdr_size, and
156then signaling userspace via uio_event_notify(). Once the command is
157completed, userspace updates mailbox.cmd_tail in the same way and
158signals the kernel via a 4-byte write(). When cmd_head equals
159cmd_tail, the ring is empty -- no commands are currently waiting to be
160processed by userspace.
161
Andy Grover0ad46af2015-04-14 17:30:04 -0700162TCMU commands are 8-byte aligned. They start with a common header
163containing "len_op", a 32-bit value that stores the length, as well as
164the opcode in the lowest unused bits. It also contains cmd_id and
165flags fields for setting by the kernel (kflags) and userspace
166(uflags).
Andy Groverce876852014-10-01 16:07:04 -0700167
Andy Grover0ad46af2015-04-14 17:30:04 -0700168Currently only two opcodes are defined, TCMU_OP_CMD and TCMU_OP_PAD.
Andy Groverce876852014-10-01 16:07:04 -0700169
Andy Grover0ad46af2015-04-14 17:30:04 -0700170When the opcode is CMD, the entry in the command ring is a struct
171tcmu_cmd_entry. Userspace finds the SCSI CDB (Command Data Block) via
172tcmu_cmd_entry.req.cdb_off. This is an offset from the start of the
173overall shared memory region, not the entry. The data in/out buffers
174are accessible via tht req.iov[] array. iov_cnt contains the number of
175entries in iov[] needed to describe either the Data-In or Data-Out
176buffers. For bidirectional commands, iov_cnt specifies how many iovec
Ilias Tsitsimpise4648b02015-04-23 21:30:09 +0300177entries cover the Data-Out area, and iov_bidi_cnt specifies how many
Andy Grover0ad46af2015-04-14 17:30:04 -0700178iovec entries immediately after that in iov[] cover the Data-In
179area. Just like other fields, iov.iov_base is an offset from the start
180of the region.
Andy Groverce876852014-10-01 16:07:04 -0700181
182When completing a command, userspace sets rsp.scsi_status, and
183rsp.sense_buffer if necessary. Userspace then increments
184mailbox.cmd_tail by entry.hdr.length (mod cmdr_size) and signals the
185kernel via the UIO method, a 4-byte write to the file descriptor.
186
Sheng Yang32c76de2016-02-29 16:02:15 -0800187If TCMU_MAILBOX_FLAG_CAP_OOOC is set for mailbox->flags, kernel is
188capable of handling out-of-order completions. In this case, userspace can
189handle command in different order other than original. Since kernel would
190still process the commands in the same order it appeared in the command
191ring, userspace need to update the cmd->id when completing the
192command(a.k.a steal the original command's entry).
193
Andy Grover0ad46af2015-04-14 17:30:04 -0700194When the opcode is PAD, userspace only updates cmd_tail as above --
195it's a no-op. (The kernel inserts PAD entries to ensure each CMD entry
196is contiguous within the command ring.)
197
198More opcodes may be added in the future. If userspace encounters an
199opcode it does not handle, it must set UNKNOWN_OP bit (bit 0) in
200hdr.uflags, update cmd_tail, and proceed with processing additional
201commands, if any.
202
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300203The Data Area
204-------------
Andy Groverce876852014-10-01 16:07:04 -0700205
206This is shared-memory space after the command ring. The organization
207of this area is not defined in the TCMU interface, and userspace
208should access only the parts referenced by pending iovs.
209
210
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300211Device Discovery
212----------------
Andy Groverce876852014-10-01 16:07:04 -0700213
214Other devices may be using UIO besides TCMU. Unrelated user processes
215may also be handling different sets of TCMU devices. TCMU userspace
216processes must find their devices by scanning sysfs
217class/uio/uio*/name. For TCMU devices, these names will be of the
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300218format::
Andy Groverce876852014-10-01 16:07:04 -0700219
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300220 tcm-user/<hba_num>/<device_name>/<subtype>/<path>
Andy Groverce876852014-10-01 16:07:04 -0700221
222where "tcm-user" is common for all TCMU-backed UIO devices. <hba_num>
223and <device_name> allow userspace to find the device's path in the
224kernel target's configfs tree. Assuming the usual mount point, it is
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300225found at::
Andy Groverce876852014-10-01 16:07:04 -0700226
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300227 /sys/kernel/config/target/core/user_<hba_num>/<device_name>
Andy Groverce876852014-10-01 16:07:04 -0700228
229This location contains attributes such as "hw_block_size", that
230userspace needs to know for correct operation.
231
232<subtype> will be a userspace-process-unique string to identify the
233TCMU device as expecting to be backed by a certain handler, and <path>
234will be an additional handler-specific string for the user process to
235configure the device, if needed. The name cannot contain ':', due to
236LIO limitations.
237
238For all devices so discovered, the user handler opens /dev/uioX and
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300239calls mmap()::
Andy Groverce876852014-10-01 16:07:04 -0700240
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300241 mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0)
Andy Groverce876852014-10-01 16:07:04 -0700242
243where size must be equal to the value read from
244/sys/class/uio/uioX/maps/map0/size.
245
246
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300247Device Events
248-------------
Andy Groverce876852014-10-01 16:07:04 -0700249
250If a new device is added or removed, a notification will be broadcast
251over netlink, using a generic netlink family name of "TCM-USER" and a
252multicast group named "config". This will include the UIO name as
253described in the previous section, as well as the UIO minor
254number. This should allow userspace to identify both the UIO device and
255the LIO device, so that after determining the device is supported
256(based on subtype) it can take the appropriate action.
257
258
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300259Other contingencies
260-------------------
Andy Groverce876852014-10-01 16:07:04 -0700261
262Userspace handler process never attaches:
263
264- TCMU will post commands, and then abort them after a timeout period
265 (30 seconds.)
266
267Userspace handler process is killed:
268
269- It is still possible to restart and re-connect to TCMU
270 devices. Command ring is preserved. However, after the timeout period,
271 the kernel will abort pending tasks.
272
273Userspace handler process hangs:
274
275- The kernel will abort pending tasks after a timeout period.
276
277Userspace handler process is malicious:
278
279- The process can trivially break the handling of devices it controls,
280 but should not be able to access kernel memory outside its shared
281 memory areas.
282
283
284Writing a user pass-through handler (with example code)
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300285=======================================================
Andy Groverce876852014-10-01 16:07:04 -0700286
287A user process handing a TCMU device must support the following:
288
289a) Discovering and configuring TCMU uio devices
290b) Waiting for events on the device(s)
291c) Managing the command ring: Parsing operations and commands,
292 performing work as needed, setting response fields (scsi_status and
293 possibly sense_buffer), updating cmd_tail, and notifying the kernel
294 that work has been finished
295
296First, consider instead writing a plugin for tcmu-runner. tcmu-runner
297implements all of this, and provides a higher-level API for plugin
298authors.
299
300TCMU is designed so that multiple unrelated processes can manage TCMU
301devices separately. All handlers should make sure to only open their
302devices, based opon a known subtype string.
303
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300304a) Discovering and configuring TCMU UIO devices::
Andy Groverce876852014-10-01 16:07:04 -0700305
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300306 /* error checking omitted for brevity */
Andy Groverce876852014-10-01 16:07:04 -0700307
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300308 int fd, dev_fd;
309 char buf[256];
310 unsigned long long map_len;
311 void *map;
Andy Groverce876852014-10-01 16:07:04 -0700312
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300313 fd = open("/sys/class/uio/uio0/name", O_RDONLY);
314 ret = read(fd, buf, sizeof(buf));
315 close(fd);
316 buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
Andy Groverce876852014-10-01 16:07:04 -0700317
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300318 /* we only want uio devices whose name is a format we expect */
319 if (strncmp(buf, "tcm-user", 8))
Andy Groverce876852014-10-01 16:07:04 -0700320 exit(-1);
321
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300322 /* Further checking for subtype also needed here */
Andy Groverce876852014-10-01 16:07:04 -0700323
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300324 fd = open(/sys/class/uio/%s/maps/map0/size, O_RDONLY);
325 ret = read(fd, buf, sizeof(buf));
326 close(fd);
327 str_buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
Andy Groverce876852014-10-01 16:07:04 -0700328
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300329 map_len = strtoull(buf, NULL, 0);
Andy Groverce876852014-10-01 16:07:04 -0700330
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300331 dev_fd = open("/dev/uio0", O_RDWR);
332 map = mmap(NULL, map_len, PROT_READ|PROT_WRITE, MAP_SHARED, dev_fd, 0);
Andy Groverce876852014-10-01 16:07:04 -0700333
334
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300335 b) Waiting for events on the device(s)
Andy Groverce876852014-10-01 16:07:04 -0700336
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300337 while (1) {
338 char buf[4];
Andy Groverce876852014-10-01 16:07:04 -0700339
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300340 int ret = read(dev_fd, buf, 4); /* will block */
Andy Groverce876852014-10-01 16:07:04 -0700341
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300342 handle_device_events(dev_fd, map);
Andy Groverce876852014-10-01 16:07:04 -0700343 }
Andy Groverce876852014-10-01 16:07:04 -0700344
Andy Groverce876852014-10-01 16:07:04 -0700345
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300346c) Managing the command ring::
Andy Groverce876852014-10-01 16:07:04 -0700347
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300348 #include <linux/target_core_user.h>
Andy Groverce876852014-10-01 16:07:04 -0700349
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300350 int handle_device_events(int fd, void *map)
351 {
352 struct tcmu_mailbox *mb = map;
353 struct tcmu_cmd_entry *ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
354 int did_some_work = 0;
355
356 /* Process events from cmd ring until we catch up with cmd_head */
357 while (ent != (void *)mb + mb->cmdr_off + mb->cmd_head) {
358
359 if (tcmu_hdr_get_op(ent->hdr.len_op) == TCMU_OP_CMD) {
360 uint8_t *cdb = (void *)mb + ent->req.cdb_off;
361 bool success = true;
362
363 /* Handle command here. */
364 printf("SCSI opcode: 0x%x\n", cdb[0]);
365
366 /* Set response fields */
367 if (success)
368 ent->rsp.scsi_status = SCSI_NO_SENSE;
369 else {
370 /* Also fill in rsp->sense_buffer here */
371 ent->rsp.scsi_status = SCSI_CHECK_CONDITION;
372 }
373 }
374 else if (tcmu_hdr_get_op(ent->hdr.len_op) != TCMU_OP_PAD) {
375 /* Tell the kernel we didn't handle unknown opcodes */
376 ent->hdr.uflags |= TCMU_UFLAG_UNKNOWN_OP;
377 }
378 else {
379 /* Do nothing for PAD entries except update cmd_tail */
380 }
381
382 /* update cmd_tail */
383 mb->cmd_tail = (mb->cmd_tail + tcmu_hdr_get_len(&ent->hdr)) % mb->cmdr_size;
384 ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
385 did_some_work = 1;
386 }
387
388 /* Notify the kernel that work has been finished */
389 if (did_some_work) {
390 uint32_t buf = 0;
391
392 write(fd, &buf, 4);
393 }
394
395 return 0;
396 }
Andy Groverce876852014-10-01 16:07:04 -0700397
398
Andy Groverce876852014-10-01 16:07:04 -0700399A final note
Mauro Carvalho Chehab4ca9bc22019-06-12 14:52:59 -0300400============
Andy Groverce876852014-10-01 16:07:04 -0700401
402Please be careful to return codes as defined by the SCSI
403specifications. These are different than some values defined in the
404scsi/scsi.h include file. For example, CHECK CONDITION's status code
405is 2, not 1.