blob: bef81e42788f2b8e48238b63318afd556e843979 [file] [log] [blame]
Andy Groverce876852014-10-01 16:07:04 -07001Contents:
2
31) TCM Userspace Design
4 a) Background
5 b) Benefits
6 c) Design constraints
7 d) Implementation overview
8 i. Mailbox
9 ii. Command ring
10 iii. Data Area
11 e) Device discovery
12 f) Device events
13 g) Other contingencies
142) Writing a user pass-through handler
15 a) Discovering and configuring TCMU uio devices
16 b) Waiting for events on the device(s)
17 c) Managing the command ring
Andy Grover9c1cd1b2015-05-19 14:44:39 -0700183) A final note
Andy Groverce876852014-10-01 16:07:04 -070019
20
21TCM Userspace Design
22--------------------
23
24TCM is another name for LIO, an in-kernel iSCSI target (server).
25Existing TCM targets run in the kernel. TCMU (TCM in Userspace)
26allows userspace programs to be written which act as iSCSI targets.
27This document describes the design.
28
29The existing kernel provides modules for different SCSI transport
30protocols. TCM also modularizes the data storage. There are existing
31modules for file, block device, RAM or using another SCSI device as
32storage. These are called "backstores" or "storage engines". These
33built-in modules are implemented entirely as kernel code.
34
35Background:
36
37In addition to modularizing the transport protocol used for carrying
38SCSI commands ("fabrics"), the Linux kernel target, LIO, also modularizes
39the actual data storage as well. These are referred to as "backstores"
40or "storage engines". The target comes with backstores that allow a
41file, a block device, RAM, or another SCSI device to be used for the
42local storage needed for the exported SCSI LUN. Like the rest of LIO,
43these are implemented entirely as kernel code.
44
45These backstores cover the most common use cases, but not all. One new
46use case that other non-kernel target solutions, such as tgt, are able
47to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
48target then serves as a translator, allowing initiators to store data
49in these non-traditional networked storage systems, while still only
50using standard protocols themselves.
51
52If the target is a userspace process, supporting these is easy. tgt,
53for example, needs only a small adapter module for each, because the
54modules just use the available userspace libraries for RBD and GLFS.
55
56Adding support for these backstores in LIO is considerably more
57difficult, because LIO is entirely kernel code. Instead of undertaking
58the significant work to port the GLFS or RBD APIs and protocols to the
59kernel, another approach is to create a userspace pass-through
60backstore for LIO, "TCMU".
61
62
63Benefits:
64
65In addition to allowing relatively easy support for RBD and GLFS, TCMU
66will also allow easier development of new backstores. TCMU combines
67with the LIO loopback fabric to become something similar to FUSE
68(Filesystem in Userspace), but at the SCSI layer instead of the
69filesystem layer. A SUSE, if you will.
70
71The disadvantage is there are more distinct components to configure, and
72potentially to malfunction. This is unavoidable, but hopefully not
73fatal if we're careful to keep things as simple as possible.
74
75Design constraints:
76
77- Good performance: high throughput, low latency
78- Cleanly handle if userspace:
79 1) never attaches
80 2) hangs
81 3) dies
82 4) misbehaves
83- Allow future flexibility in user & kernel implementations
84- Be reasonably memory-efficient
85- Simple to configure & run
86- Simple to write a userspace backend
87
88
89Implementation overview:
90
91The core of the TCMU interface is a memory region that is shared
92between kernel and userspace. Within this region is: a control area
93(mailbox); a lockless producer/consumer circular buffer for commands
94to be passed up, and status returned; and an in/out data buffer area.
95
96TCMU uses the pre-existing UIO subsystem. UIO allows device driver
97development in userspace, and this is conceptually very close to the
98TCMU use case, except instead of a physical device, TCMU implements a
99memory-mapped layout designed for SCSI commands. Using UIO also
100benefits TCMU by handling device introspection (e.g. a way for
101userspace to determine how large the shared region is) and signaling
102mechanisms in both directions.
103
104There are no embedded pointers in the memory region. Everything is
105expressed as an offset from the region's starting address. This allows
106the ring to still work if the user process dies and is restarted with
107the region mapped at a different virtual address.
108
109See target_core_user.h for the struct definitions.
110
111The Mailbox:
112
113The mailbox is always at the start of the shared memory region, and
114contains a version, details about the starting offset and size of the
115command ring, and head and tail pointers to be used by the kernel and
116userspace (respectively) to put commands on the ring, and indicate
117when the commands are completed.
118
119version - 1 (userspace should abort if otherwise)
120flags - none yet defined.
121cmdr_off - The offset of the start of the command ring from the start
122of the memory region, to account for the mailbox size.
123cmdr_size - The size of the command ring. This does *not* need to be a
124power of two.
125cmd_head - Modified by the kernel to indicate when a command has been
126placed on the ring.
127cmd_tail - Modified by userspace to indicate when it has completed
128processing of a command.
129
130The Command Ring:
131
132Commands are placed on the ring by the kernel incrementing
133mailbox.cmd_head by the size of the command, modulo cmdr_size, and
134then signaling userspace via uio_event_notify(). Once the command is
135completed, userspace updates mailbox.cmd_tail in the same way and
136signals the kernel via a 4-byte write(). When cmd_head equals
137cmd_tail, the ring is empty -- no commands are currently waiting to be
138processed by userspace.
139
Andy Grover0ad46af2015-04-14 17:30:04 -0700140TCMU commands are 8-byte aligned. They start with a common header
141containing "len_op", a 32-bit value that stores the length, as well as
142the opcode in the lowest unused bits. It also contains cmd_id and
143flags fields for setting by the kernel (kflags) and userspace
144(uflags).
Andy Groverce876852014-10-01 16:07:04 -0700145
Andy Grover0ad46af2015-04-14 17:30:04 -0700146Currently only two opcodes are defined, TCMU_OP_CMD and TCMU_OP_PAD.
Andy Groverce876852014-10-01 16:07:04 -0700147
Andy Grover0ad46af2015-04-14 17:30:04 -0700148When the opcode is CMD, the entry in the command ring is a struct
149tcmu_cmd_entry. Userspace finds the SCSI CDB (Command Data Block) via
150tcmu_cmd_entry.req.cdb_off. This is an offset from the start of the
151overall shared memory region, not the entry. The data in/out buffers
152are accessible via tht req.iov[] array. iov_cnt contains the number of
153entries in iov[] needed to describe either the Data-In or Data-Out
154buffers. For bidirectional commands, iov_cnt specifies how many iovec
Ilias Tsitsimpise4648b02015-04-23 21:30:09 +0300155entries cover the Data-Out area, and iov_bidi_cnt specifies how many
Andy Grover0ad46af2015-04-14 17:30:04 -0700156iovec entries immediately after that in iov[] cover the Data-In
157area. Just like other fields, iov.iov_base is an offset from the start
158of the region.
Andy Groverce876852014-10-01 16:07:04 -0700159
160When completing a command, userspace sets rsp.scsi_status, and
161rsp.sense_buffer if necessary. Userspace then increments
162mailbox.cmd_tail by entry.hdr.length (mod cmdr_size) and signals the
163kernel via the UIO method, a 4-byte write to the file descriptor.
164
Andy Grover0ad46af2015-04-14 17:30:04 -0700165When the opcode is PAD, userspace only updates cmd_tail as above --
166it's a no-op. (The kernel inserts PAD entries to ensure each CMD entry
167is contiguous within the command ring.)
168
169More opcodes may be added in the future. If userspace encounters an
170opcode it does not handle, it must set UNKNOWN_OP bit (bit 0) in
171hdr.uflags, update cmd_tail, and proceed with processing additional
172commands, if any.
173
Andy Groverce876852014-10-01 16:07:04 -0700174The Data Area:
175
176This is shared-memory space after the command ring. The organization
177of this area is not defined in the TCMU interface, and userspace
178should access only the parts referenced by pending iovs.
179
180
181Device Discovery:
182
183Other devices may be using UIO besides TCMU. Unrelated user processes
184may also be handling different sets of TCMU devices. TCMU userspace
185processes must find their devices by scanning sysfs
186class/uio/uio*/name. For TCMU devices, these names will be of the
187format:
188
189tcm-user/<hba_num>/<device_name>/<subtype>/<path>
190
191where "tcm-user" is common for all TCMU-backed UIO devices. <hba_num>
192and <device_name> allow userspace to find the device's path in the
193kernel target's configfs tree. Assuming the usual mount point, it is
194found at:
195
196/sys/kernel/config/target/core/user_<hba_num>/<device_name>
197
198This location contains attributes such as "hw_block_size", that
199userspace needs to know for correct operation.
200
201<subtype> will be a userspace-process-unique string to identify the
202TCMU device as expecting to be backed by a certain handler, and <path>
203will be an additional handler-specific string for the user process to
204configure the device, if needed. The name cannot contain ':', due to
205LIO limitations.
206
207For all devices so discovered, the user handler opens /dev/uioX and
208calls mmap():
209
210mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0)
211
212where size must be equal to the value read from
213/sys/class/uio/uioX/maps/map0/size.
214
215
216Device Events:
217
218If a new device is added or removed, a notification will be broadcast
219over netlink, using a generic netlink family name of "TCM-USER" and a
220multicast group named "config". This will include the UIO name as
221described in the previous section, as well as the UIO minor
222number. This should allow userspace to identify both the UIO device and
223the LIO device, so that after determining the device is supported
224(based on subtype) it can take the appropriate action.
225
226
227Other contingencies:
228
229Userspace handler process never attaches:
230
231- TCMU will post commands, and then abort them after a timeout period
232 (30 seconds.)
233
234Userspace handler process is killed:
235
236- It is still possible to restart and re-connect to TCMU
237 devices. Command ring is preserved. However, after the timeout period,
238 the kernel will abort pending tasks.
239
240Userspace handler process hangs:
241
242- The kernel will abort pending tasks after a timeout period.
243
244Userspace handler process is malicious:
245
246- The process can trivially break the handling of devices it controls,
247 but should not be able to access kernel memory outside its shared
248 memory areas.
249
250
251Writing a user pass-through handler (with example code)
252-------------------------------------------------------
253
254A user process handing a TCMU device must support the following:
255
256a) Discovering and configuring TCMU uio devices
257b) Waiting for events on the device(s)
258c) Managing the command ring: Parsing operations and commands,
259 performing work as needed, setting response fields (scsi_status and
260 possibly sense_buffer), updating cmd_tail, and notifying the kernel
261 that work has been finished
262
263First, consider instead writing a plugin for tcmu-runner. tcmu-runner
264implements all of this, and provides a higher-level API for plugin
265authors.
266
267TCMU is designed so that multiple unrelated processes can manage TCMU
268devices separately. All handlers should make sure to only open their
269devices, based opon a known subtype string.
270
271a) Discovering and configuring TCMU UIO devices:
272
273(error checking omitted for brevity)
274
275int fd, dev_fd;
276char buf[256];
277unsigned long long map_len;
278void *map;
279
280fd = open("/sys/class/uio/uio0/name", O_RDONLY);
281ret = read(fd, buf, sizeof(buf));
282close(fd);
283buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
284
285/* we only want uio devices whose name is a format we expect */
286if (strncmp(buf, "tcm-user", 8))
287 exit(-1);
288
289/* Further checking for subtype also needed here */
290
291fd = open(/sys/class/uio/%s/maps/map0/size, O_RDONLY);
292ret = read(fd, buf, sizeof(buf));
293close(fd);
294str_buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
295
296map_len = strtoull(buf, NULL, 0);
297
298dev_fd = open("/dev/uio0", O_RDWR);
299map = mmap(NULL, map_len, PROT_READ|PROT_WRITE, MAP_SHARED, dev_fd, 0);
300
301
302b) Waiting for events on the device(s)
303
304while (1) {
305 char buf[4];
306
307 int ret = read(dev_fd, buf, 4); /* will block */
308
309 handle_device_events(dev_fd, map);
310}
311
312
313c) Managing the command ring
314
315#include <linux/target_core_user.h>
316
317int handle_device_events(int fd, void *map)
318{
319 struct tcmu_mailbox *mb = map;
320 struct tcmu_cmd_entry *ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
321 int did_some_work = 0;
322
323 /* Process events from cmd ring until we catch up with cmd_head */
324 while (ent != (void *)mb + mb->cmdr_off + mb->cmd_head) {
325
Andy Grovercf87edc2015-05-19 14:44:38 -0700326 if (tcmu_hdr_get_op(ent->hdr.len_op) == TCMU_OP_CMD) {
Andy Groverce876852014-10-01 16:07:04 -0700327 uint8_t *cdb = (void *)mb + ent->req.cdb_off;
328 bool success = true;
329
330 /* Handle command here. */
331 printf("SCSI opcode: 0x%x\n", cdb[0]);
332
333 /* Set response fields */
334 if (success)
335 ent->rsp.scsi_status = SCSI_NO_SENSE;
336 else {
337 /* Also fill in rsp->sense_buffer here */
338 ent->rsp.scsi_status = SCSI_CHECK_CONDITION;
339 }
340 }
Andy Grovercf87edc2015-05-19 14:44:38 -0700341 else if (tcmu_hdr_get_op(ent->hdr.len_op) != TCMU_OP_PAD) {
342 /* Tell the kernel we didn't handle unknown opcodes */
343 ent->hdr.uflags |= TCMU_UFLAG_UNKNOWN_OP;
344 }
Andy Groverce876852014-10-01 16:07:04 -0700345 else {
Andy Grovercf87edc2015-05-19 14:44:38 -0700346 /* Do nothing for PAD entries except update cmd_tail */
Andy Groverce876852014-10-01 16:07:04 -0700347 }
348
349 /* update cmd_tail */
350 mb->cmd_tail = (mb->cmd_tail + tcmu_hdr_get_len(&ent->hdr)) % mb->cmdr_size;
351 ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
352 did_some_work = 1;
353 }
354
355 /* Notify the kernel that work has been finished */
356 if (did_some_work) {
357 uint32_t buf = 0;
358
359 write(fd, &buf, 4);
360 }
361
362 return 0;
363}
364
365
Andy Groverce876852014-10-01 16:07:04 -0700366A final note
367------------
368
369Please be careful to return codes as defined by the SCSI
370specifications. These are different than some values defined in the
371scsi/scsi.h include file. For example, CHECK CONDITION's status code
372is 2, not 1.