Creating virtual block devices with ublk

Ublk is a new framework available since Linux v6.0+ for creating virtual block devices using io_uring in user space. It was written and has been maintained by a long time Linux kernel contributor Ming Lei and improved by others. As of right now it's marked experimental and its APIs might change in the future.

Ublk applications are not part of the Linux kernel and run entirely in user space as any other application. The main advantage of that is the applications can be written in any programming language with any 3rd party libraries and tools. The only requirement is that the language provides access to low level primitives required to work with io_uring and memory buffers.

By being in user space, the applications should also not cause instability to the system in case of crashes or other bugs. Last but not least by being outside of the kernel, the applications are not tied to its release process and schedule.

There are two parts to ublk. The in-kernel driver ublk_drv and ublk servers, user space applications which communicate with the driver. A ublk server is exposed to the rest of the system as a virtual block device /dev/ublkbN (and its control part /dev/ublkcN). Any I/O request directed at this device is redirected by the driver to the corresponding ublk server. The server handles the request by its own logic and communicates the result to the driver, which in turn completes the I/O request.

The end user of such block device can be a file system, a framework such as data mapper, or any other system that can work with block devices. Similarly, a server can handle the request in any way it wants - it can fetch data from a local system, compute it on the fly, or over the network.

The idea of moving certain parts or responsibilities of the kernel to user space is not new. However all attempts have mostly failed flat due to the performance cost of system calls and associated context switching. Switching back and forth between the kernel space and user space has always been costly and has become even more so due to various security mitigations for CPU bugs.

What makes ublk's approach feasible is its usage of io_uring for coordination and data sharing. A server starts by asking for work and waiting (~ syscall and context switch). When work arrives, the server handles it, asks for more, and waits again. The key part is that when the server gets notified about pending work, there can be multiple work requests pending in the queue, not just one. When all requests are finished, the server asks for more. But by the time it wants to wait, there might be already new pending work requests and so it just keeps on working. A busy server will process work in large batches, amortizing the cost of context switching.

Worker threads

There is nothing that forces a specific architecture. However, a typical server uses the main thread to establish communication with the driver and then creates a thread for each I/O request queue and other resources. The number of these queues depends on the particular server. For example, a server backed by a local file system might use the number of hardware queues of the storage device. It can also just use the number of available threads on the machine or anything in between.

Since each queue is associated with buffers for data transfer, care must be taken to properly configure the depth and the number of queues to prevent excessive memory usage. More on this later.

Each thread is responsible for processing I/O requests coming from the driver. To do that, each thread maintains a queue of so called tags. A queue is not a queue in the programming data structure sense but rather each thread knows its queue ID (q_id) and it knows the queue's depth (the number of elements or tags).

When a thread is ready to work, it sends a work request for each tag via io_uring to the driver along with its queue ID. A pair of (q_id, tag) uniquely identifies a single work request. The thread now waits for at least one completion to arrive on its ring.

When a successful completion arrives, the server knows that there's work to be done associated with this (q_id, tag) pair. The completion doesn't contain details about the specifics of the work however. To get those details, the server looks up a descriptor based on the tag. The descriptor map is a per thread memory mapped region shared with the driver where each element of the map corresponds to a tag.

Most of the time the result value will map directly to the result of the call which was required to fulfill the operation. For example, if the server is asked to read a certain number of bytes at a given offset, the server might use the pread(2) syscall via io_uring and the result will contain the number of actually read bytes.

Waiting for work, fetching descriptors, doing the work, and finally committing of the result is the basic workflow of each worker thread. Notice that as an optimization committing of results also asks for more work, all in one request. For that reason, the command UBLK_U_IO_FETCH_REQ is used only for the very first request and following requests use UBLK_U_IO_COMMIT_AND_FETCH_REQ only.

Keep in mind that this is done for every (q_id, tag) pair at the same time. With 2 threads each with 128 tags deep queue there will be 256 pending work requests. Having 128 work requests per thread might seem like a lot but the actual work is always delegated to io_uring, the worker thread is just a mediator. For example, when a write request comes in, the worker thread just submits a io_uring_prep_write submission (or similar) based on the work description and moves on to other work requests. When the write operation completes, it commits the result and makes the specific pair ready for new work.

What this means in practice is that it's necessary to distinguish between work requests coming from the driver and completions of the actual work submitted by the work thread itself. A common solution to this is to utilize the user_data field on each io_uring submission to mark the operation as either driver or server specific. If you're reading the official examples, you will often see this referred to as submissions / completions for "target IO" or "tgt io" (target being the thing that's underlying the server).

Another solution might be to just use a different ring for the target IO altogether, purely for management reasons. This should not be expensive in terms of resources because rings can be configured to share their kernel thread pools with IORING_SETUP_ATTACH_WQ.

Data transfer

When a server receives a read data request, where does it put the actual data to delivery it to the driver? Ublk supports a few different modes to do that. The mode is configured when a server starts and is fixed during its lifetime.

The most straightforward mode is the default mode, a mode which is used when you don't explicitly ask for another one. You start by allocating a buffer for each (q_id, tag) pair. The length of each buffer is the same and can be chosen by the server at startup. You can of course allocate a single buffer and slice it up. The only requirement is that each buffer is aligned to the machine's page size.

To make the driver aware of these buffers, you specify their addresses when requesting work. When you later fetch the corresponding descriptor, the addr field will contain the address to which you are supposed write or read from. You can of course index into the buffers the same way you index into the descriptor map. Under that scheme, you can ignore the addr field and just fetch the buffer based on the tag.

Specifying the buffers as part of the work requests might not be ideal in memory constraint environments however. By default Linux will not actually reserve the underlying memory until used but nevertheless after a while all of the buffers will have been used. A common size for a buffer is 1MiB combined with a queue's depth of 128. This means 128MiB memory per thread. You can tackle this in multiple ways. For example, you can simply lower both of these numbers at the possible expense of performance. You can also mark the buffers (pages) as not needed (MADV_DONTNEED) after being idle for a while. This will not lower the peak memory usage but at least the server will not hog memory while not being used.

Yet another alternative is to switch to a different mode called user copy (UBLK_F_USER_COPY). The user copy mode doesn't use preallocated buffers in work requests and gives the server complete freedom to choose when, how, and if to allocate. For example, a server can choose to allocate the memory on request arrival and release it right after, or maintain a small memory pool.

When a request arrives, the server fetches the corresponding descriptor. The descriptor contains the name of the operation (let's assume it's a read) and the length of the required buffer. It is now up to the server to obtain a buffer of the given length and fill it by doing the actual work (e.g. reading from a file using io_uring). To deliver the data to the driver, it writes the buffer into the control device /dev/ublkcN and commits the result. If the operation was a write, the server would instead read from the device to get the data.

The control device /dev/ublkcN is a counter part to the block device (notice the letter c in the name) and internally reserves a logical buffer of the same length. The offset to the reserved area is used for identification of the request and is computed as the base offset constant plus offset for q_id and tag each.

The user copy mode gives you a complete control over the allocation. For example, if you're delegating the requests to an external library (which maintains its own memory buffers), you might not need to allocate yourself at all. Similarly if it's possible for you to work with the control device directly without any intermediate buffers.

Finally, the user copy mode replaces an older mode called get data (UBLK_F_NEED_GET_DATA). The mode is still supported but there's no point in using it.

The last supported mode is called zero copy (UBLK_F_SUPPORT_ZERO_COPY). As the name suggests, its main goal is to eliminate a copy of request data between the kernel space and user space. To use this mode, the server registers a sparse buffer table on its instance of io_uring, which it plans to use for requesting work. The length of the table should match the depth of its queue such that it there's a slot for each tag.

When a request comes in, the server needs to tie a buffer to the request. It does so by sending a register command to the driver with the request's queue_id and tag pair along with an index into the buffer table. When the driver receives the command, it maps kernel IO buffers related to the request to the specified io_uring buffer.

At this point the server can continue handling the request and when it needs to read from or write to the buffer, it uses the io_uring's fixed buffer family of functions by setting "buf_index" to the previously registered buffer index. This means the server never gains direct access to the request data, it can only refer to it by the index, something which can be a limitation in certain scenarios.

In order for the server to refer to the buffer by the index, the register command needs to finish first. It's up the server to handle this, either by linking submissions together (register -> work -> un-register) or by other logic.

Registering and un-registering of buffers is not only inconvenient for the programmer but also inefficient because of the additional communication with the driver. Fortunately, the driver can do this automatically. Enter zero copy with auto buf reg mode (UBLK_F_AUTO_BUF_REG). In reality it's not a separate mode but rather an option you enable in addition to the zero copy mode.

When the automatic buffer registration option is active, the driver will automatically register and prepare a buffer before notifying the server and will also clean it up when the request is committed. Similarly to the default mode, the request for work is used to communicate the specific buffer which should be automatically registered for the request via the addr field.

Just a heads up here. While I think it's obvious that these diagrams don't match the underlying data structures exactly as defined for simplicity, in this case it's worth pointing out that there are two addr fields and it's very easy to confuse the two and spend a lovely time debugging it. To set the buffer index, you need to set addr on the final submission, not on the io command. Alright, back to the program.

The auto buffer registration can fail for a number of reasons ranging from user error (specifying out of bounds index to the sparse table, using index that's currently in use) to io_uring failing to allocate memory for the table slot. If that happens, the I/O request is aborted and the server will not be notified.

To handle this situation gracefully, it's possible to specify UBLK_AUTO_BUF_REG_FALLBACK in the ublk_auto_buf_reg object. If the auto registration fails with this option enabled, the driver will set UBLK_IO_F_NEED_REG_BUF in the request descriptor's flags and it's up to the server to always check for this flag and manually register a buffer if set. You don't need to un-register it however.

Warmed up

Now that we have warmed up a little, it's time take ublk for a spin. The project itself ships with libraries written in C and Rust but we will not use those. Since this is all just to learn about the internals, I think it makes more sense to reimplement the logic ourselves. I will however often use the official command line utility ublk which provides a nice way to list all of the current devices and their state.

All of the code snippets will be written in Rust with the io-uring library but again one of the benefits of ublk is that you can use your own favorite language instead. Just make sure it has sufficient support for working with io_uring either in the standard library or through bindings to the official library and that you can allocate and map raw memory.

For io_uring, ublk also requires support for 128 byte sized submissions (IORING_SETUP_SQE128) and 32 byte sized completions (IORING_SETUP_CQE32). When these flags are specified, the kernel will move through the respective queues by the increment of 2, using the second entry as free real estate for data.

Let this be a warning to the Zig enjoyers out there. Zig's io_uring support is currently lacking support for these flags and the silent mismatch between the kernel's and user space's entry sizes leads to hilarious results. Funnily enough, the person trying to use these flags is doing so because of ublk.

Finally, make sure your kernel actually supports ublk. If you're running on v6.0+ (except for 6.18.4 or .5), you should be fine but keep in mind that ublk is a moving target and you should use the latest available version if possible to get all of the features and fixes. The kernel driver itself is called ublk_drv and is compiled as a module by default.

# uname -r
6.18.3-arch1-1

# modprobe ublk_drv

# file /dev/ublk-control
/dev/ublk-control: character special (10/262)

Set up

The device at /dev/ublk-control is the entry point for all ublk servers and the first thing to do is to open the device for read and write.

let control = OpenOptions::new()
    .read(true)
    .write(true)
    .open("/dev/ublk-control")?;

To make the code samples more manageable I will omit most of the error handling but you should always assume that these calls might fail. I will also sometimes omit boiler plate to make Rust's type checker happy in exchange for making the code simpler and focused on the main point.

In this case, the effective user might not have enough permissions to open the device. There will be a separate section talking about ublk's permission model.

The file needs to stay open until the server is shut down. The easiest way to do that is to convert the file into an OwnedFd and keep the reference alive.

type Ring = IoUring<squeue::Entry128, cqueue::Entry32>;

let main_ring: Ring = Ring::builder()
    .setup_coop_taskrun()
    .setup_defer_taskrun()
    .setup_single_issuer()
    .build(8)?;

// UBLK_CONTROL_FD_IDX = 0
ring.submitter().register_files(&[control.as_raw_fd()])?;

The main_ring ("main" as in main thread) will be used exclusively to communicate with the control using the control fd. As such, it makes sense to register the file descriptor and refer to it using an index in the future. This allows the kernel to optimize access to the fd and avoid costly reference increments and decrements for every operation.

In the case of the main ring, it doesn't matter much because there will be only a few messages exchanged on it, but it should make a measurable difference for rings used in worker threads so it's good to get used to this practice. Similarly with the options. We don't need to be interrupted or wait between transitions (in worker threads we want to do work in large batches we control) and we will submit operations from a single thread (each worker thread will have its own ring; the kernel side doesn't yet take advantage of this but the option will allow skipping a lock on the internal ring's state).

Finally, we will need access to ublk's struct and constant definitions from ublk_cmd.h. Using bindgen we can generate just what's necessary.

bindgen --ignore-methods --with-derive-default \
    --wrap-unsafe-ops ublk_cmd.h > src/bindings.rs

Unfortunately even with correct headers included, bindgen is unable to correctly handle command definitions which use macros such as Linux's _IOWR to define ioctls. There's a few workarounds but the simplest solution seems to be to just use a library such as nix which can replicate these macros in Rust.

pub const UBLK_U_CMD_ADD_DEV =
    request_code_readwrite!(b'u', 0x04, size_of::<ublksrv_ctrl_cmd>())

Note that since these ioctls are just numbers it might be tempting to hard code them. That's not a good idea because these numbers can be different on other architectures.

New device

Adding a new device to the system and making it usable requires three separate stages, each represented with a separate command coming from a server to the control device.

First, we need to add a device and pass along a few very basic parameters. This will create a character device /dev/ublkcN, where N is replaced with the device's ID. Second, we need to set parameters related of the upcoming block device.

Before we send the final command to actually start and expose the new block device to the system, we need to start the worker threads and they all need to mmap the /dev/ublkcN device at their offsets into their address space. The "start dev" command will not finish processing until all the threads are up.

let dev_info = ublksrv_ctrl_dev_info {
    dev_id: 42, // /dev/ublk[cb]42
    nr_hw_queues: 2
    max_io_buf_bytes: 512 << 11, // 1MiB
    queue_depth: 128,
    ..Default::default()
};

The dev_id field specifies the desired ID of the device. If set to u32::MAX, the value is interpreted as "allocate the first unused ID starting from zero". Setting the value means the device will have a stable "address" which can be then used in scripts or e.g. /etc/fstab.

The nr_hw_queues field specifies the number of hardware queues the virtual device claims to support. In theory if the server is backed by a real device, it should query that device and pass through the actual value. In practice it's easier to think of this value as the number of worker threads.

The max_io_buf_bytes specifies the maximum size of the I/O request in bytes the server might receive. Remember when we talked about buffers in the various modes of operation? This is the size of one such buffer.

The queue_depth specifies the number of requests that can be in flight against the server at any given moment. Again, remember the queue from the diagrams above.

We have the data in dev_info ready but how do we send it? Ublk uses the relatively new io_uring opcode IORING_OP_URING_CMD. You can think of it as an async ioctl. Here the usage of the opcode is abstracted over by the io-uring library and its UringCmd80.

let cmd = ublksrv_ctrl_cmd {
    dev_id: dev_info.dev_id,
    queue_id: u16::MAX,
    len: size_of::<ublksrv_ctrl_dev_info>() as u16,
    addr: &raw mut dev_info as u64,
    ..Default::default()
};

let fd = Fixed(UBLK_CONTROL_FD_IDX);

let sqe = UringCmd80::new(fd, UBLK_U_CMD_ADD_DEV)
    .cmd(serialize(cmd))
    .build();

First, we need to wrap the data into a ublksrv_ctrl_cmd command and fill in a few fields. The dev_id field is just a copy of the inner dev_id. The queue_id must be set to u16::MAX to indicate that the command is not for a specific queue (the field is currently not used by the driver). The len field is the size of the struct and finally addr is the address of dev_info.

The fd uses the Fixed variant because we registered the actual file descriptor and now refer to it only by index. Internally the io-uring library will set the FIXED_FILE flag for us.

The serialize function takes the cmd and turns it into a fixed size byte array [u8; 80]. There are multiple ways how to achieve it and one of the safest is to use Google's zerocopy.

An important caveat of this interface is that similarly to a standard ioctl, the kernel driver will not only read the data pointed at by addr, it might also write to that data (and in this case it will). That's because the driver performs validation of the dev_info values and might adjust them to its liking. It's also the way the driver communicates back the allocated dev_id. What this means is that addr must point to a valid memory location until we get a response from the driver. In other words, the dev_info variable must not go out of scope until we get a completion from the IORING_OP_URING_CMD submission. The easiest way to achieve this is to send and wait for the completion all within one function, making the whole process essentially synchronous. In Rust you could also tie the u64 address back to dev_info by creating a wrapper with a phantom that indicates that it mutably borrows it.

The operation completion's result should be zero if successful, negative errno otherwise. If the result is -17 (EEXIST 17 File exists), it means there's already a device with the given dev_id. We will talk about this more in the recovery section.

If the command successfully finished, the driver will create a new character device /dev/ublkc42. Note that the device might not be available (exist) immediately after the command finishes and you should retry with a back off of a few hundred milliseconds.

$ file /dev/ublkc42
/dev/ublkc42: block special (259/0)

The device has to be open only once and the file descriptor shared with the worker threads. We will see how the device is used when starting the worker threads but for now all communication still happens with the control device.

let path = format!("/dev/ublkc{}", dev_id);
let ublkc = OpenOptions::new()
    .read(true)
    .write(true)
    .open(&path)?;

Setting parameters

Now it's time to configure the parameters of the actual block device that will be exposed to the system. These parameters are passed to the block infrastructure of the kernel and are not specific to ublk.

The ublk_params structure is used to carry over the values. Since there are different "classes" of parameters, the types field is used as a bit field to specify which are filled in. Besides the required "basic" parameter, you can set parameters specific to zoned devices, discard support, DMA alignments, or segment limits.

let size = 2 << 31; // 4 GiB

let params = ublk_params {
    len: size_of::<ublk_params>() as u32,
    types: UBLK_PARAM_TYPE_BASIC,
    basic: ublk_param_basic {
        attrs: UBLK_ATTR_VOLATILE_CACHE,
        logical_bs_shift: 9,   // 512
        physical_bs_shift: 12, // 4096
        io_opt_shift: 12,
        io_min_shift: 9,
        max_sectors: dev_info.max_io_buf_bytes >> 9,
        dev_sectors: size >> 9,
        ..Default::default()
    },
    ..Default::default()
};

The len field represents the size of the structure itself. The driver needs to be informed about the size because the definition of the struct in the kernel and user space applications might diverge in the future. If a new class of attributes is added, the size of the struct in the kernel would increase but remain the same in all user space applications without the updated definition. This in turn would case the kernel to read from user space beyond the actual end of the struct.

In the basic class of parameters, the attr field is used to configure miscellaneous attributes about the device such as read only, rotational, or if the device has a volatile cache with or without FUA support.

The first attributes are self explanatory but the volatile cache is interesting. If a physical device advertises having a volatile cache, it will acknowledge writes as soon as they hit its internal "RAM" and before the data is necessarily persisted. The kernel reacts to this by issuing flush requests when a file system requests a sync and in the context of ublk it means a server gets a UBLK_IO_OP_FLUSH request.

$ grep -r . /sys/class/block/sdb/queue | grep -E 'write_cache|fua'
/sys/class/block/sdb/queue/write_cache:write back
/sys/class/block/sdb/queue/fua:0

A device with a volatile cache might also support FUA (Force Unit Access). These are write commands which bypass caches and a write is not acknowledged until the data is persisted. In ublk this is communicated via work descriptor's flags by checking for UBLK_IO_F_FUA.

It's up to you to decide if these attributes make sense for your server. If it's backed by a real device, you should ideally query it and pass along its attributes. Most drives these days do have a volatile cache at least though.

The next parameters, such as logical_bs_shift, are common block device parameters and again depend on your specific server. The only difference is that these values are expressed as shifts, not actual values. In this case, the kernel queue_limits structure calls it logical_block_size.

The max_sectors field represents the maximum number of sectors in a single read or write request. The value must be equal to (or less than) max_io_buf_bytes because otherwise there wouldn't be enough space to fulfill the request. A sector in the Linux kernel is hardcoded to 512 bytes and so it's common to divide max_io_buf_bytes by 512, usually by shifting by 9 to feel smart.

The dev_sectors field represents the total number of sectors of the device. In other words, it's the size of the device in bytes divided by the sector size.

Sending of the command is the same as in the "add dev" case. Of course the operation need to be set appropriately (UBLK_U_CMD_SET_PARAMS).

Threads

Now that the device is configured, it's time to start worker threads, one for each queue (nr_hw_queues). Before each thread is ready for work, it has to do a few things.

First, it has to create its own io_uring ring. The ring doesn't have to use 128 byte sized submissions (IORING_SETUP_SQE128) but has to have enough submission entry slots to accept queue_depth entries. The number of completion entries should match. Since Linux 5.5 completion events are never dropped even if there's no room for them but would cause new submissions to fail to submit.

The ring will be used to communicate with the /dev/ublkcN device and so to lower overhead it's a good idea to register its file descriptor with the ring.

// UBLKC_FD_IDX = /dev/ublkcN
ring.submitter().register_files_update(0, &[ublkc_dev_fd])?;

Second, it has to memory map the /dev/ublkcN device into its address space to access work descriptors. Since there are very likely multiple threads and only a single device, an offset is used to give each thread its own "portion" of the device.

The address space available for individual threads is first offset by the UBLKSRV_CMD_BUF_OFFSET constant. Next, each thread is given UBLK_MAX_QUEUE_DEPTH space one after another. The queue ID, which is effectively the index of the thread starting at 0, is multiplied with the maximum depth. This way the first thread's address space starts right after the initial offset, the second's after it and so on. Within the maximum depth is the actual memory mapped region with length as configured (queue_depth).

fn len(depth: u16) -> usize {
    let io_size = depth as usize * size_of::<ublksrv_io_desc>();
    let page_size = page_size();

    io_size.next_multiple_of(page_size)
}

fn offset(queue_id: usize) -> usize {
    let max_len = len(UBLK_MAX_QUEUE_DEPTH as u16);
    UBLKSRV_CMD_BUF_OFFSET as usize + max_len * queue_id
}

let ptr = libc::mmap(ptr::null_mut(), len(queue_depth),
    libc::PROT_READ, libc::MAP_SHARED | libc::MAP_POPULATE,
    ublkc_dev_fd, offset(queue_id) as i64);

Each work descriptor is now available at the base ptr address plus the descriptor's index (tag) times the size of ublksrv_io_desc.

Third, it needs to do set up specific to its mode of operation as explained earlier. To demonstrate I will use the default mode. To use it, each thread has to allocate a buffer for each tag. Instead of doing literally that we can allocate a single buffer and cut it into smaller pieces similarly to the way the descriptors work.

let elem_size = (max_io_buf_bytes as usize)
    .checked_next_multiple_of(page_size)?;
let size = elem_size * queue_depth as usize;
let layout = Layout::from_size_align(size, page_size)?;

let ptr = alloc::alloc(layout);

Notice that the value of max_io_buf_bytes is at play here yet again as the size of one such buffer before alignment. It should also be clear how the size of the value affects memory usage of your server.

Finally, the last thing a thread has to do is to send work requests. This means sending an io_uring command for every tag.

for tag in 0..self.dev_info.queue_depth {
    let mut cmd = ublksrv_io_cmd {
        tag,
        q_id,
        ..Default::default()
    };
    cmd.__bindgen_anon_1.addr = buffer_addr_for_tag(tag);

    let sqe = UringCmd16::new(Fixed(UBLKC_FD_IDX), UBLK_U_IO_FETCH_REQ)
        .cmd(serialize(cmd))
        .build()
        .user_data(tag);

    // push sqe
}

ring.submit_and_wait(1)?;

A worker thread works in a loop. It sends the initial work requests and waits. When a work request completes, it examines the work descriptor, does the work, and commits it. The only difference is that initially it sends UBLK_U_IO_FETCH_REQ to request work and later on uses UBLK_U_IO_COMMIT_AND_FETCH_REQ to commit and request more work all in one command.

To be able to match a completion to a tag, you use io_uring's support for user_data. It's a u64 field on a submission which is passed through intact to its completion.

impl ublksrv_io_desc {
    pub fn op(&self) -> u32 {
        self.op_flags & 0xff
    }
}

let tag = cqe.user_data();
let desc = descriptors[tag];

match desc.op() {
    UBLK_IO_OP_READ => handle_read_request(desc),
    UBLK_IO_OP_WRITE => handle_write_request(desc),
    // ...
}

The ublksrv_io_desc struct describes the type of operation and also the actual work description. Let's assume it's a UBLK_IO_OP_READ (your server is supposed to read data and pass it back to the driver) and that the server forwards the requests to a file. The start_sector field represents the starting offset from where to start reading. It's in sectors so the value must be multiplied by 512 to get byte offset. The nr_sectors field represents the amount of data to read. Again, in sectors.

The addr field represents the address of buffer where you are supposed to place the result. In this mode, the address will be the same as buffer_addr_for_tag(tag).

The way you handle these requests is up to your server's logic. One thing to keep in mind is that each thread is handling work for all of its tags. Unless there's a good reason, it should not block and rather just dispatch the work somewhere else, most likely back to io_uring.

let sqe = Read::new(
    backing_file_fd,
    desc.addr,
    desc.nr_sectors << 9,
)
.offset(desc.start_sector << 9)
.build()
.user_data(internal_tag);

If you're going to send these submissions into the same ring you're using to communicate with the driver, you must be able to distinguish them from the regular "work available" completions. A common way of doing that is to encode the information along with the tag in user_data. You could also use an entirely different ring and possibly share the underlying kernel thread pool (IORING_SETUP_ATTACH_WQ).

When the work is finished and you have the result of the read, it's time to commit it and ask for more work. The structure is the same except now you fill in the result field.

let mut cmd = ublksrv_io_cmd {
    tag,
    q_id,
    result: cqe.result(),
    ..Default::default()
};
cmd.__bindgen_anon_1.addr = buffer_addr_for_tag(tag);

In most cases, it's possible to just pass through the result from the work cqe directly. In this case, a read operation expects the number of bytes read as its result. If there's an error, return negative errno.

If you're still working on the basic architecture of the server and are not yet ready to return the actual requested data, you can simply return -libc::EIO to fail the request.

Starting device

Now that all threads are up, it's time to start the device and expose it to the system. All that's necessary is to send a command with op UBLK_U_CMD_START_DEV to the control device and include the server's PID.

let mut cmd = ublksrv_ctrl_cmd {
    dev_id: dev_info.dev_id,
    queue_id: u16::MAX,
    ..Default::default()
};
cmd.data[0] = pid as u64;

The operation will not complete until all threads are up and requested work. This means you don't need to explicitly make your worker threads signal that they are ready and you can proceed to start the device right away. The downside is that if one of your threads crashes during its initialization, the main thread will wait forever. You can handle it by waiting for the completion with a timeout.

$ file /dev/ublk?42
/dev/ublkb42: block special (259/1)
/dev/ublkc42: character special (241/42)

# ublk list -n 42
dev id 42: nr_hw_queues 2 queue_depth 128 block size 512 dev_capacity 4194304
	max rq size 1048576 daemon pid 20093 state LIVE
	flags 0x604a [ URING_CMD_COMP_IN_TASK CMD_IOCTL_ENCODE ]
	ublkc: 241:1 ublkb: 259:1 owner: 0:0

The server will most likely receive its first IO requests immediately. That's because the kernel will start probing the new device for partitions and other information. It might also signal user space applications (such as udev) to trigger their own logic for new devices. As already mentioned, if you're not yet ready to accept IO requests, you can fail these requests immediately with -libc::EIO. In the future there will be a flag to turn the automatic scan off (useful both for development or in case probing is not desired at all).

Stopping device

We have spent so much time bringing the server up but let's talk about how to bring it down. The call to shutdown will usually come in the form of a signal delivered to your server, be it a SIGINT or SIGTERM. Handling signals is tricky but fortunately we can leverage io_uring yet again and tell it to wait for signals too (using the nix crate).

let mut set = SigSet::empty();
set.add(Signal::SIGINT);
set.add(Signal::SIGTERM);

signal::sigprocmask(SigmaskHow::SIG_BLOCK, Some(&set), None)?;

let fd = SignalFd::with_flags(&set, SfdFlags::SFD_NONBLOCK)?;
let sqe = PollAdd::new(Fd(fd), libc::POLLIN as _).build();

Make sure to set this up before you start the worker threads so that they inherit the signal mask. With this in place, the sqe will complete when one of the signals is delivered to the server.

When a signal arrives, you send the UBLK_U_CMD_STOP_DEV command to the ublk control. The driver will react to this by halting all I/O requests and by removing the /dev/ublkbN device. All operations in worker threads waiting on either UBLK_U_IO_FETCH_REQ or UBLK_U_IO_COMMIT_AND_FETCH_REQ will complete with -19 (UBLK_IO_RES_ABORT).

Stopping the server this way is rather abrupt - everybody still using the device will suddenly get IO errors. In the generic case there's currently no alternative but that's very likely to change with the introduction of UBLK_U_CMD_TRY_STOP_DEV. If there are any active users of the device, the command will instead complete with -libc::EBUSY.

At this point you should join the threads and clean up all resources including closing of the /dev/ublkcN file descriptor or un-mapping of memory mappings. All that remains is to send the UBLK_U_CMD_DEL_DEV command which will remove the remaining the /dev/ublkcN device.

If the command is not completing and appears to hang, it's because one of the resources is still open and the driver is waiting for it to be released. You can either be more careful or you can use the async variant of the command UBLK_CMD_DEL_DEV_ASYNC. The only difference is that with this command the driver will not wait for all resources to be released because it assumes the server's process is about to be terminated and all resources released automatically anyway.

Recovery

Imagine you start a server, start working with it and suddenly the server crashes. Maybe the server's block device hosts a file system and the crash happened right in the middle of putting some data on it. The mount still exists but no I/O requests can be successfully completed. Naturally it should not be technically possible for the server to crash given your extraordinary programmings skills and split mechanical keyboard but bear with me.

By default when a server crashes, the kernel driver will remove the block device /dev/ublkbN (but not its /dev/ublkcN counterpart) and all further I/O requests will fail.

$ ls -1 /dev/ublk*
/dev/ublkc42

$ ls /tmp/mount
ls: reading directory '/tmp/mount': Input/output error

You can confirm with the official utility that the device is now considered DEAD. There's not much that can be done with the device at this point. You can just leave it there to warn the others.

# ublk list -n 42
dev id 42: nr_hw_queues 2 queue_depth 128 block size 512 dev_capacity 4194304
	max rq size 1048576 daemon pid -1 state DEAD
	flags 0x6042 [ URING_CMD_COMP_IN_TASK CMD_IOCTL_ENCODE ]
	ublkc: 241:23 ublkb: 0:0 owner: 0:0

There are very few use cases where this is acceptable though and so the ublk framework offers a feature called recovery. To enable the feature, start the server with a flag called UBLK_F_USER_RECOVERY.

let dev_info = ublksrv_ctrl_dev_info {
    flags: UBLK_F_USER_RECOVERY.into(),
    ...
};

When the server crashes with this flag enabled, instead of transitioning to the DEAD state, it becomes QUIESCED. Notice that the driver will not remove the block device.

$ ls -1 /dev/ublk*
/dev/ublkb42
/dev/ublkc42

# ublk list -n 42
dev id 42: nr_hw_queues 2 queue_depth 128 block size 512 dev_capacity 4194304
	max rq size 1048576 daemon pid 6162 state QUIESCED
	flags 0x604a [ URING_CMD_COMP_IN_TASK RECOVERY CMD_IOCTL_ENCODE ]
	ublkc: 241:25 ublkb: 259:1 owner: 0:0

To restart the server, attempt to add a device with the same dev_id (via UBLK_U_CMD_ADD_DEV as usual) but handle a case where the call fails with -17 (EEXIST 17 File exists). If that happens, send a UBLK_U_CMD_GET_DEV_INFO call to get details about the given device and examine the state field.

If the device's state is QUIESCED, proceed with UBLK_U_CMD_START_USER_RECOVERY. If it's something else, you need to handle it based on your own application logic. There's one interesting corner case here though. If you start a server and shut it down gracefully while its block device is still in use (mounted for example), the dev_id will not be available even though it's no longer visible.

$ ls -1 /dev/ublk?42

$ mount | grep 42
/dev/ublkb42 on /tmp/mount type ext4 (rw,relatime,shutdown)

The state in this case will be reported as DEAD. If this happens, the only thing you can do is to tell the user to release the device by shutting down programs which still use it. This is not specific to ublk.

The happy path is that the device is indeed in the QUIESCED state and the UBLK_U_CMD_START_USER_RECOVERY command succeeds. Since it's a recovery, you skip setting its parameters (UBLK_U_CMD_SET_PARAMS). You can now continue initializing the server and when you're done, instead of calling UBLK_U_CMD_START_DEV you call UBLK_U_CMD_END_USER_RECOVERY. The device should now transition back to the LIVE state.

What happens with I/O requests while the server is down? By default all requests which were sent to the server right before it crashed are aborted and future requests are queued (up to a limit). In practice this means that programs using the block device will appear to hang and will continue once the server is back online as if nothing happened.

Ublk supports two alternative modes, namely UBLK_F_USER_RECOVERY_FAIL_IO and UBLK_F_USER_RECOVERY_REISSUE.

let dev_info = ublksrv_ctrl_dev_info {
    flags: (UBLK_F_USER_RECOVERY | UBLK_F_USER_RECOVERY_FAIL_IO).into(),
    ...
};

With UBLK_F_USER_RECOVERY_FAIL_IO, all requests are immediately aborted. If there's a file system on the device, it will most likely react to this by going read only right away (emergency_ro) and the device itself will transition into the FAIL_IO state.

With UBLK_F_USER_RECOVERY_REISSUE, all requests will be queued, including those which were sent to the server right before it crashed, and will be send again. This means the server might receive the same request twice. This is meant for e.g. read-only devices.

Special care should be taken to ensure that the device being recovered actually corresponds to the server. In other words, there's nothing preventing an unrelated server from "hijacking" a recoverable device which belongs to a different server / program.

Unprivileged mode

Ublk applications run in user space and as such their effect on the stability and security of the system should be limited. So far however all of the examples assumed that servers run with elevated privileges (CAP_SYS_ADMIN) and that negates much of the security benefits of running in user space. There are also scenarios, such as rootless containers, where the entire point is to not have elevated privileges in the first place.

As the answer to these concerns, ublk supports a so called unprivileged mode. In this mode, servers run with as regular users and do not require any extra privileges to operate. The driver will also make sure only the owner of the device can work with it. This makes it possible to not only use ublk in the aforementioned scenarios but also to use standard Linux user accounts to separate servers from each other.

let dev_info = ublksrv_ctrl_dev_info {
    flags: UBLK_F_UNPRIVILEGED_DEV.into(),
    ...
}

In the unprivileged mode, only the owner of the device can work with it and it's checked simply by requiring all commands (except for the add command) to be accompanied by the path to the device itself. The driver will then check that the calling effective user has permissions to access the device.

let dev_path = format!("/dev/ublkc{}", dev_info.dev_id);

let mut cmd = ublksrv_ctrl_cmd {
    dev_id: dev_info.dev_id,
    queue_id: u16::MAX,

    // new
    addr: dev_path.as_ptr() as u64,
    len: dev_path.len() as u16,
    dev_path_len: dev_path.len() as u16,
    ..Default::default()
};

Doing this is easy for commands such as UBLK_U_CMD_START_DEV. The only change is that now we include a few more fields. For commands such as UBLK_U_CMD_SET_PARAMS however the situation is more complicated because these commands already use the addr and len fields to send or receive their data.

let cmd = ublksrv_ctrl_cmd {
    dev_id: dev_info.dev_id,
    queue_id: u16::MAX,
    len: ublk_params::len(),
    addr: &raw const params as u64,
    ..Default::default()
};

To work around this while keeping backwards compatibility, the dev_path along with the rest of the original params is copied into a new buffer and that buffer is sent instead.

// The new buffer
let mut buf = Vec::<u8>::with_capacity(
    dev_path.len() + ublk_params::len() as usize,
);

// Copy over the dev_path
buf.extend_from_slice(dev_path.as_bytes());

// Copy over the params
let ptr = &raw const params;
let slice = unsafe {
    slice::from_raw_parts(ptr.cast(), ublk_params::len() as usize)
};
buf.extend_from_slice(slice);

// Set the address and lengths
cmd.addr = buf.as_ptr() as u64;
cmd.len = buf.len() as u16;
cmd.dev_path_len = dev_path.len() as u16;

Unfortunately running in unprivileged mode is not as simple as making these changes and comes with a few issues and limitations, some of which are not related to ublk itself.

The first issue is that the access to the /dev/ublk-control device is limited to the root user. The driver creates the device but it's up to the user space to make it visible and assign its permissions. On Linux based systemd distributions, this will be the job of systemd-udevd.

crw------- root root /dev/ublk-control

Similarly, when a new ublk server is exposed to the system, its devices are only accessible to the root user. The driver doesn't have any way of communicating the desired permissions and so one way to handle this is to create custom udev rules.

KERNEL=="ublk-control", MODE="0666", OPTIONS+="static_node=ublk-control"
ACTION=="add",KERNEL=="ublk[bc]*",RUN+="/usr/bin/ublk_chown.sh %k add"

Setting the permissions of /dev/ublk-control is easy. However, setting the permissions of server devices is more complicated because the point is to have these devices owned and accessible only by the users who started them and again the driver doesn't have a way of communicating the values to user space on its own (it knows the information but it's up to the user space to ask for it). That's where the ublk_chown script comes in.

When the device manager receives a kernel event informing it about a new device, it launches the script which in turn starts a program (ublk_user_id) to query the control about the newly created device by sending it UBLK_CMD_GET_DEV_INFO2 along with the dev_id of the device.

# ublk_user_id ublkc42
1000:1000

Since the systemd-udevd daemon is running as root, it bypasses the ownership check and the driver returns IDs of the owner. Finally, the script uses the returned information to chown the device.

crw-rw-rw- root root /dev/ublk-control
brw-rw---- user user /dev/ublkb42
crw------- user user /dev/ublkc42

Of course, you are not required to use these rules or scripts, you can write your own, especially if you control both the server and the system. Keep in mind though that you cannot chown the server devices only after they are created. At least for the /dev/ublkcN part, it has to be done as it appears otherwise the server itself couldn't access it to finish initializing.

Doing exactly this via udev rules is also not bullet proof though. Udev makes the new device visible to the system immediately and only then runs the matching scripts. What this means is that there's a race during initialization between the server trying to set the device's parameters and udev running the scripts to make the device accessible to the server. If you hit the jackpot, the server will be faster than udev and the set parameters command will fail with EACCES 13 Permission denied. Handle it by simply trying again after some time period to give udev time to finish.

That's it for the issues. Now for the limitations. As of right now unprivileged servers cannot use the zero copy nor the user copy modes. It's for security reasons because the server could in theory use it to leak uninitialized kernel memory. It might depend on a specific kernel configuration because I haven't been able to reproduce it in my admittedly naive attempts.

Unprivileged servers also don't get their block devices scanned for partitions to avoid exposing kernel parts which assume trusted data to potentially malicious input. Finally, unprivileged servers don't yet support recovery.

Testing

Besides test cases specific to your application, you can exercise the read / write logic with fio. Fio is a command line utility from the author of io_uring which can, among many other things, generate load onto the block device and verify the content afterwards.

Running load test for 30s with verified mixed reads and writes of various block sizes can be accomplished as follows.

fio --name "randrw" --filename="/dev/ublkb42" --rw=randrw \
    --bsrange=4k-4M --direct=1 --ioengine=libaio --iodepth=128 --verify=crc32c \
    --verify_state_save=0 --verify_fatal 1 --time_based --runtime=30s

Of course this not only exercises your own application but the kernel as well. You can even find bugs in the block subsystem itself (patch).

Debugging

Sometimes it's difficult to tell why the driver rejected your perfectly valid commands and returned an error. Fortunately the driver is compiled as a module and we can replace it with a more talkative version without recompiling of the whole kernel or even rebooting.

First, download the source code of your distribution's kernel which you're currently running on. On Arch Linux it's possible to use the vanilla upstream sources but on many other distributions the sources are heavily patched and things might be silently broken.

$ curl -O https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.18.3.tar.xz
$ tar xf linux-6.18.3.tar.xz

Second, go to drivers/block and edit ublk_drv.c. Enable debugging at the very top and add any log messages you need to figure out what's wrong.

#define DEBUG

pr_devel("dev_id=%d\n", header->dev_id);

Finally, compile the module against your current kernel's headers and upsert it. This should compile just this one module without doing anything else.

$ make -C /lib/modules/$(uname -r)/build M=$(pwd) ublk_drv.ko

# rmmod ublk_drv && insmod ublk_drv.ko
# dmesg --follow

You could also try using Kprobes which would not require recompiling of the module at all but modifying the source directly is much more flexible.

Pet project

The best way to learn something is to actually use it in practice and not just read about it (I'm looking at you). While browsing the repository, I stumbled upon a question if it's possible to use ublk to create a virtual device backed by a series of regular files. Think JBOD but with files.

To make it more interesting, the individual chunks should be sparse, allocated on demand and opened only when needed later. Similarly to the "hands on" section of this post, I wanted to implement everything ublk related myself and not use the official libraries. The server is more or less a proof of concept at this point but feel free to check it out at blkchnkr.

$ blkchnkr init -r /tmp/repository --dev-id 42 --size 1t
[INFO]	Created a new repository at /tmp/repository

# blkchnkr start -r /tmp/repository
[INFO]	Starting up (v0.1.0)
[INFO]	Created a new block device at /dev/ublkb42
[INFO]	Ready!

# mkfs.xfs /dev/ublkb42

$ tree /tmp/repository
/tmp/repository
├── chunks
│   ├── 00
│   │   ├── 0
│   │   ├── 1024
│   │   ├── 1536
│   │   └── 512
│   ├── 01
│   │   └── 1025
│   └── ff
│       └── 2047
└── config

# mount --mkdir /dev/ublkb42 /tmp/mounted

The end

The post is starting to get long and the only ones still reading are probably just LLM crawlers so let's end it here. I haven't talked about (or even tried) everything ublk can do but hopefully you now have enough context to figure out things on your own.

You can find more examples in the official repository or in the kernel tree. Everything is this post is based on these resources and my own experiments.

Did you find an error? Please send me an email.

Go back to the front page