bdev_ubi provides an SPDK virtual bdev layered over another bdev, enabling copy-on-access for a base image.
This can be utilized to set up a block device initialized with an image file, without the delay of an initial copy. With bdev_ubi, copying occurs lazily upon block access, rather than at the provisioning time.
For instance, let's say you already have a bdev named "aio0" and you wish to
populate it with data from /opt/images/large-image.raw
. A conventional
approach would be to copy large-image.raw
to aio0
using tools like spdk_dd,
which can be time-consuming. However, with bdev_ubi, you can proceed by invoking
the following json-rpc method:
{
"jsonrpc": "2.0",
"id": 1,
"method": "bdev_ubi_create",
"params": {
"name": "ubi0",
"base_bdev": "aio0",
"image_path": "/opt/images/large-image.raw",
"stripe_size_mb": 1
}
}
and rather than directly using aio0
, use ubi0
as the block device. This
action completes almost instantly, eliminating the need for initial data
copying. The actual data copying will occur lazily during I/O requests.
Install some requirements:
sudo apt update
sudo apt install pkg-config build-essential liburing-dev
Clone and build SPDK:
git clone https://github.com/spdk/spdk
cd spdk
git submodule update --init
sudo scripts/pkgdep.sh
./configure --with-crypto --with-vhost
make -j16
Build bdev_ubi:
SPDK_PATH=/path/to/spdk/build/ make
The steps in the previous section generates an SPDK app in
build/bin/vhost_ubi
. The application can be managed through JSON-RPC or by
providing block device (bdev) configuration via command-line arguments,
analogous to the functionality offered by the SPDK's standard vhost
application.
For instance, begin by downloading and preparing the Ubuntu Jammy image, which will be used as the base read-only image:
wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img
qemu-img convert -p -f qcow2 -O raw jammy-server-cloudimg-amd64.img jammy.raw
Next, generate the file that will function as the writable layer for the block device:
touch write-space
truncate -s 5G write-space
Next, reserve 1G of hugepages:
echo 512 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
Finally, start the app with the configuration provided in
examples/spdk_conf.json
:
sudo build/bin/vhost_ubi --json examples/spdk_conf.json -S /var/tmp
What will happen is:
- An aio bdev called
aio0
will be created pointing towrite-space
. - A ubi bdev called
ubi0
will be created, with base imagejammy.raw
and base bdevaio0
. - A vhost-user-blk controller called
vhost.0
will be created, which will be bound to the UNIX domain socket/var/tmp/vhost.0
.
Now you can start a VM using the vhost disk:
sudo cloud-hypervisor \
...
--disk vhost_user=true,socket=/var/tmp/vhost.0,num_queues=4,queue_size=256 \
...
Parameters:
name
(text, required): Name of the bdev to be created.image_path
(text, required): Path to the image file.base_bdev
(text, required): Name of base bdev.stripe_size_kb
(integer, required): Stripe size in kibibytes.no_sync
(boolean, optional): Ignore sync requests. Defaults to false.copy_on_read
(boolean, optional): Fetch stripes for reads. Defaults to true.directio
(boolean, optional): Use O_DIRECT when opening the image file. Defaults to true;
Note. When creating the bdev for the first time, magic bits in the metadata
section of base image should be zeroed. For unencrypted base bdev, truncate
command in the previous section will take care of this. For encrypted base bdev,
spdk_dd
can be used with parameters --bs 512 --count 1 --if /dev/zero --ob [ubi_bdev_name]
.
Parameters:
name
(text, required): Name of the bdev to be deleted.
First 8MB of base bdev is reserved for metadata. Metadata consists of:
- Magic bytes (9 bytes):
BDEV_UBI\0
- Metadata version major (2 bytes)
- Metadata version minor (2 bytes)
- stripe_size_kb (1 byte)
- Stripe headers: 2 byte per stripes. Currently it specifies whether a stripe has been fetched from image or not. 15-bits are reserved for future extension.
- Padding to make the total size 8MB.
Then at the 8MB offset the actual disk data starts.
If the stripe containing the requested block range hasn't been fetched yet, then a stripe fetch is enqueued. Once stripe has been fetched, the actual I/O operation is served.
- Data for the requested range is flushed to base bdev.
- Once data flush is finished, and if metadata has been modified in memory, then metadata is first written and then flushed to base bdev.