The term "queue" in the context of NVMe refers to a data structure used to manage and organize I/O operations for an NVMe device. NVMe queues are implemented as memory structures both in the system's main memory (host memory) and within the NVMe device itself.
Here's how it works:
On the Host (System Memory):
Submission Queue (SQ): The host has one or more Submission Queues (SQs) in its system memory. These are used to hold commands (I/O requests) that the host CPU wants the NVMe device to process.
Completion Queue (CQ): The host also has one or more Completion Queues (CQs) in its system memory. These are used to hold completion entries generated by the NVMe device when it completes processing a command.
On the NVMe Device:
Device Submission Queue (SQ): The NVMe device has its own Submission Queue (Device SQ) that is implemented internally in the device's memory. This queue holds commands that the NVMe controller processes.
Device Completion Queue (CQ): The NVMe device also has its own Completion Queue (Device CQ) that is implemented internally in the device's memory. This queue holds completion entries that are generated by the NVMe controller when it completes processing a command.
Doorbell Registers:
In NVMe, a "doorbell register" is a mechanism used for the NVMe controller (in the NVMe device) to notify the host CPU about certain events, such as when it has completed processing a command. The doorbell registers are associated with the Submission Queue (SQ) and Completion Queue (CQ) pairs.
The host has a doorbell register for each Submission Queue (SQ) and Completion Queue (CQ) pair. It writes to the doorbell register to signal the NVMe device about new commands in the SQ or new completion entries in the CQ.
The NVMe device has doorbell registers for its Device Submission Queue (Device SQ) and Device Completion Queue (Device CQ). It writes to the doorbell register to signal the host about new commands in the Device SQ or new completion entries in the Device CQ.
For most use cases, its often best to use the NOOP scheduler when utilizing NVMe devices. The NOOP scheduler is a simple I/O scheduler that essentially passes I/O requests through without reordering them. It’s often used in scenarios where the device itself (such as an NVMe SSD) has its own sophisticated handling mechanisms and reordering capabilities.
When using the NOOP scheduler, the Linux kernel can still be using the Block Multi-Queue (blk-mq) subsystem. blk-mq is a multi-queue I/O scheduler that was introduced to handle I/O operations more efficiently on modern hardware, including NVMe devices. It allows for improved parallelism and scalability by using multiple hardware queues and processing cores.
❯ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline
Submission Queue (SQ) Enqueue:
The host sends a command descriptor (metadata) to the Submission Queue (SQ) to initiate an I/O operation (read or write).
This command descriptor contains information about the operation, such as the command type, the starting logical block address, the number of blocks, and the location of data in host memory.
After writing the command descriptor to the SQ, the host rings the SQ's doorbell register.
Doorbell Register:
The doorbell register is a special memory location shared between the host and the NVMe controller.
Ringing the doorbell involves writing a value to the doorbell register, indicating that a new command descriptor has been added to the SQ.
The value written to the doorbell register corresponds to the index of the command in the SQ.
Controller Processing:
The NVMe controller fetches the command descriptor from the SQ based on the index indicated by the doorbell register.
The controller interprets the command descriptor and initiates the necessary actions, which could include reading or writing data.
Data Transfer:
For a write operation, the NVMe controller accesses the data in the host memory specified by the command descriptor.
The data is transferred from host memory to the NVMe device's internal buffers.
Internal Processing:
The NVMe device processes the command, performing any necessary internal operations, such as wear leveling, error correction, or garbage collection.
Completion Generation:
Once the requested operation is complete, the NVMe device generates a completion entry that includes the operation's status (success or failure) and any relevant information.
Completion Queue (CQ) Update:
The completion entry is placed in the Completion Queue (CQ) by the NVMe controller.
The NVMe controller rings the CQ's doorbell register to indicate that a new completion entry is available.
Host Retrieval:
The host periodically polls or uses interrupt-driven mechanisms to check the CQ for new completion entries.
When the host finds a new completion entry, it reads the entry's status and any relevant details.
Memory Access:
For a read operation, the NVMe controller retrieves the requested data from the NVMe device's internal buffers.
The data is then written to the host memory location specified by the command descriptor.
Acknowledgment:
The host acknowledges the completion of the I/O operation and processes the data as needed.
Throughout this process, the doorbell registers and queues help manage the flow of commands and completions between the host and the NVMe device. The device accesses the host memory as necessary during the data transfer stages, ensuring that the requested data is read from or written to the correct memory locations. The completion entries in the CQ signal the host that the requested operation has been fully processed and is ready for further action.
In Linux, there are a few different tools we can use to interact with the NVMe to obtain details about the device(s). One of the most common and well supported is nvme-cli which I’ll be using for these examples.
The below command will give us a high level view of the device(s):
❯ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 50026B7685968D5C KINGSTON OM8PDP3512B-A01 1 512.11 GB / 512.11 GB 4 KiB + 0 B EDFK0S03
We can see the Node which is the root NVMe along with the serial number, model, firmware, block size on device, and more.
This NVMe is using physical and logical sizes of 4096 instead of the traditional 512 byte sectors legacy drives usually use. We can verify they match what NVMe reports doing the following:
❯ cat /sys/block/nvme0n1/queue/{logical,physical}_block_size
4096
4096
Typically, NVMe devices leverage multiple queues as each SQ can handle multiple I/O requests and having multiple SQs allows the host to send requests concurrently to the device, improving parallelism and potentially reducing latency.
We can check how many queues there are with the following command. Something to note, the number of Host SQs is determined by the device's capabilities and is usually equal to or fewer than the number of device SQs. The number of Host SQs is often influenced by the capabilities of the hardware and the driver.
❯ sudo nvme get-feature /dev/nvme0 -f 7 -H
get-feature:0x07 (Number of Queues), Current value:0x00070007
Number of IO Completion Queues Allocated (NCQA): 8
Number of IO Submission Queues Allocated (NSQA): 8
❯ cat /sys/class/nvme/nvme0/sqsize
1023
We can see there are 8 Host SQs and each can hold up to 1023 requests.
The requests themselves are typically small. The actual data written from a pwrite64 (in default Aerospike storage this would be our 1048576 bytes write-block-size) stays in memory and the request struct is formed which will contain pointers to memory for the data we intend to write to disk.
❯ sudo nvme id-ctrl /dev/nvme0 | egrep 'sqes|cqes|mdts'
mdts : 6
sqes : 0x66
cqes : 0x44
Submission Queue Entry Size (SQES) - In hex this is 0x66 which in decimal is 102. Each submission request can be up to 102 bytes large.
Completion Queue Entry Size (CQES) - In hex this is 0x44 which in decimal is 68. Each completion request can be up to 68 bytes large.
Maximum Data Transfer Size (mdts) - This indicates the maximum amount of data that can be transferred in a single command between the host and the NVMe device. The calculation is 2^(mdts + 12) bytes. In the above example this would be 2^(6 + 12) == 2^(18) == 256KiB. Thus, each command sent into the device SQ will be a maximum of 256KiB so our 1MiB write block will be split into 4 commands of 256KiB each.
This is how the cluster is setup:
❯ aerolab cluster create -c 3 -v '5.7*' --instance c2d-standard-16 --zone us-central1-a --disk pd-ssd:110 --disk local-ssd@4 --start n
The list of NVMe devices:
root@mydc-1:~# nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 nvme_card nvme_card 1 0.00 B / 402.65 GB 4 KiB + 0 B 2
/dev/nvme0n2 nvme_card nvme_card 2 0.00 B / 402.65 GB 4 KiB + 0 B 2
/dev/nvme0n3 nvme_card nvme_card 3 0.00 B / 402.65 GB 4 KiB + 0 B 2
/dev/nvme0n4 nvme_card nvme_card 4 0.00 B / 402.65 GB 4 KiB + 0 B 2
The number of SQs and CQs:
root@mydc-1:~# nvme get-feature -f 7 /dev/nvme0n1 -H
get-feature:0x7 (Number of Queues), Current value:0x0f000f
Number of IO Completion Queues Allocated (NCQA): 16
Number of IO Submission Queues Allocated (NSQA): 16
The size of requests and max mdts:
root@mydc-1:~# nvme id-ctrl /dev/nvme0n1 | egrep 'mdts|sqes|cqes'
mdts : 9
sqes : 0x66
cqes : 0x44
Amount of requests allowed in SQ:
root@mydc-1:~# cat /sys/class/nvme/nvme0/sqsize
1023
Let’s summarize:
The NVMe block-size is 4096
We can have 16 Host SQs for device nvme0n1
The maximum amount of data we can transfer to nvme0n1 is 2MiB 2^(9+12)
Each SQ can hold up to 1023 requests
This is how the cluster is setup:
$ aerolab cluster create -n z3-176 --count 1 --instance=z3-highmem-176 --zone=us-central1-a
$ aerolab cluster attach -n z3-176
The list of NVMe devices:
root@z3-176-1:~# nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 nvme_card-pd nvme_card-pd 1 0.00 B / 21.47 GB 512 B + 0 B 2
/dev/nvme1n1 nvme_card0 nvme_card0 1 3.22 TB / 3.22 TB 4 KiB + 0 B
/dev/nvme10n1 nvme_card5 nvme_card5 1 3.22 TB / 3.22 TB 4 KiB + 0 B
/dev/nvme11n1 nvme_card1 nvme_card1 1 3.22 TB / 3.22 TB 4 KiB + 0 B
/dev/nvme12n1 nvme_card2 nvme_card2 1 3.22 TB / 3.22 TB 4 KiB + 0 B
/dev/nvme2n1 nvme_card3 nvme_card3 1 3.22 TB / 3.22 TB 4 KiB + 0 B
/dev/nvme3n1 nvme_card6 nvme_card6 1 3.22 TB / 3.22 TB 4 KiB + 0 B
/dev/nvme4n1 nvme_card9 nvme_card9 1 3.22 TB / 3.22 TB 4 KiB + 0 B
/dev/nvme5n1 nvme_card4 nvme_card4 1 3.22 TB / 3.22 TB 4 KiB + 0 B
/dev/nvme6n1 nvme_card10 nvme_card10 1 3.22 TB / 3.22 TB 4 KiB + 0 B
/dev/nvme7n1 nvme_card7 nvme_card7 1 3.22 TB / 3.22 TB 4 KiB + 0 B
/dev/nvme8n1 nvme_card11 nvme_card11 1 3.22 TB / 3.22 TB 4 KiB + 0 B
/dev/nvme9n1 nvme_card8 nvme_card8 1 3.22 TB / 3.22 TB 4 KiB + 0 B
root@z3-176-1:~# cat /sys/block/nvme1n1/queue/{logical,physical}_block_size
4096
4096
The number of SQs and CQs:
root@z3-176-1:~# nvme get-feature -f 7 /dev/nvme0n1 -H
get-feature:0x07 (Number of Queues), Current value:0x00030003
Number of IO Completion Queues Allocated (NCQA): 4
Number of IO Submission Queues Allocated (NSQA): 4
root@z3-176-1:~# nvme get-feature -f 7 /dev/nvme1n1 -H
get-feature:0x07 (Number of Queues), Current value:0x00070007
Number of IO Completion Queues Allocated (NCQA): 8
Number of IO Submission Queues Allocated (NSQA): 8
The size of requests and max mdts:
root@z3-176-1:~# nvme id-ctrl /dev/nvme1n1 | egrep 'mdts|sqes|cqes'
mdts : 5
sqes : 0x66
cqes : 0x44
root@z3-176-1:~# sudo nvme id-ctrl /dev/nvme1 | grep 'mdts' | awk '{print "2^("$3"+12)"}' | bc | numfmt --to=iec
128K
Amount of requests allowed in SQ:
root@z3-176-1:~# cat /sys/class/nvme/nvme1/sqsize
1023
Let’s summarize:
The NVMe block-size is 4096
We can have 8 Host SQs for device nvme1n1
The maximum amount of data we can transfer to nvme1n1 is 128K [ 2^(5+12) ]
Each SQ can hold up to 1023 requests