From the man pages, the tools are described as the following:
[2] blktrace - blktrace is a block layer IO tracing mechanism which provides detailed information about request queue operations up to user space. There are three major components: a kernel component, a utility to record the i/o trace information for the kernel to user space, and utilities to analyse and view the trace information.
[3] blkparse - The blkparse utility will attempt to combine streams of events for various devices on various CPUs, and produce a formatted output of the event information. Specifically, it will take the (machine-readable) output of the blktrace utility and convert it to a nicely formatted and human-readable form.
[3] Tracepoints are static probe points that are located in strategic points throughout the kernel. 'Probes' register/unregister with tracepoints via a callback mechanism. The 'probes' are strictly typed functions that are passed a unique set of parameters defined by each tracepoint.
In the man page for blkparse there are a list of possible actions that are recognize which I will paste below:
The following trace actions are recognised:
C -- complete A previously issued request has been completed. The output will detail the sector and size of that request, as well as the success or failure of it.
D -- issued A request that previously resided on the block layer queue or in the i/o scheduler has been sent to the driver.
I -- inserted A request is being sent to the i/o scheduler for addition to the internal queue and later service by the driver. The request is fully formed at this time.
Q -- queued This notes intent to queue i/o at the given location. No real requests exists yet.
B -- bounced The data pages attached to this bio are not reachable by the hardware and must be bounced to a lower memory location. This causes a big slowdown in i/o performance, since the data must be copied to/from kernel buffers. Usually this can be fixed with using better hardware -- either a better i/o controller, or a platform with an IOMMU.
M -- back merge A previously inserted request exists that ends on the boundary of where this i/o begins, so the i/o scheduler can merge them together.
F -- front merge Same as the back merge, except this i/o ends where a previously inserted requests starts.
M --front or back merge One of the above
G -- get request To send any type of request to a block device, a struct request container must be allocated first.
S -- sleep No available request structures were available, so the issuer has to wait for one to be freed. P -- plug When i/o is queued to a previously empty block device queue, Linux will plug the queue in anticipation of future ios being added before this data is needed.
U -- unplug Some request data already queued in the device, start sending requests to the driver. This may happen automatically if a timeout period has passed (see next entry) or if a number of requests have been added to the queue.
T -- unplug due to timer If nobody requests the i/o that was queued after plugging the queue, Linux will automatically unplug it after a defined period has passed.
X -- split On raid or device mapper setups, an incoming i/o may straddle a device or internal zone and needs to be chopped up into smaller pieces for service. This may indicate a performance problem due to a bad setup of that raid/dm device, but may also just be part of normal boundary conditions. dm is notably bad at this and will clone lots of i/o.
A -- remap For stacked devices, incoming i/o is remapped to device below it in the i/o stack. The remap action details what exactly is being remapped to what.
This is relatively simple. To collect traces you can specify which device(s) to monitor and where to output those files for blkparse to reference later.
To trace a singular device and output the trace files to current working directory:
sudo blktrace -d /dev/nvme0n1 -o nvme0n1 -w 10
To trace all NVMe devices and put the trace files into the ./traces/ directory:
sudo blktrace -d /dev/nvme0n* -D traces/ -w 10
If we have blktraces then we can utilize the tool blkparse to further inspect how the IO is behaving.
In blkparse we can specify formats using the -f and can output whichever way you like. This is the format I tend to use:
blkparse -f "Time: %5T.%9t, PID: %p, Device: %M,%m CPU: %c, Command: %C, Action: %a, RWBS: %3d, Bytes: %N, Sector: %S, Blocks: %n, SequenceNum: %s\n" -i nvme0n1.blktrace.0
Which shows the lines as:
Time: 0.000000000, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: A, RWBS: WFS, Bytes: 1048576, Sector: 10264576, Blocks: 2048, SequenceNum: 1
So let’s review the output for a single write being flushed to the disk in Aerospike.
The following asbench command was used:
asbench -n test -s blktrace -w I -t 0 -o '1*S102400,1*S24320' -k 5 -K 15000 -T 0 -h localhost:31775 --connect-timeout 10000 --throughput 500 -U admin -P admin123
and max-flush-ms is set to 1000, so even though this may not be 1048576 bytes completely, we will still write this amount of data to the disk since we are forcing a flush based on time.
We can see the following from this command using blkparse:
❯ blkparse -f "Time: %5T.%9t, PID: %p, Device: %M,%m CPU: %c, Command: %C, Action: %a, RWBS: %3d, Bytes: %N, Sector: %S, Blocks: %n, SequenceNum: %s\n" -i nvme0n1.blktrace.0
Input file nvme0n1.blktrace.2 added
Input file nvme0n1.blktrace.6 added
Input file nvme0n1.blktrace.7 added
Time: 0.000000000, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: A, RWBS: WFS, Bytes: 1048576, Sector: 10264576, Blocks: 2048, SequenceNum: 1
Time: 0.000002794, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: Q, RWBS: WFS, Bytes: 1048576, Sector: 10264576, Blocks: 2048, SequenceNum: 2
Time: 0.000008940, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: X, RWBS: WFS, Bytes: 262144, Sector: 10264576, Blocks: 512, SequenceNum: 3
Time: 0.000015644, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: G, RWBS: WFS, Bytes: 262144, Sector: 10264576, Blocks: 512, SequenceNum: 4
Time: 0.000057549, PID: 331, Device: 259,0 CPU: 6, Command: kworker/6:1H, Action: D, RWBS: WFS, Bytes: 262144, Sector: 10264576, Blocks: 512, SequenceNum: 5
Time: 0.000070190, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: X, RWBS: WFS, Bytes: 262144, Sector: 10265088, Blocks: 512, SequenceNum: 6
Time: 0.000072565, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: G, RWBS: WFS, Bytes: 262144, Sector: 10265088, Blocks: 512, SequenceNum: 7
Time: 0.000095961, PID: 331, Device: 259,0 CPU: 6, Command: kworker/6:1H, Action: D, RWBS: WFS, Bytes: 262144, Sector: 10265088, Blocks: 512, SequenceNum: 8
Time: 0.000105460, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: X, RWBS: WFS, Bytes: 262144, Sector: 10265600, Blocks: 512, SequenceNum: 9
Time: 0.000107765, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: G, RWBS: WFS, Bytes: 262144, Sector: 10265600, Blocks: 512, SequenceNum: 10
Time: 0.000130253, PID: 331, Device: 259,0 CPU: 6, Command: kworker/6:1H, Action: D, RWBS: WFS, Bytes: 262144, Sector: 10265600, Blocks: 512, SequenceNum: 11
Time: 0.000138774, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: G, RWBS: WFS, Bytes: 262144, Sector: 10266112, Blocks: 512, SequenceNum: 12
Time: 0.000162660, PID: 331, Device: 259,0 CPU: 6, Command: kworker/6:1H, Action: D, RWBS: WFS, Bytes: 262144, Sector: 10266112, Blocks: 512, SequenceNum: 13
Time: 0.038976446, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: WFS, Bytes: 262144, Sector: 10264576, Blocks: 512, SequenceNum: 1
Time: 0.038988249, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: WFS, Bytes: 262144, Sector: 10265088, Blocks: 512, SequenceNum: 2
Time: 0.039132680, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: WFS, Bytes: 262144, Sector: 10265600, Blocks: 512, SequenceNum: 3
Time: 0.039579034, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: WFS, Bytes: 262144, Sector: 10266112, Blocks: 512, SequenceNum: 4
Time: 0.039614304, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: Q, RWBS: FWS, Bytes: 0, Sector: 0, Blocks: 0, SequenceNum: 14
Time: 0.039620799, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: G, RWBS: FWS, Bytes: 0, Sector: 0, Blocks: 0, SequenceNum: 15
Time: 0.039636723, PID: 331, Device: 259,0 CPU: 6, Command: kworker/6:1H, Action: D, RWBS: FN, Bytes: 0, Sector: 0, Blocks: 0, SequenceNum: 16
Time: 0.041816251, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: FN, Bytes: 0, Sector: 0, Blocks: 0, SequenceNum: 5
Time: 0.041820581, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: WS, Bytes: 0, Sector: 0, Blocks: 0, SequenceNum: 6
CPU6 (nvme0n1):
Reads Queued: 0, 0KiB Writes Queued: 2, 1024KiB
Read Dispatches: 1, 0KiB Write Dispatches: 4, 1024KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 1 Write depth: 4
IO unplugs: 0 Timer unplugs: 0
CPU7 (nvme0n1):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 1, 0KiB Writes Completed: 5, 1024KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 1 Write depth: 4
IO unplugs: 0 Timer unplugs: 0
Total (nvme0n1):
Reads Queued: 0, 0KiB Writes Queued: 2, 1024KiB
Read Dispatches: 1, 0KiB Write Dispatches: 4, 1024KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 1, 0KiB Writes Completed: 5, 1024KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
IO unplugs: 0 Timer unplugs: 0
Throughput (R/W): 0KiB/s / 24975KiB/s
Events (nvme0n1): 22 entries
Skips: 0 forward (0 - 0.0%)
We have a summary of the reads and writes for the disk at the bottom. The interesting behavior are the traces:
Time: 0.000000000, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: A, RWBS: WFS, Bytes: 1048576, Sector: 10264576, Blocks: 2048, SequenceNum: 1
Time: 0.000002794, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: Q, RWBS: WFS, Bytes: 1048576, Sector: 10264576, Blocks: 2048, SequenceNum: 2
Time: 0.000008940, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: X, RWBS: WFS, Bytes: 262144, Sector: 10264576, Blocks: 512, SequenceNum: 3
Time: 0.000015644, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: G, RWBS: WFS, Bytes: 262144, Sector: 10264576, Blocks: 512, SequenceNum: 4
Time: 0.000057549, PID: 331, Device: 259,0 CPU: 6, Command: kworker/6:1H, Action: D, RWBS: WFS, Bytes: 262144, Sector: 10264576, Blocks: 512, SequenceNum: 5
Time: 0.000070190, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: X, RWBS: WFS, Bytes: 262144, Sector: 10265088, Blocks: 512, SequenceNum: 6
Time: 0.000072565, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: G, RWBS: WFS, Bytes: 262144, Sector: 10265088, Blocks: 512, SequenceNum: 7
Time: 0.000095961, PID: 331, Device: 259,0 CPU: 6, Command: kworker/6:1H, Action: D, RWBS: WFS, Bytes: 262144, Sector: 10265088, Blocks: 512, SequenceNum: 8
Time: 0.000105460, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: X, RWBS: WFS, Bytes: 262144, Sector: 10265600, Blocks: 512, SequenceNum: 9
Time: 0.000107765, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: G, RWBS: WFS, Bytes: 262144, Sector: 10265600, Blocks: 512, SequenceNum: 10
Time: 0.000130253, PID: 331, Device: 259,0 CPU: 6, Command: kworker/6:1H, Action: D, RWBS: WFS, Bytes: 262144, Sector: 10265600, Blocks: 512, SequenceNum: 11
Time: 0.000138774, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: G, RWBS: WFS, Bytes: 262144, Sector: 10266112, Blocks: 512, SequenceNum: 12
Time: 0.000162660, PID: 331, Device: 259,0 CPU: 6, Command: kworker/6:1H, Action: D, RWBS: WFS, Bytes: 262144, Sector: 10266112, Blocks: 512, SequenceNum: 13
Time: 0.038976446, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: WFS, Bytes: 262144, Sector: 10264576, Blocks: 512, SequenceNum: 1
Time: 0.038988249, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: WFS, Bytes: 262144, Sector: 10265088, Blocks: 512, SequenceNum: 2
Time: 0.039132680, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: WFS, Bytes: 262144, Sector: 10265600, Blocks: 512, SequenceNum: 3
Time: 0.039579034, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: WFS, Bytes: 262144, Sector: 10266112, Blocks: 512, SequenceNum: 4
Time: 0.039614304, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: Q, RWBS: FWS, Bytes: 0, Sector: 0, Blocks: 0, SequenceNum: 14
Time: 0.039620799, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: G, RWBS: FWS, Bytes: 0, Sector: 0, Blocks: 0, SequenceNum: 15
Time: 0.039636723, PID: 331, Device: 259,0 CPU: 6, Command: kworker/6:1H, Action: D, RWBS: FN, Bytes: 0, Sector: 0, Blocks: 0, SequenceNum: 16
Time: 0.041816251, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: FN, Bytes: 0, Sector: 0, Blocks: 0, SequenceNum: 5
Time: 0.041820581, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: WS, Bytes: 0, Sector: 0, Blocks: 0, SequenceNum: 6
So, we wrote 5 keys into the database and the database flushed the partial block down to the disk.
We see the first line is an Action A for a remap. This is because the device layout on the machine looks like this:
nvme0n1 259:0 0 476.9G 0 disk
├─nvme0n1p1 259:1 0 150G 0 part
├─nvme0n1p2 259:2 0 100G 0 part
├─nvme0n1p3 259:3 0 100G 0 part /mnt/nvme-test
└─nvme0n1p4 259:4 0 4K 0 part
Thus, the request is coming into nvme0n1 but it needs to be remapped to nvme0n1p1 which is the raw partition used for my Aerospike namespace.
Next, we can see the request is in the Host SQ denoted by Action: Q. We then have to start splitting the request into smaller chunks of 512*512 (256KiB). Why is this the case? This is because the current NVMe can only handle 256KiB as a Maximum Data Transfer (mdts).
❯ sudo nvme id-ctrl /dev/nvme0 | egrep 'sqes|cqes|mdts'
mdts : 6
sqes : 0x66
cqes : 0x44
So now our single 1048576 write is getting split into 4 requests each holding a reference to 256KiB buffers on the Host containing our data.
Once the request is split, the Host ‘rings the doorbell’ for the NVMe controller letting it know a request is submitted. Then the NVMe controller can retrieve the request and start processing it. The Action: G represents the controller granting permission to process a command from the Device Submission Queue (Device SQ). Action: D represents when the actual I/O operation is performed on the device.
Lastly, we have Action: C which is when the Host CPU receives the completion entry and can determine the results of the I/O operation.
This is the fun part!
We can get a good look into the I/O operations and how they’re performing based on the data above. We can track which request is which by the Sector. We know the pwrite64 from Aerospike is 1048576 bytes, and we also know it’s being chunked into 256KiB sized commands. Each Action: X shows the starting Sector (or offset) of the request. The original starting sector will be the Action: Q line, and then each command is that intial sector + 512 sectors.
Please note that the Linux kernel hardcodes the number of blocks to be base of 512 bytes each. This specific NVMe has a 4096 block size configured, so in reality, the amount of blocks are actually (256KiB / 4096 == 64 blocks) but this will not be represented in the tracepoints due to the hardcoded 9 bit right shifting.
There are a few metrics we can use to determine the latency of devices roughly. Let’s explain a few of these methods:
Q2C (Queue to Completion): This metric measures the time it takes from the point when an I/O request is added to a submission queue (SQ) until the completion entry for that request appears in the corresponding completion queue (CQ). It reflects the entire round-trip time for processing an I/O request.
Q2G (Queue to Get): Q2G represents the time taken from when an I/O request is added to a submission queue (SQ) until the request is fetched from that queue by the controller to be processed. It doesn't include the time for the I/O operation to complete; it only measures the time the request spends in the SQ before being picked up.
D2C (Device to Completion): D2C measures the time taken from when the NVMe device starts processing an I/O request, as indicated by the request being removed from its submission queue (SQ), until the completion entry for that request appears in the device's completion queue (CQ). It captures the time spent by the device to process the I/O request.
Now let’s analyze a few lines for a single split request and calculate its latency.
Time: 0.000000000, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: A, RWBS: WFS, Bytes: 1048576, Sector: 10264576, Blocks: 2048, SequenceNum: 1
Time: 0.000002794, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: Q, RWBS: WFS, Bytes: 1048576, Sector: 10264576, Blocks: 2048, SequenceNum: 2
Time: 0.000008940, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: X, RWBS: WFS, Bytes: 262144, Sector: 10264576, Blocks: 512, SequenceNum: 3
Time: 0.000015644, PID: 2007033, Device: 259,0 CPU: 6, Command: asd, Action: G, RWBS: WFS, Bytes: 262144, Sector: 10264576, Blocks: 512, SequenceNum: 4
Time: 0.000057549, PID: 331, Device: 259,0 CPU: 6, Command: kworker/6:1H, Action: D, RWBS: WFS, Bytes: 262144, Sector: 10264576, Blocks: 512, SequenceNum: 5
Time: 0.038976446, PID: 0, Device: 259,0 CPU: 7, Command: swapper/7, Action: C, RWBS: WFS, Bytes: 262144, Sector: 10264576, Blocks: 512, SequenceNum: 1
In this case, our Q2G is a bit higher because we have to chunk the request into 256KiB segments. This has a total time of (0.000015644 - 0.000002794 == 0.00001285 or 12.85us)
However, if we look at the delta between splitting ('X') and the controller processing the request ('G') this is only 6.704us
The D2C is when the I/O operation begins and completes. We can see the write took 38.91ms at the device level which is quite long.
This particular NVMe is pretty cheap and has low specs. Its rated for only 1.1GB/s writes and 2.4GB/s reads. If I run an fio benchmark on one of the partitions we can see the latencies are pretty bad for the device:
fio setup:
❯ sudo fio nvme-rand-read.fio
nvme0n1p2: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=16
...
fio-3.28
Starting 16 processes
Jobs: 16 (f=16): [W(16)][2.0%][w=1142MiB/s][w=1141 IOPS][eta 29m:03s]
nvme0n1p2: (groupid=0, jobs=16): err= 0: pid=4008446: Thu Aug 31 12:34:32 2023
write: IOPS=1151, BW=1159MiB/s (1216MB/s)(34.2GiB/30234msec); 0 zone resets
slat (usec): min=61, max=3305, avg=177.16, stdev=41.47
clat (msec): min=29, max=897, avg=221.15, stdev=160.91
lat (msec): min=29, max=897, avg=221.33, stdev=160.91
clat percentiles (msec):
| 1.00th=[ 32], 5.00th=[ 36], 10.00th=[ 43], 20.00th=[ 70],
| 30.00th=[ 100], 40.00th=[ 140], 50.00th=[ 190], 60.00th=[ 241],
| 70.00th=[ 300], 80.00th=[ 355], 90.00th=[ 443], 95.00th=[ 531],
| 99.00th=[ 684], 99.50th=[ 726], 99.90th=[ 852], 99.95th=[ 869],
| 99.99th=[ 894]
bw ( MiB/s): min= 328, max= 3044, per=100.00%, avg=1161.48, stdev=37.19, samples=959
iops : min= 328, max= 3044, avg=1161.32, stdev=37.19, samples=959
lat (msec) : 50=13.54%, 100=17.19%, 250=31.38%, 500=32.24%, 750=5.90%
lat (msec) : 1000=0.44%
cpu : usr=0.33%, sys=1.18%, ctx=34756, majf=0, minf=932
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,34809,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
WRITE: bw=1159MiB/s (1216MB/s), 1159MiB/s-1159MiB/s (1216MB/s-1216MB/s), io=34.2GiB (36.8GB), run=30234-30234msec
Disk stats (read/write):
nvme0n1: ios=30/162631, merge=0/0, ticks=40/35369585, in_queue=35369625, util=99.94%
fio output:
❯ sudo fio nvme-rand-read.fio
nvme0n1p2: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=16
...
fio-3.28
Starting 16 processes
Jobs: 16 (f=16): [W(16)][2.0%][w=1142MiB/s][w=1141 IOPS][eta 29m:03s]
nvme0n1p2: (groupid=0, jobs=16): err= 0: pid=4008446: Thu Aug 31 12:34:32 2023
write: IOPS=1151, BW=1159MiB/s (1216MB/s)(34.2GiB/30234msec); 0 zone resets
slat (usec): min=61, max=3305, avg=177.16, stdev=41.47
clat (msec): min=29, max=897, avg=221.15, stdev=160.91
lat (msec): min=29, max=897, avg=221.33, stdev=160.91
clat percentiles (msec):
| 1.00th=[ 32], 5.00th=[ 36], 10.00th=[ 43], 20.00th=[ 70],
| 30.00th=[ 100], 40.00th=[ 140], 50.00th=[ 190], 60.00th=[ 241],
| 70.00th=[ 300], 80.00th=[ 355], 90.00th=[ 443], 95.00th=[ 531],
| 99.00th=[ 684], 99.50th=[ 726], 99.90th=[ 852], 99.95th=[ 869],
| 99.99th=[ 894]
bw ( MiB/s): min= 328, max= 3044, per=100.00%, avg=1161.48, stdev=37.19, samples=959
iops : min= 328, max= 3044, avg=1161.32, stdev=37.19, samples=959
lat (msec) : 50=13.54%, 100=17.19%, 250=31.38%, 500=32.24%, 750=5.90%
lat (msec) : 1000=0.44%
cpu : usr=0.33%, sys=1.18%, ctx=34756, majf=0, minf=932
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,34809,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
WRITE: bw=1159MiB/s (1216MB/s), 1159MiB/s-1159MiB/s (1216MB/s-1216MB/s), io=34.2GiB (36.8GB), run=30234-30234msec
Disk stats (read/write):
nvme0n1: ios=30/162631, merge=0/0, ticks=40/35369585, in_queue=35369625, util=99.94%
The p99 is sitting at 684 milliseconds! This works fine for some of my use cases where its just low writes and occasional reads, but other use cases depending on high throughput would be bottlenecked really quickly.
Let’s quickly look at I/O operation from blktrace on a GCP production instance:
Time: 0.437987441, PID: 3052, Device: 259,11 CPU: 47, Command: asd, Action: Q, RWBS: WFS, Bytes: 1048576, Sector: 360308736, Blocks: 2048, SequenceNum: 40
Time: 0.437989005, PID: 3052, Device: 259,11 CPU: 47, Command: asd, Action: G, RWBS: WFS, Bytes: 1048576, Sector: 360308736, Blocks: 2048, SequenceNum: 41
Time: 0.437992731, PID: 3052, Device: 259,11 CPU: 47, Command: asd, Action: D, RWBS: WFS, Bytes: 1048576, Sector: 360308736, Blocks: 2048, SequenceNum: 42
Time: 0.438313389, PID: 3424, Device: 259,11 CPU: 45, Command: asd, Action: C, RWBS: WFS, Bytes: 1048576, Sector: 360308736, Blocks: 2048, SequenceNum: 76
Right off the bat, notice how we don’t split any requests? This is because this NVMe has a mtds of 2MiB so it can transfer all of our 1048576 in the Host buffer to the NVMe controller in one swoop.
Let’s review the metrics:
Q2C - 0.000325948 or 325.948us
Q2G - 0.000001564 or 1.564us
D2C - 0.000320658 or 320.658us (This is MUCH faster than the previous 38.91ms!)
So we can see the longest point in time for this I/O operation is the time operation itself on the device. 320 microseconds is definitely sub-millisecond and can be leveraged for plenty of write throughput use cases!
The following awk script will track the traces and locate the ‘Q' and 'C’ for the respective sectors being requested, storing some data about the requests, and calculate the delta between the two points:
awk '
/Action: Q/ {
startTime[$18] = $2 # Store start time using Sector as the key
firstCpu[$18] = $8
}
/Action: C/ {
if ($18 in startTime) {
endTime = $2 # Store end time
sector = $18 # Store sector
rwbs = $14
endCpu = $8
bytes = $16
diff = endTime - startTime[sector]
print "RWBS:", rwbs, "Sector:", sector, "Bytes:", bytes, "CPU Q:", firstCpu[sector], "CPU-C", endCpu, "Start Time:", startTime[sector], "End Time:", endTime, "Time Difference:", diff
delete startTime[sector] # Clean up the start time record
}
}'
Using the blkparse format above we can easily chain the awk to our parsed output and then sort by the time difference column to find the top 10 latencies for this disk:
❯ blkparse -f "Time: %5T.%9t, PID: %p, Device: %M,%m CPU: %c, Command: %C, Action: %a, RWBS: %3d, Bytes: %N, Sector: %S, Blocks: %n, SequenceNum: %s\n" -i nvme0n1.blktrace.1 | awk '
/Action: Q/ {
startTime[$18] = $2 # Store start time using Sector as the key
firstCpu[$18] = $8
}
/Action: C/ {
if ($18 in startTime) {
endTime = $2 # Store end time
sector = $18 # Store sector
rwbs = $14
endCpu = $8
bytes = $16
diff = endTime - startTime[sector]
print "RWBS:", rwbs, "Sector:", sector, "Bytes:", bytes, "CPU Q:", firstCpu[sector], "CPU-C", endCpu, "Start Time:", startTime[sector], "End Time:", endTime, "Time Difference:", diff
delete startTime[sector] # Clean up the start time record
}
}' | sort -k 20 | tail
RWBS: WS, Sector: 318633984, Bytes: 262144, CPU Q: 15, CPU-C 15, Start Time: 3.940327605, End Time: 4.585299929, Time Difference: 0.644972
RWBS: WS, Sector: 318636032, Bytes: 262144, CPU Q: 15, CPU-C 15, Start Time: 3.951697538, End Time: 4.596854173, Time Difference: 0.645157
RWBS: WS, Sector: 318314496, Bytes: 262144, CPU Q: 15, CPU-C 15, Start Time: 3.821926007, End Time: 4.468323510, Time Difference: 0.646398
RWBS: WS, Sector: 318312448, Bytes: 262144, CPU Q: 15, CPU-C 15, Start Time: 3.736457873, End Time: 4.390702661, Time Difference: 0.654245
RWBS: WS, Sector: 318300160, Bytes: 262144, CPU Q: 15, CPU-C 15, Start Time: 3.599359513, End Time: 4.254486814, Time Difference: 0.655127
RWBS: WS, Sector: 318302208, Bytes: 262144, CPU Q: 15, CPU-C 15, Start Time: 3.616588073, End Time: 4.274781820, Time Difference: 0.658194
RWBS: WS, Sector: 318304256, Bytes: 262144, CPU Q: 15, CPU-C 15, Start Time: 3.651924493, End Time: 4.331813004, Time Difference: 0.679889
RWBS: WS, Sector: 318306304, Bytes: 262144, CPU Q: 15, CPU-C 15, Start Time: 3.663468052, End Time: 4.345605867, Time Difference: 0.682138
RWBS: WS, Sector: 318310400, Bytes: 262144, CPU Q: 15, CPU-C 15, Start Time: 3.691804780, End Time: 4.380341164, Time Difference: 0.688536
RWBS: WS, Sector: 318308352, Bytes: 262144, CPU Q: 15, CPU-C 15, Start Time: 3.670144309, End Time: 4.359024591, Time Difference: 0.68888
https://linux.die.net/man/8/blktrace [1]