Netty IO_URING based transport was recently released
as incubator
project. Several sources have reported substantial performance gains with It, including
Netty team TCP echo test results that demonstrated 2x better throughput compared to EPOLL.
The goal of this post is CPU and memory performance evaluation of IO_URING in RPC library implemented with Netty on top of TCP, and comparison against EPOLL IO.
It focuses on two use cases:
-
backend service: few dozens of connections, each serving ~1000 concurrent streams of few dozens messages each
-
frontend service: thousands of connections, each serving dozens of long-running streams, streams are periodical message batches
IO_URING
IO_URING is newest API for doing async I/O on Linux.
IO_URING has 2 ringbuffer backed queues shared between kernel and application process: one for submission of syscalls that will be eventually executed by kernel, one for completion events.
io_uring_enter
syscall is used to notify kernel about pending entries on submission queue.
IO_URING may reduce user-kernel space switch overhead, proportional to number of entries on submission queue
per io_uring_enter
.
Netty: IO_URING based IO
Netty creates one IO_URING instance per event loop, its submission queue can have 4096 (allowed maximum) entries by default.
Transport relies on normal syscalls - pollin
, pollout
for network buffers readability/writability notifications,
followed by read
and writev
for reads/writes. Syscalls are scheduled as entries on submission queue, and results are consumed on
completion queue.
Transport maintains 1 outstanding read
& writev
submission per channel; channel writability state and available outbound
buffer capacity are updated asynchronously relative to Channel.flush()
- only after corresponding writev
result is received
on completion queue.
For this reason there may be only 1 outstanding flush()
per Channel
- consequent flushes are ignored until next write result.
Comparison with EPOLL
EPOLL is purely IO notification mechanism - Netty sets It in edge-triggered mode so clients are aware when inbound/outbound buffers become readable/writeable for set of socket descriptors.
Contrary to IO_URING, Channel.flush()
does synchronous writev
syscall -
channel writability and available outbound buffer capacity are also updated synchronously.
Setup
Test setup is 1 client and 1 server sharing single host - commodity box having 12 vCPU/32GB/1GBit ethernet.
RPC library is part of not yet publicly available project com.jauntsdn.rsocket
.
Kernel version: 5.10.1
Java version: 1.8.0_282 OpenJDK 64-Bit Server VM
Netty version: 4.1.59.Final
Transport IO_URING version: 0.0.4.Final
Backend service: streams throughput
1K of concurrent finite streams per connection, 32 messages per stream, message size is 128 bytes, 30 seconds ramp up.
Connection count varies in range 16-64.
Connection IO transport Host CPU usage,
count Server/Client
16 EPOLL 0.05 / 0.05
IO_URING 0.05 / 0.06
32 EPOLL 0.10 / 0.13
IO_URING 0.11 / 0.16
48 EPOLL 0.15 / 0.24
IO_URING 0.16 / 0.26
64 EPOLL 0.20 / 0.37
IO_URING 0.22 / 0.38
Frontend service: periodic message batches
Infinite stream of message batches: 20K connections, 30 streams per connection, 10 messages every 10 seconds, 100 seconds ramp up.
Message size varies in range 128 - 512 bytes.
Message IO transport Host CPU usage,
size, bytes Server/Client
128 EPOLL 0.09 / 0.22
IO_URING 0.10 / 0.22
256 EPOLL 0.09 / 0.23
IO_URING 0.11 / 0.23
512 EPOLL 0.09 / 0.23
IO_URING 0.11 / 0.24
Numbers follow backend service
use case: slight IO_URING disadvantage for both reads and writes.
For both cases flame graphs showed there is no interesting discrepancies between EPOLL / IO_URING on client and server, other than 7% of total CPU time spent in native code of IO_URING (I omit graphs for brevity as they are not very readable in context of a blog).
This correlates with observed measurements. It seems the number of IO syscalls scheduled per io_uring_enter
is so small the cost of io_uring_enter
itself overweights context switch and JNI overhead of individual reads/writes.
Next step would be measuring number of entries per submit & writev
iovec size, but transport does not offer any metrics,
and I don’t have cycles to implement them atm.
Frontend service: periodic messages
Infinite stream of messages: 1 message every 1 second, 100 streams per connection, 8K connections, 200 seconds ramp up.
Message size varies in range 128 - 512 bytes.
Message IO transport Host CPU usage,
size, bytes Server/Client
128 EPOLL 0.24 / 0.24
IO_URING 0.18 / 0.12
256 EPOLL 0.26 / 0.24
IO_URING 0.18 / 0.13
512 EPOLL 0.29 / 0.25
IO_URING 0.20 / 0.16
Now IO_URING demonstrates better CPU performance for both reads and writes.
Client flamegraphs for periodic
have one peculiarity in native callstack:
io_queue_async_work
-------------
io_queue_sqe
while for batch
use case there is
io_read | io_write
-------------
io_issue_sqe
-------------
io_queue_sqe
Difference is caused by IOSQE_ASYNC flag.
It is set in transport if submission queue size exceeds given value on io_uring_enter
, and is 25
by default
Forcing IOSQE_ASYNC flag off did not have observable negative impact on CPU performance, but gave valuable hint -
periodic
case performs better mainly due to more entries enqueued per submit compared to batch
case.
To summarize, IO_URING transport gains vary on use case, may perform worse than EPOLL - profile your application to make informed choice.
Single stream throughput
Single continuous infinite stream over single connection, 1K message size, window size is 700K messages.
Window size is intentionally set to extreme value to highlight IO_URING transport specifics.
This case is not realistic and intended to demonstrate async nature of flushes with IO_URING
.
EPOLL
Throughput, messages per second
11:08:31.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2994233
11:08:32.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 3008608
11:08:33.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 3006733
11:08:34.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 3003608
11:08:35.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2902706
11:08:36.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 3002358
11:08:37.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2986733
11:08:38.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2974233
11:08:39.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 3002983
11:08:40.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2998608
IO_URING
Throughput, messages per second
11:36:30.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1462272
11:36:31.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 871424
11:36:32.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 777216
11:36:33.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2249728
11:36:34.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1446912
11:36:35.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1747968
11:36:36.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1555456
11:36:37.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1555456
11:36:38.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1555456
11:36:39.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1556478
11:36:40.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2057216
11:36:41.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1763328
11:36:42.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1173504
Single stream throughput test is not in favor of IO_URING transport: 0.8 - 2.3 million messages per second versus 3 million with EPOLL.
To explain instability let’s look into allocations & GC.
Garbage production
EPOLL
Server GC over 1 minute with EPOLL
GC time: 1 collections, 8.468 ms Last Cause: G1 Evacuation Pause
Server peak heap memory usage is around 35 MBytes.
Small allocation rates are expected given that library under test creates nonexistent amount of per message garbage, and request bytebuffer is reused for response stream.
And small heap size is consequence of sync flushes under EPOLL IO - there is limited number of inflight ByteBufs
and backed native memory so they are pooled efficiently.
IO_URING
Server GC over 1 minute with IO_URING
GC time: 67 collections, 24.933 s Last Cause: G1 Evacuation Pause
Server peak heap memory usage is 3.7 GBytes
Here are top allocations
bytes percent samples top
---------- ------- ------- ---
6177757184 64.87% 5898 io.netty.channel.ChannelOutboundBuffer$Entry
2543085568 26.70% 2432 io.netty.util.Recycler$DefaultHandle
793051136 8.33% 761 io.netty.buffer.PooledUnsafeDirectByteBuf
6279168 0.07% 6 io.netty.buffer.PoolSubpage
3141632 0.03% 3 io.netty.channel.unix.IovArray
Lot of ChannelOutboundBuffer
entries for native buffers that await scheduling on submission queue is consequence
of async flushing with IO_URING.
Switching to more realistic 70K messages window still demonstrates IO_URING transport tendency for elevated memory usage and GC, with lower throughput limit:
13:01:16.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2021374
13:01:17.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2206720
13:01:18.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2254848
13:01:19.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2333696
13:01:20.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2410496
13:01:21.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2227200
13:01:22.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2284544
13:01:23.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2331648
13:01:24.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2255872
GC time: 35 collections, 2.196 s Last Cause: G1 Evacuation Pause
Peak heap usage is 850 MBytes - order of magnitude more than EPOLL.
Single stream throughput is 2.2 - 2.4 million messages per second, versus 3 million with EPOLL. While such workload rarely appears in practice, numbers are indicator of per message round trip overhead that manifests itself on production workloads, but in smaller extent.