IO_URING for Netty based RPC: quick analysis

March 2, 2019
io_uring netty

Netty IO_URING based transport was recently released as incubator project. Several sources have reported substantial performance gains with It, including Netty team TCP echo test results that demonstrated 2x better throughput compared to EPOLL.

The goal of this post is CPU and memory performance evaluation of IO_URING in RPC library implemented with Netty on top of TCP, and comparison against EPOLL IO.

It focuses on two use cases:

IO_URING

IO_URING is newest API for doing async I/O on Linux.

IO_URING has 2 ringbuffer backed queues shared between kernel and application process: one for submission of syscalls that will be eventually executed by kernel, one for completion events.

io_uring_enter syscall is used to notify kernel about pending entries on submission queue.

IO_URING may reduce user-kernel space switch overhead, proportional to number of entries on submission queue per io_uring_enter.

Netty: IO_URING based IO

Netty creates one IO_URING instance per event loop, its submission queue can have 4096 (allowed maximum) entries by default.

Transport relies on normal syscalls - pollin, pollout for network buffers readability/writability notifications, followed by read and writev for reads/writes. Syscalls are scheduled as entries on submission queue, and results are consumed on completion queue.

Transport maintains 1 outstanding read & writev submission per channel; channel writability state and available outbound buffer capacity are updated asynchronously relative to Channel.flush() - only after corresponding writev result is received on completion queue.

For this reason there may be only 1 outstanding flush() per Channel - consequent flushes are ignored until next write result.

Comparison with EPOLL

EPOLL is purely IO notification mechanism - Netty sets It in edge-triggered mode so clients are aware when inbound/outbound buffers become readable/writeable for set of socket descriptors.

Contrary to IO_URING, Channel.flush() does synchronous writev syscall - channel writability and available outbound buffer capacity are also updated synchronously.

Setup

Test setup is 1 client and 1 server sharing single host - commodity box having 12 vCPU/32GB/1GBit ethernet.

RPC library is part of not yet publicly available project com.jauntsdn.rsocket.

Kernel version: 5.10.1
Java version: 1.8.0_282 OpenJDK 64-Bit Server VM
Netty version: 4.1.59.Final
Transport IO_URING version: 0.0.4.Final

Backend service: streams throughput

1K of concurrent finite streams per connection, 32 messages per stream, message size is 128 bytes, 30 seconds ramp up.

Connection count varies in range 16-64.

Connection    IO transport         Host CPU usage,
count                              Server/Client

16            EPOLL                0.05 / 0.05
              IO_URING             0.05 / 0.06

32            EPOLL                0.10 / 0.13
              IO_URING             0.11 / 0.16

48            EPOLL                0.15 / 0.24
              IO_URING             0.16 / 0.26

64            EPOLL                0.20 / 0.37
              IO_URING             0.22 / 0.38

Frontend service: periodic message batches

Infinite stream of message batches: 20K connections, 30 streams per connection, 10 messages every 10 seconds, 100 seconds ramp up.

Message size varies in range 128 - 512 bytes.

Message       IO transport      Host CPU usage,
size, bytes                     Server/Client

128           EPOLL             0.09 / 0.22
              IO_URING          0.10 / 0.22

256           EPOLL             0.09 / 0.23
              IO_URING          0.11 / 0.23

512           EPOLL             0.09 / 0.23
              IO_URING          0.11 / 0.24

Numbers follow backend service use case: slight IO_URING disadvantage for both reads and writes.

For both cases flame graphs showed there is no interesting discrepancies between EPOLL / IO_URING on client and server, other than 7% of total CPU time spent in native code of IO_URING (I omit graphs for brevity as they are not very readable in context of a blog).

This correlates with observed measurements. It seems the number of IO syscalls scheduled per io_uring_enter is so small the cost of io_uring_enter itself overweights context switch and JNI overhead of individual reads/writes.

Next step would be measuring number of entries per submit & writev iovec size, but transport does not offer any metrics, and I don’t have cycles to implement them atm.

Frontend service: periodic messages

Infinite stream of messages: 1 message every 1 second, 100 streams per connection, 8K connections, 200 seconds ramp up.

Message size varies in range 128 - 512 bytes.

Message       IO transport      Host CPU usage,
size, bytes                     Server/Client

128           EPOLL             0.24 / 0.24
              IO_URING          0.18 / 0.12

256           EPOLL             0.26 / 0.24
              IO_URING          0.18 / 0.13

512           EPOLL             0.29 / 0.25
              IO_URING          0.20 / 0.16

Now IO_URING demonstrates better CPU performance for both reads and writes.

Client flamegraphs for periodic have one peculiarity in native callstack:

io_queue_async_work
-------------
io_queue_sqe

while for batch use case there is

io_read | io_write
-------------
io_issue_sqe
-------------
io_queue_sqe

Difference is caused by IOSQE_ASYNC flag. It is set in transport if submission queue size exceeds given value on io_uring_enter, and is 25 by default

Forcing IOSQE_ASYNC flag off did not have observable negative impact on CPU performance, but gave valuable hint - periodic case performs better mainly due to more entries enqueued per submit compared to batch case.

To summarize, IO_URING transport gains vary on use case, may perform worse than EPOLL - profile your application to make informed choice.

Single stream throughput

Single continuous infinite stream over single connection, 1K message size, window size is 700K messages.

Window size is intentionally set to extreme value to highlight IO_URING transport specifics.

This case is not realistic and intended to demonstrate async nature of flushes with IO_URING.

EPOLL

Throughput, messages per second

11:08:31.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2994233
11:08:32.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 3008608
11:08:33.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 3006733
11:08:34.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 3003608
11:08:35.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2902706
11:08:36.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 3002358
11:08:37.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2986733
11:08:38.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2974233
11:08:39.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 3002983
11:08:40.656 netty-io-transport-epoll-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2998608

IO_URING

Throughput, messages per second

11:36:30.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1462272
11:36:31.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 871424
11:36:32.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 777216
11:36:33.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2249728
11:36:34.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1446912
11:36:35.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1747968
11:36:36.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1555456
11:36:37.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1555456
11:36:38.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1555456
11:36:39.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1556478
11:36:40.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2057216
11:36:41.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1763328
11:36:42.598 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 1173504

Single stream throughput test is not in favor of IO_URING transport: 0.8 - 2.3 million messages per second versus 3 million with EPOLL.

To explain instability let’s look into allocations & GC.

Garbage production

EPOLL

Server GC over 1 minute with EPOLL
GC time: 1 collections, 8.468 ms Last Cause: G1 Evacuation Pause

Server peak heap memory usage is around 35 MBytes.

Small allocation rates are expected given that library under test creates nonexistent amount of per message garbage, and request bytebuffer is reused for response stream.

And small heap size is consequence of sync flushes under EPOLL IO - there is limited number of inflight ByteBufs and backed native memory so they are pooled efficiently.

IO_URING

Server GC over 1 minute with IO_URING
GC time: 67 collections, 24.933 s Last Cause: G1 Evacuation Pause

Server peak heap memory usage is 3.7 GBytes

Here are top allocations

  bytes       percent  samples  top
  ----------  -------  -------  ---
  6177757184   64.87%     5898  io.netty.channel.ChannelOutboundBuffer$Entry
  2543085568   26.70%     2432  io.netty.util.Recycler$DefaultHandle
   793051136    8.33%      761  io.netty.buffer.PooledUnsafeDirectByteBuf
     6279168    0.07%        6  io.netty.buffer.PoolSubpage
     3141632    0.03%        3  io.netty.channel.unix.IovArray

Lot of ChannelOutboundBuffer entries for native buffers that await scheduling on submission queue is consequence of async flushing with IO_URING.

Switching to more realistic 70K messages window still demonstrates IO_URING transport tendency for elevated memory usage and GC, with lower throughput limit:

13:01:16.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2021374
13:01:17.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2206720
13:01:18.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2254848
13:01:19.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2333696
13:01:20.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2410496
13:01:21.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2227200
13:01:22.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2284544
13:01:23.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2331648
13:01:24.665 netty-io-transport-io_uring-1-1 com.jauntsdn.rsocket.test.client.Main client received messages: 2255872

GC time: 35 collections, 2.196 s Last Cause: G1 Evacuation Pause

Peak heap usage is 850 MBytes - order of magnitude more than EPOLL.

Single stream throughput is 2.2 - 2.4 million messages per second, versus 3 million with EPOLL. While such workload rarely appears in practice, numbers are indicator of per message round trip overhead that manifests itself on production workloads, but in smaller extent.

WebSockets over http/2: implementing RFC8441 with Netty

July 30, 2020
netty websocket http2 java