Serving one million streams. Part 1. RSocket limits on the JVM

June 24, 2020

RSocket Mstreams java

One of the interesting topics is estimation of RSocket limits in setup that approximates gateway serving tens of thousands clients.

There are numerous reports focusing on few streams over single connection, with emphasis on huge throughput. However this mode is not intrinsical for real server. Even though gateways are more likely to be cpu bound than IO bound, we can relax this assumption by making server logic trivial (just echo back incoming messages), to see how RSocket behaves under idealized conditions when major resource consumption is related to IO.

We use jauntsdn/rsocket to have http2 based transports, and metrics that are actually useful (RSocket/RSocket-java metrics do not distinguish inbound/outbound messages, generate redundant time series and seems were not designed with efficient querying in mind - at least for Prometheus case. Also numerous vulnerabilities (1 , 2, 3, 4, 5) that were not resolved despite it’s post 1.0.0 GA hint that library is not yet intended for internet facing services - use case we are interested in here).

Goals

Multiple transports: tcp, websocket, http2, websocket-over-http2.
Runs both locally and in data center (aws), ops/automation friendly.
Distributed load clients.
Keep-It-simple with test configuration options for load clients: number of connections per instance, number of streams per connection, ramp up interval.
Test only long-running request-stream interactions - dominating model for clients connecting over the internet - browsers & mobile applications.

Software

Software stack is comprised of stable & widely adopted components:

Docker for building client/server images & running containers. Test application images are created & saved to configured docker registry as part of project build process.
Nomad for containers scheduling with built-in docker driver.
Consul for service discovery. We register 2 services for load server instance

rsocket-stress-test-server for RSocket load itself.
rsocket-stress-test-server-metrics for http server exposing /metrics endpoint scraped by Prometheus.

Prometheus stores host, JVM & server metrics scraped from rsocket-stress-test-server-metrics endpoint.
Grafana does metrics visualizations used to evaluate performance.

Load server contains RSocket server parameterized with network/IO transports and security configurations. Responder request-stream echoes back batches of original message at given intervals (effectively 1 message per second).

Load clients are configured with number of connections/streams, ramp up interval. They start new streams monotonically until target count is reached. Request message contains random bytes, and is 1KB in size.

Both clients and servers use G1GC with default settings.

The application can be deployed locally, running load test with local infrastructure is a matter of changing few urls:

export DOCKER_REPOSITORY_ADDRESS=localhost:5000
export CONSUL_ADDRESS=localhost:8500
export NOMAD_ADDR=http://localhost:4646

Hardware

Server is equivalent of AWS c4.2xlarge - mainstream VM with 8 vCPU / 16Gb RAM / 1000MBit network.

Clients are c4.xlarge having 4 vCPU / 8Gb RAM / 1000MBit network each.

Comparable resources may be allocated on common developer’s box that helps with local experiments.

Test cases

Test application consists of 2 load clients connecting single load server instance.

Each load client is configured with env variables:

export CONNECTIONS=5000
export STREAMS=25
export RAMP_UP_PERIOD=250 #seconds

This results in 125k simultaneous streams per client, 250k streams per server over 10k connections.

Startup script generates Nomad’s job specification files from templates by substituting
configuration env variables, then asks Nomad to schedule test application.

First test is for TCP transport, configured for 250k, 500k, 750k and 1000k streams respectively.

We deliberately omit latency measurements until Part 2 - the goal is to determine maximum load before server starts to tip over.

Tcp, 250K simultaneous streams

250k streams over 10k connections, 125 seconds ramp up.

==> Client # of connections: 5000
==> Client # of streams: 25
==> Client ramp up period, seconds: 125

==> Starting server using scripts/server.nomad...

==> Monitoring evaluation "3b158360"
    Evaluation triggered by job "rsocket-stress-test-server"
    Allocation "2eae67ce" created: node "b4fd7677", group "servers"
    Evaluation within deployment: "69514dd5"
    Allocation "2eae67ce" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "3b158360" finished with status "complete"

==> Waiting 15s until server is started...

==> Starting clients using scripts/client.nomad...

==> Monitoring evaluation "00b0c5e7"
    Evaluation triggered by job "rsocket-stress-test-client"
    Allocation "38387f71" created: node "b4fd7677", group "clients"
    Allocation "453a9469" created: node "b4fd7677", group "clients"
    Evaluation within deployment: "4e7c3c74"
    Allocation "38387f71" status changed: "pending" -> "running" (Tasks are running)
    Allocation "453a9469" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "00b0c5e7" finished with status "complete"

==> Stress test started
==> It will be stopped automatically after 1h
==> Can be stopped manually with ./test_stop.sh

The base case has passed with moderate cpu usage: ~ 25%, total memory used peak (heap + non-heap) around 6GB.

Tcp, 500K simultaneous streams

500k streams over 10k connections, 250 seconds ramp up.

Cpu usage grows linearly with streams count: peaked at 50% for cpu, and same level of total memory peak: 6GB.

Tcp, 750K simultaneous streams

750k streams over 10k connections, 375 seconds ramp up.

Test demonstrates the server starts to tip over around 600k streams, 7 GB memory, 60% process cpu - but this time at 100% host cpu usage. The result is an artifact of scheduling both server and clients on same host, and actual streams peak is higher - around 700-750k will consume 70-75% cpu - practical limit for stable service operation.

The number is promising as there are few opportunities for optimizations - as will be shown in part 2 of the post.

Interesting observation - allocation rate peaks at 800MB/s even though load server application code does not allocate directly. Handful of objects instantiated by library per message are significant contribution to total GC pressure once number of messages becomes millions.