I really enjoyed reading how the phoenix-framework people managed to get to two million active websocket connections.
I’ve heard some very smart people say that Haskell has an amazing runtime with very cheap threads. I have no reason to disbelieve them but we thought it’d be fun to see how Haskell fares in a comparable setup.
Test servers
Unlike the Phoenix people we don’t have Rackspace sponsorship so we had
to resort to the common man’s cheap machines: EC2 spot instances. We
bid $0.10 on two m4.xlarge
machines with 16G of RAM and 4 cores
which usually cost around $0.05 per hour in eu-west.
We’re using Nix to deploy tsung and a very simple Haskell chat program that just broadcasts incoming messages to everyone.
tsung is a tcp/web load generator written in Erlang configured through a XML domain-specific language (website).
The core handler of our chat program looks like this (full source here):
handleWS :: InChan ByteString -> PendingConnection -> IO ()
handleWS bcast pending = do
localChan <- dupChan bcast
connection <- acceptRequest pending
-- Yes, we're leaking resources here because this forkIO
-- doesn't terminate when the socket closes :)
forkIO $ forever $ do
message <- readChan localChan
sendTextData connection message
-- loop forever
let loop = do
Text message <- receiveDataMessage connection
writeChan bcast message
loop
loop
To run the ec2 machines we’re using nixops which also does the spot-price bidding for us:
$ nixops create '<nix/test-setup.nix>'
$ nixops deploy
(See here for the full configuration including kernel tuning).
tsung setup
Unfortunately I could not get the distributed tsung going: The
distributed testing uses an Erlang function called slave:start
which
connects through SSH and spawns Erlang on the remote host. This failed
for reasons I didn’t have time to debug.
But without the distributed testing there’s a problem: A single server can only open ~65000 connections because ports are limited to 16 bits. We want more connections though!
Luckily tsung supports using multiple virtual IP addresses for a single network interface out of the box. We went to Amazon and clicked “Assign new IP” a few times to assign more private IPs to our tsung box.
Now we associate the new IPs with our network interface:
$ ip addr add 172.31.23.115/20 dev eth0
$ ip addr add 172.31.23.113/20 dev eth0
$ ip addr add 172.31.23.114/20 dev eth0
$ ip addr add 172.31.23.112/20 dev eth0
$ ip addr add 172.31.18.80/20 dev eth0
$ ip addr add 172.31.18.81/20 dev eth0
$ ip addr add 172.31.18.82/20 dev eth0
$ ip addr add 172.31.18.83/20 dev eth0
We have a
slightly different
tsung config from the Phoenix people which we copy to our tsung
box. The differences are using the recently added tsung websockets
connection type, and to terminate on messages instead of a delay.
$ nixops scp --to tsung-1 code/src/tsung-conf.xml tsung-conf.xml
code/src/tsung-conf.xml -> root@52.31.104.126:tsung-conf.xml
tsung-conf.xml 100% 1494 1.5KB/s 00:00
Running tsung
Our Nix config tuned the TCP stack and increased kernel limits, but we
still need to run ulimit
to make sure we’re not hitting the default
limit of 1024 file descriptors:
$ nixops ssh tsung-1
[root@tsung-1:~]# ulimit -n 2000000
[root@tsung-1:~]# tsung -f tsung-conf.xml start
Starting Tsung
Log directory is: /root/.tsung/log/20151104-1622
tsung exports some data via a web interface on port 8091. We use an extra SSH tunnel so we can access this data on http://127.0.0.1:8091:
$ ssh root@tsung-1 -L 8091:127.0.0.1:8091
Problem 1: The firewall
All our Nix boxes are configured with a firewall enabled. This is because I start from a template configuration instead of starting from scratch.
The firewall uses connection tracking to make decisions, and
connection tracking requires memory. When that memory is full the
dmesg
logs look like this:
[ 2960.570157] nf_conntrack: table full, dropping packet
[ 2960.575060] nf_conntrack: table full, dropping packet
[ 2960.629764] nf_conntrack: table full, dropping packet
[ 2960.678016] nf_conntrack: table full, dropping packet
[ 2992.936177] TCP: request_sock_TCP: Possible SYN flooding on port 8080. Sending cookies. Check SNMP counters.
[ 2998.005969] net_ratelimit: 364 callbacks suppressed
That log also shows that we triggered the kernel’s DOS protection
against SYN flooding. We fixed that by increasing
net.ipv4.tcp_max_syn_backlog
and net.core.somaxconn
.
Now when running tsung we got up to about 120k connections on the Haskell websocket box:
[root@websock-server:~]# netstat -ntp | grep -v TIME_WAIT | wc -l
119748
Problem 2: The Erlang process limit
But then tsung’s web UI would suddenly throw 500 errors and drop all connections. Initially we could not figure out what was going on because tsung is really slow at writing logs. Waiting for 5 minutes and then checking the logs reveals the message:
=ERROR REPORT==== 4-Nov-2015::18:03:45 ===
Too many processes
We noticed that tsung supports changing the maximum number of internal Erlang processes and we tried this:
$ tsung -p 1250000 -f tsung-conf.xml start
But no luck - the same problem occurs. Turns out that the -p
switch
doesn’t actually work (we filed
a bug which was
fixed within hours. Thanks!).
We patched tsung ourselves for now.
Some performance numbers
So far we spent most of our time fighting tsung and the slighly bizarre Erlang ecosystem. Here’s what 100k users look like for CPU and memory for the Haskell server:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1944 root 20 0 7210960 2.656g 22524 S 177.7 16.9 2:58.50 haskell-websock
2.6G, not bad! With all problems fixed we ran another test with 256k users:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2252 root 20 0 11.237g 4.714g 22532 S 128.3 30.1 6:58.25 haskell-websock
More addresses
In order to go higher we needed more IP addresses for tsung. This is where we learnt that EC2 limits the number of additional private IPs based on the instance type. You’ll see a message like this:
eni-5af8fa3d: Number of private addresses will exceed limit.
For m4.xlarge the limit is 15 addresses so we got another 6:
ip addr add 172.31.26.100/20 dev eth0
ip addr add 172.31.26.99/20 dev eth0
ip addr add 172.31.18.106/20 dev eth0
ip addr add 172.31.30.220/20 dev eth0
ip addr add 172.31.18.240/20 dev eth0
ip addr add 172.31.30.188/20 dev eth0
With 15 addresses in total we should get close to one million connections:
>>> 15 * 64000
960000
But tsung needs more memory than our Haskell server and died at ~500k connections:
/run/current-system/sw/bin/tsung: line 60: 29721 Killed [...]
The Haskell server was still running quite comfortably below 10G at 500k connections:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2320 root 20 0 16.879g 9.395g 22300 S 0.0 59.9 14:38.75 haskell-websock
That was certainly a fun afternoon! Time to clean up:
$ nixops destroy
warning: are you sure you want to destroy EC2 machine ‘tsung-1’? (y/N) y
warning: are you sure you want to destroy EC2 machine ‘websock-server’? (y/N) y
The whole experiment took ~2.5 hours and cost us a grand total of $0.25. To get to 4x the connections (two million) we’d need two m4.4xlarge or r2.2xlarge instances but that’s for another day.
Graphs
Our graphs show very nicely that we add a bit more than 1000 connections a second, and that the connection count follows the user count closely. I.e. there is no delay from the Haskell server.
Some unscientific testing also showed that propagating a message to all 256k clients takes 10-50 milliseconds, so the 2 seconds quoted by the Phoenix team for two million users sounds about right.
256k connections:

~500k connections (tsung died):
