'Hello, world!' HTTP/1.1 servers in C
To learn more about implementing high-performance HTTP API's, I've implemented a few trivial 'Hello, world!' servers and run some basic benchmarks.
These servers simply respond to any HTTP request with a
text/plain response of
In this post I'll cover three server implementations:
- Fixed-size pool of threads
We start with a bare-bones HTTP socket server implementation: only one connection is handled at a time.
This implementation spawns a new thread for each connection, and each thread is discarded after handling the connection.
Fixed-size pool of threads
This implementation spawns a fixed quantity of threads upon startup and uses them to handle all connections.
A stack is used to hand off the connection sockets from the main thread to the worker threads, protected by a pthread condition and mutex.
The main thread accepts connections and pushes the resulting file descriptors onto a stack, then signals the worker threads to wake up, consume the connections from the stack and reply to the incoming HTTP requests.
(Note: in retrospect, a stack is perhaps not a good choice, as the connections towards the bottom of the stack could suffer from starvation / excess latency).
- Mac Mini (M1, gigabit Ethernet)
- macOS Monterey
- Macbook Pro (M1, gigabit Ethernet)
- macOS Ventura
- Dell Optiplex 7050 (i5-7500 @3.8GHz, gigabit Ethernet)
- Ubuntu 22.10
- Thinkpad T500 (Core 2 Duo P8600 @2.4GHz, gigabit Ethernet)
- Ubuntu 23.04
- Raspberry Pi 2 Model B (ARMv7 @900MHz, 100Mbit Ethernet)
- Raspbian Bullseye
- Linksys NSLU2 (ARMv5 @266MHz, 100Mbit Ethernet)
- Debian Jessie
- PowerMac G5 (PowerPC G5 @2.3GHz x2, gigabit Ethernet)
- OS X Leopard
- eMac (PowerPC G4 @1.25GHz, 100Mbit Ethernet)
- OS X Leopard
- PowerMac G4 (PowerPC G4 @500MHz x2, gigabit Ethernet)
- OS X Tiger
- PowerMac G3 (PowerPC G3 @400MHz, 100Mbit Ethernet)
- OS X Tiger
Max bandwidth (
Before we test the HTTP server implementations, first let's guage the network bandwidth capabilities of these hosts using iperf.
Typical output from
cell@indium(master)$ iperf -c plasma ------------------------------------------------------------ Client connecting to plasma, TCP port 5001 TCP window size: 129 KByte (default) ------------------------------------------------------------ [ 1] local 192.168.1.98 port 63271 connected with 192.168.1.176 port 5001 [ ID] Interval Transfer Bandwidth [ 1] 0.00-10.04 sec 1.09 GBytes 935 Mbits/sec
See the full output: iperf.txt
Summary of results for all hosts:
127.0.0.1: 68.6 Gbits/sec 🤯
plasma: 935 Mbits/sec
flouride: 937 Mbits/sec
opti7050: 939 Mbits/sec
thinkpad: 939 Mbits/sec
pi2b-1: 94.0 Mbits/sec
nslu2: 72.9 Mbits/sec
pmacg5: 939 Mbits/sec
emac3: 94.0 Mbits/sec
graphite: 496 Mbits/sec
pmacg3: 93.7 Mbits/sec
Of note are:
- the crazy-high loopback bandwidth to localhost!
- despite having gigabit, the dual G4 500MHz (
graphite) can't quite saturate it
- the NSLU2 can't quite saturate 100Mbit with its 266MHz ARM processor
Requests per second (
wrk to load-test these HTTP servers.
cell@indium(master)$ wrk -t8 -c32 -d5s http://flouride:8080 Running 5s test @ http://flouride:8080 8 threads and 32 connections Thread Stats Avg Stdev Max +/- Stdev Latency 542.74us 18.81us 1.41ms 89.78% Req/Sec 1.84k 27.67 1.99k 90.20% 9316 requests in 5.10s, 700.52KB read Requests/sec: 1826.70 Transfer/sec: 137.36KB
I tested each host using
-t8 -c32 (8 threads pumping 32 connections) as well as
-t8 -c64 (8 threads pumping 64 connections).
First, let's look at the "fast" machines:
Across the board, we see that the threaded implementations far surpass the performance of the single-threaded implementation. No surprise there.
We also see that testing over loopback to the same host (127.0.0.1) is by far the winner. No surprise there either.
What was a bit surprising was that we see virtually no performance difference at all between the thread-per-connection and thread-pool implementations. I would guess this is due to my naivete with pthread conditions / mutexes.
The results for
plasma (my work laptop) appear to be anomalous. I have no idea why its network performance is so poor.
iperf proved it could saturate gigabit Ethernet, so I'm not sure what the issue is.
For the three x86_64 machines (
thinkpad), we see that bumping
wrk from 32 to 64 connections increases performance by anywhere from 28% to 82%.
Now let's take a look at the "slower" machines:
Interestingly, here we see a slight to significant decrease in performance when jumping from 32 to 64 connections, especially with the G5 (
Coincidentally, the performance of an (older) Raspberry Pi (
pi2b-1) very closely matches that of the two G4 machines (
Also interesting to note that a dual-processor 500MHz G4 (
graphite) matches the performance of a single-processor 1.25GHz G4 (
emac3), and that the dual G4 500MHz (
graphite) is more than twice as fast as the G3 400MHz (
Surprisingly, threading does not seem to help at all on the 266MHz NSLU2.