fdv/gist:8542918

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Every article about Nginx optimization talks about using the sendfile, tcp_nodelay and tcp_nopush settings. Unfortunately, none of them explains neither why they should be used, nor how they actually work.
A few weeks ago, as we were building Botify SAAS platform, we started working on the Web server performances. As we're relying a lot on peer review to improve the quality of our work, Greg left my pull request open with questions, lots of questions, most of them starting with "Why?".
As we didn't find any obvious answer, we started a journey  inside the Linux Kernel TCP stack, trying to understand Nginx internals and why we should combine 2 options as opposed as  tcp_nopush and tcp_nodelay.
tcp_nodelay

How can I force a socket to send the data in its buffer? One answer to that tricky question lies in the TCP_NODELAY option of the Linux TCP(7) stack. When you set the TCP_NODELAY option on the receiving socket, the data is sent immediately, whatever the size of your packet. Nginx tcp_nodelay enables that option at the socket level.
To avoid network congestion, the Linux TCP stack has a mechanism that waits up to 200ms before sending a very small packet. That mechanism lies on Nagle's algorithm, and 200ms is Linux implementation delay.
Nagle's usefulness is hard to understand if you've discovered the Internet after during or after the first .com era. The Internet main purpose has not always been to transfer Web pages and huge files. Imagine what happens when you connect to a remote computer using telnet. When you press ctrl+c, you send a 1 byte message to the remote machine. Things would be easier if only true: to your 1 byte ctrl+c, you need to add the IP headers (20 bytes for IPv4, 40 for IPv6), TCP headers (20 bytes). Pressing ctrl+c, you actually send up to 61 bytes over the network. That's where Nagle's useful: instead of flooding your network with lots of headers and small messages, you'll send more data while avoiding noticeable latency.
Unfortunately, Nagle's not really adapted to the World Wide Web in general, and more precisely to data streams. The probability for a file to fit exactly a random number of packets is almost null. That means Nagle will force a 200ms delay before sending the latest packet of a file. When you're downloading a Web page made of a few dozen files, this create a very noticeable latency.
Using the TCP_NODELAY option on a socket disables Nagle, and the  data is sent as soon as it is available.
Nginx applies TCP_NODELAY to HTTP keepalive connections. keepalive connections are sockets kept open after they send their data, just in case, instead of starting a new connection and replaying the whole 3 ways handshake thing. It also allows to save sockets since these ones won't go FIN_WAIT at the end of a transmission. Connection: Keep-alive is a HTTP 1.0 option and the default behavior when using HTTP 1.1.
When downloading a Web page, TCP_NODELAY saves you up to 200ms per file. That's not bad. When you're dealing with real time games or high frequency trading, such a latency is a no go compared to the amount of data transferred.
tcp_nopush

Nginx tcp_nopush setting is the opposite od tcp_nodelay. Instead of reducing the network latency, it optimizes the amount of data sent in a single packet.
Since we're living in a perfectly logical world, tcp_nopush won't set the TCP_NOPUSH option of the socket since this flag doesn't only exists on FreeBSD. Running Linux, we're using the well named TCP_CORK option instead.
TCP_CORK prevents TCP from sending packets if the message is not equal to the MSS. The MSS is equal to the MTU, minus 40 to 60 bytes from the TCP and IP headers whether you're running IPv4 or IPv6.
ICI ON MET LE DIAGRAMME
This is very well explained in Linux Kernel source:
/* Return false, if packet can be sent now without violation Nagle's rules:
 * 1. It is full sized.
 * 2. Or it contains FIN. (already checked by caller)
 * 3. Or TCP_CORK is not set, and TCP_NODELAY is set.
 * 4. Or TCP_CORK is not set, and all sent packets are ACKed.
 *    With Minshall's modification: all sent small packets are ACKed.
 */

static inline bool tcp_nagle_check(const struct tcp_sock *tp,
									const struct sk_buff *skb,
									unsigned int mss_now, int nonagle)
{
  return skb->len < mss_now &&
	((nonagle & TCP_NAGLE_CORK) ||
	(!nonagle && tp->packets_out && tcp_minshall_check(tp)));
}
The most important thing you need to know about TCP_CORK is that it must be explicitly removed from the socket if you want a packet half empty to be sent (or half void, or with a MSS twice as big as needed, pick up accordingly).
TCP(7) man page explains that TCP_NODELAY and TCP_CORK are mutually exclusive, but they can actually be combined since Linux 2.5.9.
On Nginx, tcp_nopush will only be activated alone with sendfile, and that's where things start to be interesting.
sendfile

This part deals with the nginx setting called sendfile, and the Linux system call sendfile(2). To be easier to understand, they will be written accordingly.
Nginx is well know for its ability at serving static files. That  ability comes from the sendfile setting associated with tcp_nodelay and tcp_nopush.  sendfile enables the sendfile(2) system call for everything related to file transfer.
sendfile(2) allows to transfer data from a file descriptor directly in kernel space. sendfile(2) allows lots of resource saving:

sendfile(2) is a syscall, which means everything is done in kernel space, hence no context switching.
sendfile(2) is used in place of the read(2) + write(2) combination, so you're saving one syscall each time.
sendfile(2) allows to do zero copy. The writes is done in the kernel buffer from the block device memory using DMA.

Unfortunately, sendfile(2) needs a file descriptor that supports the mmap(2) and co. This means you can't use sendfile(2) on a UNIX socket to download dynamic pages from an application server like Rails or Django.

The in_fd argument must correspond to a file which supports mmap(2)-like operations (i.e., it cannot be a socket).

sendfile can be totally useless or a must have, according to the way you're using Nginx.
If you're sending lots of locally stored static files, or in case of micro caching, using sendfile is mandatory to optimize your Web site performances.
On the other hand, if you're using Nginx as a reverse proxy in front of an application server, sendfile is totally useless.
Let's mix everything together

Things get interesting when you start mixing sendfile, tcp_nodelay and tcp_nopush. I was wondering how I could use 2 mutually exclusive and opposite options such as tcp_nodelay and tcp_nopush, and I found the answer in a Russian thread  the Nginx mailing list date from 2005.
When using sendfile, tcp_nopush ensures that the packets are full before sending them. This reduces network overhead, and can reduce the time taken to send static files. When Nginx needs to send the last half packet, Nginx removes the TCP_CORK flag and TCP_NODELAY forces the transmission of the remaining data, allowing to save up to 200ms per file.
This behavior is confirmed in the sources of the Linux Kernel TCP stack about TCP_CORK:

When set indicates to always queue non-full frames. Later the user clears this option and we transmit any pending partial frames in the queue. This is meant to be used alongside sendfile() to get properly filled frames when the user (for example) must write out headers with a write() call first and then use sendfile to send out the data parts. TCP_CORK can be set together with TCP_NODELAY and it is stronger than TCP_NODELAY.

Everything was clear to us.
That's all folks! We didn't talk about writev(2) as an alternative to TCP_NOPUSH, as it is was out of our scope. I hope we've been clear, and don't mind dropping a comment if you think we've been missing something.