Instantly share code, notes, and snippets.

What would you like to do?
HTTP Streaming (or Chunked vs Store & Forward)

HTTP Streaming (or Chunked vs Store & Forward)

The standard way of understanding the HTTP protocol is via the request reply pattern. Each HTTP transaction consists of a finitely bounded HTTP request and a finitely bounded HTTP response.

However it's also possible for both parts of an HTTP 1.1 transaction to stream their possibly infinitely bounded data. The advantages is that the sender can send data that is beyond the sender's memory limit, and the receiver can act on the data stream in chunks immediately instead of waiting for the entire data to arrive. Basically you're either saving space or you're saving time. The advantages of streaming is elaborated in Wikipedia's Online algorithm article.

Note that HTTP streaming is only involves the HTTP protocol and not websockets. Streaming is also the basis for HTML5 server sent events.

So we're going to look at HTTP streaming architecture, and how to achieve streaming in a few different languages.

The first thing to understand is that HTTP streaming involves streaming within a single HTTP transaction. In a larger context, each HTTP transaction itself represents an event as part of a larger event stream. This reveals to us that the concepts of "streaming" is a context-specific concept, it's relative to what we consider the "stream" to be.

Firstly we have to consider the HTTP headers that supports streaming. Open this up for reference:


The Content-Length header determines the byte length of the request/response body. If you neglect to specify the Content-Length header, HTTP servers will implicitly add a Transfer-Encoding: chunked header. The Content-Length and Transfer-Encoding header should not be used together. The receiver will have no idea what the length of the body is and cannot estimate the download completion time. If you do add a Content-Length header, make sure it matches the entire body in bytes, if it is incorrect, the behaviour of receivers is undefined.

The Content-Length header will not allow streaming, but it is useful for large binary files, where you want to support partial content serving. This basically means resumable downloads, paused downloads, partial downloads, and multi-homed downloads. This requires the use of an additional header called Range. This technique is called Byte serving.


The use of Transfer-Encoding: chunked is what allows streaming within a single request or response. This means that the data is transmitted in a chunked manner, and does not impact the representation of the content.

Officially an HTTP client is meant to send a request with a TE header field that specifies what kinds of transfer encodings the client is willing to accept. This is not always sent, however most servers assume that clients can process chunked encodings.

The chunked transfer encoding makes better use of persistent TCP connections, which HTTP 1.1 assumes to be true by default.

Chunked data is represented in this manner:


Each chunk starts with its byte length expressed as a hexadecimal number followed by optional parameters (chunk extension) and a terminating CRLF sequence, followed by the chunk data. The final chunk is terminated by a CRLF sequence.

Chunk extensions can be used to indicate a message digest or an estimated progress. They are just custom metadata that your layer 7 receiver needs to parse. There's no standardised format for it. Because of this, it's probably better to just add your metadata (if any) into the chunk itself for your layer 7.5 application to parse.

For your application to send out chunked data, you must first send out the Transfer-Encoding header, and then you must flush content in chunks according to the chunk format. If you don't have an appropriate HTTP server that handles this, then you need to implement the syntax generator yourself. Sometimes you can use a library to provide an abstract interface.

For example in PHP, there's the Symfony HTTP Foundation Stream Response and in NodeJS, it's native HTTP module chunks all responses.

Chunking is a 2 way street. The HTTP protocol allows the client to chunk HTTP requests. This allows the client to stream the HTTP request. Which is useful for uploading large files. However not many servers (except NGINX) support this feature, and most streaming upload implementations rely on Javascript libraries to cut up a binary file and send it by chunks to the server. Using Javascript gives you more control over the uploading experience, but the HTTP protocol would be the most simplest.

Browsers natively support chunked data. So if your server sends chunked data, they will start rendering data as soon as they receive it. However there's a buffer limit that browsers need to receive before it starts rendering them. This is different for each browser, but generally it's 1KB. You can see the limits for various browsers here:

If however you want to consume an API that supports streaming, you need to be aware of how your HTTP library handles chunked data. In most cases, you'll need to attach a callback handler that executes upon each chunk of data. This should mean that your API will need to frame each chunk in a useful manner. If the API is doing too many chunks, you may end up needing to buffer the data up into a "semantic protocol data unit" (PDU) before you can work on it. This of course defeats the purpose of chunking in the first place. For example in PHP, you can use the Guzzle library or curl.

In considering performance, you want to make sure that you're not producing way too chunky data. The more "chunking" you do, the more overhead that exists in both producing the chunks and parsing the chunks. Furthermore, it also results in more executions of buffering functions if the receiver can't make immediate use of the chunks. Chunking isn't always the right answer, it adds extra complexity on the recipient. So if you're sending small units of things that won't gain much from streaming, don't bother with it!

Do note that byte serving is compatible with chunked encoding, this would be applicable where you know the total content length, want to allow partial or resumable downloads, but you want to stream each partial response to the client.


It is also possible to compress chunked or non-chunked data. This is practically done via the Content-Encoding header.

Note that the Content-Length is equal to the length of the body after the Content-Encoding. This means if you have gzipped your response, then the length calculation happens after compression. You will need to be able to load the entire body in memory if you want to calculate the length (unless you have that information elsewhere).

When streaming using chunked encoding, the compression algorithm must also support online processing. Thankfully, gzip supports stream compression. I believe that the content gets compressed first, and then cut up in chunks. That way, the chunks are received, then decompressed to acquire the real content. If it were the other way around, you'll get the compressed stream, and then decompressing would give us chunks. Which doesn't make sense.

A typical compressed stream response may have these headers:

Content-Type: text/html
Content-Encoding: gzip
Transfer-Encoding: chunked

Semantically the usage of Content-Encoding indicates an "end to end" encoding scheme, which means only the final client or final server is supposed to decode the content. Proxies in the middle are not suppose to decode the content.

If you want to allow proxies in the middle to decode the content, the correct header to use is in fact the Transfer-Encoding header. If the HTTP request possessed a TE: gzip chunked header, then it is legal to respond with Transfer-Encoding: gzip chunked.

However this is very rarely supported. So you should only use Content-Encoding for your compression right now.

Buffering Problem

The biggest problem when implementing HTTP streaming is understanding the effect of buffering. Buffering is the practice of accumulating reads or writes into a temporary fixed memory space. The advantages of buffering include reducing read or write call overhead. For example instead of writing 1KB 4096 times, you can just write 4096KB at once. This means your program can create a write buffer holding 4096KB of temporary data (which can be aligned to the disk blocksize), and once the space limit is reached, the buffer is flushed to disk.

Typical HTTP architectures include these components:

Client <--> Proxy <--> HTTP Server <--> Application Server <--> Database Server

Each one of these components can possess adjustable and varied buffering styles and limits.

To correct perform streaming, you have to know and adjust the buffering limits at each component.

For example, let's invesigate the typical PHP stack such as:

Browser <--> Proxy <--> NGINX <--> PHP <--> MySQL

The Client

Firstly browsers have a rendering buffer limit. You must send as much data as the limit before the browsers will render the content. Having chunks smaller than the buffer will just make the browser hold the data until either the buffer is full or when the connection is closed (or after some time limit).

The Proxies

At the proxy level, this could be your ISP or some custom proxy. If the proxy buffers data this means, your streamed data from upstream will be stored up the proxy buffer before sending to the browser. Some mobile wireless ISP will buffer things and you won't be able to control this behaviour, this is a violation of the end to end principle, so there's nothing here you can do technically.

The Web Server

At the NGINX level, buffering is dependent upon the type of the upstream connection. There are 3 common connection types for HTTP: "proxy", "uwsgi", "fastcgi". If you want your NGINX server to respect streaming, you can either switch off buffering for your connection type, or match the buffer size with the upstream chunk size. Switching off buffering can be done using a buffering directive (proxy_buffering, uwsgi_buffering, fastcgi_buffering), or you can use a special header X-Accel-Buffering: no which tells NGINX to not buffer the response. The special header is more flexible, as this allows NGINX to buffer responses that don't need streaming. It also works for all 3 connection types.

If you instead try to match the buffer size with the chunk size, you have to make sure that the number of buffers multiplied by the buffer size (equal to a system memory page) is equal to a single chunk size. If it is greater than a single chunk from upstream, then this means your chunks will be accumulated before they are sent downstream. If it is less than the chunk size, this would result in NGINX buffering to disk, you want to avoid this as this results in extra overhead when streaming. For more information on buffer size see this gist.

Just a note on buffering optimisation: the larger the total buffer size, the greater likelihood of each connection using more memory. This is because if each buffer is large, there's a chance that you may not be efficiently using the buffer which can cause memory fragmentation. In the end, each buffer size should match the system memory page size. The number of buffers is what can be dynamically allocated. If your total buffer size across all connections exceeds your OS's memory limit, you're either going to meet an OOM error or starting paging to disk. To maintain your NGINX's availability, you have to consider the theoretical number of connections that a single NGINX server can handle, before it exhausts your server's memory limit.

Be aware of the real chunk size after compression. If your upstream is compressing the content, the resulting chunk size will be different. In most cases, NGINX should be doing the compression and it does support compressing for chunk that arrives from upstream. You just need gzip on. This means your application layer should not be compressing or chunking the content, it should just flush raw data. NGINX is smart enough to understand and will automatically compress each received upstream data, and then format it into chunks, which is then flushed to downstream.

There's an advantage in keeping buffers available or having a larger buffer size than the chunk size. It comes from dealing with slow clients. NGINX as a reverse proxy is very fast and can read the response from your upstream application server very quickly. NGINX itself can deal with any slow browsers that has a slower read rate than your upstream's write rate. Because NGINX is very light weight (asynchronous IO), the cost of holding a connection in NGINX is far smaller than holding open a process (that is waiting for the client to finish reading) in your application server. This is of course relative, as your application server might also be very light weight, and rely on either green threads or asynchronous IO. This problem does reveal an interesting property of streaming systems. Any stream will only be as quick as the slowest link (reader or writer) in the chain. This problem with streaming is related to network back pressure issue in distributed systems.

To take advantage of NGINX's ability of handling slow clients while still streaming data as fast as possible, there will need to be some tuning of both the buffer size and potentially the *_busy_buffer_size option. You cannot just increase the total buffer size, as that will just make NGINX wait until the buffer is full. What you need is some buffer size that is allocated only for slow clients. This has something to do with the *_busy_buffer_size, but this is poorly documented currently, so I do not know how make this work.

Here are 2 quotes about the *_busy_buffer_size:

When buffering of responses from the * server is enabled, limits the total size of buffers that can be busy sending a response to the client while the response is not yet fully read. In the meantime, the rest of the buffers can be used for reading the response and, if needed, buffering part of the response to a temporary file. By default, size is limited by the size of two buffers set by the *_buffer_size and *_buffers directives.

  • NGINX documentation

proxy_busy_buffers_size: This directive sets the maximum size of buffers that can be marked "client-ready" and thus busy. While a client can only read the data from one buffer at a time, buffers are placed in a queue to send to the client in bunches. This directive controls the size of the buffer space allowed to be in this state.

The Application Server

At the PHP level, global buffers can be set inside the php.ini configuration file. There are 3 options defined output_buffering, output_handler and implicit_flush. They are explained in the output control section of the PHP documentation. It is interesting to note that for CLI applications, the output buffering is off by default. This is so that your CLI application can show you results as its running. This buffer is controlled by the server application programming interface "SAPI". You can control inside your application by calling flush(), which will flush the entire SAPI buffer.

During runtime, custom buffers can also be created using ob_start(). Once you have added content to the buffer, you can then flush your custom buffer using ob_flush(). This only flushes the buffer that you created using ob_start(). Think of the ob_start() as a kind of PHP specific manual memory management. You're basically asking for some block of memory (fixed or variable), which you then can only use for your output statements and functions: echo and print.

If you have entered both levels of buffers, you need call the flush functions in this order: ob_flush(); flush();.

Both the global SAPI buffer and the custom application buffer have settings that enable automatic flushing. This can depend on hitting the buffer limit, or on some function call. Check the documentation for more.

The Upstream Data Source

Finally we reach the MySQL level. This can be replaced with any upstream data source that you are calling in order to prepare a response. By default all SQL queries are buffered. There are 2 options to achieve unbuffered queries (writes and reads). The first is the unbuffered query option. This allows one to work with reading large result sets, and to process each row as it arrives (including flushing to the client).The second option works with just one single column of data. This is useful where a single column contains a large binary or textual content, and you want to be able to work with a stream on this data specifically. This involves the usage of the large object option. You can also stream write a large binary or textual content into the database using large object option. The streaming of writing rows is just done by running multiple insert queries.

With regards to the second method, there are some peculiarities you have to keep in mind:

A Note About NodeJS

NodeJS has great support for streaming. In fact its entire native HTTP module does streaming by default for both incoming requests and outgoing responses. Everytime you call response.writeHead or response.write, it is just writing a chunk of data. However there may be a buffer size inside NodeJS which is probably the highWaterMark setting. However I have not looked into this further.

NodeJS has a native stream module: that serves as a base object for all other IO modules.


This comment has been minimized.

ujasur commented Jan 11, 2017

Very informative. Thanks


This comment has been minimized.

xxxxlr commented Mar 29, 2017

Like it!

One thing, browsers might not doing a 'end-to-end' decompressing according to this.


This comment has been minimized.

WHK102 commented Feb 10, 2018

Salve my day.


This comment has been minimized.

koybe commented Apr 4, 2018



This comment has been minimized.

ytitov commented Jun 7, 2018

Thank you for an excellent write up. Very useful for me.


This comment has been minimized.

amrendra007 commented Jun 14, 2018

Great content, well written thank you so much


This comment has been minimized.

arslan-ahmad commented Jun 29, 2018

nice work


This comment has been minimized.

cloudbow commented Aug 27, 2018

Great! I have a question? . Did you try chunking the response as early as the first chunk you recieve from the server side? will it work?


This comment has been minimized.

goophile commented Sep 10, 2018

Is it possible that a client sends a chunked request, and before all the requested data was delivered, the server starts to send the chunked response? If it's possible, we can do bidirectional real time communication in a single TCP connection without websockets.


This comment has been minimized.


CMCDragonkai commented Sep 12, 2018

HTTP 2 allows sending the response before fully receiving the request.


This comment has been minimized.

s001dxp commented Nov 24, 2018


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment