duarten/persistent_pipes_linux.md

## persistent_pipes_linux.md

      
    Raw
  

              persistent_pipes_linux.md
            
          
    Persistent "pipes" in Linux

In a project I'm working on I ran into the requirement of having some sort of
persistent FIFO buffer or pipe in Linux, i.e. something file-like that could
accept writes from a process and persist it to disk until a second process
reads (and acknowledges) it. The persistence should be both across process
restarts as well as OS restarts.
AFAICT unfortunately in the Linux world such a primitive does not exist (named
pipes/FIFOs do not persist
their contents to disk and - being in-memory - have low limits on how much data
they can contain). The thing that comes closer to this is the not-so fancy
mechanism of logrotation:

process 1 appends to a regular file
something else (normally logrotate) checks the above file for certain
conditions and if needed swaps the old file with a new, empty, file
process 2 must be logrotation-aware to be able to tail from the file (and this
becomes really complicated if process 2 depends on external systems to make
progress)

So because I was too lazy to implement logrotation-aware tailing in process 2 and
because yak-shaving (and necessity) are the mother of most advances in IT, I
figured I could try to come up with something better.
To have a file work as a pipe I would basically need a file that can be used to
do two things:

push (append) new data at the end of the file
pop (read+truncate) data at the beginning of the file

As we all know it is easy to truncate at the end of the file, not so much at the
beginning. If only there was a way to truncate at the beginning we could easily
satisfy all requirements:

process 1 appends to a regular file
process 2 reads from the beginning of the file (N bytes); when the read is
acknowledged N bytes are truncated from the beginning of the file

Enter hole punching (no, not of the network
variety).
Hole punching

Recent Linux kernels support a feature called hole punching that allows
user-space to deallocate selective byte ranges from files on certain
filesystems (ext4 and xfs are probably the most complete implementations, others
are catching up) by using [fallocate]
(http://man7.org/linux/man-pages/man2/fallocate.2.html):
fallocate(fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, offset, length);

The above line tells the filesystem that we don't need the contents of the
file in the byte range [offset, offset+len). If this call completes successfully
the filesystem will free up the corresponding blocks and "logically" replace
their contents with 0s. The logical size of the file will not change. So
assuming we originally have a 14kB file:
block  0       1       2       3       4
bytes  0       4k      8k      12k     16k
       |       |       |       |       |
data   XXXXXXXXXXXXXXXXXXXXXXXXXXXX
blocks ================================

after telling fallocate to punch a hole with offset 0 and length 6192 we should
end up with the following situation, where the first block has been deallocated
and where only the second half of the second block is used:
block  0       1       2       3       4
bytes  0       4k      8k      12k     16k
       |       |       |       |       |
data   000000000000XXXXXXXXXXXXXXXX
blocks         ========================

What this means for our use-case is that we can first read some bytes from the
beginning of the file and then, once we have processed them, tell the filesystem
to discard those bytes and deallocate the space they used. As said before, the
logical size of the file does not change; only the physical size (the number of
blocks) may change.
Hole punching is then complemented by the ability of the [lseek()]
(http://linux.die.net/man/2/lseek) call to skip any holes that have been punched
in the file:
lseek(fd, offset, SEEK_DATA);

if offset is a position inside a hole the call above will move the file
position to the first position after the end of the hole (if offset is not
inside a hole, it will do nothing and simply return offset).
By putting all of this together we can then build process 2 as follows:
open_file_for_consuming
seek_data
while !eof {
  read_chunk_from_file
  process_chunk
  delete_chunk_from_file
}

With this in place we successfully built a working basic persistent pipe between
processes. Note that the file should be (obviously) opened for writing.
Extensions

The basic version is all well and good but has a minor annoyance: the logical
file length (the one reported by ls) will keep growing indefinitely, even if
the physical size is 0. Luckily fallocate also allows to trim the logical
size, albeit with the caveat that the trimming happens on block boundaries (i.e.
both offset and length are a multiple of the block size):
fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, length);

So we can modify our new tool to perform trimming, if possible:
open_file_for_consuming
seek_data
while !eof {
  read_chunk_from_file
  process_chunk
  delete_chunk_from_file
  try_trim_file
}

Another extension is the ability to leave behind data that can not be processed
right now (we depend on an external system, remember?): instead of stopping the
whole processing, we can simply skip the problematic data without deleting it
from the file:
open_file_for_consuming
while !eof {
  seek_next_data
  read_chunk_from_file
  if process_chunk {
    delete_chunk_from_file
  }
  try_trim_file
}

In this case when we reach the end of the file we might have left data chunks
scattered between the holes. Running the same process again will first seek to
the first data chunk; will read, process and eventually delete it; then seek to
the second data chunk and so on.
The good thing about this approach is that all writes and reads are linear and
therefore pretty fast also on conventional drives. To help with this it makes
sense to use fadvise to inform the
system of this:
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL|POSIX_FADV_NOREUSE);

(in case you have frequent processing failures it might be a good idea to mark
failed chunks with POSIX_FADV_SEQUENTIAL|POSIX_FADV_WILLNEED)
tail -f-like behavior

The last piece left to implement is obviously the ability to "wait" for new
content to be appended at the end of the file. AFAIK the only way to do this
(short of polling to check if the file mtime or size have changed) is to use
the platform-specific filesystem notification mechanisms. The specifics of how
to do this are definitely outside the scope of this post, but it would look
something like this:
open_file_for_consuming
do {
  while !eof {
    seek_next_data
    read_chunk_from_file
    if process_chunk {
      delete_chunk_from_file
    }
    try_trim_file
  }
  wait_for_file_changes_or_timer
  seek_start
}

Implementation

I have tested the idea locally using a hastily-hacked implementation and
verified that the approach is sound and viable (at least on ext4 on kernel 3.19).
I'm currently in the process of properly implementing this in [ActiveState/tail]
(https://github.com/ActiveState/tail): will update this post as soon as I'm
done.
Notes

Some warnings:

Linux-only (some BSDs have support, YMMV) but other OSes have similar
capabilities (e.g. using
[FSCTL_SET_SPARSE, FSCTL_SET_ZERO_DATA and FSCTL_QUERY_ALLOCATED_RANGES]
(https://msdn.microsoft.com/en-us/library/windows/desktop/aa365566(v=vs.85).aspx)
on Windows)
I'm relying on implementation-specific behaviors in certain cases, e.g.
lseek(fd, offset, SEEK_DATA) is not required to actually find the next data
chunk: it could very well simply return offset and the application would have
to skip over any 0-filled ranges. By a cursory look at the ext4 source code it
appears that at least ext4 is doing the proper thing, so I didn't consider the
case in which we have to skip over 0-ranges in seek_next_data.
At least in my testing it appears that fallocate requires explicit fsyncs
to actually do its thing.