Skip to content

Instantly share code, notes, and snippets.

@duarten
Forked from CAFxX/persistent_pipes_linux.md
Created April 25, 2018 00:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save duarten/c8243e80c9cf6cc78d0adb120c4fbeb7 to your computer and use it in GitHub Desktop.
Save duarten/c8243e80c9cf6cc78d0adb120c4fbeb7 to your computer and use it in GitHub Desktop.
Persistent pipes/circular buffers for Linux

Persistent "pipes" in Linux

In a project I'm working on I ran into the requirement of having some sort of persistent FIFO buffer or pipe in Linux, i.e. something file-like that could accept writes from a process and persist it to disk until a second process reads (and acknowledges) it. The persistence should be both across process restarts as well as OS restarts.

AFAICT unfortunately in the Linux world such a primitive does not exist (named pipes/FIFOs do not persist their contents to disk and - being in-memory - have low limits on how much data they can contain). The thing that comes closer to this is the not-so fancy mechanism of logrotation:

  • process 1 appends to a regular file
  • something else (normally logrotate) checks the above file for certain conditions and if needed swaps the old file with a new, empty, file
  • process 2 must be logrotation-aware to be able to tail from the file (and this becomes really complicated if process 2 depends on external systems to make progress)

So because I was too lazy to implement logrotation-aware tailing in process 2 and because yak-shaving (and necessity) are the mother of most advances in IT, I figured I could try to come up with something better.

To have a file work as a pipe I would basically need a file that can be used to do two things:

  • push (append) new data at the end of the file
  • pop (read+truncate) data at the beginning of the file

As we all know it is easy to truncate at the end of the file, not so much at the beginning. If only there was a way to truncate at the beginning we could easily satisfy all requirements:

  • process 1 appends to a regular file
  • process 2 reads from the beginning of the file (N bytes); when the read is acknowledged N bytes are truncated from the beginning of the file

Enter hole punching (no, not of the network variety).

Hole punching

Recent Linux kernels support a feature called hole punching that allows user-space to deallocate selective byte ranges from files on certain filesystems (ext4 and xfs are probably the most complete implementations, others are catching up) by using [fallocate] (http://man7.org/linux/man-pages/man2/fallocate.2.html):

fallocate(fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, offset, length);

The above line tells the filesystem that we don't need the contents of the file in the byte range [offset, offset+len). If this call completes successfully the filesystem will free up the corresponding blocks and "logically" replace their contents with 0s. The logical size of the file will not change. So assuming we originally have a 14kB file:

block  0       1       2       3       4
bytes  0       4k      8k      12k     16k
       |       |       |       |       |
data   XXXXXXXXXXXXXXXXXXXXXXXXXXXX
blocks ================================

after telling fallocate to punch a hole with offset 0 and length 6192 we should end up with the following situation, where the first block has been deallocated and where only the second half of the second block is used:

block  0       1       2       3       4
bytes  0       4k      8k      12k     16k
       |       |       |       |       |
data   000000000000XXXXXXXXXXXXXXXX
blocks         ========================

What this means for our use-case is that we can first read some bytes from the beginning of the file and then, once we have processed them, tell the filesystem to discard those bytes and deallocate the space they used. As said before, the logical size of the file does not change; only the physical size (the number of blocks) may change.

Hole punching is then complemented by the ability of the [lseek()] (http://linux.die.net/man/2/lseek) call to skip any holes that have been punched in the file:

lseek(fd, offset, SEEK_DATA);

if offset is a position inside a hole the call above will move the file position to the first position after the end of the hole (if offset is not inside a hole, it will do nothing and simply return offset).

By putting all of this together we can then build process 2 as follows:

open_file_for_consuming
seek_data
while !eof {
  read_chunk_from_file
  process_chunk
  delete_chunk_from_file
}

With this in place we successfully built a working basic persistent pipe between processes. Note that the file should be (obviously) opened for writing.

Extensions

The basic version is all well and good but has a minor annoyance: the logical file length (the one reported by ls) will keep growing indefinitely, even if the physical size is 0. Luckily fallocate also allows to trim the logical size, albeit with the caveat that the trimming happens on block boundaries (i.e. both offset and length are a multiple of the block size):

fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, length);

So we can modify our new tool to perform trimming, if possible:

open_file_for_consuming
seek_data
while !eof {
  read_chunk_from_file
  process_chunk
  delete_chunk_from_file
  try_trim_file
}

Another extension is the ability to leave behind data that can not be processed right now (we depend on an external system, remember?): instead of stopping the whole processing, we can simply skip the problematic data without deleting it from the file:

open_file_for_consuming
while !eof {
  seek_next_data
  read_chunk_from_file
  if process_chunk {
    delete_chunk_from_file
  }
  try_trim_file
}

In this case when we reach the end of the file we might have left data chunks scattered between the holes. Running the same process again will first seek to the first data chunk; will read, process and eventually delete it; then seek to the second data chunk and so on.

The good thing about this approach is that all writes and reads are linear and therefore pretty fast also on conventional drives. To help with this it makes sense to use fadvise to inform the system of this:

posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL|POSIX_FADV_NOREUSE);

(in case you have frequent processing failures it might be a good idea to mark failed chunks with POSIX_FADV_SEQUENTIAL|POSIX_FADV_WILLNEED)

tail -f-like behavior

The last piece left to implement is obviously the ability to "wait" for new content to be appended at the end of the file. AFAIK the only way to do this (short of polling to check if the file mtime or size have changed) is to use the platform-specific filesystem notification mechanisms. The specifics of how to do this are definitely outside the scope of this post, but it would look something like this:

open_file_for_consuming
do {
  while !eof {
    seek_next_data
    read_chunk_from_file
    if process_chunk {
      delete_chunk_from_file
    }
    try_trim_file
  }
  wait_for_file_changes_or_timer
  seek_start
}

Implementation

I have tested the idea locally using a hastily-hacked implementation and verified that the approach is sound and viable (at least on ext4 on kernel 3.19). I'm currently in the process of properly implementing this in [ActiveState/tail] (https://github.com/ActiveState/tail): will update this post as soon as I'm done.

Notes

Some warnings:

  • Linux-only (some BSDs have support, YMMV) but other OSes have similar capabilities (e.g. using [FSCTL_SET_SPARSE, FSCTL_SET_ZERO_DATA and FSCTL_QUERY_ALLOCATED_RANGES] (https://msdn.microsoft.com/en-us/library/windows/desktop/aa365566(v=vs.85).aspx) on Windows)
  • I'm relying on implementation-specific behaviors in certain cases, e.g. lseek(fd, offset, SEEK_DATA) is not required to actually find the next data chunk: it could very well simply return offset and the application would have to skip over any 0-filled ranges. By a cursory look at the ext4 source code it appears that at least ext4 is doing the proper thing, so I didn't consider the case in which we have to skip over 0-ranges in seek_next_data.
  • At least in my testing it appears that fallocate requires explicit fsyncs to actually do its thing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment