Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Persistent pipes/circular buffers for Linux

📂 Persistent "pipes" in Linux

In a project I'm working on I ran into the requirement of having some sort of persistent FIFO buffer or pipe in Linux, i.e. something file-like that could accept writes from a process and persist it to disk until a second process reads (and acknowledges) it. The persistence should be both across process restarts as well as OS restarts.

AFAICT unfortunately in the Linux world such a primitive does not exist (named pipes/FIFOs do not persist their contents to disk and - being in-memory - have low limits on how much data they can contain). The thing that comes closer to this is the not-so fancy mechanism of logrotation:

  • process 1 appends to a regular file
  • something else (normally logrotate) checks the above file for certain conditions and if needed swaps the old file with a new, empty, file
  • process 2 must be logrotation-aware to be able to tail from the file (and this becomes really complicated if process 2 depends on external systems to make progress)

So because I was too lazy to implement logrotation-aware tailing in process 2 and because yak-shaving (and necessity) are the mother of most advances in IT, I figured I could try to come up with something better.

To have a file work as a pipe I would basically need a file that can be used to do two things:

  • push (append) new data at the end of the file
  • pop (read+truncate) data at the beginning of the file

As we all know it is easy to truncate at the end of the file, not so much at the beginning. If only there was a way to truncate at the beginning we could easily satisfy all requirements:

  • process 1 appends to a regular file
  • process 2 reads from the beginning of the file (N bytes); when the read is acknowledged N bytes are truncated from the beginning of the file

Enter hole punching (no, not of the network variety).

Hole punching

Recent Linux kernels support a feature called hole punching that allows user-space to deallocate selective byte ranges from files on certain filesystems (ext4 and xfs are probably the most complete implementations, others are catching up) by using [fallocate] (http://man7.org/linux/man-pages/man2/fallocate.2.html):

fallocate(fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, offset, length);

The above line tells the filesystem that we don't need the contents of the file in the byte range [offset, offset+len). If this call completes successfully the filesystem will free up the corresponding blocks and "logically" replace their contents with 0s. The logical size of the file will not change. So assuming we originally have a 14kB file:

block  0       1       2       3       4
bytes  0       4k      8k      12k     16k
       |       |       |       |       |
data   XXXXXXXXXXXXXXXXXXXXXXXXXXXX
blocks ================================

after telling fallocate to punch a hole with offset 0 and length 6192 we should end up with the following situation, where the first block has been deallocated and where only the second half of the second block is used:

block  0       1       2       3       4
bytes  0       4k      8k      12k     16k
       |       |       |       |       |
data   000000000000XXXXXXXXXXXXXXXX
blocks         ========================

What this means for our use-case is that we can first read some bytes from the beginning of the file and then, once we have processed them, tell the filesystem to discard those bytes and deallocate the space they used. As said before, the logical size of the file does not change; only the physical size (the number of blocks) may change.

Hole punching is then complemented by the ability of the [lseek()] (http://linux.die.net/man/2/lseek) call to skip any holes that have been punched in the file:

lseek(fd, offset, SEEK_DATA);

if offset is a position inside a hole the call above will move the file position to the first position after the end of the hole (if offset is not inside a hole, it will do nothing and simply return offset).

By putting all of this together we can then build process 2 as follows:

open_file_for_consuming
seek_data
while !eof {
  read_chunk_from_file
  process_chunk
  delete_chunk_from_file
}

With this in place we successfully built a working basic persistent pipe between processes. Note that the file should be (obviously) opened for writing.

Extensions

The basic version is all well and good but has a minor annoyance: the logical file length (the one reported by ls) will keep growing indefinitely, even if the physical size is 0. Luckily fallocate also allows to trim the logical size, albeit with the caveat that the trimming happens on block boundaries (i.e. both offset and length are a multiple of the block size):

fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, length);

So we can modify our new tool to perform trimming, if possible:

open_file_for_consuming
seek_data
while !eof {
  read_chunk_from_file
  process_chunk
  delete_chunk_from_file
  try_trim_file
}

Another extension is the ability to leave behind data that can not be processed right now (we depend on an external system, remember?): instead of stopping the whole processing, we can simply skip the problematic data without deleting it from the file:

open_file_for_consuming
while !eof {
  seek_next_data
  read_chunk_from_file
  if process_chunk {
    delete_chunk_from_file
  }
  try_trim_file
}

In this case when we reach the end of the file we might have left data chunks scattered between the holes. Running the same process again will first seek to the first data chunk; will read, process and eventually delete it; then seek to the second data chunk and so on.

The good thing about this approach is that all writes and reads are linear and therefore pretty fast also on conventional drives. To help with this it makes sense to use fadvise to inform the system of this:

posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL|POSIX_FADV_NOREUSE);

(in case you have frequent processing failures it might be a good idea to mark failed chunks with POSIX_FADV_SEQUENTIAL|POSIX_FADV_WILLNEED)

tail -f-like behavior

The last piece left to implement is obviously the ability to "wait" for new content to be appended at the end of the file. AFAIK the only way to do this (short of polling to check if the file mtime or size have changed) is to use the platform-specific filesystem notification mechanisms. The specifics of how to do this are definitely outside the scope of this post, but it would look something like this:

open_file_for_consuming
do {
  while !eof {
    seek_next_data
    read_chunk_from_file
    if process_chunk {
      delete_chunk_from_file
    }
    try_trim_file
  }
  wait_for_file_changes_or_timer
  seek_start
}

Implementation

I have tested the idea locally using a hastily-hacked implementation and verified that the approach is sound and viable (at least on ext4 on kernel 3.19). I'm currently in the process of properly implementing this in [ActiveState/tail] (https://github.com/ActiveState/tail): will update this post as soon as I'm done.

Notes

Some warnings:

  • Linux-only (some BSDs have support, YMMV) but other OSes have similar capabilities (e.g. using [FSCTL_SET_SPARSE, FSCTL_SET_ZERO_DATA and FSCTL_QUERY_ALLOCATED_RANGES] (https://msdn.microsoft.com/en-us/library/windows/desktop/aa365566(v=vs.85).aspx) on Windows)
  • I'm relying on implementation-specific behaviors in certain cases, e.g. lseek(fd, offset, SEEK_DATA) is not required to actually find the next data chunk: it could very well simply return offset and the application would have to skip over any 0-filled ranges. By a cursory look at the ext4 source code it appears that at least ext4 is doing the proper thing, so I didn't consider the case in which we have to skip over 0-ranges in seek_next_data.
  • At least in my testing it appears that fallocate requires explicit fsyncs to actually do its thing.
@cralso

This comment has been minimized.

Copy link

@cralso cralso commented Aug 25, 2015

Nice solution, I enjoyed reading it!

@lonetwin

This comment has been minimized.

Copy link

@lonetwin lonetwin commented Sep 10, 2015

Erm any reason why message passing using message queues weren't used ? http://beej.us/guide/bgipc/output/html/multipage/mq.html ?
Update: Just saw that you also needed persistence across OS restarts. Well, ok then.

@sixtus

This comment has been minimized.

Copy link

@sixtus sixtus commented Sep 10, 2015

If you consider this anything but a hack, I strongly recommend to look at Kafka instead

@anko

This comment has been minimized.

Copy link

@anko anko commented Sep 10, 2015

For reference, this Gist is on Hacker News.

@sixtus I'm unfamiliar with Kafka (beyond a little reading just now): could you elaborate? In what ways is it better? How can I use it to create a UNIX pipe that persists over reboots?

@polobo

This comment has been minimized.

Copy link

@polobo polobo commented Sep 10, 2015

@anko the point is to consider whether the Unix pipe is really a hard requirement or whether the task-oriented nature of the flow is what matters. You seem to be writing a custom low-level solution to an already solved problem with multiple high-level implementations.

@b3h3moth

This comment has been minimized.

Copy link

@b3h3moth b3h3moth commented Sep 10, 2015

Great solution, many thanks!

@anko

This comment has been minimized.

Copy link

@anko anko commented Sep 10, 2015

@polobo Ah, I see—we're thinking about different layers of abstraction. Your use-case assumes the data being transmitted have a higher-level meaning, like messages or tasks. My use-case specifically assumes arbitrary data, so a low-level UNIX-pipe-ish stream would be a useful tool despite its simplicity. Kafka indeed seems a better tool for bigger high-level use-cases.

Just to clarify, I'm not involved in the implementation effort. I'm just a curious person with a bunch of shell scripts which this would be useful for.

@CAFxX

This comment has been minimized.

Copy link
Owner Author

@CAFxX CAFxX commented Oct 2, 2015

@sixtus funny: our main use-case for this is exactly as a persistent buffer before sending to Kafka. Because, you know, networks are unreliable.

@fsaintjacques

This comment has been minimized.

Copy link

@fsaintjacques fsaintjacques commented May 10, 2016

Is the implementation open sourced?

@BruceBuckland

This comment has been minimized.

Copy link

@BruceBuckland BruceBuckland commented Mar 31, 2017

How about emlog? kernel module usually used or embedded systems that creates a file that can be written to forever but never exceeds a particular size, and always contains the last stuff you wrote.

Not really sure that's what you want...

http://www.circlemud.org/jelson/software/emlog/
which begat
https://github.com/nicupavel/emlog

@drydenp

This comment has been minimized.

Copy link

@drydenp drydenp commented May 12, 2017

@BruceBuckland

That would destroy older information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment