Skip to content

Instantly share code, notes, and snippets.

@glv
Last active January 19, 2024 19:41
Show Gist options
  • Save glv/6b32924c3718f916e065 to your computer and use it in GitHub Desktop.
Save glv/6b32924c3718f916e065 to your computer and use it in GitHub Desktop.
Comparing monadic and non-monadic styles for managing state in Clojure

The Clojure State Monad: A Non-Trivial Example

When I started writing this in early 2013, I intended it to be a series of blog posts about effective use of the state monad (and the `algo.monads` library in general) in Clojure. Along the way, I learned that `algo.monads` is both somewhat buggy and extremely slow, and I decided that the most effective way to use monads in Clojure was simply not to use them at all. Therefore, I abandoned work on the post. But it's still probably useful as a good way of explaining the state monad by example, so I've spruced the old draft up slightly and am posting it here.

TL;DR

There are too many monad tutorials, and not enough practical examples of solving real problems with monads. This article shows how to use the state monad to solve a sizable, real-world problem.

Introduction

This is not a monad tutorial.

This is a state monad tutorial. There's a difference.

I mostly know how monads work. It kind of clicked with the 43rd monad tutorial I read, and I got the idea. Most importantly, I knew when to reach for a monad to solve a problem. I was ready.

Then of course, I needed to use a state monad for a real problem. That's when I realized how little I knew.

Each monad has its own peculiarities in terms of how it's used in practice. The state monad has more such peculiarities than the other common monads. But what really makes it challenging is that all of the state monad examples I could find on the web—literally all of them, and I looked pretty hard—are trivial, leaving many practical questions unanswered.

I needed to figure it out, though, because I thought the state monad might improve the DVI output code for my Cló project. So I gave it a try, and quickly hit roadblocks. I happen to know a few prominent Clojure programmers, so I asked for help … and was told “I haven't had much luck with the state monad either” (and this was from the author of a truly excellent monad tutorial). So I rolled up my sleeves and started really digging in.

Here's what I needed to learn that I wished the existing state monad examples had shown me:

  • how to feed a starter state into the monadic computation
  • how to use the helper functions to use a map as your state—essentially, how to use a map as a sort of local mutable variable space for a sizable computation
  • how to factor a complex monadic computation across multiple functions.

The rest of this article will explain all of those things, in the context of a substantial problem that is a realistic use case for the state monad, much too large to be contained in a single function, and requiring the tracking of multiple, independent bits of state.

The Problem

The program I'm writing needs to write a binary file, of the type that includes back-references by byte offset. Examples of that kind of file include some object file formats, PDF files, and Zip files. A Zip file consists of an introductory header, sections for each file in the archive (each with its own descriptive header), and an end-of-file directory. The directory contains information about each file in the archive, including the byte offset (from the beginning of the archive) of the start of that file's section.

Such a structure is very practical:

  • it can be written in a single pass;
  • it's quick to simply list the contents, by scanning back from the end to find the directory;
  • it's also quick to extract a single file from the middle of the archive, by first finding the file's directory entry and then seeking to the byte offset where that file is stored;
  • new files can be added by reading the directory into memory, overwriting the directory with the new file data, and then rewriting the directory at the end (augmented with information about the new files).

Building such a file, though, is an inherently stateful problem. You have to keep track of the byte offsets and other information about the stored files as they are written, so that info can be included in the end-of-file directory.

I've considered three ways of managing that state.

The first is the conventional functional-language style, using program structure and control flow to accumulate the state on the program stack in parameters or return values. And for a file as simple as a Zip archive, that could work: a vector of file directory info could be built as the result of mapping over the list of files to compress and write them into the archive. Unfortunately, for more complex files such as PDFs, the file structure doesn't match the state that must be gathered quite as closely, so this strategy would be impractical. Additionally, it would require passing state variables around all over the place. So even though a simple, pure functional approach will work for this particular file format, I want to use this example to explore techniques for cases when that straightforward approach is not sufficient.

The second strategy uses a var to hold a map of state values, and changing the var with set! when some of the state needs to be updated. Apart from just not feeling very clean, this strategy has some serious drawbacks. For one thing, it sacrifices some performance, because updating a var requires locking and synchronization. And that brings up the other big issue: that it would be difficult to simultaneously build and write multiple files in different threads. (If it seems unlikely that you would ever want to do that, consider the "Download this repository as a zip file" link that's on every repository page on GitHub.)

The third strategy is to use the state monad, and that's the approach that I explore in the rest of this article.

A Simplified Problem

There's a lot of stuff involved in building a Zip file that isn't related to the state management problem. So, to avoid getting bogged down in compression algorithms and such things, I've defined a simplified archive format called NoZip. It's similar to the Zip file format, but leaves out almost every interesting feature. File data is not compressed in a NoZip archive. You can think of it as similar to a tar file, except designed for disks and random access rather than the sequential, tape-oriented format of tar.

Here's the structure of a NoZip archive:

HEADER:
    "NOZA"          magic number
    4-byte int      number of files in archive

FILE ENTRIES (one per file)
    4-byte int      length of file name
    string          file name
    4-byte int      length of file data
    data            file data, uncompressed

TRAILER
    "NOZADIR"       directory intro
    4-byte int      number of directory entries

  DIRECTORY ENTRIES (one per file)
    4-byte int      length of file name
    string          file naem
    4-byte int      byte offset of file entry from start of archive

  TAIL
    "NOZATAIL"      tail intro
    4-byte int      byte offset of start of trailer from start of archive

All of the ints in the file are stored in big-endian format.

(The intro strings at the start of the trailer and the tail are for error detection. Starting from the end of the file, you should be able to find the beginning of the tail by seeking backward 12 bytes, and then read the offset to the start of the trailer and seek directly there. But a program reading this kind of file should check for those intro strings and fail fast if they aren't at the expected locations.)

The Solution

The Top Level: Orchestrating the Process

We want to define a function like this:

(defn write-archive [output-channel files]
  ...)

The output-channel argument should be an instance of java.nio.channels.WritableByteChannel, and files should be an array of java.io.File objects. We don't have to worry about finding the list of files to archive, or opening the output file. It's all about building and writing the archive itself.

So let's sketch out a first draft of what this function should look out, without using monads for now:

(defn write-archive [output-channel files]
  (write-archive-header output-channel (count files))
  (let [file-info (map (partial write-file-entry output-channel) files)]
    (write-archive-trailer output-channel file-info)))

Assuming reasonable implementations of write-archive-header, write-file-entry, and write-archive-trailer, that seems OK. For a file format this simple, we don't even need to use the state monad.

But we missed something. In Java's NIO facility, byte channels aren't always buffered. If we want our I/O to be efficient, we need to write into a buffer, and flush that buffer into the channel occasionally. So we need the buffer as well:

(defn write-archive [output-channel files]
  (let [buffer (.clear (java.nio.ByteBuffer/allocate 2048))]
    (write-archive-header output-channel buffer (count files))
    (let [file-info (map (partial write-file-entry output-channel buffer) files)]
      (write-archive-trailer output-channel buffer file-info))))

That's starting to look a little ugly. We need to pass both the buffer and the channel around all over the place, and the number of functions involved in building the whole file could be reasonably large. That'll get cumbersome.

So we have another reason to use the state monad: making the buffer and channel available throughout the process, without having to pass them explicitly on every function call. It's time to shift to a monadic style.

My first version looked something like this:

(defn write-archive [output-channel files]
  (let [buffer (.clear (java.nio.ByteBuffer/allocate 2048))]
    (domonad state-m
             [_ (write-archive-header (count files))
              _ (map write-file-entry files)
              _ (write-archive-trailer)]
              nil)))

There's still some extra complexity there, but instead of being threaded through each function call, it's around the core function calls. It's a little less obtrusive.

A lot of monad tutorials that mention the state monad stop with this. But there are two problems. One is kind of obvious: how does the state get initialized with the buffer and channel? That's not here anywhere.

The other problem is that calling this function doesn't actually do anything. Sure, the domonad actually, you know, does the monad, but that just means that it rearranges everything and returns you a "monadic value". And monadic values for the state monad are functions.

When I first tried this, I recognized the first problem, but I thought “let's just see what happens.” I wrote stub implementations of the write-* functions that just wrote messages to stdout. And nothing happened.

A monadic value in the state monad is a function that is passed the state, operates on it, and returns a vector: [result new-state]. And domonad simply builds and returns such a value, so the actual work doesn't actually happen until you call that function. That answers the question of how to seed the initial state to start the process, so let's do that now.

(defn write-archive [output-channel files]
  (with-monad state-m
    (let [buffer (.clear (java.nio.ByteBuffer/allocate 2048))
          builder (domonad
                   [_ (write-archive-header (count files))
                    _ (map write-file-entry files)
                    _ (write-archive-trailer)]
                   nil)
          ]
      (builder {:channel output-channel
                :buffer buffer
                :entries []})
      nil)))

That actually does something! We capture the function returned by domonad In the builder variable and then call that function, seeding the process using a map as the state. That map holds the three values we'll need to access throughout the process. (:channel and :buffer will be read-only state variables, but :entries will be updated as we go.) And this time, the functions we call as part of the monadic computation actually get executed!

There's still one problem, though: every step in the monadic computation must return a monadic value. I've written the write-* functions so that they do that properly (as we'll discuss shortly). But map is not equipped to either call state-monadic functions, or to return a state-monadic value. For that, I need to call the special monadic version of map, m-map. So here's our final version:

(defn write-archive [output-channel files]
  (with-monad state-m
    (let [buffer (.clear (java.nio.ByteBuffer/allocate 2048))
          builder (domonad
                   [_ (write-archive-header (count files))
                    _ (m-map write-file-entry files)
                    _ (write-archive-trailer)]
                   nil)
          ]
      (builder {:channel output-channel
                :buffer buffer
                :entries []})
      nil)))

Writing the Header

At this point we have a "working" monadic computation, but we need to write the real versions of the functions that do all the work. We'll have to understand two things: how to write a function that returns a proper state-monadic value, and how to access the state from within that function.

To decompose a monadic function into separate functions, you should use the defmonadfn macro, and wrap the function bodies in either m-result or domonad, like this:

(defmonadfn write-archive-header [number-of-files]
  (domonad [] true))

What does that accomplish? The defmonadfn macro arranges for the definition of the current monad (the m-bind, m-result, m-zero, and m-plus functions) to be propagated from the caller into this function. The domonad wrapped around the body allows you to continue the monadic computation in this function, and ensure that the return value of write-archive-header is a state-monadic value that fits right into the calling computation.

But what should we actually do in this function? We need to extract the buffer from the state map, write a couple things into it, and then get the channel from the state map and flush the buffer into the channel.

Getting the values out of the state map is easy: the algo.monads library includes a few functions that are designed to help you work with maps as the monad state, and one of them is called fetch-val. Using that, we can flesh out the body of our function:

(defmonadfn write-archive-header [number-of-files]
  (domonad 
   [buffer (fetch-val :buffer)
    channel (fetch-val :channel)
    _ ( ... )]  ; write header somehow
   (flush-buffer buffer channel)))

As you'll recall from the many general monad tutorials, a domonad call works like a let, with a set of local variable bindings. Often we simply assign to _ to indicate (by convention) that we are ignoring the value. But in some cases we do want to use the return value later on in the function, and here we see two of those cases: where we're fetching buffer and channel from the state.

We'll need to flush the buffer at other times as well, so I made the decision right away to put that in its own function. But writing the actual header bytes is a one-off, so it might be nice to do it inline right here. But we can't simply use a do or a doto on the right-hand side of that binding, because those don't return monadic values. What to do?

The algo.monads library has quite a few general functions to use in monads, but none of them are suitable for simply wrapping up some non-monadic function calls at a particular point. So I wrote a macro to do the job. (This turned out to be unnecessary, because there's a different way of doing the trick using :let, as we'll see in the write-file-entry function. But I'm leaving it in here because I think it's instructive about what's going on.)

Remember what a state-monadic value is: a function that takes the current state as a parameter, does something, and then returns a vector of the result and the new state (which, of course, might be exactly the same as the old state). So here's the macro, called >>>:

(defmacro >>> [& forms]
  `(fn [s#] (let [result# (do ~@forms)] [result# s#])))

With that in place, we can make the right-hand side of the third binding in write-archive-header look like this:

(>>> (doto buffer
       (.put (.getBytes "NOZA"))
       (.putInt number-of-files)))

That will expand to (essentially) this:

(fn [s]
  (let [result (do (doto buffer
                     (.put (.getBytes "NOZA"))
                     (.putInt number-of-files)))]
       [result s]))

So here's the full write-archive-header function:

(defmonadfn write-archive-header [number-of-files]
  (domonad 
   [buffer (fetch-val :buffer)
    channel (fetch-val :channel)
    _ (>>> (doto buffer
             (.put (.getBytes "NOZA"))
             (.putInt number-of-files)))]
   (flush-buffer buffer channel)))

Flushing the buffer into the channel

The flush-buffer channel gets everything it needs—buffer and channel—as parameters, and it doesn't need to change the state. So it doesn't need to call domonad.

But it still needs to be a monadic value. So we define it with defmonadfn and wrap the body in m-result instead of domonad. The core of the function is mostly java.nio stuff. We return the number of bytes written to the channel.

(defmonadfn flush-buffer [buffer channel]
  (m-result
   (let [result (.write channel (.flip buffer))]
     (.clear buffer)
     result)))

Writing file entries and updating the state

At this point, writing write-file-entry is pretty straightforward. There are a couple of new things, though.

(defmonadfn write-file-entry [file]
  (domonad
   [buffer (fetch-val :buffer)
    channel (fetch-val :channel)
    :let [file-name (.getName file)
          byte-offset (.position channel)
          _ (doto buffer
              (.putInt (.length file-name))
              (.put (.getBytes file-name))
              (.putInt (.length file)))]
    _ (flush-buffer buffer channel)
    _ (write-file-contents file)
    _ (update-val :entries #(conj % [file-name byte-offset]))]
   true))

First, what's that :let thing? It turns out that that's the way to inject non-monadic computations into the middle of a monad comprehension. Those bindings are processed just as in an ordinary let, and the bound variables are avaliable through the rest of the domonad form. So there we simply gather file-name and byte-offset because we need them in the rest of the function, and we also throw in the steps to write the file header (the same kind of thing that we used >>> for in write-archive-header).

After writing the file header and flushing the buffer, we call write-file-contents to put the file data into the archive. Finally, it's time to update the state. The state map started with an :entries value of []. So we can use the update-val function (a sibling of fetch-val, along with set-val) to update :entries by conjing a new [file-name byte-offset] pair onto the end.

Writing the file contents

The write-file-contents method is really just java.nio stuff, but the first version I wrote didn't work, and it's instructive to see why.

Here's what I tried:

(defmonadfn write-file-contents-wrong [file]
  (with-open [from-file (java.io.FileInputStream. file)
              from-channel (.getChannel from-file)]
    (domonad
     [to-channel (fetch-val :channel)]
     (.transferTo from-channel 0 (.length file) to-channel))))

Looks reasonable, right? But when I ran the code with this version, I got an error for trying to read from a closed channel. Can you see why?

Remember that domonad rearranges its body into a monadic value---a function---that it returns, and it's actually evaluated by the caller. So when the with-open call returns (closing the channel and the output stream) nothing in the body of domonad has actually been evaluated yet; that's done in the caller, write-file-entry. The file is opened and immediately closed before we try to do anything with it.

To fix it, all that's necessary is to move with-open inside of domonad:

(defmonadfn write-file-contents [file]
  (domonad
   [to-channel (fetch-val :channel)]
   (with-open [from-file (java.io.FileInputStream. file)
               from-channel (.getChannel from-file)]
     (.transferTo from-channel 0 (.length file) to-channel))))

Wrapping it up: writing the trailer

Let's get a simple little function, write-trailer-entry, out of the way first:

(defmonadfn write-trailer-entry [buffer [file-name byte-offset]]
  (m-result 
   (doto buffer
     (.putInt (.length file-name))
     (.put (.getBytes file-name))
     (.putInt byte-offset))))

Now there's just one piece left, and it's where we make use of the :entries state value that we accumulated while writing out the file entries. At this point, though, there are no surprises:

(defmonadfn write-archive-trailer []
  (domonad
   [buffer (fetch-val :buffer)
    channel (fetch-val :channel)
    entries (fetch-val :entries)
    :let [trailer-offset (.position channel)
          _ (doto buffer
              (.put (.getBytes "NOZADIR"))
              (.putInt (count entries)))]
    _ (m-map (partial write-trailer-entry buffer) entries)
    :let [_ (doto buffer
              (.put (.getBytes "NOZATAIL"))
              (.putInt trailer-offset))]
    _ (flush-buffer buffer channel)]
   true))

As usual, we need to fetch :buffer and :channel from the state map, but this time we fetch :entries as well. We collect the byte offset of the start of the trailer (for use in the tail) and then write the trailer intro in the usual fashion.

To write the trailer entries, all we have to do is m-map over the list of entries. (Remember, a regular map won't work in a monad.)

Finally we can write the tail, flush the buffer, and we're done!

Closing Thoughts

The rest of this article was originally going to show alternative, non-monadic strategies, and then compare and contrast them. For reasons explained above, though, I abandoned this as a blog post idea. The rest of this gist includes three versions of the nozip code: the monadic version described in the text, a version that maintains the state in a var, and a third version that doesn't need either of those tricks, managing the necessary state in a straightforward, purely functional way. (But note that, while this simplified example allows such a solution, the real problem that led me down this path does not. As near as I can tell, a purely functional solution to that problem requires the use of the state monad.)

I have a fourth version of the code that uses a custom monad specialized for this task. If I have time soon, I will clean that up a bit and add it here.

;; A nozip writer that uses the state monad to track state.
(ns nozip.archive-monad
(:use clojure.algo.monads))
;; Note: this file contains many notes to myself, remnants of false
;; starts, and exploratory/diagnosis code that I used while trying
;; to figure out what was going on. I'll clean that up shortly.
;; Structure of a NoZip archive (big-endian):
;;
;; HEADER:
;; "NOZA" magic number
;; 4-byte int number of files in archive
;;
;; FILE ENTRIES (one per file)
;; 4-byte int length of file name
;; string file name
;; 4-byte int length of file data
;; data file data, uncompressed
;;
;; TRAILER
;; "NOZADIR" directory intro
;; 4-byte int number of directory entries
;;
;; DIRECTORY ENTRIES (one per file)
;; 4-byte int length of file name
;; string file naem
;; 4-byte int byte offset of file entry from start of archive
;;
;; TAIL
;; "NOZATAIL" tail intro
;; 4-byte int byte offset of start of trailer from start of archive
(comment
;; This is a discussion of the problems with m-seq and an
;; exploration of how to produce one that works properly.
;; Here's the algo.monads definition of `m-seq`:
(defmonadfn m-seq
"'Executes' the monadic values in ms and returns a sequence of the
basic values contained in them."
[ms]
(reduce (fn [q p]
(m-bind p (fn [x]
(m-bind q (fn [y]
(m-result (cons x y)))) )))
(m-result '())
(reverse ms)))
;; This definition is clearly based on Haskell's `sequence`
;; function, which is defined so (slightly reformatted):
;;
;; ```haskell
;; sequence = foldr mcons (return [])
;; where
;; mcons p q = p >>= \x ->
;; q >>= \y ->
;; return (x : y)
;; ```
;;
;; But in Clojure, that definition has some problems:
;;
;; 1. Clojure doesn't have foldr, so instead we do a reduce (with
;; swapped arguments) over the reversed list. Probably better
;; and more natural to do a reduce seeded with an array and use
;; conj instead of cons.
;; 2. Haskell's `foldr` is lazy, but Clojure's `reduce` is not. So
;; if this is used in a deferred-execution monad like `state-m`,
;; then the reduce happens when `domonad` is being processed,
;; rather than later. As far as I can tell this *shouldn't* have
;; any real practical effects, since the result of the reduction
;; is a giant recursive function that then has to be evaluated
;; later. But that leads to the next point:
;; 3. Because Haskell is lazy and has proper tail-call elimination,
;; returning a giant recursive function from this doesn't have
;; negative consequences in terms of either stack depth or (I
;; think) code size. But in Clojure, with `reduce` being strict
;; and no TCE, it's bad in both ways.
;;
;; In Clojure, this should return a monadic value that (in
;; `state-m`) performs the reduce when it is called, in a more
;; straightforward way. (For that matter, some of the other
;; functions---definitely `m-map`, and probably `m-chain`, and
;; `m-reduce`, should be rewritten similarly. (And also, looking at
;; it more closely, `m-until`, `m-when`, and `m-when-not` might have
;; a different kind of problem, where their conditions are evaluated
;; at domonad time rather than being deferred.)
;;
(defmonadfn m-seq
[ms]
(reduce (fn [q p]
(m-bind p (fn [x]
(m-bind q (fn [y]
(m-result (cons x y)))))))
(m-result '())
(reverse ms)))
;; for reference:
(defn m-result [v] ; in state-m
(fn [s] [v s]))
(defn m-bind [mv f] ; in state-m
(fn [s]
(let [[v ss] (mv s)]
((f v) ss))))
(defmonadfn m-fmap
[f mv]
(m-bind mv (fn [x] (m-result (f x)))))
;; m-result: (fn m-result-state [v]
;; (fn [s] [v s])
(defmonadfn m-seq
[ms]
(m-bind (m-result ms)
(fn [_] (m-bind )
(reduce (fn [q p]
;; Here I need to
;;
;; 1. bind q to unwrap it from the result
;;
;; 2. force evaluation of p somehow in a way that
;; doesn't assume it's a state-m value
;;
;; 3. conj the two and wrap in m-result
;;
;; I *think* the first step is basically
;; m-bind, except that in the state monad it
;; returns a function. How do I force that to
;; be evaluated? It almost seems as though,
;; in a language without TCE, we need a third
;; required monadic primitive: something to
;; force evaluation, so that an iterative
;; implementation of some things can work
;; properly.
;;
)
(m-result [])
ms))))
)
(comment
;; This explains how the current m-map and m-seq functions evaluate
;; in the context of state-m.
;; assuming
(defmonadfn twice [n]
(m-result (+ n n)))
;; then, in the state monad, (twice 2) returns
(fn [s] [4 s])
;; so (m-map twice [1 2]) evaluates to this:
(m-seq '((fn [s] [2 s]) (fn [s] [4 s])))
;; which in turn evaluates to the following. (Not exactly; I'm
;; using `let` to show the effects of closure scope. But this is
;; the basic idea.
(fn [s] ; result of second iteration
(let [p (fn [s] [2 s])
[v ss] (p s)]
(((fn [x]
(let [q (fn [s] ; q on reduce's second iteration
; (i.e., result of first iteration)
(let [p (fn [s] [4 s])
[v ss] (p s)]
(((fn [x]
(let [q (fn [s] ['() s])] ; <-- q on reduce's first iteration
(m-bind q (fn [y]
(m-result (cons x y))))))
v)
ss)))]
(m-bind q (fn [y]
(m-result (cons x y))))))
v)
ss)))
;; It would be better if it evaluated to this:
(fn [s]
(m-result
(reduce (fn [[s v] mv]
(let [[r ss] (mv s)]
[ss (conj v r)]))
[s []]
'((fn [s] [2 s]) (fn [s] [4 s])))))
)
(defmonadfn state-m-seq
[ms]
(fn [s]
(reduce (fn [[v s] mv]
(let [[r ss] (mv s)]
[(conj v r) ss]))
[[] s]
ms)))
(defmonadfn state-m-map
[f xs]
(state-m-seq (map f xs)))
(defmacro >>>
"Wraps forms in a state-monad function that returns the result of
the last form and does not modify the state."
[& forms]
`(fn [s#] (let [result# (do ~@forms)] [result# s#])))
(defn stack-depth [label]
(let [trace (-> (Throwable.)
(.fillInStackTrace)
(.getStackTrace))
depth (count trace)]
(println (str label ": " depth))
depth))
(defmonadfn flush-buffer
"Flushes the buffer to the channel and clears the buffer.
Returns the number of bytes written."
[buffer channel]
(m-result
(let [result (.write channel (.flip buffer))]
(.clear buffer)
result)))
(defmonadfn write-archive-header [number-of-files]
(domonad
[buffer (fetch-val :buffer)
_ (>>> (doto buffer
(.put (.getBytes "NOZA"))
(.putInt number-of-files)))
channel (fetch-val :channel)]
(flush-buffer buffer channel)))
;; This doesn't work, because the work of the monad doesn't happen
;; here. This function:
;; 1. opens the file and channel
;; 2. builds a monadic value (the function returned from the domonad
;; call) but DOES NOT EVALUATE IT!!
;; 3. closes the channel and file
;; 4. returns the monadic value
;;
;; Finally, that monadic value is evaluated as part of the builder
;; evaluation in write-archive. By the time we call .transferTo,
;; from-file and from-channel have already been closed.
(defmonadfn write-file-contents-wrong [file]
(with-open [from-file (java.io.FileInputStream. file)
from-channel (.getChannel from-file)]
(domonad
[to-channel (fetch-val :channel)]
(.transferTo from-channel 0 (.length file) to-channel))))
(defmonadfn write-file-contents
"Copies the contents of file (uncompressed) directly to the output channel.
Returns the number of bytes written."
[file]
(domonad
[to-channel (fetch-val :channel)]
(with-open [from-file (java.io.FileInputStream. file)
from-channel (.getChannel from-file)]
(.transferTo from-channel 0 (.length file) to-channel))))
(defmonadfn write-file-entry [file]
(domonad
[buffer (fetch-val :buffer)
channel (fetch-val :channel)
:let [file-name (.getName file)
byte-offset (.position channel)
_ (doto buffer
(.putInt (.length file-name))
(.put (.getBytes file-name))
(.putInt (.length file)))]
_ (flush-buffer buffer channel)
_ (write-file-contents file)
_ (update-val :entries #(conj % [file-name byte-offset]))]
true))
(defmonadfn write-trailer-entry [channel buffer [file-name byte-offset]]
(m-result
(do
(when (< (.remaining buffer) (+ 8 (count file-name)))
(flush-buffer buffer channel))
(doto buffer
(.putInt (.length file-name))
(.put (.getBytes file-name))
(.putInt byte-offset)))))
(defmonadfn write-archive-trailer []
(domonad
[buffer (fetch-val :buffer)
channel (fetch-val :channel)
entries (fetch-val :entries)
:let [trailer-offset (.position channel)
_ (doto buffer
(.put (.getBytes "NOZADIR"))
(.putInt (count entries)))]
_ (state-m-map (partial write-trailer-entry channel buffer) entries)
:let [_ (when (< (.remaining buffer) 12)
(flush-buffer buffer channel))
_ (doto buffer
(.put (.getBytes "NOZATAIL"))
(.putInt trailer-offset))]
_ (flush-buffer buffer channel)]
true))
(defn write-archive
"Writes a NoZip archive containing files on the (already open)
writable channel."
[output-channel files]
(with-monad state-m
(let [buffer (.clear (java.nio.ByteBuffer/allocate 2048))
builder (domonad
[_ (write-archive-header (count files))
_ (state-m-map write-file-entry files)
_ (write-archive-trailer)]
nil)
]
(builder {:channel output-channel
:buffer buffer
:entries []})
nil)))
;; A nozip writer that uses functional decomposition to track state (on the stack).
(ns nozip.archive-simple)
(defn flush-buffer [buffer channel]
(let [result (.write channel (.flip buffer))]
(.clear buffer)
result))
(defn write-archive-header [channel buffer number-of-files]
(doto buffer
(.put (.getBytes "NOZA"))
(.putInt number-of-files))
(flush-buffer buffer channel))
(defn write-file-contents [to-channel file]
(with-open [from-file (java.io.FileInputStream. file)
from-channel (.getChannel from-file)]
(.transferTo from-channel 0 (.length file) to-channel)))
(defn write-file-entry [channel buffer file]
(let [file-name (.getName file)
byte-offset (.position channel)]
(doto buffer
(.putInt (.length file-name))
(.put (.getBytes file-name))
(.putInt (.length file)))
(flush-buffer buffer channel)
(write-file-contents channel file)
(vector file-name byte-offset)))
(defn write-trailer-entry [channel buffer [file-name byte-offset]]
(when (< (.remaining buffer) (+ 8 (count file-name)))
(flush-buffer buffer channel))
(doto buffer
(.putInt (.length file-name))
(.put (.getBytes file-name))
(.putInt byte-offset)))
(defn write-archive-trailer [channel buffer file-info-entries]
(let [trailer-offset (.position channel)]
(doto buffer
(.put (.getBytes "NOZADIR"))
(.putInt (count file-info-entries)))
(doall (map (partial write-trailer-entry channel buffer) file-info-entries))
(when (< (.remaining buffer) 12)
(flush-buffer buffer channel))
(doto buffer
(.put (.getBytes "NOZATAIL"))
(.putInt trailer-offset))
(flush-buffer buffer channel)))
(defn write-archive [output-channel files]
(let [buffer (.clear (java.nio.ByteBuffer/allocate 2048))]
(write-archive-header output-channel buffer (count files))
(let [file-info-entries (doall (map (partial write-file-entry output-channel buffer) files))]
(write-archive-trailer output-channel buffer file-info-entries))
nil))
;; A nozip writer that uses a var to track state.
(ns nozip.archive-var)
(def ^:dynamic archive-state {})
(defn flush-buffer [buffer channel]
(let [result (.write channel (.flip buffer))]
(.clear buffer)
result))
(defn write-archive-header [number-of-files]
(let [buffer (:buffer archive-state)
channel (:channel archive-state)]
(doto buffer
(.put (.getBytes "NOZA"))
(.putInt number-of-files))
(flush-buffer buffer channel)))
(defn write-file-contents [file]
(let
[to-channel (:channel archive-state)]
(with-open [from-file (java.io.FileInputStream. file)
from-channel (.getChannel from-file)]
(.transferTo from-channel 0 (.length file) to-channel))))
(defn write-file-entry [file]
(let [buffer (:buffer archive-state)
channel (:channel archive-state)
file-name (.getName file)
byte-offset (.position channel)]
(doto buffer
(.putInt (.length file-name))
(.put (.getBytes file-name))
(.putInt (.length file)))
(flush-buffer buffer channel)
(write-file-contents file)
(set! archive-state
(update-in archive-state [:entries] conj [file-name byte-offset]))))
(defn write-trailer-entry [channel buffer [file-name byte-offset]]
(when (< (.remaining buffer) (+ 8 (count file-name)))
(flush-buffer buffer channel))
(doto buffer
(.putInt (.length file-name))
(.put (.getBytes file-name))
(.putInt byte-offset)))
(defn write-archive-trailer []
(let [buffer (:buffer archive-state)
channel (:channel archive-state)
entries (:entries archive-state)
trailer-offset (.position channel)]
(doto buffer
(.put (.getBytes "NOZADIR"))
(.putInt (count entries)))
(doall (map (partial write-trailer-entry channel buffer) entries))
(when (< (.remaining buffer) 12)
(flush-buffer buffer channel))
(doto buffer
(.put (.getBytes "NOZATAIL"))
(.putInt trailer-offset))
(flush-buffer buffer channel)))
(defn write-archive [output-channel files]
(let [buffer (.clear (java.nio.ByteBuffer/allocate 2048))]
(binding [archive-state {:channel output-channel
:buffer buffer
:entries []}]
(write-archive-header (count files))
(doall (map write-file-entry files))
(write-archive-trailer)))
nil)
;; A driver program that allows testing any of the three nozip writers.
(ns nozip.core
(:require nozip.archive-monad
nozip.archive-simple
nozip.archive-var))
(defn choose-version [version-name]
(case version-name
"monad" nozip.archive-monad/write-archive
"simple" nozip.archive-simple/write-archive
"var" nozip.archive-var/write-archive
))
(defn basic-run [write-archive-fn files]
(with-open [archive (java.io.FileOutputStream. "archive.nozip")
channel (.getChannel archive)]
(write-archive-fn channel files)))
(defn benchmark-run [write-archive-fn files]
(with-open [archive (java.io.FileOutputStream. "/dev/null")
channel (.getChannel archive)]
(time
(write-archive-fn channel (take 10000 (cycle files))))))
(defn run-with [run-fn args]
;; work around dangerous default behaviour in Clojure
(alter-var-root #'*read-eval* (constantly false))
(let [write-archive-fn (choose-version (first args))
dir (java.io.File. (second args))
file-filter (reify java.io.FileFilter
(accept [this path]
(and (not (.isDirectory path))
(not (.isHidden path))
(.canRead path))))
files (.listFiles dir file-filter)]
(run-fn write-archive-fn files)))
(defn -main [& args]
(run-with basic-run args))
(defn benchmark [& args]
(run-with benchmark-run args))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment