Skip to content

Instantly share code, notes, and snippets.

@kisom
Last active May 31, 2016 13:00
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kisom/3863f17636d99b4f8401 to your computer and use it in GitHub Desktop.
Save kisom/3863f17636d99b4f8401 to your computer and use it in GitHub Desktop.
"Why Purity Matters" blog post.
Purity is a useful construct that forces programmers to consider that the
environment that they are operating in is unclean, and this introduces
barriers to formally defining the behaviour of the system comprising
the program and its environment.
This file is a [literate Haskell
file](https://gist.github.com/kisom/3863f17636d99b4f8401) run through
[pandoc](http://johnmacfarlane.net/pandoc/) to produce a Markdown
post. There might be a few glitches as I'm still developing a workflow
in this style.
> import System.IO
> import System.IO.Error
I've spent the majority of my career thus far as an embedded Linux
engineer writing primarily C targeting our boxes. In this post, this
type of environment is what I have in mind for the target system, but it
really applies to most Linux systems as well. I can't speak to the others.
Consider the following program fragment in C; this is a common pattern
I've encountered working as a systems engineer.
```c
typedef struct {
uint8_t *data;
size_t length;
} FileData;
FileData *
read_file(const char *path)
{
struct stat sb;
FileData *fdata = NULL;
FILE *file = NULL;
if (stat(path, &sb) == -1) {
return NULL;
}
if (NULL == (fdata = calloc(1, sizeof(FileData)))) {
return NULL;
}
fdata->length = (size_t)sb.st_size;
fdata->data = calloc(fdata->length, 1);
if (NULL == fdata->data) {
free(fdata);
return NULL;
}
file = fopen(path, "r");
if (NULL == file) {
free(fdata->data);
free(fdata);
return NULL;
}
if (fdata->length != fread(fdata->data, 1, fdata->length, file)) {
free(fdata->data);
free(fdata);
return NULL;
}
fclose(file);
return fdata;
}
```
In what ways can `read_file` fail?
#### The obvious
The most obvious way this fails are
0. The file doesn't exist (which is picked up in the `stat(2)` call).
0. The program doesn't have permissions to read the file (picked up in
the `stat(2)` or `fopen(3)` call).
0. The program cannot allocate memory, either kernel memory for the
call to `stat(2)` or user memory for the calls to `calloc(3)`.
0. The program cannot read the entire file into memory.
I've noticed that my first tendency is to thing of this file as
running on a "snapshot" of the system: that is, the state of the
system remains consistent throughout this function. Functions are
fast, right? And this shouldn't take so long to run that the world can
change, right?
It turns out the answer is more subtle than this.
#### Abandon every hope
In order to understand the complexities of what is actually going on,
let's consider how the previous four failure modes occur. We also need
to understand that the scheduler can sleep this process at any time and
give another process control.
##### ENOENT
In the case of `ENOENT`, it turns out that this can occur both places
the file is accessed. That is, if the process yields control between the
`stat(2)` and `fopen(3)`, the file may not exist any more. This will
result in the same behaviour as the case of a permissions failure.
This might occur, for example, during a log rotation:
* Process A runs a check through the system logs and determines that
"server.log" needs to be rotated.
* The scheduler puts A to sleep and wakes up process B.
* Process B enters `read_file` for "server.log".
* Process B calls `stat(2)` inside `read_file` and determines that
"server.log" is `L` bytes.
* The scheduler puts B to sleep and wakes up A.
* Process A renames "server.log" to "server.log.1".
* The scheduler puts A to sleep and wakes up process B.
* Process B allocates `L` bytes and attempts to read "server.log".
* "server.log" no longer exists, and `read_file` fails. The allocated
memory is returned back to the system.
This can put some churn on the memory allocator, which might lead to
performance problems.
The parent directories can also be removed or renamed, as well.
##### EACCES
A file's (or its parent's) permissions can change during the course of
its lifetime; while it's been rarer, in my experience, it might make
for an interesting debugging session.
##### ENOMEM
Linux malloc [can never fail](http://scvalex.net/posts/6/), except
when it can. Usually, it won't be the malloc itself that fails, but
the effects of the memory pressure will be felt elsewhere in the system
causing turbulence like heavy paging or OOM kills. Memory pressure can
also affect scheduling.
##### EOF/FERROR
The most common cas where the entire file can't be read into memory is
if it has been truncated. In the example for `ENOENT`, imagine that
process A manages to create a new "server.log" before process B resumes
execution. In this case, it expects to read `L` bytes, but "server.log"
is now `L'` bytes. The `fread(3)` doesn't know to expect a smaller file.
#### Other failure modes
There are other ways the disk can fail: hardware failures or filesystem
corruption, for example. If the filesystem is a network filesystem,
all the failure modes of a network enter the mix now as well. Some of
these calls will fail if the path name is too long or the program expects
32-bit file offsets on a 64-bit system (i.e. -D_FILE_OFFSET_BITS=32).
#### Purity
Purity are those functions that do not rely on the outside world for
their answer; they do not rely on side effects or some state. In Haskell,
pure functions are the default, and impure functions (such as those that
access the disk) must be handled distinctly. The main function of every
program is wrapped in an IO pipeline; pure functions can split off from
this and operate on data, but they must always return to the IO
pipeline. This means that interactions with the outside world are always
marked as impure, and require special handling. There are ways to
circumvent this, but they require explicitly doing so and are frowned
upon. Furthermore, adding type annotations to mark where it's appropriate
to handle impure interactions and explicitly marking the pure code paths
allows one to arguably better reason about the behaviour of their code.
#### Haskell example
The following code sample actually uses two layers of monads to mark
the code paths.
The function operates on a data structure similar to the `FileData`
structure in the C fragment above.
> data FileData = FileData String Integer
The data structure will be showable in the REPL, but I'd rather not see
all the file's contents when I see the file.
> instance Show FileData where
> show (FileData _ l) = "file of " ++ (show l) ++ " bytes"
Here is the Haskell `read_file`: it takes a file path and returns
an `IO (Either IOErrorType FileData)`. What is an `IO (IOErrorType
FileData)`? The `IO` part marks the output as being part of the `IO`
monad; it is in a pipeline of impure code that interacts with the outside
world. Any function that operates on the result of this function must
be prepared to handle such code. The `Either IOErrorType FileData` monad
inside the `IO` pipeline means that the result of this function is a
value that might be either an `IOErrorType` or `FileData`. Functions
that handle the contents of the `IO` pipeline should be prepared to
handle both of these types of values as well as actual data.
> read_file :: FilePath -> IO (Either IOErrorType FileData)
> read_file path = do
> catchIOError (hf path) exHandler
> where hf p = do
> handle <- openFile p ReadMode
> fileSize <- hFileSize handle
> hClose handle
> handle <- openFile p ReadMode
> fileData <- hGetContents handle
> hClose handle
> let fdata = Right $ FileData fileData fileSize
> return fdata
> exHandler e = return $ Left (ioeGetErrorType e)
Unlike the C version, the error information is returned immediately with
the code instead of going through extracting `errno` after receiving
a failure (which is idiomatic in C).
Coming from this embedded C background, I'm coming to like this
explictness about the world my programs operate in.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment