Skip to content

Instantly share code, notes, and snippets.

@e2
Last active August 29, 2015 13:59
Show Gist options
  • Save e2/10948802 to your computer and use it in GitHub Desktop.
Save e2/10948802 to your computer and use it in GitHub Desktop.
How to PROPERLY think about watching file and directories for changes
# (For developers) How to PROPERLY think about watching files and directories for changes
Goal: saving people the frustration of getting tools to "correctly" respond to
changes to files or directories.
## How to think about directories
Here's an example TODO file (/home/me/Documents/todo.txt):
```
jack work 10:00am discuss plans (check agenda notes)
jane sales 1:00pm make a call (call 123 456 789)
me home 7:00pm go shopping (take shopping list)
```
and here is a (simplified) directory listing in Linux (/home/me/Documents):
```
-rw-rw-r-- jack team Feb 2 2014 9:34 agenda.txt
-rw-rw-r-- jane sales Feb 3 2014 10:22 phonebook.txt
-rw-r----- me me Feb 4 2014 3:22 shopping_list.txt
```
Do you see it? There's almost NO DIFFERENCE!!!
What does this mean? DIRECTORIES ARE FILES!!!
Specifically, a directory is a SPECIAL FILE that contains a list of other
files.
A directory doesn't contain files, it contains INFORMATION ABOUT files.
## How to think about files
Let's take the above shopping_list.txt:
```
2 cucumbers
3 forks
5 spoons
```
and let's remove a line;
```
2 cucumbers
5 spoons
```
What happened? The CONTENTS of the file changed.
In other words, the file was MODIFIED.
How? A line was REMOVED.
Now let's go back to the list of files (/home/me/Documents):
```
-rw-rw-r-- jack team Feb 2 2014 9:34 agenda.txt
-rw-rw-r-- jane sales Feb 3 2014 10:22 phonebook.txt
-rw-r----- me me Feb 4 2014 3:22 shopping_list.txt
```
and let's remove one:
```
-rw-rw-r-- jack team Feb 2 2014 9:34 agenda.txt
-rw-r----- me me Feb 4 2014 3:22 shopping_list.txt
```
What happened? The directory CONTENT changed.
The file didn't change...
The directory didn't change ...
The directory CONTENT changed.
Which means: the DIRECTORY CONTENT was modified.
How? A file was deleted FROM the directory.
The DIRECTORY itself wasn't changed, because ...
... it's an entry in it's PARENT directory.
(e.g. the directory didn't change itself, because
it's name is the same and it's attributes are the
same).
Physically, the file is still likely there.
Along with all the data in it.
It never changed. The DIRECTORY CONTENT changed.
## How to think about names and attributes
What happens if I rename shopping_list.txt to my_shopping_list.txt?
The DIRECTORY CONTENTS changed. Not the file.
That's because files are just containers for DATA.
They have no "name".
It's the DIRECTORIES that have names of files.
Directories contain ATTRIBUTES of files and other directories.
This means names, timestamps and other attributes.
So, changing the modification date of a file means ...
... yes, changing the DIRECTORY.
(Technically "under the hood" it's not (always) true - but this
is about how to think to avoid design problems)
What about changing the name of a directory?
Answer: doing so changes the CONTENT of the PARENT directory.
## How to think about changes
Tracking changes to files or directories makes NO SENSE!
Instead, you probably want to track:
- file content AS A WHOLE (e.g. version control, diff, md5, lines)
- directory content AS A WHOLE (e.g. for syncing, cron.d, config.d, etc.)
What you probably DON'T want to track:
- file/directory names
- file/directory attributes
- added files/directories
- removed files/directories
- changed files/directories
- symlinks
... because those changes happen in the directory and that kind
of information makes little sense to respond to, because the
STATE of the directory can be very different than the changes you
got notified about.
Trying to track those is a recipe for a world of headaches.
## How to think about symlinks
A symlink is a FILE or DIRECTORY within another DIRECTORY.
And that's regardless if you think of the symlink as a file
or the TARGET as a file.
Which means a symlink is:
1. a FILE
2. a DIRECTORY
3. a FILE (as a symlink)
So you want to track the CONTENT of the symlinked DIRECTORY
or symlinked FILE or directory CONTAINING the symlink.
But you do NOT want to "track the symlink", because that
doesn't make sense.
## How to think about TRACKING changes
First, decide WHAT you want to DO after a change happens.
THEN, decide what EXACTLY has to change for the action to
make sense.
THEN, based on that, you'll know if you want to track:
1. Directory content
or
2. File content
E.g. if you want to run rsync, you want to respond to DIRECTORY CONTENT CHANGES.
## How to think of file descriptors
File descriptors are isolated TRANSACTIONS related to SINGLE files or directories.
Which means you can continue to read a file that was meanwhile deleted.
You can enter a directory while it gets deleted by something else.
This means you shouldn't make ANY assumptions about the contents or existence
of a file or directory until after you open it and before you close it.
## How to track content (low level)
Of course, checking file contents for changes every few microseconds makes no sense.
Same with directories.
The solution is to track the CONTAINING DIRECTORY (for file attribute changes)
or the PARENT DIRECTORY (for directory attribute changes).
Note that this is just a way to think to avoid design problems and bugs, because
e.g. technically timestamps are stored in inodes (not as content) and tools provide
abstractions, e.g. inotify lets you "track files" (e.g. for close_write, attrib, etc.).
So if you're designing LOW LEVEL tools, you want to PROVIDE content tracking by
WATCHING directories for file/directory ATTRIBUTE changes or DESCRIPTOR related
changes, e.g. close_write.
## How to track content (High level)
If you're interested in creating HIGH LEVEL tools, it's best to allow ONLY
tracking CONTENT (file content, or directory-a-list-of-files content).
This means NOT allowing users to track additions, removals, etc., because the
users will STILL have to verify those changes themselves.
E.g. even if a file is added, that "extra" information makes no more sense for
rsync than when a file is modified. It just makes the high level interface
extremely confusing and complex for users.
## How to track trees
Tracking whole trees means tracking the ROOT directory CONTAINING the tracked
directories for changes, making sure that newly created subdirectories are
automatically watched.
When symlinks are present, this probably means (depending on use case) tracking
BOTH:
1. the directory containing the symlink (if we're tracking symlinking)
2. the directory containing the target (if we're tracking file/dir content - the target)
It does NOT make sense to track for file changes in the directory containing
the symlink!
## How to think about renaming/moving
Basically, renaming a file or directory is the same as copying, deleting and
creating, except it's faster (which is an implementation detail - and not
something anyone tracking changes will want to really know about).
Tracking renames is an optimization that can be useful to know about for
optimizations, but you need reliable notifications - which means having a
robust integrated solution.
So if you're not using something like e.g. AMPQ for reliability, you probably
DON'T want to track renames - unless you have a RELIABLE way of detecting if a
moved file CHANGED or not.
More importantly, without knowing the INTENTION behind renaming, you can't
respond accordingly.
Consider shopping_list.txt renamed to my_shopping.list.
Should tools detect the file was renamed and use the new file from now on?
Probably not.
Renaming means things RELYING on that name are SUPPOSED to NO LONGER USE THAT
FILE.
Which effectively from the tool's perspective means ... the file was DELETED!
And from another tool's perspective (one tracking my_shopping.list), the file
APPEARED.
Conclusion: you DO NOT want to track renames as something separate. You want to
JUST track DIRECTORY CONTENT CHANGES or INDIVIDUAL FILE CONTENTS.
## How to think about adding/deleting
Removing an expected file results in an error. But that's only useful to know
ONLY WHILE the file opening is attempted. So it's only up to the application
USING the file to CARE whether a file exists or not.
So the application tracking the files DOESN'T CARE if a file is ADDED, MODIFIED
OR DELETED.
E.g. when a file is deleted 2 things change:
1. The file content is no longer accessible (it "disappears", e.g. there's an error)
2. The directory content changes - the NAME and ATTRIBUTES disappear.
This means the same content is no longer available UNDER THE SAME NAME.
Which means trying to access the content UNDER THAT NAMES results in something
OTHER than the previous content ... which means ... the file CONTENT changed.
So deleteing my_shopping_list.txt means my_shopping_list.txt CONTENT has
changed.
This means adding/deleting files is JUST A CONTENT CHANGE - nothing more.
That means either the content of the file or the directory - or the parent
directory containing them.
## Summary
If you're "tracking files or directories" you're really interested in either:
1. File contents
2. Directory content, meaning ONLY file/directory names and attributes
(Conceptually, in both cases this means tracking the parent directory.
Technically, this varies.)
If you expect to "track" adding/removing/moving/symlinks/renames ...
... your approach is likely wrong.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment