Skip to content

Instantly share code, notes, and snippets.

@andris9
Created March 5, 2012 13:15
Show Gist options
  • Star 66 You must be signed in to star a gist
  • Fork 23 You must be signed in to fork a gist
  • Save andris9/1978266 to your computer and use it in GitHub Desktop.
Save andris9/1978266 to your computer and use it in GitHub Desktop.
git-cache-meta
#!/bin/sh -e
#git-cache-meta -- simple file meta data caching and applying.
#Simpler than etckeeper, metastore, setgitperms, etc.
#from http://www.kerneltrap.org/mailarchive/git/2009/1/9/4654694
#modified by n1k
# - save all files metadata not only from other users
# - save numeric uid and gid
# 2012-03-05 - added filetime, andris9
: ${GIT_CACHE_META_FILE=.git_cache_meta}
case $@ in
--store|--stdout)
case $1 in --store) exec > $GIT_CACHE_META_FILE; esac
find $(git ls-files)\
\( -printf 'chown %U %p\n' \) \
\( -printf 'chgrp %G %p\n' \) \
\( -printf 'touch -c -d "%AY-%Am-%Ad %AH:%AM:%AS" %p\n' \) \
\( -printf 'chmod %#m %p\n' \) ;;
--apply) sh -e $GIT_CACHE_META_FILE;;
*) 1>&2 echo "Usage: $0 --store|--stdout|--apply"; exit 1;;
esac

source:

git-cache-meta --store

destination:

git-cache-meta --apply

Download jgit.sh

Config

cat > ~/.jgit
accesskey: aws access key
secretkey: aws secret access key
<Ctrl-D>

Setup repo

git remote add origin amazon-s3://.jgit@bucket.name/repo-name.git

Push

jgit push origin master

Clone

jgit clone amazon-s3://.jgit@bucket.name/repo-name.git

Pull

jgit fetch
git merge origin/master
@Explorer09
Copy link

Hello @DaniellMesquita,
Thanks for integrating the code, but I didn't have time at the moment to test the new one out.
However at the first glance of the new code, it seems like it does not meet the portability that I expected yet. The first issue I found is that your version includes bashisms and the shebang line says /bin/sh and not /bin/bash. Either the script should be POSIX compatible, or it should use /bin/bash as the shebang.
If I see another issue, I might report it later.

@danimesq
Copy link

@andris9

Completely forgot that this thing even exists 😀

Congrats on starting this project.
If in the future there is a layer2 for git, for sure this will be as useful as it was for me when I was needing.

@danny0838

Though it's nice to have revisions integrated, I have been shifted to git-store-meta for many years, as it's much more performant, secure, and supports more features. 😅

Haven't heard of it. Could you share?

@Explorer09

Thanks for integrating the code, but I didn't have time at the moment to test the new one out.
However at the first glance of the new code, it seems like it does not meet the portability that I expected yet. The first issue I found is that your version includes bashisms and the shebang line says /bin/sh and not /bin/bash. Either the script should be POSIX compatible, or it should use /bin/bash as the shebang.

Despite looking very bashy, it works with a simple ./git-meta.sh --store

Should I still change?

If I see another issue, I might report it later.

Thank you. You're very welcome, as the first to report issues when this project has been started.

@danny0838
Copy link

@DaniellMesquita As above mentioned.

@danimesq
Copy link

@danny0838

If you port it from Perl to Rust, then I'll switch to it.

@danny0838
Copy link

danny0838 commented Oct 17, 2021

@DaniellMesquita I don't write Rust. Even if I do, I won't recommend that.

The purpose of using Perl is because Perl is a component of Git core (normally) and must be supported by any platform that can run Git.

Using another famous language will inevitably introduce an additional dependency and make installation more difficult.

@Explorer09
Copy link

@DaniellMesquita As I said, either make the script POSIX compatible or use /bin/bash as the shebang.
Not everyone uses bash as the default shell and you can run into compatibility problems.
Which choice to make is up to you. It's your project so you can make up your own policy.

@danimesq
Copy link

@danny0838

The purpose of using Perl is because Perl is a component of Git core (normally) and must be supported by any platform that can run Git.

Using another famous language will inevitably introduce an additional dependency and make installation more difficult.

It makes sense.

Will git-store-meta support git hooks to automatically version changes in file metadata on every commit?

@danimesq
Copy link

@Explorer09

As I said, either make the script POSIX compatible or use /bin/bash as the shebang.
Not everyone uses bash as the default shell and you can run into compatibility problems.
Which choice to make is up to you. It's your project so you can make up your own policy.

Democracy is way better than seeing this community effort as "my project".

Done: 01VCS/git-meta@810d5ff

And issues/PRs are welcome.

@danny0838
Copy link

@DaniellMesquita

Will git-store-meta support git hooks to automatically version changes in file metadata on every commit?

Yes. Read the manual for details, bro.

@danimesq
Copy link

danimesq commented Oct 18, 2021

@danny0838

Yes. Read the manual for details, bro.

Interesting. An native version, in the same language as git, makes more sense.

Although I'll personally stick with the sh/bash version for simplicity (and for diversification).

Now it can be initiated in a repo for performing metadata versioning on every commit: 01VCS/git-meta@cf30ef0 (automatically)

Next step, maybe, is having an individual git repository for the metadata, inside .git/meta (will make things more organized and magical)?

@Arcitec
Copy link

Arcitec commented Oct 5, 2022

I was inspired by the script but saw some severe issues in it.

  1. %p is just the path without any special quoting of special characters in filenames, such as leading - which would be interpreted as a parameter, or spaces in the filename which would be interpreted as separate parameters. This breaks SEVERELY if the filenames are weird in any way whatsoever.
  2. The %A is the ACCESS TIME of the file. Why the F is it being tracked? I think you meant to use %T which is the last MODIFICATION TIME of the file. Most people these days don't even use access times anymore, and disable them completely or make them relative to some other time. It's definitely NOT what you intended to copy over.
  3. You're writing the time in human format while totally ignoring a little thing known as TIME ZONES. The dates it restores will be totally wrong.
  4. Why on earth are you using chmod at all? GIT PRESERVES FILE MODE BITS ALREADY! At least if core.filemode in the Git config is true, which it is by default. It makes NO SENSE to save modification bits via your script. It's pointless.
  5. You're grabbing %U and %G which are the NUMERIC USER/GROUP IDs. You should be using %u and %g which are the HUMAN-READABLE user/group, which is way more portable to other machines.
  6. Instead of outputting separate chown and chgrp commands, you should output ONE chown command, since it's able to take chown user:group -- FILE as parameters.
  7. Speaking of --... You should be using -- in every command, to tell them that there are no more flags, and that the rest of the command is the arguments. This is necessary to avoid the risk of filenames being interpreted as parameters.
  8. The "metadata restoration script" you generate has no error-checking whatsoever. So it gives a false sense of security, since it runs but might fail to do anything, but it will just happily continue executing all lines even if there are severe errors (such as not having any write-permissions to the directory it's running in).
  9. Pretty much all of the "variants" above suffer the exact same bugs.

Anyway that's just a few of the issues, there are probably more, but I was only really focused on the "file time" aspect which is what I am interested in saving/restoring in my repo...

So, I was investigating how to rapidly produce quoted (safe) filenames, in universal UNIX TIME format.

The following techniques are what I came up with:

  1. TERRIBLE: find . -type f -printf '%p %T@\n': This outputs the %T@ (Unix modification time with milisecond precision, which on most filesystems leads to trailing .000000). The reason it's terrible is because %p is not quoted, and because the trailing zeroes after every timestamp is just stupid and wasteful.
  2. TERRIBLE: find . -type f -printf '%P %T@\n': Almost the same as the previous one, but I wanted to mention that %P outputs the paths without the leading folder (the . argument in this case), which is very useful if you're trying to be portable. But we still have the HUGE issue that filenames are not quoted. And no, we can't simply slap "%P" around it, since quoting DOESN'T WORK THAT WAY.
  3. KINDA GOOD: stat --printf='touch -mcd "@%Y" -- %N\n' **/*: Alright, now we're getting somewhere. This uses stat which supports %N which is the properly quoted/escaped path to the file. And its %Y outputs the Unix timestamp without ridiculous trailing milliseconds. That's a pretty nice evolution. But the globbing **/* is bad because it CAN'T HANDLE INVISIBLE FILES and also grabs every file and FOLDER, rather than just files.
  4. GREAT BUT SLOW: cd "somefolder" && find . -type f -exec stat --printf='touch -mcd "@%Y" -- %N\n' "{}" \; && cd ..: Alright this is getting close to perfection. It enters a folder, uses find to only look at files, executes stat on the file to get the Unix timestamp and quoted filename. So why is it bad? Well, it's super slow due to spawning stat once per file. Even small collections take a long time. But we can improve this...
  5. PERFECTION: cd "somefolder" && find . -type f -print0 | xargs -0 stat --printf='touch -mcd "@%Y" -- %N\n' && cd ..: With this we've finally achieved perfection. We're using find to discover all files rapidly, and since we're using find you can add other conditions like "all files ending in .x" or "skip all files named foobar"), and the -0 argument is used to output them with NULL separator (so that we support complex filenames, including spaces and even special characters such as newlines in the filename). Next, we use xargs with NULL separator to pass ALL of the discovered files SIMULTANEOUSLY into ONE execution of stat. This gives INSTANT RESULTS, which are all perfectly formatted and escaped.

TL;DR: Solution 5 is BY FAR the best way to back up modification times of files.

Oh and if you're wondering how we're setting the date: Type man date to read about supported DATE formats. Specifically, we're using Unix timestamps which are supported by prepending an @ before the numbers, as seen in this DATE manual example:

Convert seconds since the Epoch (1970-01-01 UTC) to a date

$ date --date='@2147483647'

Here's the "core" of what we're going to do:

cd "Parent Folder" && find . -type f -not -name "metadata-cache" -print0 | sort -z | xargs -0 stat --printf='touch -mcd "@%Y" -- %N\n' > "./metadata-cache" && cd ..

This enters the parent folder to ensure that all paths become relative to that parent. This is actually the full path to my parent folder, I just changed it to "Parent Folder" for this demo.

Next, it lists all regular files except any named "metadata-cache", to avoid listing the cache itself.

Then it sorts the NUL-terminated filenames to ensure that they end up in a nice order (this just makes the metadata file easier to diff and compare).

Then it executes "stat" to safely print their UNIX timestamp commands and their quoted paths.

It pipes that output into a file named "metadata-cache" which ends up inside the parent folder.

The end result is a very clean file which can now be executed to apply all modification times, when necessary.


The reason for && between all commands is to make the sequence fail with an error if any part of the command-chain fails. This means that you can check $? after executing this one-liner, to see if any part of the chain failed. So if [[ $? -ne 0 ]]; then echo "OH NO IT FAILED"; fi means there was an error.

But you MUST do that check immediately after this one-liner, because if you run any other commands first, then the value of $? (last command status) will change. Keep that in mind! :)

Also keep in mind that if the chain fails before the cd .. then you will be stuck in the "Parent Folder" location that you cd-ed into. But personally I don't care since my script will exit if any part failed.

But... to make things even better, it's possible to save the result of $PWD (Bash's always-up-to-date pwd equivalent variable), before we cd at all, which will allow us to restore the current working dir at the end no matter where you came from originally. That's what we'll do in the final functions below.

Final, reliable functions, hereby placed in the Public Domain:

#!/usr/bin/env bash

function write_metadata() {
    # Writes a robust metadata file containing sorted, fully-escaped paths, with
    # the full UNIX modification timestamp of each file.
    CURRENT_PWD="${PWD}"
    cd "${WHATEVER_DIR}" && find . -type f -not -name "metadata-cache" -print0 | sort -z | xargs -0 stat --printf='touch -mcd "@%Y" -- %N || exit 1\n' > "./metadata-cache"
    if [[ $? -ne 0 ]]; then echo "Error while writing metadata cache. Aborting..."; exit 1; fi
    cd "${CURRENT_PWD}"
    if [[ $? -ne 0 ]]; then echo "Error while accessing previous working directory. Aborting..."; exit 1; fi
}

function read_metadata() {
    # Applying the metadata again is a simple matter of going into the target
    # folder if it exists, and then executing the metadata file as a script.
    if [[ ! -f "${WHATEVER_DIR}/metadata-cache" ]]; then return 0; fi
    CURRENT_PWD="${PWD}"
    cd "${WHATEVER_DIR}" && env bash -- "./metadata-cache"
    if [[ $? -ne 0 ]]; then echo "Error while reading metadata cache. Aborting..."; exit 1; fi
    cd "${CURRENT_PWD}"
    if [[ $? -ne 0 ]]; then echo "Error while accessing previous working directory. Aborting..."; exit 1; fi
}

The "${WHATEVER_DIR}" is just whatever folder you want to scan/restore. Replace that with whatever your own variable is called, where you store the path to the target directory.

If you want to make things harder for yourself, you may even decide to make the functions modular by taking $1 as a dynamic parameter of what directory to scan. But then you'll have to call the function with parameters everywhere in your code, so the choice is yours. :) I personally don't think anyone is gonna need modularity enough to warrant all the risks/drawbacks of taking a random parameters instead, so I went with the hardcoded path variables.

One thing to be aware of is that we're using if [[ ... ]]; then ...; fi instead of the [[ ... ]] && { ... } shorthand that many people like, because the shorthand is treated as a command rather than an if-statement, and Bash functions will automatically return the value of the last executed command, which will be false (if everything was successful and the "check for errors" "failed"), which therefore looks like the function gave an error even though it didn't. So to avoid having a return-value bug, we must explicitly use if-statements in all checks in our functions.

There's another small but important thing to be aware of in read_metadata(): The metadata restoration script is executed in a sub-shell, and the metadata-cache script is written to contain || exit 1 after each statement, so that it exits with an error code if there were any errors in any of the commands. This means that we'll detect if anything went wrong while restoring metadata. But if you prefer letting the metadata file ignore errors, you can remove that part of the lines created by write_metadata(). :) However, you most likely WANT to keep these error checks, because it lets you discover when stat found a file but failed to modify its timestamp (such as lacking permissions to modify the folder). If a file doesn't exist, stat simply returns success, so you don't have to worry about missing files triggering those error handlers. They will only trigger on actual errors with the metadata restoration process!

Enjoy!

@danimesq
Copy link

Hi @Bananaman! In what version your implementation is based off?

@spoelstraethan
Copy link

@Bananaman if you used pushd "${WHATEVER_DIR}" >/dev/null and then popd >/dev/null if any error occurred, you would still end up in the "original" directory without the extra output messages from those commands polluting the output.

@Explorer09
Copy link

Explorer09 commented Jan 10, 2023

@spoelstraethan

if you used pushd "${WHATEVER_DIR}" >/dev/null and then popd >/dev/null if any error occurred, you would still end up in the "original" directory without the extra output messages from those commands polluting the output.

Just to inform that pushd and popd are bashism and thus unportable. There is one more problem with pushd is that you do need to handle when pushd itself fails (e.g. when directory doesn't exist), and if you are not careful you would end up popping one more directory from the stack than needed (which could become a security vulnerability in certain applications).

If you need to return to the original directory, the safest as well as the simplest approach is to cd in a subshell, and exit the subshell when you are done or an error occurred in that directory. Like this:

func1 () {
    # Process current directory
    (
        set -e
        cd "$new_dir"
        # Process $new_dir
        # When an error occurs, this subshell exits because of the "set -e" command
    ) || return
    # Back to the directory that was current when entering the function
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment