Skip to content

Instantly share code, notes, and snippets.

@warpfork
Created November 25, 2023 21:44
Show Gist options
  • Save warpfork/7cd00e59309cc059f4f27be48505750a to your computer and use it in GitHub Desktop.
Save warpfork/7cd00e59309cc059f4f27be48505750a to your computer and use it in GitHub Desktop.
A bash Zappifier. (Not the latest version, due to... computer ownership shenanigans.)
#!/bin/bash
set -euo pipefail
#set -x
## HOW TO HOLD IT:
##
## Give the program you want to zapp as the first argument.
##
## For the couple of different usage patterns:
## - If you want to use directory shard conventions, set SPLAY_BASE to something sensible.
## - If you want to use file shard conventions, set SPLAY_BASE="-".
## - If you want just the whole files in output, do neither of the above -- it's the default.
##
## If you want to bundle multiple programs into the same bin dir,
## just set OUT_DIR to the same thing and run the whole script repeatedly.
##
## If you want to set OUT_DIR *and* SPLAY_BASE
##
## Most of these other variables you only set if you have to.
## (We'll try to find an "ldshim" on your path -- or you can just tell us where it is.
## We'll assume you have the usual host ELF interp.
## Stuff like that.)
target_program="${1?"must provide target program as first argument"}"
ZAPPIT_ELFINTERP="${ZAPPIT_ELFINTERP:-"/usr/lib64/ld-linux-x86-64.so.2"}"
OUT_DIR="${OUT_DIR:-"/tmp/zappme/app/$(basename "$target_program")"}"
SHIM_BIN="${SHIM_BIN:-"$(which ldshim)"}"
SPLAY_BASE=${SPLAY_BASE:-""} ## Can be a pattern.
# Spicy: the elf interp segfaults on a static binary. Maybe that's bad. Geesh.
readarray -t liblines < <(LD_TRACE_LOADED_OBJECTS=1 "$ZAPPIT_ELFINTERP" "$target_program")
declare -A libs
for line in "${liblines[@]}"; do
#echo
#echo "${line@Q}"
## This regexp hits on a couple points:
## - The output lines always start with a tab.
## - The library name is whatever comes before a "=>".
## - ... except when it's an in-memory only library; then the "=>" doesn't appear at all.
## For our purposes: we just don't match that line. We don't need it anyway.
## - There's a file path after that. It's always absolute.
## - A memory address comes in parens at the end.
## We don't need this, so we don't report it, but we do match on it, just to be exhaustive.
##
## Is that enough?
##
## Well, I don't know. It depends on how loosey-goosey your ELF interp is
## about handling any library names or filesystem paths that have wonky characters in them.
## (It looks to me like it's typical for there to be no escaping at all on the file path part,
## which is why our regexp is so complete about handling the full line.)
## (If the path to the library contains a linebreak, things are truly impossible to control.
## This is unfortunate, because it's possible. I see no way to address this in our script;
## the ELF interp would have to do some escaping or validation of its output, and it... doesn't.)
## (I haven't tested at all what happens if a library name has madness in it.
## I suspect it's unhandled in the typical ELF interp as well, and thus uncorrectable here.)
##
## I've seen other parsers of this data outright assume no spaces are present in any names,
## and we've managed to do a bit better than that. But ultimately we're parsing a format
## that has no escaping that is sufficient for the range of data it's willing to pass through,
## and there's simply no way to secure that. The regexp isn't the problem; the data is.
##
## So! Moving along... as best we can...
if [[ $line =~ ^$'\t'([^\0/]+)\ =\>\ (/[^\0]+)\ \([x0-9a-f]+\)$ ]]; then
#declare -p BASH_REMATCH
libs["${BASH_REMATCH[1]}"]="${BASH_REMATCH[2]}"
fi
## One more interesting bit of that regexp: we don't match if any slashes are in the library name.
## As far as I know, the only time a slash appears in the library name is itself a special case:
## it's when the ELF interp reports *itself*: it does this with a full path, including a leading slash.
## Excluding that from our consideration is generally correct for our purposes here.
## TODO: if something uses the mad/unsafe ORIGIN format for the ELF interp itself, I wonder if that shows up here instead.
done
echo ---
declare -p libs
## N.B., Right now, this tool is only focused on dynamic libraries
## (e.g. `.so` files). We detected those above by use of ELF headers.
##
## In the future, there's no reason the attention targets can't grow:
## both explicit inputs, or files detected by strace, could be attended to;
## and the output dirs of those might be "data" instead of "lib".
## There's two angles of approach to getting content addressed paths.
##
## - We can say that packaging controls this, and things "should" already be
## in directories, where the path name contains the hashes;
## - Or, we can say "screw it" and apply a hash ourselves and declare that's
## how it's going to be now.
##
## If we're building tools that are meant to live in a hostile world, and bring
## things into the fold the instant we encounter them, wherever they're coming from:
## then the second approach is the more powerful, because it demands nothing in advance,
## and also leaves nothing to be decided (the hashes are file granularity: done).
## The downside is: the only well-defined choice is hashes of file granularity;
## and that means we've pretty much removed any way to preserve any concepts of
## organizational intention. (Whether that matters for shared libs is debatable!)
##
## The approach of asking for directories to already be content-addressed is...
## well, it's asking for more. However, it also provides more.
## Specifically, if there's multiple library files from the same package,
## they get to stay as siblings on the filesystem (with each other, and perhaps,
## although it is rare for this to matter, also staying adjacent to other data files).
## Also, if you have a system managing packages... it gets to manage packages,
## as opposed to managing individual files -- and the former is a bit smaller of a number.
##
## Hybridizations are possible. You can *always* construct the file-scale CA index
## for a heap of libraries you already have available.
## For auto-detecting content-address-friendly paths, we do the following:
##
## Step 1: resolve (all) symlinks and clean the path.
##
## Step 2: look for our configured sharding path hunk.
## This can be at any depth in the path.
## It's a very simple match. It's just the string literal (there's no
## imaginable call for globbing or patterns here), anywhere within the path
## (as long as it's not the last two segments -- the last is the filename,
## and the second-to-last should be the content-addressing mangle-named dir.)
## (There's no more checks than that we can really do: We don't assume
## anything about the format of the content-addressing mangle that's
## presumed to be the subsequent path segment, so we've got nothing to check
## there; and the directory depth inside that dir can be arbitrarily high.)
##
## Step 3: for any path where we did find shard-looking paths,
## TODO FINISH
## Set up the directory skeleton for our packaged output.
mkdir -p -- "$OUT_DIR"
mkdir -p -- "$OUT_DIR/bin"
mkdir -p -- "$OUT_DIR/dynbin"
mkdir -p -- "$OUT_DIR/lib"
## In all cases, we copy the binary to the dynbin folder, and the shim to the bin folder:
cp -- "$SHIM_BIN" "$OUT_DIR/bin/$(basename "$target_program")"
cp -- "$target_program" "$OUT_DIR/dynbin/$(basename "$target_program")"
case "$SPLAY_BASE" in
"")
>&2 printf "copying library files...\n"
for lib in "${!libs[@]}"; do
## For the simplest story, where we're just copying files entirely:
cp -- "${libs["$lib"]}" "$OUT_DIR/lib/$lib"
done
;;
"-")
>&2 printf "creating file-hash splay of library files...\n"
## TODO
;;
*)
>&2 printf "searching for splay patterns in paths to library files...\n"
## TODO
;;
esac
echo "----"
find "$OUT_DIR"
echo "----"
echo "size by parts:"
du -sh "$OUT_DIR"/*
echo "----"
echo "size in total:"
du -sh "$OUT_DIR"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment