Skip to content

Instantly share code, notes, and snippets.

@tomsaleeba
Last active May 16, 2018 00:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tomsaleeba/096e4fadc24f5c68e3ffb2283843a803 to your computer and use it in GitHub Desktop.
Save tomsaleeba/096e4fadc24f5c68e3ffb2283843a803 to your computer and use it in GitHub Desktop.
Concatenating RDF Turtle files

ttlcat

When you have a series of *.ttl files in a directory and you want to cat them all together, you need to make sure you strip out the @prefix and only prepend it once to the output.

Use the following commands

# run *in* the directory with the TTL files
head -n 50 -q *.ttl | grep '^@prefix' | sort -u > header
time cat *.ttl | grep -v '^@prefix' | cat header - | gzip > $(basename $(pwd)).ttl.gz
rm header
echo "output is $(basename $(pwd)).ttl.gz"

Or, if you want a one-liner (formatted over multiple lines) that can pipe to S3, use:

head -n 50 -q *.ttl | grep '^@prefix' | sort -u > header && \
time cat *.ttl | grep -v '^@prefix' | cat header - | gzip | aws s3 cp - s3://<bucket>/$(basename $(pwd)).ttl.gz; \
rm -f header

...just be sure to update the <bucket> placeholder with your S3 bucket name.

Limitations

  1. Doesn't support spaces in the directory name
  2. Doesn't support more than 50 @prefix lines. Just bump up the -n arg to head if you need more.
@meliyahu
Copy link

Great chief thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment