Skip to content

Instantly share code, notes, and snippets.

@blahah
Last active April 11, 2016 16:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save blahah/95b7aa06b344a495be40dd71ba9eadd9 to your computer and use it in GitHub Desktop.
Save blahah/95b7aa06b344a495be40dd71ba9eadd9 to your computer and use it in GitHub Desktop.
UNIX one-liner to split a EuropePMC XML archive into a stream of articles
sed '1d;$d' | sed 's/<\/article>/<\/article>♛/g' | tr -d '\n' | tr '♛' '\n' | less -S
# e.g. download every EuropePMC archive and convert them to a stream of all articles
curl --silent http://europepmc.org/ftp/oa/ | \
grep -o 'PMC[0-9]*_PMC[0-9]*\.xml\.gz' | \
sort | uniq | \
sed 's/^/http:\/\/europepmc.org\/ftp\/oa\//' | \
xargs -n 1 curl --silent | gunzip | grep -vP '<\/?articles>' | \
sed 's/<\/article>/<\/article>♛/g' | tr -d '\n' | tr '♛' '\n' | less -S
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment