Create a gist now

Instantly share code, notes, and snippets.

@blahah /demo.sh
Last active Apr 11, 2016

What would you like to do?
UNIX one-liner to split a EuropePMC XML archive into a stream of articles
sed '1d;$d' | sed 's/<\/article>/<\/article>♛/g' | tr -d '\n' | tr '' '\n' | less -S
# e.g. download every EuropePMC archive and convert them to a stream of all articles
curl --silent http://europepmc.org/ftp/oa/ | \
grep -o 'PMC[0-9]*_PMC[0-9]*\.xml\.gz' | \
sort | uniq | \
sed 's/^/http:\/\/europepmc.org\/ftp\/oa\//' | \
xargs -n 1 curl --silent | gunzip | grep -vP '<\/?articles>' | \
sed 's/<\/article>/<\/article>♛/g' | tr -d '\n' | tr '' '\n' | less -S
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment