Skip to content

Instantly share code, notes, and snippets.

@sneumann
Created October 30, 2023 10:31
Show Gist options
  • Save sneumann/6cd525f83bc7c8d335169fa8dab6d552 to your computer and use it in GitHub Desktop.
Save sneumann/6cd525f83bc7c8d335169fa8dab6d552 to your computer and use it in GitHub Desktop.
Chemotion2JsonLD-DataDump
#!/bin/bash
BASEURL=https://www.chemotion-repository.net/api/v1/public/metadata/publications
LIMIT=500
MAXPARALLEL=25
MAXRECORDS=99999999
MAXRECORDS=2000
# There are currently ~6000 Samples, ~25000 Containers and ~5000 Reactions
#if /bin/false ; then
for OBJECTTYPE in Sample Container Reaction ; do
echo "Getting index for $OBJECTTYPE"
OFFSET=0
echo -n > all.$OBJECTTYPE.txt
while [ $OFFSET -lt $MAXRECORDS ] && curl -s -X GET --header 'Accept: application/json' \
"$BASEURL?type=$OBJECTTYPE&offset=$OFFSET&limit=$LIMIT" |\
jq "if (.publications | length > 0) then .publications[] else halt_error end" >>all.$OBJECTTYPE.txt ; do
echo -n "$OFFSET "
OFFSET=$((OFFSET+LIMIT))
done
echo "Now Downloading all $OBJECTTYPE"
## Download the individual records
cat all.$OBJECTTYPE.txt | tr -d \" |\
parallel --joblog /tmp/parallel.log wget -q --content-on-error --directory-prefix=./json-ld
done
#fi
#if /bin/false ; then
## Temporary hack: delete any broken downloads (wget content-on-error)
#find json-ld/ -type f |\
xargs grep 'sorry, but something went wrong .500.' |\
cut -d: -f 1 |\
xargs --no-run-if-empty rm
## concatenate Download the individual records
echo "[" >chemotion-datadump.json
i=0
find ./json-ld -type f |\
while read F ; do
if [ $i != "0" ] ; then
echo "," >>chemotion-datadump.json
i=1
fi
cat "$F" | jq . >>chemotion-datadump.json || echo "Error in $F"
done
echo "]" >>chemotion-datadump.json
#fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment