Skip to content

Instantly share code, notes, and snippets.

@seandavi
Created January 29, 2022 19:08
Show Gist options
  • Save seandavi/470b21795299cb92a0a55a084067e1b0 to your computer and use it in GitHub Desktop.
Save seandavi/470b21795299cb92a0a55a084067e1b0 to your computer and use it in GitHub Desktop.
Load semantic scholar json to bigquery
#!/bin/bash
# requires about 200G of disk space
# downloads stuff
# create disposable bucket
# upload
# bq load
# remove bucket
mkdir -p ss
cd ss
wget https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2022-01-01/manifest.txt
wget -B https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2022-01-01/ -i manifest.txt
RANDO=`hexdump -n 4 -e '4/4 "%04X" 1 "\n"' /dev/urandom | base64 | tr '[:upper:]' '[:lower:]'`
gsutil mb gs://$RANDO
gsutil -m cp s2* gs://$RANDO/
bq load --source_format=NEWLINE_DELIMITED_JSON --autodetect --replace omicidx_etl.s2_raw "gs://$RANDO/s2-*"
gsutil rm -rf gs://$RANDO/
cd ..
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment