Skip to content

Instantly share code, notes, and snippets.

@andrewbattista
Last active September 17, 2020 17:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save andrewbattista/e388fadfe9643fae0d8118d3b6640d40 to your computer and use it in GitHub Desktop.
Save andrewbattista/e388fadfe9643fae0d8118d3b6640d40 to your computer and use it in GitHub Desktop.
Pull all records from invenio and pretty print them

Pull all records from invenio

Use this bash script and command to pull down all records from invenio and save them as discrete files within uniquely named directories

Create a file named invenio_recs.sh with these contents:

#!/bin/bash

#create directory for records
mkdir $1

 curl_resp1=`curl -k -X GET -H "Content-Type: application/json" -H "Accept: application/json" "https://invenio-test.rc.it.nyu.edu/api/records/?sort=mostrecent&size=1"`
 id_rec=`echo $curl_resp1 | awk -F, '{ print $8 }'| awk -F: '{ print $2 }'`
 id_rec_num=`echo ${id_rec:1:2}`
 i=1
while [  $i -le $id_rec_num ]  
do
 #check if record exist 
  status_code=$(curl -k --write-out %{http_code} --silent --output /dev/null https://invenio-test.rc.it.nyu.edu/api/records/$i)

  if [[ $status_code == 429 ]] ; then
   sleep 60
  fi
  if [[ $status_code == 200 ]] ; then
   #if record exists and is not deleted e.g. has metadata save it as json
   curl_resp=`curl -k -X GET -H "Content-Type: application/json" -H "Accept: application/json"  "https://invenio-test.rc.it.nyu.edu/api/records/$i?prettyprint=1"`
   if ! [[  "$curl_resp" = *"metadata\": {}"* ]]; then
    if ! [[  "$curl_resp" = *"message"* ]]; then
      echo $curl_resp>$1/record_$i.json
    else
      echo "$curl_resp"
    fi
   fi
  fi
  let i=i+1
  echo $i
done

Next, log on to the VPN, navigate to the place where you want to run your script and take down the records, run this command:

chmod 775 invenio_recs.sh

Then

bash ./invenio_recs.sh /Users/staff/Desktop/inveniopull

Where the the path is where you want the records to go. Note that the app throttles the downloads a bit, and you will be limited to 1000 requests per hour and only 30 in a single minute from the same IP address. You'll need to wait the appropriate amount of time. After downloading, you should rename each file according to the invenio ID and then pretty print the records before committing them. For now, this is a two step process:

Navigate to the directory where all of the reocrds are and oncatenate all individual JSONs into a single file with JS

find . -name '*.json' -exec cat '{}' + | jq -s '.' > /Users/staff/Desktop/newjsonsinglefile.json

Establish a Ruby session with irb and paste in the following:

require 'json'

irb_context.echo = false

allrecords_file = File.read('/Users/staff/Desktop/newsinglefile.json')

## The file where the original .json that contains all of the records is above. Make sure to include the full path

parsed_file = JSON.parse(allrecords_file)

## The JSON.parse function parses the single file into discrete outputs as JSON files

parsed_file.each do |record|
  folder_name = record['id']
  
  full_folder = "/Users/andrewbattista/Desktop/inveniopull/revised"

  `mkdir -p #{full_folder}/#{folder_name}`


  File.open("#{full_folder}/#{folder_name}/invenio.json", "w") do |f|


    f.write(JSON.pretty_generate(record))
  end

end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment