andrewbattista/pull-invenio-records.md

## pull-invenio-records.md

      
    Raw
  

              pull-invenio-records.md
            
          
    Pull all records from invenio

Use this bash script and command to pull down all records from invenio and save them as discrete files within uniquely named directories
Create a file named invenio_recs.sh with these contents:
#!/bin/bash

#create directory for records
mkdir $1

 curl_resp1=`curl -k -X GET -H "Content-Type: application/json" -H "Accept: application/json" "https://invenio-test.rc.it.nyu.edu/api/records/?sort=mostrecent&size=1"`
 id_rec=`echo $curl_resp1 | awk -F, '{ print $8 }'| awk -F: '{ print $2 }'`
 id_rec_num=`echo ${id_rec:1:2}`
 i=1
while [  $i -le $id_rec_num ]  
do
 #check if record exist 
  status_code=$(curl -k --write-out %{http_code} --silent --output /dev/null https://invenio-test.rc.it.nyu.edu/api/records/$i)

  if [[ $status_code == 429 ]] ; then
   sleep 60
  fi
  if [[ $status_code == 200 ]] ; then
   #if record exists and is not deleted e.g. has metadata save it as json
   curl_resp=`curl -k -X GET -H "Content-Type: application/json" -H "Accept: application/json"  "https://invenio-test.rc.it.nyu.edu/api/records/$i?prettyprint=1"`
   if ! [[  "$curl_resp" = *"metadata\": {}"* ]]; then
    if ! [[  "$curl_resp" = *"message"* ]]; then
      echo $curl_resp>$1/record_$i.json
    else
      echo "$curl_resp"
    fi
   fi
  fi
  let i=i+1
  echo $i
done


Next, log on to the VPN, navigate to the place where you want to run your script and take down the records, run this command:
chmod 775 invenio_recs.sh

Then
bash ./invenio_recs.sh /Users/staff/Desktop/inveniopull

Where the the path is where you want the records to go. Note that the app throttles the downloads a bit, and you will be limited to 1000 requests per hour and only 30 in a single minute from the same IP address. You'll need to wait the appropriate amount of time. After downloading, you should rename each file according to the invenio ID and then pretty print the records before committing them. For now, this is a two step process:
Navigate to the directory where all of the reocrds are and oncatenate all individual JSONs into a single file with JS
find . -name '*.json' -exec cat '{}' + | jq -s '.' > /Users/staff/Desktop/newjsonsinglefile.json

Establish a Ruby session with irb and paste in the following:
require 'json'

irb_context.echo = false

allrecords_file = File.read('/Users/staff/Desktop/newsinglefile.json')

## The file where the original .json that contains all of the records is above. Make sure to include the full path

parsed_file = JSON.parse(allrecords_file)

## The JSON.parse function parses the single file into discrete outputs as JSON files

parsed_file.each do |record|
  folder_name = record['id']
  
  full_folder = "/Users/andrewbattista/Desktop/inveniopull/revised"

  `mkdir -p #{full_folder}/#{folder_name}`


  File.open("#{full_folder}/#{folder_name}/invenio.json", "w") do |f|


    f.write(JSON.pretty_generate(record))
  end

end