Skip to content

Instantly share code, notes, and snippets.

@ravivmg
Created April 21, 2011 13:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ravivmg/0278ac17e2bff1ea0d4d to your computer and use it in GitHub Desktop.
Save ravivmg/0278ac17e2bff1ea0d4d to your computer and use it in GitHub Desktop.
#!/bin/bash
#arguments for this program should be the list of years that you want to import (e.g. sh inputPatents.sh 2001 2002)
wget -r -H -l1 -np -nd -P google --reject=txt,css,html,js --wait=3 http://www.google.com/googlebooks/uspto-patents-applications-biblio.html
rm 'google/*.txt'
rm 'google/*.css'
rm 'google/*.js'
rm 'google/*.html'
for file in google/*.zip* ; do mv $file `echo $file | sed 's/\(.*\.\)zip\(.*\)/\1zip/'` ; done
for var in "$@"
do
mkdir -p "google/$var"
echo "*ab$var*.xml"
unzip "google/*ab$var*.zip" -d google
awk -v outputname="google/$var/output_" '/<?xml\ version=\"1.0\"(\ encoding=\"UTF\-8\")?\?>/{if (n) close(output); output=outputname n++ ".xml"} n {print >> output }' google/*ab$var*.xml
$EXIST_HOME/bin/client.sh -ouri=xmldb:exist:// -m /db/patents/applications/$var -p `pwd`/google/$var
rm -rf "google/$var"
rm "google/*.xml"
rm "google/*.txt"
rm "google/*.html"
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment