Skip to content

Instantly share code, notes, and snippets.

@jswrenn
Last active June 2, 2021 16:35
Show Gist options
  • Save jswrenn/df757ebb94cfdf73872b7e35d537c38c to your computer and use it in GitHub Desktop.
Save jswrenn/df757ebb94cfdf73872b7e35d537c38c to your computer and use it in GitHub Desktop.
Monasticon Scraper
:>database-raw.json
seq 0 57 \
| xargs -I{} printf 'https://arts.st-andrews.ac.uk/monasticmatrix/monasticon/browse?page=%s\n' {} \
| while read url ; do
xidel -se '//*[@class="view-content"]//a/@href' "$url"
done \
| xargs -I{} printf 'https://arts.st-andrews.ac.uk%s\n' {} \
| while read url ; do
xidel --ignore-namespaces --output-format json-wrapped -se '
[
{"Name"://span[@class="monasticon-comm-title"]},
//*[contains(@class,"monasticon-rows")]/{
./*[@class="monasticon-title-column"]/strong :
./*[@class="monasticon-value-column"]/inner-html()
}
]' $url \
| jq -rc --arg URL "$url" '.[0] | add + {"URL": $URL}' >> database-raw.json \
|| echo "jq error: $url" >&2
printf '%s\n' "$url"
done | pv -l -s 2873 >/dev/null
sort -u database-raw.json > database-unique.json
jq -sc '.' database-unique.json > database-array.json
# finally, upload database-array.json to this tool:
# https://www.convertcsv.com/json-to-csv.htm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment