Skip to content

Instantly share code, notes, and snippets.

@dshorthouse
Last active August 4, 2020 17:55
Show Gist options
  • Save dshorthouse/376a5c5ab9770a072461068a08842ac0 to your computer and use it in GitHub Desktop.
Save dshorthouse/376a5c5ab9770a072461068a08842ac0 to your computer and use it in GitHub Desktop.
Scrape Zootaxa Issues for ORCID IDs
#!/usr/bin/env ruby
# encoding: utf-8
require 'rest_client'
require 'csv'
require 'nokogiri'
require 'colorize'
page_range = 1..10
def get_doc_urls(url, xpath)
html = RestClient.get(url)
doc = Nokogiri::HTML.parse(html)
doc.xpath(xpath).map{|a| a.attributes["href"].value}
end
CSV.open("orcids.csv", "w") do |csv|
page_range.each do |i|
issue_urls = get_doc_urls("https://www.mapress.com/j/zt/issue/archive?issuesPage=#{i}", "//*[@id=\"issues\"]//h4/a")
issue_urls.each do |url|
toc_urls = get_doc_urls(url, "//*[@class=\"tocTitle\"]/a")
toc_urls.each do |url|
orcid_urls = get_doc_urls(url, "//a[contains(@href, 'orcid')]")
orcids = orcid_urls.map{|o| o.sub!("https://orcid.org/", "")}
orcids.each do |orcid|
if orcid
csv << [orcid]
puts orcid.green
end
end
end
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment