Skip to content

Instantly share code, notes, and snippets.

@waynegraham
Created June 16, 2021 17:38
Show Gist options
  • Save waynegraham/c0903b8aa281709a233ad83a6afca622 to your computer and use it in GitHub Desktop.
Save waynegraham/c0903b8aa281709a233ad83a6afca622 to your computer and use it in GitHub Desktop.
Parse postdoc URIs from sitemap
require 'nokogiri'
require 'open-uri'
sitemaps = [
'https://www.clir.org/page-sitemap1.xml',
'https://www.clir.org/page-sitemap2.xml'
]
xml = Nokogiri::XML(URI.open(sitemaps[0]))
# urls = xml.search('url')
# puts "before: #{urls.count}"
sitemaps.drop(1).each do |sitemap|
s = Nokogiri::XML(URI.open(sitemap))
url = s.search('url')
xml.at('urlset').add_child(url)
end
pages = xml.search('loc:contains("postdoc")')
pages.each do |page|
puts page.text
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment