Skip to content

Instantly share code, notes, and snippets.

@henkm
Created August 16, 2013 15:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save henkm/6250916 to your computer and use it in GitHub Desktop.
Save henkm/6250916 to your computer and use it in GitHub Desktop.
# start with empty list
list = []
# split pdf file into many text files (one for each page)
# `docsplit text jaarboek.pdf --no-ocr --pages all`
# define pattern for email address
regex = /[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}/i
# loop through all the pages
(1..705).each do |page|
# open page
f = File.open("jaarboek_#{page}.txt")
# read content of page
content_of_page = f.read
# find addresses on page
addresses_found_on_page = content_of_page.scan regex
# append results to the list
list += addresses_found_on_page
end
# write list to txt-file
File.open("adressen.txt", 'w') {|f| f.write(list.join(", ")) }
puts "#{list.count} adressen gevonden"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment