Skip to content

Instantly share code, notes, and snippets.

@markgraf
Last active May 20, 2018 12:07
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save markgraf/eca353dfd1253d0daf1002d1dffbf23c to your computer and use it in GitHub Desktop.
Save markgraf/eca353dfd1253d0daf1002d1dffbf23c to your computer and use it in GitHub Desktop.
So you need to extract email adresses from hundreds of excel-files and vcards?

Extracting email addresses

So your boss just jumped in, because he remembered that things need to be GDPR-compliant within the next few hours...
And of course he forgot to send the necessary mailing to your newsletter-subscribers.
And of course it's all properly documented in a haphazard mix of excel-files (each of which has a different layout), text-files, v-cards and the like...

So now you need to extract those email addresses from all those files, because doing it manually will never finish in time.

Steps to be taken

You will need a linux shell for this. You need "libreoffice" and "rename" installed.

  • create two directories: "infiles" and "outfiles"
  • copy all the excels to "infiles"
  • copy the vcards, csv and txt files to outfiles directly
  • now rename those files to get rid of spaces in the filenames.
    #> find infiles/ -depth -name "* *" -execdir rename 's/ /_/g' "{}" \;
    
  • convert the excel-files to csv:
    #> find infiles/ -type f \( -name '*.xls' -o -name '*.xlsx' \) -exec libreoffice --headless --convert-to csv {} --outdir outfiles/ \;
    
    Don't worry if you see messages about empty files. Those are the empty sheets in the workbook...
  • Now get those addresses out of there:
    #> touch unsorted.txt
    #> find outfiles/ -type f -exec grep -i -o '[A-Z0-9._%+-]\+@[A-Z0-9.-]\+\.[A-Z]\{2,4\}' {} \; | tee -a unsorted.txt
    #> sort -fu unsorted.txt -o gotcha.txt
    

The result

You now have a list of unique email-addresses in gotcha.txt, one address per line.

A word of caution: this will not catch email-adresses with ',' or '"' in them. If you're worried about that, you may want to use a more complex regex, as described at...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment