markgraf/extract_email.md

## extract_email.md

      
    Raw
  

              extract_email.md
            
          
    Extracting email addresses

So your boss just jumped in, because he remembered that things need to be GDPR-compliant within the next few hours...

And of course he forgot to send the necessary mailing to your newsletter-subscribers.

And of course it's all properly documented in a haphazard mix of excel-files (each of which has a different layout),
text-files, v-cards and the like...
So now you need to extract those email addresses from all those files, because doing it manually will never finish in time.
Steps to be taken

You will need a linux shell for this.
You need "libreoffice" and "rename" installed.

create two directories: "infiles" and "outfiles"
copy all the excels to "infiles"
copy the vcards, csv and txt files to outfiles directly
now rename those files to get rid of spaces in the filenames.
#> find infiles/ -depth -name "* *" -execdir rename 's/ /_/g' "{}" \;


convert the excel-files to csv:
#> find infiles/ -type f \( -name '*.xls' -o -name '*.xlsx' \) -exec libreoffice --headless --convert-to csv {} --outdir outfiles/ \;

Don't worry if you see messages about empty files. Those are the empty sheets in the workbook...
Now get those addresses out of there:
#> touch unsorted.txt
#> find outfiles/ -type f -exec grep -i -o '[A-Z0-9._%+-]\+@[A-Z0-9.-]\+\.[A-Z]\{2,4\}' {} \; | tee -a unsorted.txt
#> sort -fu unsorted.txt -o gotcha.txt


The result

You now have a list of unique email-addresses in gotcha.txt, one address per line.
A word of caution: this will not catch email-adresses with ',' or '"' in them.
If you're worried about that, you may want to use a more complex regex,
as described at...

https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression
http://emailregex.com/
https://www.w3.org/TR/html5/forms.html#valid-e-mail-address