So your boss just jumped in, because he remembered that things need to be GDPR-compliant within the next few hours...
And of course he forgot to send the necessary mailing to your newsletter-subscribers.
And of course it's all properly documented in a haphazard mix of excel-files (each of which has a different layout),
text-files, v-cards and the like...
So now you need to extract those email addresses from all those files, because doing it manually will never finish in time.
You will need a linux shell for this. You need "libreoffice" and "rename" installed.
- create two directories: "infiles" and "outfiles"
- copy all the excels to "infiles"
- copy the vcards, csv and txt files to outfiles directly
- now rename those files to get rid of spaces in the filenames.
#> find infiles/ -depth -name "* *" -execdir rename 's/ /_/g' "{}" \;
- convert the excel-files to csv:
Don't worry if you see messages about empty files. Those are the empty sheets in the workbook...#> find infiles/ -type f \( -name '*.xls' -o -name '*.xlsx' \) -exec libreoffice --headless --convert-to csv {} --outdir outfiles/ \;
- Now get those addresses out of there:
#> touch unsorted.txt #> find outfiles/ -type f -exec grep -i -o '[A-Z0-9._%+-]\+@[A-Z0-9.-]\+\.[A-Z]\{2,4\}' {} \; | tee -a unsorted.txt #> sort -fu unsorted.txt -o gotcha.txt
You now have a list of unique email-addresses in gotcha.txt, one address per line.
A word of caution: this will not catch email-adresses with ',' or '"' in them. If you're worried about that, you may want to use a more complex regex, as described at...