Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Convert list of german last names from pdf to plain text
# Using pdf2txt we will convert an extensive list of german last names from pdf to plain text
# Deutscher Familiennamenatlas (DFA) is available from https://www.namenforschung.net
#
# depends on pdftotext from poppler package (available via Homebrew)
# currently 54152 names (May 2021)
curl https://www.namenforschung.net/fileadmin/user_upload/dfa/Inhaltsverzeichnisse_etc/Index_Band_I-V_Gesamt_Stand_September_2016.pdf > temp.pdf
pdftotext temp.pdf
# remove page titles, page numbers, etc. from text with regular expressions
gsed -E 's/ [IV]+:.*//' temp.txt |\
ggrep -E -v '.*Deutscher Familiennamenatlas.*\n' |\
ggrep -E -v '.*[0-9]+.*' |\
ggrep -E -v '^$' |\
ggrep -E -v '\f' > dfa.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment