Skip to content

Instantly share code, notes, and snippets.

@stefanschmidt
Created May 8, 2021 03:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save stefanschmidt/65f61e49d66c19d393df858173f188a6 to your computer and use it in GitHub Desktop.
Save stefanschmidt/65f61e49d66c19d393df858173f188a6 to your computer and use it in GitHub Desktop.
Convert list of german last names from pdf to plain text
# Using pdf2txt we will convert an extensive list of german last names from pdf to plain text
# Deutscher Familiennamenatlas (DFA) is available from https://www.namenforschung.net
#
# depends on pdftotext from poppler package (available via Homebrew)
# currently 54152 names (May 2021)
curl https://www.namenforschung.net/fileadmin/user_upload/dfa/Inhaltsverzeichnisse_etc/Index_Band_I-V_Gesamt_Stand_September_2016.pdf > temp.pdf
pdftotext temp.pdf
# remove page titles, page numbers, etc. from text with regular expressions
gsed -E 's/ [IV]+:.*//' temp.txt |\
ggrep -E -v '.*Deutscher Familiennamenatlas.*\n' |\
ggrep -E -v '.*[0-9]+.*' |\
ggrep -E -v '^$' |\
ggrep -E -v '\f' > dfa.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment