Skip to content

Instantly share code, notes, and snippets.

@bbenno
Last active January 9, 2020 20:17
Show Gist options
  • Save bbenno/91ae28e5777d2d7c14f60c897ca09798 to your computer and use it in GitHub Desktop.
Save bbenno/91ae28e5777d2d7c14f60c897ca09798 to your computer and use it in GitHub Desktop.
Convert text from Wikipedia to easy readable, easy processable text
# citation marks
s/\[[[:digit:]]\+\]//g
s/\[citation needed\]//g
# print non-ASCII
#s/[^[:print:\n\r\t]]+/sign: &/w errorfile
# empty lines
/^[[:space:]]*$/d
# quotation marks
## U+2018-U+201B
s/[‘’‚‛]/'/g
## U+201C-U+201E
s/[“„”]/"/g
# whitespaces
## leading whitespaces
s/^[ \t]*//g
## trailing whitespaces
s/[[:space:]]\+$//
## double whitespaces
s/ / /g
# hyphen
## U+2013-U+2015
s/[—―–−]/-/g
# misc unicode symbols
## U+203C
s/‼/!!/g
## U+2026
s/…/.../g
s/π/Pi/g
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment