Skip to content

Instantly share code, notes, and snippets.

@jeffjohnson9046
Created April 8, 2019 05:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jeffjohnson9046/687def3cbe36efbb89d3e46623f9fc29 to your computer and use it in GitHub Desktop.
Save jeffjohnson9046/687def3cbe36efbb89d3e46623f9fc29 to your computer and use it in GitHub Desktop.
Replace UTF-8 "extended" characters with ASCII equivalents
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
import io
"""
Input file might look something like this:
cat input.txt
some ñ thing
foo ñññ
When the script is done, the output.txt will look like this:
cat output.txt
some x thing
foo xxx
There are other libraries that will do this automagically (e.g. unidecode), but in my case I wanted control over what gets
mapped to what.
"""
replacement_map = {
ord(u'ñ'): u'x',
# ... other mappings here...
}
with io.open('input.txt', encoding='utf-8') as data:
with io.open('output.txt', 'w', encoding='ascii') as out:
for row in data:
out.write(row.translate(replacement_map))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment