Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save sumonst21/597cdaf3ee1b3080c4aef6ee5f81c41b to your computer and use it in GitHub Desktop.
Save sumonst21/597cdaf3ee1b3080c4aef6ee5f81c41b to your computer and use it in GitHub Desktop.
How to change encoding from Non-ISO extended-ASCII text, with CRLF line terminators to UTF-8?

file tells you “Non-ISO extended-ASCII text” because it detects that this is:

  • most likely a “text” file from the lack of control characters (byte values 0–31) other than line breaks;
  • “extended-ASCII” because there are characters outside the ASCII range (byte values ≥128);
  • “non-ISO” because there are characters in the 128–159 range (ISO 8859 reserves this range for control characters).

You have to figure out which encoding this file seems to be in. You can try Enca's automatic recognition. You might need to nudge it in the right direction by telling it in what language the text is.

enca x.txt
enca -L polish x.txt

To convert the file, pass the -x option: enca -L polish x.txt -x utf8 >x.utf8.txt

If you can't or don't want to use Enca, you can guess the encoding manually. A bit of looking around told me that this is Polish text and the words are trwały, stały, usuważ, so we're looking for a translation where ³ł and æż. This looks like latin-2 or latin-10 or more likely (given “non-ISO” CP1250 which you're viewing as latin1. To convert the file to UTF-8, you can use recode or iconv.

recode CP1250..utf8 <x.txt >x.utf8.txt
iconv -f CP1250 -t UTF-8 <x.txt >x.utf8.txt
@sumonst21
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment