Skip to content

Instantly share code, notes, and snippets.

Last active April 30, 2017 09:48
What would you like to do?
How to identify malformed characters or illegal byte sequence in files

Legacy Data

When dealing with legacy data it's been pretty common to run into malformed / illegal byte sequences in files. Figuring out what's causing the issue is often really difficult, especially when the file has thousands of rows.

Here's a trick I pretty much stumpbled upon:

nl file.txt | sort

sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `  9009\tThis is a line without errors\r' and `  9010\tLine\222s got strange chars\r'.

nl [man page] adds line numbers to the file, sort [man page] blows up while comparing the line with the malformed characters in question. The output from the explosion contains the line number in question, in this case 9010.

At the very least this should give you a good starting point to troubleshooting upstream.

Copy link

cjolly commented Aug 5, 2014

You can also use iconv file.txt and it will give the relevant line and character sequence that's causing issues.

iconv: file.txt:8904:209: cannot convert

Copy link

yamoinza commented Feb 1, 2017

the nl tip didn't work for me... that seems to remove the error from sort appearing
iconv does report ":3:157: cannot convert" though on "pequeño" which seems strange

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment