cjolly/sort.md

## sort.md

      
    Raw
  

              sort.md
            
          
    Legacy Data

When dealing with legacy data it's been pretty common to run into malformed / illegal byte sequences in files. Figuring out what's causing the issue is often really difficult, especially when the file has thousands of rows.
Here's a trick I pretty much stumpbled upon:
nl file.txt | sort
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `  9009\tThis is a line without errors\r' and `  9010\tLine\222s got strange chars\r'.

nl [man page] adds line numbers to the file, sort [man page] blows up while comparing the line with the malformed characters in question. The output from the explosion contains the line number in question, in this case 9010.
At the very least this should give you a good starting point to troubleshooting upstream.