Skip to content

Instantly share code, notes, and snippets.

@GwynethLlewelyn
Created November 2, 2023 13:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save GwynethLlewelyn/faadaaa1c6faf0dd18d6e5758029f4d0 to your computer and use it in GitHub Desktop.
Save GwynethLlewelyn/faadaaa1c6faf0dd18d6e5758029f4d0 to your computer and use it in GitHub Desktop.
One-liner shell script to convert every file in a large directory tree from ISO-Latin-1 to UTF-8

Convert files from ISO-8859-1 (Latin-1) to UTF-8, recursively

This requires recode to be installed (brew install recode or apt install recode).

The example shows HTML files only. Adjust as required.

Process:

  1. (Recursively) find all files that end with *.htm or *.html.
  2. For each match, check its file type using file.
  3. Extract from the reply the ISO-8859 tag, using grep and cut (note: if you're using a more sophisticated version of grep — such as ugrep — then you might be able to directly format the result and skip the piping to cut to show only selected fields; "modern" grep versions may also have some formatting options these days (but I have not checked).
  4. This will give you a list of the full paths (starting on the current working directory) for all files currently known as being Latin-1.
  5. Pipe the resulto to cat (why exactly this is needed is a bit beyond me, but is some sort of shell-y requirement which baffled me for quite a while).
  6. Feed the generated list through the usual while read line; do ...; done shell loop, printing each filename in turn.
  7. Feed each filename to recode for converting it from Latin-1 to UTF-8, while preserving all timestamps and other attributes.

Is this the best solution? Probably not. It has the advantage of having just O(2*N) complexity (for N = number of files in the directory tree): find does a single pass to extract all filenames; these are then fed (as if they were just one list) into a loop to do the conversion, one by one — but at this stage, they have been filtered out already (i.e. no binaries, only text files with ISO-8859-1 encoding, etc.).

find . -name "*.htm?" -exec sh -c "file {} | grep ISO-8859 | cut -d':' -f 1" \; | cat | while read line; do recode Latin-1..UTF-8 $line; done

It's possible to do everything in a single loop (e.g. O(N) complexity), but the exact command eluded me.

You can also tackle a different approach: use find just to retrieve directory names and give you a tree of those. Then feed those to grep, which will evaluate all the entries on each directory. The theory here is that grep — and especially ugrep! — might be considerably faster than find on each directory. And it's even possible that a few tweaks might allow ugrep (which works recursively by default!) to do all the work, and pipe the results to the while loop. Or even execute recode directly. Hmm. I should look more into that possibility...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment