GwynethLlewelyn/convert-from-Latin-1-to-UTF-8.md

## convert-from-Latin-1-to-UTF-8.md

      
    Raw
  

              convert-from-Latin-1-to-UTF-8.md
            
          
    Convert files from ISO-8859-1 (Latin-1) to UTF-8, recursively

This requires recode to be installed (brew install recode or apt install recode).
The example shows HTML files only. Adjust as required.
Process:


(Recursively) find all files that end with *.htm or *.html.
For each match, check its file type using file.
Extract from the reply the ISO-8859 tag, using grep and cut (note: if you're using a more sophisticated version of grep — such as ugrep — then you might be able to directly format the result and skip the piping to cut to show only selected fields; "modern" grep versions may also have some formatting options these days (but I have not checked).
This will give you a list of the full paths (starting on the current working directory) for all files currently known as being Latin-1.
Pipe the resulto to cat (why exactly this is needed is a bit beyond me, but is some sort of shell-y requirement which baffled me for quite a while).
Feed the generated list through the usual while read line; do ...; done shell loop, printing each filename in turn.
Feed each filename to recode for converting it from Latin-1 to UTF-8, while preserving all timestamps and other attributes.

Is this the best solution? Probably not. It has the advantage of having just O(2*N) complexity (for N = number of files in the directory tree): find does a single pass to extract all filenames; these are then fed (as if they were just one list) into a loop to do the conversion, one by one — but at this stage, they have been filtered out already (i.e. no binaries, only text files with ISO-8859-1 encoding, etc.).
find . -name "*.htm?" -exec sh -c "file {} | grep ISO-8859 | cut -d':' -f 1" \; | cat | while read line; do recode Latin-1..UTF-8 $line; done
It's possible to do everything in a single loop (e.g. O(N) complexity), but the exact command eluded me.
You can also tackle a different approach: use find just to retrieve directory names and give you a tree of those. Then feed those to grep, which will evaluate all the entries on each directory. The theory here is that grep — and especially ugrep! — might be considerably faster than find on each directory. And it's even possible that a few tweaks might allow ugrep (which works recursively by default!) to do all the work, and pipe the results to the while loop. Or even execute recode directly. Hmm. I should look more into that possibility...