Skip to content

Instantly share code, notes, and snippets.

@leonelag
Created March 1, 2013 11:58
Show Gist options
  • Save leonelag/5064195 to your computer and use it in GitHub Desktop.
Save leonelag/5064195 to your computer and use it in GitHub Desktop.
;;
;; Utility script to convert files between character encodings.
;;
;; I've used this to fix the character encoding in a messy project where files
;; were encoded with different encodings
;;
(require '[clojure.java.io :as io])
(defn has-suffix [f suffixes]
(some #(.endsWith (.getName f) %)
suffixes))
(def files
(concat
(filter #(has-suffix % [".java" ".ui.xml"])
(file-seq (io/file "C:/MyProject/src")))))
;;
;; Basic Latin and Control characters
;; http://en.wikipedia.org/wiki/C0_Controls_and_Basic_Latin
;;
;; Latin1 Supplement
;; http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement
;;
;; Latin characters in Unicode
;; http://en.wikipedia.org/wiki/Latin_characters_in_Unicode
;;
(defn valid-char? [character]
(let [ch (int (.charValue character))]
(or (#{ \newline \tab \return } character)
(<= 0x20 ch 0x3F) ; Space, punctuation marks, digits.
(<= 0x40 ch 0x7D) ; Letters
(#{\à \á \â \ã \ç \è \é \ê \í ; Non-exhaustive list of letters with accents.
\À \Á \Â \Ã \Ç \È \É \Ê \Í
\õ \ó \ô \ú \û \ü
\Õ \Ó \Ô \Ú \Û } character))))
(defn invalid-chars [f]
(let [contents (slurp f :encoding "utf-8")]
(filter (complement valid-char?)
contents)))
;; not used
(defn convert-encoding [f from-encoding to-encoding]
(let [contents (slurp f :encoding from-encoding)]
(spit f contents :encoding to-encoding)))
;;
;; Prints the codes of invalid characters in a file tree
;;
(doseq [f files]
(let [invalid (invalid-chars f)]
(when (not (empty? invalid))
(println (.getAbsolutePath f)
(map (fn [ch]
[(Integer/toHexString (int (.charValue ch)))
ch])
invalid)))))
@leonelag
Copy link
Author

leonelag commented Mar 1, 2013

Clojure script to convert files to a different character encoding.

I've used this in a project where files where written in Brazilian Portuguese, so the allowed characters are the ones more present in Portuguese, with special regard to accented characters.

When writing this script, the following pages were useful:

Basic Latin and Control characters
http://en.wikipedia.org/wiki/C0_Controls_and_Basic_Latin

Latin1 Supplement
http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement

Latin characters in Unicode
http://en.wikipedia.org/wiki/Latin_characters_in_Unicode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment