Skip to content

Instantly share code, notes, and snippets.

@ericnormand
Last active September 7, 2020 15:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ericnormand/5c89a9211a97af48c50c14a01671ed41 to your computer and use it in GitHub Desktop.
Save ericnormand/5c89a9211a97af48c50c14a01671ed41 to your computer and use it in GitHub Desktop.

Better filename sort

Typically, strings are sorted by their unicode value. That means you get some weird behavior. We see this with naive sorts of filenames.

  • "Zoo.txt" is sorted before "academy.zip"
  • "12 Final chapter.txt" is sorted before "2 Second chapter.txt"

We should be able to find something a bit smarter. Let's try two rules:

  • Case insensitivity — a and A should be equivalent
  • Numbers in order — strings of digits should sort in numerical order, not string order

Write a function that you can pass to sort-by that will sort filenames correctly.

Examples

(sort-by filename-order ["Zoo.txt" "academy.zip"]) ;=> ("academy.zip" "Zoo.txt")
(sort-by filename-order ["12 Final chapter.txt" "2 Second chapter.txt"]) ;=> ("2 Second chapter.txt" "12 Final chapter.txt")

Bonus

It's also unfortunate that "elephant.txt" is sorted so far away from "éléphant.txt" (the French version). Make them sort next to each other. "e" should come before "é".

Email submissions to eric@purelyfunctional.tv before September 06, 2020. You can discuss the submissions in the comments below.

(defn filename-order [filename]
(-> filename
;; case insensitive
(str/lower-case)
;; proper number ordering?
;; I couldn't find a canonical form to which this could be transformed
;; so that numbers are ordered properly
#_(str/replace #"(\d+)" "0$1") ; this clearly doesn't work
;; accent insensitive
;; https://stackoverflow.com/questions/51731574/removing-accents-and-diacritics-in-kotlin
;; https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/text/Normalizer.Form.html
(Normalizer/normalize Normalizer$Form/NFD)))
;; return starting number else nil
(defn startwnum? [s]
(let [nn (re-find #"^\d+" s)]
(if nn (Integer/parseInt nn) nil)))
;; custom comparator
(defn filename-order [a b]
(let [anum (startwnum? a)
bnum (startwnum? b)]
(if (and anum bnum)
(compare anum bnum)
(compare (s/lower-case a) (s/lower-case b)))))
(sort filename-order ["Zoo.txt" "academy.zip"])
;;=> ("academy.zip" "Zoo.txt")
(sort filename-order ["12 Final chapter.txt" "2 Second chapter.txt"])
;;=> ("2 Second chapter.txt" "12 Final chapter.txt")
;; Bonus
;; map of regexes of common accented chars => ascii chars. NOT COMPLETE
(def accent-map
{#"[èéêë]" "e"
#"[àáâãäå]" "a"
#"[ç]" "c"
#"[ìíîï]" "i"
#"[ñ]" "n"
#"[òôõö]" "o"
#"[ùúûü]" "u"
#"[ýÿ]" "y"})
(defn remove-accents [s]
(reduce #(s/replace %1 (key %2) (val %2)) (s/lower-case s) accent-map))
;; accent-insensitive custom comparator
(defn filename-order-ai [a b]
(let [anum (startwnum? a) aremac (remove-accents a)
bnum (startwnum? b) bremac (remove-accents b)]
(if (and anum bnum)
(compare anum bnum)
(compare aremac bremac))))
;; note the incorrect sorting of original comparator
(sort filename-order ["éléphant.txt" "èlide.txt" "elope.txt"])
;; ("elope.txt" "èlide.txt" "éléphant.txt")
;; correct sort order using accent-insensitive comparator
(sort filename-order-ai ["éléphant.txt" "èlide.txt" "elope.txt"])
;; ("éléphant.txt" "èlide.txt" "elope.txt")
(defn filename-order [s]
(letfn [(f [[_ digits non-digits]] (if (seq digits) (Integer/parseInt digits) non-digits))]
(->> s clojure.string/lower-case (str " ") (re-seq #"(\d+)|(\D+)") (mapv f))))
(defn filename-order [s]
(-> s
clojure.string/lower-case
(clojure.string/replace #"\d+"
#(format "%09d" (Integer/parseInt %)))
(clojure.string/replace "e" "e0")
(clojure.string/replace "é" "e1")))
@steffan-westcott
Copy link

steffan-westcott commented Sep 7, 2020

Here's a variant of my solution which attempts to account for accented characters:

(defn filename-order [s]
  (->> (java.text.Normalizer/normalize s java.text.Normalizer$Form/NFD)
       clojure.string/lower-case
       (str " ")
       (re-seq #"(\d+)|(\D+)")
       (mapv (fn [[_ digits non-digits]] (if digits (Integer/parseInt digits) non-digits)))))

The spec is a little vague on how to sort strings that have runs of digits other than at the start. My approach partitions a string to a vector of alternating string and integer elements. For example, "Chapter 4 Part 3" becomes [" chapter " 4 " part " 3]. I add a single space character at the start to ensure the first element in the vector is a string, which allows the vectors to be compared (during sorting) without mishap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment