ericnormand/00 Better filename sort.md

## 00 Better filename sort.md

      
    Raw
  

              00 Better filename sort.md
            
          
    Better filename sort
Typically, strings are sorted by their unicode value. That means you get some weird behavior. We see this with naive sorts of filenames.

"Zoo.txt" is sorted before "academy.zip"
"12 Final chapter.txt" is sorted before "2 Second chapter.txt"

We should be able to find something a bit smarter. Let's try two rules:

Case insensitivity — a and A should be equivalent
Numbers in order — strings of digits should sort in numerical order, not string order

Write a function that you can pass to sort-by that will sort filenames correctly.
Examples
(sort-by filename-order ["Zoo.txt" "academy.zip"]) ;=> ("academy.zip" "Zoo.txt")
(sort-by filename-order ["12 Final chapter.txt" "2 Second chapter.txt"]) ;=> ("2 Second chapter.txt" "12 Final chapter.txt")
Bonus
It's also unfortunate that "elephant.txt" is sorted so far away from "éléphant.txt" (the French version). Make them sort next to each other. "e" should come before "é".
Email submissions to eric@purelyfunctional.tv before September 06, 2020. You can discuss the submissions in the comments below.

  
## Juraj Martinka.clj
(defn filename-order [filename]
  (-> filename
      ;; case insensitive
      (str/lower-case)

      ;; proper number ordering?
      ;; I couldn't find a canonical form to which this could be transformed
      ;; so that numbers are ordered properly
      #_(str/replace #"(\d+)" "0$1") ; this clearly doesn't work

      ;; accent insensitive
      ;; https://stackoverflow.com/questions/51731574/removing-accents-and-diacritics-in-kotlin
      ;; https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/text/Normalizer.Form.html
      (Normalizer/normalize Normalizer$Form/NFD)))

## Mark Champine.clj
;; return starting number else nil
(defn startwnum? [s]
  (let [nn (re-find #"^\d+" s)]
    (if nn (Integer/parseInt nn) nil)))

;; custom comparator
(defn filename-order [a b]
  (let [anum (startwnum? a)
        bnum (startwnum? b)]
    (if (and anum bnum)
        (compare anum bnum)
        (compare (s/lower-case a) (s/lower-case b)))))

(sort filename-order ["Zoo.txt" "academy.zip"])
;;=> ("academy.zip" "Zoo.txt")

(sort filename-order ["12 Final chapter.txt" "2 Second chapter.txt"])
;;=> ("2 Second chapter.txt" "12 Final chapter.txt")

;; Bonus

;; map of regexes of common accented chars => ascii chars. NOT COMPLETE
(def accent-map
  {#"[èéêë]" "e"
   #"[àáâãäå]" "a"
   #"[ç]" "c"
   #"[ìíîï]" "i"
   #"[ñ]" "n"
   #"[òôõö]" "o"
   #"[ùúûü]" "u"
   #"[ýÿ]" "y"})

(defn remove-accents [s]
  (reduce #(s/replace %1 (key %2) (val %2)) (s/lower-case s) accent-map))

;; accent-insensitive custom comparator
(defn filename-order-ai [a b]
  (let [anum (startwnum? a) aremac (remove-accents a)
        bnum (startwnum? b) bremac (remove-accents b)]
    (if (and anum bnum)
        (compare anum bnum)
        (compare aremac bremac))))

;; note the incorrect sorting of original comparator
(sort filename-order ["éléphant.txt" "èlide.txt" "elope.txt"])
;; ("elope.txt" "èlide.txt" "éléphant.txt")

;; correct sort order using accent-insensitive comparator
(sort filename-order-ai ["éléphant.txt" "èlide.txt" "elope.txt"])
;; ("éléphant.txt" "èlide.txt" "elope.txt")

## Steffan Westcott.clj
(defn filename-order [s]
  (letfn [(f [[_ digits non-digits]] (if (seq digits) (Integer/parseInt digits) non-digits))]
    (->> s clojure.string/lower-case (str " ") (re-seq #"(\d+)|(\D+)") (mapv f))))

## Willy Heng.clj
(defn filename-order [s]
  (-> s
      clojure.string/lower-case
      (clojure.string/replace #"\d+"
                              #(format "%09d" (Integer/parseInt %)))
      (clojure.string/replace "e" "e0")
      (clojure.string/replace "é" "e1")))
	(defn filename-order [filename]
	(-> filename
	;; case insensitive
	(str/lower-case)

	;; proper number ordering?
	;; I couldn't find a canonical form to which this could be transformed
	;; so that numbers are ordered properly
	#_(str/replace #"(\d+)" "0$1") ; this clearly doesn't work

	;; accent insensitive
	;; https://stackoverflow.com/questions/51731574/removing-accents-and-diacritics-in-kotlin
	;; https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/text/Normalizer.Form.html
	(Normalizer/normalize Normalizer$Form/NFD)))
	;; return starting number else nil
	(defn startwnum? [s]
	(let [nn (re-find #"^\d+" s)]
	(if nn (Integer/parseInt nn) nil)))

	;; custom comparator
	(defn filename-order [a b]
	(let [anum (startwnum? a)
	bnum (startwnum? b)]
	(if (and anum bnum)
	(compare anum bnum)
	(compare (s/lower-case a) (s/lower-case b)))))

	(sort filename-order ["Zoo.txt" "academy.zip"])
	;;=> ("academy.zip" "Zoo.txt")

	(sort filename-order ["12 Final chapter.txt" "2 Second chapter.txt"])
	;;=> ("2 Second chapter.txt" "12 Final chapter.txt")

	;; Bonus

	;; map of regexes of common accented chars => ascii chars. NOT COMPLETE
	(def accent-map
	{#"[èéêë]" "e"
	#"[àáâãäå]" "a"
	#"[ç]" "c"
	#"[ìíîï]" "i"
	#"[ñ]" "n"
	#"[òôõö]" "o"
	#"[ùúûü]" "u"
	#"[ýÿ]" "y"})

	(defn remove-accents [s]
	(reduce #(s/replace %1 (key %2) (val %2)) (s/lower-case s) accent-map))

	;; accent-insensitive custom comparator
	(defn filename-order-ai [a b]
	(let [anum (startwnum? a) aremac (remove-accents a)
	bnum (startwnum? b) bremac (remove-accents b)]
	(if (and anum bnum)
	(compare anum bnum)
	(compare aremac bremac))))

	;; note the incorrect sorting of original comparator
	(sort filename-order ["éléphant.txt" "èlide.txt" "elope.txt"])
	;; ("elope.txt" "èlide.txt" "éléphant.txt")

	;; correct sort order using accent-insensitive comparator
	(sort filename-order-ai ["éléphant.txt" "èlide.txt" "elope.txt"])
	;; ("éléphant.txt" "èlide.txt" "elope.txt")
	(defn filename-order [s]
	(letfn [(f [[_ digits non-digits]] (if (seq digits) (Integer/parseInt digits) non-digits))]
	(->> s clojure.string/lower-case (str " ") (re-seq #"(\d+)\|(\D+)") (mapv f))))
	(defn filename-order [s]
	(-> s
	clojure.string/lower-case
	(clojure.string/replace #"\d+"
	#(format "%09d" (Integer/parseInt %)))
	(clojure.string/replace "e" "e0")
	(clojure.string/replace "é" "e1")))