ericnormand/00 Word segmentation.md

## 00 Word segmentation.md

      
    Raw
  

              00 Word segmentation.md
            
          
    Word segmentation
One of the issues with domain names is that spaces aren't allowed. So we get domain names like this:

penisland.com (Pen Island)
expertsexchange.com (Experts Exchange)

Now we also have the problem with #hashtags on social media platforms.
We want to be able to take a string without spaces and insert the spaces so that the words are separated and our gradeschool teacher can be happy again.
Your task is to write a function that takes a string without spaces and a dictionary of known words and returns all possible ways it could be segmented (i.e., insert spaces) into those words. If it can't be segmented, it should return an empty sequence.
(segmentations "hellothere" ["hello" "there"]) ;=> ("hello there")
(segmentations "fdsfsfdsjkljf" ["the" "he" "she" "it"...]) ;=> ()
Bonus: use a dictionary file and some text from somewhere and do a real test.
Super bonus: make it lazy.
Thanks to this site for the challenge idea where it is considered Expert level in JavaScript.
Email submissions to eric@purelyfunctional.tv until May 31, 2020. You can discuss the submissions in the comments below.

  
## Daniel Gaspani.clj
(defn tokenize [tag]
  (map #(list (apply str (take % tag))
              (apply str (drop % tag))) (range 1 (inc (count tag)))))

(defn segmentations
  ([tag dict] (->> (segmentations tag dict [])
                   (map (partial interpose " "))
                   (map (partial apply str))))
  ([tag dict acc]
   (cond
     (empty? tag) (list acc)
     :else (mapcat (fn [[a b]]
                     (if (some (partial = a) dict)
                       (segmentations b dict (conj acc a)))) (tokenize tag)))))

## Dave Paroulek.clj
(ns purelyfun.379)

;; My solution to the weekly challenge here:
;; https://purelyfunctional.tv/issues/purelyfunctional-tv-newsletter-379-get-produces-unexpected-behavior-as-expected/

(comment

  ;; My approach is to use a very basic LR parser.

  ;; First, I create a nested map datastructure to act as a lookup
  ;; table. For example:
  (index-dict ["hello" "hey"])
  {:h {:e {:y {:end true}, :l {:l {:o {:end true}}}}}}

  ;; Next, I parse one character at a time. Each time I parse a
  ;; character, I can check the lookup table to test whether a
  ;; sequence of characters exists in the dictionary. For example, if
  ;; I've parsed "h" and "e", I can do this:
  (get-in (index-dict ["hello" "hey"]) [:h :e])

  ;; I use the :end keyword in my lookup table to know when found a
  ;; word. For example this returns `nil`:
  (get-in (index-dict ["hello" "hey"]) [:h :e :end])

  ;; ... and this returns `true`:
  (get-in (index-dict ["hello" "hey"]) [:h :e :y :end])

  ;; The `parse` method continues to look at each character until it
  ;;finds a word or can't find a match in the lookup table for the
  ;;characters it's seen so far.

  ;; If parse finds word, then there are two possibilities: 1. That's
  ;; the only match possible, or 2. It might be possible to find a
  ;; longer match.

  ;; So whenever a word is found, `parse` will branch to try these two
  ;; possibilities.

  ;; Of course, this branching creates a tree (nested vector)
  ;; result. So the final step is to traverse the tree and remove the
  ;; nested lists.
  )

(defn deep-merge
  "Deep merge maps"
  [& maps]
  (if (every? map? maps) (apply merge-with deep-merge maps) (last maps)))

(defn index-word
  "Create a recursive map structure representing characters in a word"
  [word]
  (reduce (fn [r c] (hash-map (keyword (str c)) r)) {:end true} (reverse word)))

(defn index-dict
  [dict]
  (apply deep-merge (map index-word dict)))

(defn parse
  "Parse `text` one character at a time, building a tree of possible
  lists of words"
  [lkup c text path result]
  (let [k (keyword (str c))
        new-path (conj path k)
        matched (get-in lkup new-path)]
    (cond

      ;; no match for next character
      (nil? matched)
      (if (get-in lkup (conj path :end))
        ;; found a word! Add it and continue parsing
        (parse lkup c text [] (conj result (apply str (map name path))))
        ;; no more matches can be found, so return results
        result)

      ;; found a match for the next character
      :else
      (if (get-in lkup (conj path :end))
        ;; found a word at this ending, but might also find longer
        ;; word if we keep parsing
        [(parse lkup c text [] (conj result (apply str (map name path))))
         (parse lkup (first text) (rest text) (conj path k) result)]

        ;; no word here, keep going
        (parse lkup (first text) (rest text) (conj path k) result))

      )))

(defn segmentations
  "Create a tree of possible matches. The possible match lists will be
  the leaves of the tree. Filter the tree to return the leaves."
  [text dictionary]
  (let [lkup (index-dict dictionary)
        results (parse lkup (first text) (rest text) [] [])
        result-tree (tree-seq (partial every? vector?) seq results)]
    (map #(apply str (interpose " " %))
         (filter (complement (partial every? vector?))
                 result-tree))))

(comment
  ;; test against 1000 words
  (def words
    (with-open [rdr (clojure.java.io/reader "./resources/10000words.txt")]
      (let [words (line-seq rdr)]
        (segmentations "hellothere" words))))

  ;; penisland ... I only read this as "pen" "island" at first, but I
  ;; get it now ;-)

  (segmentations "penisland" ["pen" "is" "land" "island" "penis"])
  ;;=> ("pen is land" "pen island" "penis land")

  )

## Eric Normand.clj
(defn append-words [a b]
  (cond
    (empty? a)
    b

    (empty? b)
    a

    :else
    (str a " " b)))

(defn segfirst [s words]
  (for [w words
        :when (clojure.string/starts-with? s w)]
    [w (.substring s (.length w))]))

(defn segment [string words]
  (if (empty? string)
    [""]
    (for [[word rest]  (segfirst string words)
          segmentation (segment rest words)]
      (append-words word segmentation))))

## J.-F. Rompre.clj
(ns functional-tv-puzzles.-2020.segmentations-379
  (:require [clojure.string :refer [starts-with? join]]))

(defn head-matches [txt words]
  (for [w words
        :when (starts-with? txt w)
        :let [more (subs txt (count w))]]
    [w more]))

(defn tries [txt words]
  (for [[match more] (head-matches txt words)

        :when (seq match)

        :let [tails (tries more words)]

        :when (not (and (seq more)
                        (empty? tails)))]

    (cons match tails)))

(defn ->legs [[head & tails :as trie]]
  (if (< (count tails) 2)
    (list trie)
    (map #(cons head %) (map ->legs tails))))

(defn ->strs [legs]
  (->> legs (map flatten) (map #(join " " %))))

(defn segmentations [txt words]
  (->> (tries txt (set words))
       (mapcat ->legs)
       ->strs))


## Kevin Herald Reierskog.clj
(defn segmentations [s words]
  (map #(clojure.string/join " " (cons % (segmentations (subs s (count %)) words)))
    (filter (set words)
      (map #(subs s 0 %) (range 1 (inc (count s)))))))

## Mark Champine.clj
(defn segmentations [s l]
  (if (empty? s) [""]
      (for [pm (filter #(.startsWith s %) l)
            segrest (segmentations (subs s (count pm)) l)]
        (str pm (if (empty? segrest) nil " ") segrest))))

## Steffan Westcott.clj

;;;; lazy implementation

(defn segmentations* [s words]
  (let [segs (for [first-word-length (range 1 (.length s)) ;; do not include last char
                   :let [first-word (.substring s 0 first-word-length)]
                   :when (contains? words first-word)
                   other-words (segmentations* (.substring s first-word-length) words)]
                (str first-word " " other-words))]
    (if (contains? words s)
      (lazy-seq (cons s segs)) ;; non-recursive case
      segs)))

(defn segmentations [s words]
  (segmentations* s (set words)))


## Yizhe.clj
(require '[clojure.string :as str])

;; Load dictionary online. Removes all one-letter words because they're noisy.
(def words (->> (slurp "http://www-personal.umich.edu/~jlawler/wordlist")
                (str/split-lines)
                (filter #(> (count %) 1))))

;; Returns a seq of seq.
;; Example: (segmentations "helloworld") => (("hello" "world") ("hell" "ow" "or" "ld"))
(defn segmentations
  [s]
  (if (str/blank? s)
    [""]
    (mapcat (fn [word]
              (if (str/starts-with? s word)
                (map #(cons word %) (segmentations (subs s (count word))))
                []))
            words)))
	(defn tokenize [tag]
	(map #(list (apply str (take % tag))
	(apply str (drop % tag))) (range 1 (inc (count tag)))))

	(defn segmentations
	([tag dict] (->> (segmentations tag dict [])
	(map (partial interpose " "))
	(map (partial apply str))))
	([tag dict acc]
	(cond
	(empty? tag) (list acc)
	:else (mapcat (fn [[a b]]
	(if (some (partial = a) dict)
	(segmentations b dict (conj acc a)))) (tokenize tag)))))
	(ns purelyfun.379)

	;; My solution to the weekly challenge here:
	;; https://purelyfunctional.tv/issues/purelyfunctional-tv-newsletter-379-get-produces-unexpected-behavior-as-expected/

	(comment

	;; My approach is to use a very basic LR parser.

	;; First, I create a nested map datastructure to act as a lookup
	;; table. For example:
	(index-dict ["hello" "hey"])
	{:h {:e {:y {:end true}, :l {:l {:o {:end true}}}}}}

	;; Next, I parse one character at a time. Each time I parse a
	;; character, I can check the lookup table to test whether a
	;; sequence of characters exists in the dictionary. For example, if
	;; I've parsed "h" and "e", I can do this:
	(get-in (index-dict ["hello" "hey"]) [:h :e])

	;; I use the :end keyword in my lookup table to know when found a
	;; word. For example this returns `nil`:
	(get-in (index-dict ["hello" "hey"]) [:h :e :end])

	;; ... and this returns `true`:
	(get-in (index-dict ["hello" "hey"]) [:h :e :y :end])

	;; The `parse` method continues to look at each character until it
	;;finds a word or can't find a match in the lookup table for the
	;;characters it's seen so far.

	;; If parse finds word, then there are two possibilities: 1. That's
	;; the only match possible, or 2. It might be possible to find a
	;; longer match.

	;; So whenever a word is found, `parse` will branch to try these two
	;; possibilities.

	;; Of course, this branching creates a tree (nested vector)
	;; result. So the final step is to traverse the tree and remove the
	;; nested lists.
	)

	(defn deep-merge
	"Deep merge maps"
	[& maps]
	(if (every? map? maps) (apply merge-with deep-merge maps) (last maps)))

	(defn index-word
	"Create a recursive map structure representing characters in a word"
	[word]
	(reduce (fn [r c] (hash-map (keyword (str c)) r)) {:end true} (reverse word)))

	(defn index-dict
	[dict]
	(apply deep-merge (map index-word dict)))

	(defn parse
	"Parse `text` one character at a time, building a tree of possible
	lists of words"
	[lkup c text path result]
	(let [k (keyword (str c))
	new-path (conj path k)
	matched (get-in lkup new-path)]
	(cond

	;; no match for next character
	(nil? matched)
	(if (get-in lkup (conj path :end))
	;; found a word! Add it and continue parsing
	(parse lkup c text [] (conj result (apply str (map name path))))
	;; no more matches can be found, so return results
	result)

	;; found a match for the next character
	:else
	(if (get-in lkup (conj path :end))
	;; found a word at this ending, but might also find longer
	;; word if we keep parsing
	[(parse lkup c text [] (conj result (apply str (map name path))))
	(parse lkup (first text) (rest text) (conj path k) result)]

	;; no word here, keep going
	(parse lkup (first text) (rest text) (conj path k) result))

	)))

	(defn segmentations
	"Create a tree of possible matches. The possible match lists will be
	the leaves of the tree. Filter the tree to return the leaves."
	[text dictionary]
	(let [lkup (index-dict dictionary)
	results (parse lkup (first text) (rest text) [] [])
	result-tree (tree-seq (partial every? vector?) seq results)]
	(map #(apply str (interpose " " %))
	(filter (complement (partial every? vector?))
	result-tree))))

	(comment
	;; test against 1000 words
	(def words
	(with-open [rdr (clojure.java.io/reader "./resources/10000words.txt")]
	(let [words (line-seq rdr)]
	(segmentations "hellothere" words))))

	;; penisland ... I only read this as "pen" "island" at first, but I
	;; get it now ;-)

	(segmentations "penisland" ["pen" "is" "land" "island" "penis"])
	;;=> ("pen is land" "pen island" "penis land")

	)
	(defn append-words [a b]
	(cond
	(empty? a)
	b

	(empty? b)
	a

	:else
	(str a " " b)))

	(defn segfirst [s words]
	(for [w words
	:when (clojure.string/starts-with? s w)]
	[w (.substring s (.length w))]))

	(defn segment [string words]
	(if (empty? string)
	[""]
	(for [[word rest] (segfirst string words)
	segmentation (segment rest words)]
	(append-words word segmentation))))
	(ns functional-tv-puzzles.-2020.segmentations-379
	(:require [clojure.string :refer [starts-with? join]]))

	(defn head-matches [txt words]
	(for [w words
	:when (starts-with? txt w)
	:let [more (subs txt (count w))]]
	[w more]))

	(defn tries [txt words]
	(for [[match more] (head-matches txt words)

	:when (seq match)

	:let [tails (tries more words)]

	:when (not (and (seq more)
	(empty? tails)))]

	(cons match tails)))

	(defn ->legs [[head & tails :as trie]]
	(if (< (count tails) 2)
	(list trie)
	(map #(cons head %) (map ->legs tails))))

	(defn ->strs [legs]
	(->> legs (map flatten) (map #(join " " %))))

	(defn segmentations [txt words]
	(->> (tries txt (set words))
	(mapcat ->legs)
	->strs))
	(defn segmentations [s words]
	(map #(clojure.string/join " " (cons % (segmentations (subs s (count %)) words)))
	(filter (set words)
	(map #(subs s 0 %) (range 1 (inc (count s)))))))
	(defn segmentations [s l]
	(if (empty? s) [""]
	(for [pm (filter #(.startsWith s %) l)
	segrest (segmentations (subs s (count pm)) l)]
	(str pm (if (empty? segrest) nil " ") segrest))))

	;;;; lazy implementation

	(defn segmentations* [s words]
	(let [segs (for [first-word-length (range 1 (.length s)) ;; do not include last char
	:let [first-word (.substring s 0 first-word-length)]
	:when (contains? words first-word)
	other-words (segmentations* (.substring s first-word-length) words)]
	(str first-word " " other-words))]
	(if (contains? words s)
	(lazy-seq (cons s segs)) ;; non-recursive case
	segs)))

	(defn segmentations [s words]
	(segmentations* s (set words)))
	(require '[clojure.string :as str])

	;; Load dictionary online. Removes all one-letter words because they're noisy.
	(def words (->> (slurp "http://www-personal.umich.edu/~jlawler/wordlist")
	(str/split-lines)
	(filter #(> (count %) 1))))

	;; Returns a seq of seq.
	;; Example: (segmentations "helloworld") => (("hello" "world") ("hell" "ow" "or" "ld"))
	(defn segmentations
	[s]
	(if (str/blank? s)
	[""]
	(mapcat (fn [word]
	(if (str/starts-with? s word)
	(map #(cons word %) (segmentations (subs s (count word))))
	[]))
	words)))