Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?

Word segmentation

One of the issues with domain names is that spaces aren't allowed. So we get domain names like this:

  • penisland.com (Pen Island)
  • expertsexchange.com (Experts Exchange)

Now we also have the problem with #hashtags on social media platforms.

We want to be able to take a string without spaces and insert the spaces so that the words are separated and our gradeschool teacher can be happy again.

Your task is to write a function that takes a string without spaces and a dictionary of known words and returns all possible ways it could be segmented (i.e., insert spaces) into those words. If it can't be segmented, it should return an empty sequence.

(segmentations "hellothere" ["hello" "there"]) ;=> ("hello there")
(segmentations "fdsfsfdsjkljf" ["the" "he" "she" "it"...]) ;=> ()

Bonus: use a dictionary file and some text from somewhere and do a real test.

Super bonus: make it lazy.

Thanks to this site for the challenge idea where it is considered Expert level in JavaScript.

Email submissions to eric@purelyfunctional.tv until May 31, 2020. You can discuss the submissions in the comments below.

(defn tokenize [tag]
(map #(list (apply str (take % tag))
(apply str (drop % tag))) (range 1 (inc (count tag)))))
(defn segmentations
([tag dict] (->> (segmentations tag dict [])
(map (partial interpose " "))
(map (partial apply str))))
([tag dict acc]
(cond
(empty? tag) (list acc)
:else (mapcat (fn [[a b]]
(if (some (partial = a) dict)
(segmentations b dict (conj acc a)))) (tokenize tag)))))
(ns purelyfun.379)
;; My solution to the weekly challenge here:
;; https://purelyfunctional.tv/issues/purelyfunctional-tv-newsletter-379-get-produces-unexpected-behavior-as-expected/
(comment
;; My approach is to use a very basic LR parser.
;; First, I create a nested map datastructure to act as a lookup
;; table. For example:
(index-dict ["hello" "hey"])
{:h {:e {:y {:end true}, :l {:l {:o {:end true}}}}}}
;; Next, I parse one character at a time. Each time I parse a
;; character, I can check the lookup table to test whether a
;; sequence of characters exists in the dictionary. For example, if
;; I've parsed "h" and "e", I can do this:
(get-in (index-dict ["hello" "hey"]) [:h :e])
;; I use the :end keyword in my lookup table to know when found a
;; word. For example this returns `nil`:
(get-in (index-dict ["hello" "hey"]) [:h :e :end])
;; ... and this returns `true`:
(get-in (index-dict ["hello" "hey"]) [:h :e :y :end])
;; The `parse` method continues to look at each character until it
;;finds a word or can't find a match in the lookup table for the
;;characters it's seen so far.
;; If parse finds word, then there are two possibilities: 1. That's
;; the only match possible, or 2. It might be possible to find a
;; longer match.
;; So whenever a word is found, `parse` will branch to try these two
;; possibilities.
;; Of course, this branching creates a tree (nested vector)
;; result. So the final step is to traverse the tree and remove the
;; nested lists.
)
(defn deep-merge
"Deep merge maps"
[& maps]
(if (every? map? maps) (apply merge-with deep-merge maps) (last maps)))
(defn index-word
"Create a recursive map structure representing characters in a word"
[word]
(reduce (fn [r c] (hash-map (keyword (str c)) r)) {:end true} (reverse word)))
(defn index-dict
[dict]
(apply deep-merge (map index-word dict)))
(defn parse
"Parse `text` one character at a time, building a tree of possible
lists of words"
[lkup c text path result]
(let [k (keyword (str c))
new-path (conj path k)
matched (get-in lkup new-path)]
(cond
;; no match for next character
(nil? matched)
(if (get-in lkup (conj path :end))
;; found a word! Add it and continue parsing
(parse lkup c text [] (conj result (apply str (map name path))))
;; no more matches can be found, so return results
result)
;; found a match for the next character
:else
(if (get-in lkup (conj path :end))
;; found a word at this ending, but might also find longer
;; word if we keep parsing
[(parse lkup c text [] (conj result (apply str (map name path))))
(parse lkup (first text) (rest text) (conj path k) result)]
;; no word here, keep going
(parse lkup (first text) (rest text) (conj path k) result))
)))
(defn segmentations
"Create a tree of possible matches. The possible match lists will be
the leaves of the tree. Filter the tree to return the leaves."
[text dictionary]
(let [lkup (index-dict dictionary)
results (parse lkup (first text) (rest text) [] [])
result-tree (tree-seq (partial every? vector?) seq results)]
(map #(apply str (interpose " " %))
(filter (complement (partial every? vector?))
result-tree))))
(comment
;; test against 1000 words
(def words
(with-open [rdr (clojure.java.io/reader "./resources/10000words.txt")]
(let [words (line-seq rdr)]
(segmentations "hellothere" words))))
;; penisland ... I only read this as "pen" "island" at first, but I
;; get it now ;-)
(segmentations "penisland" ["pen" "is" "land" "island" "penis"])
;;=> ("pen is land" "pen island" "penis land")
)
(defn append-words [a b]
(cond
(empty? a)
b
(empty? b)
a
:else
(str a " " b)))
(defn segfirst [s words]
(for [w words
:when (clojure.string/starts-with? s w)]
[w (.substring s (.length w))]))
(defn segment [string words]
(if (empty? string)
[""]
(for [[word rest] (segfirst string words)
segmentation (segment rest words)]
(append-words word segmentation))))
(ns functional-tv-puzzles.-2020.segmentations-379
(:require [clojure.string :refer [starts-with? join]]))
(defn head-matches [txt words]
(for [w words
:when (starts-with? txt w)
:let [more (subs txt (count w))]]
[w more]))
(defn tries [txt words]
(for [[match more] (head-matches txt words)
:when (seq match)
:let [tails (tries more words)]
:when (not (and (seq more)
(empty? tails)))]
(cons match tails)))
(defn ->legs [[head & tails :as trie]]
(if (< (count tails) 2)
(list trie)
(map #(cons head %) (map ->legs tails))))
(defn ->strs [legs]
(->> legs (map flatten) (map #(join " " %))))
(defn segmentations [txt words]
(->> (tries txt (set words))
(mapcat ->legs)
->strs))
(defn segmentations [s words]
(map #(clojure.string/join " " (cons % (segmentations (subs s (count %)) words)))
(filter (set words)
(map #(subs s 0 %) (range 1 (inc (count s)))))))
(defn segmentations [s l]
(if (empty? s) [""]
(for [pm (filter #(.startsWith s %) l)
segrest (segmentations (subs s (count pm)) l)]
(str pm (if (empty? segrest) nil " ") segrest))))
;;;; lazy implementation
(defn segmentations* [s words]
(let [segs (for [first-word-length (range 1 (.length s)) ;; do not include last char
:let [first-word (.substring s 0 first-word-length)]
:when (contains? words first-word)
other-words (segmentations* (.substring s first-word-length) words)]
(str first-word " " other-words))]
(if (contains? words s)
(lazy-seq (cons s segs)) ;; non-recursive case
segs)))
(defn segmentations [s words]
(segmentations* s (set words)))
(require '[clojure.string :as str])
;; Load dictionary online. Removes all one-letter words because they're noisy.
(def words (->> (slurp "http://www-personal.umich.edu/~jlawler/wordlist")
(str/split-lines)
(filter #(> (count %) 1))))
;; Returns a seq of seq.
;; Example: (segmentations "helloworld") => (("hello" "world") ("hell" "ow" "or" "ld"))
(defn segmentations
[s]
(if (str/blank? s)
[""]
(mapcat (fn [word]
(if (str/starts-with? s word)
(map #(cons word %) (segmentations (subs s (count word))))
[]))
words)))
@KingCode

This comment has been minimized.

Copy link

KingCode commented May 27, 2020

Here is a tentative test suite for anyone interested. Please comment if anything is amiss...Thanks
EDIT: caioaao's large input has been added for performance testing, as well as MarkChampine's input shorter than any word.

(def fixtures 
  [["helloworld"          ;;input string
    ["hello" "world"]      ;; input dictionary 
    ["hello world"]],       ;; expected output, in any order if more than one string

   ["abcdefghijklmnop"
    ["a" "ab" "abc" "bcd" "cdef" "ef" "def" "defg" "ghijkl" "mnop"]
    ["abc def ghijkl mnop"
     "ab cdef ghijkl mnop"
     "a bcd ef ghijkl mnop"]],

   ["penisland"
    ["pen", "penis", "land", "island"]
    ["pen island"
     "penis land"]],

   ["penisland"
    ["pen" "island"]
    ["pen island"]],

   ["expertsexchange"
    ["experts" "expert" "sex" "exchange" "change"]
    ["experts exchange"
     "expert sex change"]],

   ["expertsexchange"
    ["expert" "experts" "exchange" "ex" "change"]
    ["experts exchange"
     "experts ex change"]],

   ["expertsexchange"
    ["expert" "experts" "exchange"]
    ["experts exchange"]],

   ["hellolasdkfjlaskdjfslkj"
    ["hello" "world"]
    []]

   ["fdsfsfdsjkljf"
    ["the" "she" "a" "it"]
    []]]

    ["errorex"
    ["exercise" "error"]
    []]
)

(def large-fixture ["ifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptome", 
["about", "be", "hell", "if", "is", "it", "me", "other", "outer", "people", "the", "to", "up", "where"],
["if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me if it is to be it is up to me"]
])
@caioaao

This comment has been minimized.

Copy link

caioaao commented May 28, 2020

Solution using trie (prefix tree) and memoization. Amortized time is O(n * M) where, n is the number of letters on s, M is sum of lengths of words in dictionary

(def empty-node
  {:children {}})

(defn add-word [trie word]
  (if-let [char-seq (seq word)]
    (update-in trie [:children (first char-seq)]
               (fnil add-word empty-node)
               (rest char-seq))
    (assoc trie :leaf? true)))

(defn dictionary->trie [words]
  (reduce add-word empty-node words))

(def segmentations-memo
  (memoize
   (fn [trie-root {:keys [children leaf?]} char-seq]
     (cond
       (and (empty? char-seq) leaf?)
       '(())

       (or (empty? char-seq) (and (not leaf?) (empty? children)))
       '()

       leaf?
       (into (->> (segmentations-memo trie-root trie-root char-seq)
                  (map (partial cons \ )))
             (->> (segmentations-memo trie-root
                                      (children (first char-seq))
                                      (rest char-seq))
                  (map (partial cons (first char-seq)))))

       :default
       (->> (segmentations-memo trie-root
                                (children (first char-seq))
                                (rest char-seq))
            (map (partial cons (first char-seq))))))))

(defn segmentations [words s]
  (let [trie (dictionary->trie words)]
    (->> (segmentations-memo trie trie (seq s))
         (map str/join))))

This will blow up the stack if the string is too long though. I couldn't think of an easy way to fix it. Here's an example that gives a stack overflow:

(segmentations ["about", "be", "hell", "if", "is", "it", "me", "other", "outer", "people", "the", "to", "up", "where"] "ifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptomeifitistobeitisuptome")
@KingCode

This comment has been minimized.

Copy link

KingCode commented May 29, 2020

Here is my first crack at it - lazy, but it takes a few seconds for the result from inputting the large BOT to be looked at.
EDIT: Fixed the performance issue by iterating over words rather than characters, as is done in other solutions :)

(defn head-matches [txt words]
  (for [w words
        :when (starts-with? txt w)
        :let [more (subs txt (count w))]]
    [w more]))

(defn tries [txt words]
  (for [[match more :as segs] (head-matches txt words)
        :when (seq match)
        :let [tail (tries (apply str more) words)]
        :when (not (and (seq more) (empty? tail)))]
    (cons match tail)))

(defn expand [[head & tails :as trie]]
  (if (empty? tails) 
    (list trie)
    (map #(cons head %) (map expand tails))))

(defn segmentations [txt words]
  (->> (tries txt words)
       (mapcat expand)
       (map flatten)
       (map #(clojure.string/join " " %))))
@steffan-westcott

This comment has been minimized.

Copy link

steffan-westcott commented May 29, 2020

A lazy, tail-recursive solution:

(defn append-segment [segments s dict]
  (for [len (range 1 (inc (.length s)))
        :let [segment (subs s 0 len)]
        :when (contains? dict segment)]
    [(conj segments segment) (subs s len)]))

(defn segmentations* [items dict]
  (lazy-seq
    (loop [[item & more] items]
      (if-let [[segments s] item]
        (if (empty? s)
          (cons (clojure.string/join " " segments) (segmentations* more dict))
          (recur (concat (append-segment segments s dict) more)))))))

(defn segmentations [s dict]
  (segmentations* [[[] s]] (set dict)))
@caioaao

This comment has been minimized.

Copy link

caioaao commented May 29, 2020

A non-memoized version that uses trampoline to circumvent the stack size limitation. Runs the big test in ~2ms (the memoized version ran in ~50ms, so the overhead apparently isn't worth it):

(ns playground
  (:require [clojure.string :as str]))

(def empty-node
  {:children {}})

(defn add-word [trie word]
  (if-let [char-seq (seq word)]
    (update-in trie [:children (first char-seq)]
               (fnil add-word empty-node)
               (rest char-seq))
    (assoc trie :leaf? true)))

(defn dictionary->trie [words]
  (reduce add-word empty-node words))

(defn segmentations* [trie-root {:keys [children leaf?]} char-seq]
  (fn []
    (cond
      (and (empty? char-seq) leaf?)
      '(())

      (or (empty? char-seq) (and (not leaf?) (empty? children)))
      '()

      leaf?
      (into (->> (segmentations* trie-root trie-root char-seq)
                 trampoline
                 (map (partial cons \ )))
            (->> (segmentations* trie-root
                                     (children (first char-seq))
                                     (rest char-seq))
                 trampoline
                 (map (partial cons (first char-seq)))))

      :default
      (->> (segmentations* trie-root
                           (children (first char-seq))
                           (rest char-seq))
           trampoline
           (map (partial cons (first char-seq)))))))

(defn segmentations [words s]
  (let [trie (dictionary->trie words)]
    (->> (segmentations* trie trie (seq s))
         trampoline
         (map str/join))))

EDIT: managed to do a tail recursive solution to avoid relying on trampoline:

(def empty-node
  {:children {}})

(defn add-word [trie word]
  (if-let [char-seq (seq word)]
    (update-in trie [:children (first char-seq)]
               (fnil add-word empty-node)
               (rest char-seq))
    (assoc trie :leaf? true)))

(defn dictionary->trie [words]
  (reduce add-word empty-node words))

(defn all-suffixes [char-seq]
  (rest (reductions conj nil (reverse char-seq))))

(defn suffixes-segmentations [trie-root suffixes-results char-seq]
  (loop [{:keys [children leaf?] :as node} (get-in trie-root [:children (first char-seq)])
         [curr-char & char-seq]            char-seq
         curr-suffixes-results             suffixes-results
         prev-prefix                       []
         result                            '()]
    (if (or (not curr-char) (not node))
      (conj suffixes-results result)
      (recur (children (first char-seq))
             char-seq
             (rest curr-suffixes-results)
             (conj prev-prefix curr-char)
             (if leaf?
               (->> (first curr-suffixes-results)
                    (map #(conj % (conj prev-prefix curr-char)))
                    (into result))
               result)))))

(defn char-seq->result-str [char-seq]
  (->> (interpose \space char-seq)
       flatten
       (apply str)))

(defn segmentations [words s]
  (->> (all-suffixes (seq s))
       (reduce (partial suffixes-segmentations (dictionary->trie words))
               '((())))
       first
       (map char-seq->result-str)))
@mchampine

This comment has been minimized.

Copy link

mchampine commented May 29, 2020

Oops, Eric caught a bug in my code! My "starts-with" test passed even when string to be segmented is shorter than the test word. So here's a fix:

(defn segmentations [s l]
  (if (empty? s) [""]
      (for [pm (filter #(.startsWith s %) l)
            segrest (segmentations (subs s (count pm)) l)]
        (str pm (if (empty? segrest) nil " ") segrest))))

Btw, I wanted to avoid using clojure.strings or interop, but the best I can come up with for starts-with? is:

(defn starts-with? [[fs & rs] [fw & rw]]
  (if (and fs (= fs fw)) (starts-with? rs rw) (nil? fw)))

Pretty compact, but still makes my single function implementation kind of ugly if I incorporate it as a lambda or letfn. Oh well.

@mchampine

This comment has been minimized.

Copy link

mchampine commented May 29, 2020

Here is a tentative test suite for anyone interested. Please comment if anything is amiss...Thanks
EDIT: caioaao's large input has been added for performance testing.

Thanks, this is great! Is this a fixture for any specific testing framework?
I just made a (not general purpose!) testrunner for this challenge, which seems to work fine:

(defn testrunner [tf fxs]
  (let [testex (fn [[s l r]] (= (set (tf s l)) (set r)))
        res (map testex fxs)]
    {:results res
     :suite (if (every? true? res) "PASS" "FAIL")}))

(testrunner segmentations fixtures)
;; {:results (true true true true true true true true true), :suite "PASS"}

;; note: large-fixture needs to be wrapped in a vector to have the same structure as fixtures.
(testrunner segmentations [large-fixture])
;; {:results (true), :suite "PASS"}

Also, this additional test would have caught my implementation bug:

["errorex"
    ["exercise" "error"]
    []]
@KingCode

This comment has been minimized.

Copy link

KingCode commented May 29, 2020

@MarkChampine, thanks for the extra test and your excellent, lightning-fast and PGA-golfing solution!... I added the test in my fixtures.

I don't use any testing framework specifically, other than clojure.test. Because there are so many solutions to test and learn from, I also find it easiest to put each one in its own namespace, and refer each namespace with different aliases in my test namespace, then simply "multiplex" them into a single local symbol which is resolved against a chosen alias:

;;... implementation namespaces with aliases declared here, etc...;;
(let [alias-str "sut"   ;; single edit to point to another implementation
      sym (symbol (str alias-str "/segmentations"))]
  (def seg (resolve sym)))

(defn passing? [[input dict exp]]
  (is (= (set exp) (set (seg input dict)))))

(deftest segmentations-test
  (testing "small text"
    (doall (map passing? fixtures)))
  (testing "larger text"
    (passing? large-bot-fixture)))

As an aside, I am working on a utility (which I prototyped in the previous challenge) which generates reports for all solutions at once and pretty-prints them into a 'scoreboard' using clojure.pprint/print-table. It is currently being refactored to be generalized - kind of fun work with metadata and runtime loading, and controlling width of each field (which print-table doesn't do, causing mangled screens on overflow with narrow screens).

@uosl

This comment has been minimized.

Copy link

uosl commented May 31, 2020

Eric's solution reminded me of the existence of starts-with?, so I managed to shorten my very lazy (personality-wise) solution:

(defn segmentations [s words]
  (map #(clojure.string/join " " (cons % (segmentations (subs s (count %)) words)))
    (filter #(clojure.string/starts-with? s %) words)))

Runs caioaao's big test in ~0.05ms, which seems to be standard for the other solutions with a similar approach.

@mchampine

This comment has been minimized.

Copy link

mchampine commented May 31, 2020

@uosl

This comment has been minimized.

Copy link

uosl commented May 31, 2020

Thanks mchampine. I believe test 2 fails because the return vector is in reverse order, and the remaining fail because I missed the "If it can't be segmented, it should return an empty sequence" requirement, and instead return partial segments. I definitely don't think I can fix the latter while keeping it tight. [=

@mchampine

This comment has been minimized.

Copy link

mchampine commented May 31, 2020

@uosl

This comment has been minimized.

Copy link

uosl commented Jun 1, 2020

Uff, running test 6 it's now obvious what's wrong. Since I'm using map it won't handle branches further into the string correctly.
Thanks for pointing it out! Analysing the other solutions properly now has been a great learning opportunity. Very interesting use of list comprehension!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.