Skip to content

Instantly share code, notes, and snippets.

@ericnormand
Last active October 7, 2021 01:47
Show Gist options
  • Save ericnormand/25e53eea786708b948d0c666c790580b to your computer and use it in GitHub Desktop.
Save ericnormand/25e53eea786708b948d0c666c790580b to your computer and use it in GitHub Desktop.
434 PurelyFunctional.tv Newsletter

Sentence searcher

Sometimes I want to find a word in a document, but I want the context for the word. Write a function that takes a document and a word and returns the sentences that contain that word. The sentences should be returned in the order they appear in the document.

Examples

(search "This is my document." "Hello") ;=> nil
(search "This is my document. It has two sentences." "sentences") ;=> ["It has two sentences."]
(search "I like to write. Do you like to write?" "Write") ;=> ["I like to write." "Do you like to write?"]

Sentences end with \., \!, or \?.

The search should be case insensitive.

Return nil if the word is not found.

Thanks to this site for the problem idea, where it is rated Hard in Python. The problem has been modified.

Please submit your solutions as comments on this gist.

To subscribe: https://purelyfunctional.tv/newsletter/

@souenzzo
Copy link

souenzzo commented Jul 13, 2021

(letfn [(search [{::keys [document word
                          ends-with]
                  :or    {ends-with #{"." "!" "?"}}}]
          (let [re-escape (fn [s]
                            (str "\\Q" (string/escape (str s)
                                         {\\ "\\\\"})
                              "\\E"))
                sentences (string/split document
                            (re-pattern (string/join "|"
                                          (for [end ends-with]
                                            (str "(" (re-escape end) ")")))))]
            (filter (partial re-find
                      (re-pattern
                        (str "(?i)" (re-escape word))))
              sentences)))]
  (for [v [{::document "I like to write. Do you like to write?"
            ::word     "write"}
           {::document "I'm learning [a-z]+. Trying to inject reg-ex. aa. [A-Z]+"
            ::word     "[a-z]+"}]]
    (assoc v
      ::search (search v))))

=>
({:user/document "I like to write. Do you like to write?",
  :user/word "write",
  :user/search ("I like to write" " Do you like to write")}
 {:user/document "I'm learning [a-z]+. Trying to inject reg-ex. aa. [A-Z]+",
  :user/word "[a-z]+",
  :user/search ("I'm learning [a-z]+" " [A-Z]+")})

@jonasseglare
Copy link

jonasseglare commented Jul 13, 2021

(require '[clojure.string :refer [lower-case index-of trim]])

(defn search [document x]
  (let [x (lower-case x)
        empty-to-nil #(if (empty? %) nil %)]
    (empty-to-nil
     (into []
           (comp (partition-by
                  (let [sentence-counter (atom 0)]
                    #(first (swap-vals! sentence-counter
                                        + (if (#{\. \? \!} %) 1 0)))))
                 (map #(->> % (apply str) trim))
                 (filter #(index-of (lower-case %) x)))
           document))))

... or using a custom transducer to segment the sentences:

(defn sentence-segmenter [step]
  (let [b (StringBuilder.)]
    (fn
      ([result] (step result (str b))) 
      ([result x] (do
                    (.append b x)
                    (if (#{\. \! \?} x)
                       (let [result (step result (str b))]
                         (.setLength b 0)
                         result)
                       result))))))

(defn search [document x]
  (let [x (lower-case x)
        empty-to-nil #(if (empty? %) nil %)]
    (empty-to-nil
     (into []
           (comp sentence-segmenter
                 (map trim)
                 (filter #(index-of (lower-case %) x)))
           document))))

@grierson
Copy link

(ns printer.core
  (:require [clojure.string :as str]))

(defn foo [state curr [w & ws]]
  (cond
    (nil? w) state
    (or
      (str/ends-with? w ".")
      (str/ends-with? w "!")
      (str/ends-with? w "?")) (foo (conj state (str/join #" " (conj curr w))) [] ws)
    :else (foo state (conj curr w) ws)))

(defn get-sentence [s]
  (foo [] [] (str/split s #" ")))

(defn search [s w]
  (let [sentences (get-sentence s)
        w (str/lower-case w)
        out (filter #(str/includes? (str/lower-case %) w) sentences)]
    (if (empty? out)
      nil
      out)))

@steffan-westcott
Copy link

(defn search [doc word]
  (when-let [matches (-> (str "(?i)[^.!?]*?\\b\\Q" word "\\E\\b.*?[.!?]")
                         re-pattern
                         (re-seq doc))]
    (map clojure.string/trim matches)))

@steffan-westcott
Copy link

I've noticed that some answers here are not taking word boundaries into account. For example:

(search "Should not appear." "no") ;; should evaluate to nil

@mcuervoe
Copy link

(defn search [document word]
  (let [word-lc (.toLowerCase word)]
    (->> 
      document
      (re-seq #"(\s*)(.*?(?:\.|\?|!))")
      (map (fn [groups] (nth groups 2)))
      (filter (fn [sentence] (.contains (.toLowerCase sentence) word-lc))))))

@safehammad
Copy link

An alternative approach which uses a single regex to split on sentence and trim surrounding whitespace, whilst at the same time, retaining \., \! and \? by using regex look behind i.e. ?<=.

(defn search [doc word]
  (->> (clojure.string/split doc #"(\s*(?<=[\.\!\?])\s*)")
       (filter (partial re-find (re-pattern (str "(?i)\\b" word "\\b"))))
       seq))

@mchampine
Copy link

Does not match on whole words only.

(defn splitter [s]
  (let [re (re-pattern (str "[^" "\\.|\\?|\\!" "]+|" "\\.|\\?|\\!"))]
    (->> (re-seq re s )
         (partition 2)
         (map (partial apply str))
         (map s/triml))))

(defn search [d w]
  (let [splits (splitter d)
        lcw (s/lower-case w)]
    (seq (filter #(s/includes? (s/lower-case %) lcw) splits))))

@diavoletto76
Copy link

(defn search [xs x]  
  (->> (clojure.string/split xs #"(?<=[\.|!|\?])\s+")
       (filter (fn [y] (re-find (re-pattern (clojure.string/lower-case x)) (clojure.string/lower-case y))))
       (seq)))

@dfuenzalida
Copy link

I had to work a little to return nil instead of an empty vector for the base case:

(def same-and-uppercase
  (juxt identity clojure.string/upper-case))

(defn search [sentences word]
  (let [WORD (clojure.string/upper-case word)]
    (when-let [result (->> (re-seq #".*?[\.|\?|!]" sentences)
                           (map (comp same-and-uppercase clojure.string/trim))
                           (filter #(clojure.string/includes? % WORD))
                           (map first)
                           seq)]
      (vec result))))

;; (search "This is my document." "Hello")
;; => nil

;; (search "This is my document. It has two sentences." "sentences")
;; => ["It has two sentences."]

;; (search "I like to write. Do you like to write?" "Write")
;; => ["I like to write." "Do you like to write?"]

@vpetruchok
Copy link

(defn search [text word]
  (let [sentence-endings #"\.|\!|\?"
        word             (clojure.string/lower-case word)]
    (->> (clojure.string/split text sentence-endings)
         (filter (fn [s] (-> s clojure.string/lower-case (.contains word))))
         (map clojure.string/trim)
         ((fn [result]
            (if (empty? result)
              nil
              (vec result)))))))

@rrrnld
Copy link

rrrnld commented Jul 14, 2021

This version respects word borders and returns the full sentences including punctuation:

(defn search [doc phrase]
  (->>
    (re-seq #"[^\s].*?[.!?]" doc)
    (filter #(re-find (re-pattern (str "(?i)\\b" phrase "\\b")) %))
    (seq)))

@alex-gerdom
Copy link

alex-gerdom commented Jul 14, 2021

(defn escape-re [s]
  #?(:clj (java.util.regex.Pattern/quote s)
     :cljs (.replace s (js/RegExp. "[.*+?^${}()|[\\]\\\\]" "g") "\\$&")))

(defn split-sentences [s]
  (if (empty? s) (list s)
      (re-seq #"[^.!?]+[.!?]?" s)))

(defn search [s substr]
  (let [pattern (re-pattern (str "(?i)" (escape-re substr)))
        contains-substr? #(some? (re-find pattern %))
        matching-sentences (->> s
                                split-sentences
                                (filter contains-substr?)
                                (map clojure.string/trim))]
    (if (empty? matching-sentences) nil
        matching-sentences)))

@jumarko
Copy link

jumarko commented Jul 15, 2021

https://github.com/jumarko/clojure-experiments/blob/master/src/clojure_experiments/purely_functional/puzzles/0434_sentence_searcher.clj#L1

(require '[clojure.string :as str])

(defn sentences [document]
  (mapv str/trim
        (str/split document #"(?<=[.?!])")))

(defn contains-word? [sentence word]
  (let [sentence-words (str/split (str/lower-case sentence) #"\s")]
    ((set sentence-words) (str/lower-case word))))

(defn search [document word]
  (not-empty (filterv #(contains-word? % word)
                      (sentences document))))

(search "This is my document." "Hello")
;; => nil

(search "I like to write. Do you like to write?" "like")
;; => ["I like to write." "Do you like to write?"]

(search "This is not my document. It has No two sentences." "no")
;; => ["It has No two sentences."]

@sztamas
Copy link

sztamas commented Jul 15, 2021

(defn search [sentences word]
  (let [ci-word       (re-pattern (str "(?i)" "\\b+" word "\\b+"))
        matches-word? (partial re-find ci-word)]
    (->> sentences
         (re-seq #"[^\.\!\?]+[\.\!\?]+")
         (filter matches-word?)
         seq)))

@javierrweiss
Copy link

javierrweiss commented Jul 16, 2021

(:require [clojure.string :as st])

(defn process-str
[text]
(as-> text t
(st/split-lines t)
(remove st/blank? t)
(map #(st/split % #"(?<=(.|?|!))") t)
(flatten t)
(map #(st/trim %) t)))

(defn matching-indexes
[xs word]
(let [coll (map-indexed
(fn [idx itm]
(if-not (nil? (re-find (re-pattern (str "(?i)" word)) itm))
idx)) xs)]
(remove nil? coll)))

(defn search [text word]
(let [data (process-str text)
indexes (matching-indexes data word)]
(if (empty? indexes)
nil
(vec (for [x indexes] (nth data x))))))

@KingCode
Copy link

KingCode commented Oct 7, 2021

(require '[clojure.string :as str])

(defn parse-ends [txt]
  (for [m (repeat (re-matcher #"([^.^!^?]+[.!?])" txt))
        :let [finds (re-find m)]
        :while finds]
    (->> finds rest (filter identity) first last str)))

(defn search [txt word]
  (let [word (str/lower-case word) 
        ends (parse-ends txt)]
    (->> (str/split txt #"[.!?]")
         (map vector ends)
         (sequence (comp
                    (filter (fn [[end sent]]
                              (->> (str/split (str/lower-case sent) #"\s+")
                                   (some #{word}))))
                    (map (fn [[end sent]]
                           (.concat sent end)))))
         seq)))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment