Skip to content

Instantly share code, notes, and snippets.

Last active October 7, 2021 01:47
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
What would you like to do?
434 Newsletter

Sentence searcher

Sometimes I want to find a word in a document, but I want the context for the word. Write a function that takes a document and a word and returns the sentences that contain that word. The sentences should be returned in the order they appear in the document.


(search "This is my document." "Hello") ;=> nil
(search "This is my document. It has two sentences." "sentences") ;=> ["It has two sentences."]
(search "I like to write. Do you like to write?" "Write") ;=> ["I like to write." "Do you like to write?"]

Sentences end with \., \!, or \?.

The search should be case insensitive.

Return nil if the word is not found.

Thanks to this site for the problem idea, where it is rated Hard in Python. The problem has been modified.

Please submit your solutions as comments on this gist.

To subscribe:

Copy link

jonasseglare commented Jul 13, 2021

(require '[clojure.string :refer [lower-case index-of trim]])

(defn search [document x]
  (let [x (lower-case x)
        empty-to-nil #(if (empty? %) nil %)]
     (into []
           (comp (partition-by
                  (let [sentence-counter (atom 0)]
                    #(first (swap-vals! sentence-counter
                                        + (if (#{\. \? \!} %) 1 0)))))
                 (map #(->> % (apply str) trim))
                 (filter #(index-of (lower-case %) x)))

... or using a custom transducer to segment the sentences:

(defn sentence-segmenter [step]
  (let [b (StringBuilder.)]
      ([result] (step result (str b))) 
      ([result x] (do
                    (.append b x)
                    (if (#{\. \! \?} x)
                       (let [result (step result (str b))]
                         (.setLength b 0)

(defn search [document x]
  (let [x (lower-case x)
        empty-to-nil #(if (empty? %) nil %)]
     (into []
           (comp sentence-segmenter
                 (map trim)
                 (filter #(index-of (lower-case %) x)))

Copy link

(ns printer.core
  (:require [clojure.string :as str]))

(defn foo [state curr [w & ws]]
    (nil? w) state
      (str/ends-with? w ".")
      (str/ends-with? w "!")
      (str/ends-with? w "?")) (foo (conj state (str/join #" " (conj curr w))) [] ws)
    :else (foo state (conj curr w) ws)))

(defn get-sentence [s]
  (foo [] [] (str/split s #" ")))

(defn search [s w]
  (let [sentences (get-sentence s)
        w (str/lower-case w)
        out (filter #(str/includes? (str/lower-case %) w) sentences)]
    (if (empty? out)

Copy link

(defn search [doc word]
  (when-let [matches (-> (str "(?i)[^.!?]*?\\b\\Q" word "\\E\\b.*?[.!?]")
                         (re-seq doc))]
    (map clojure.string/trim matches)))

Copy link

I've noticed that some answers here are not taking word boundaries into account. For example:

(search "Should not appear." "no") ;; should evaluate to nil

Copy link

(defn search [document word]
  (let [word-lc (.toLowerCase word)]
      (re-seq #"(\s*)(.*?(?:\.|\?|!))")
      (map (fn [groups] (nth groups 2)))
      (filter (fn [sentence] (.contains (.toLowerCase sentence) word-lc))))))

Copy link

An alternative approach which uses a single regex to split on sentence and trim surrounding whitespace, whilst at the same time, retaining \., \! and \? by using regex look behind i.e. ?<=.

(defn search [doc word]
  (->> (clojure.string/split doc #"(\s*(?<=[\.\!\?])\s*)")
       (filter (partial re-find (re-pattern (str "(?i)\\b" word "\\b"))))

Copy link

Does not match on whole words only.

(defn splitter [s]
  (let [re (re-pattern (str "[^" "\\.|\\?|\\!" "]+|" "\\.|\\?|\\!"))]
    (->> (re-seq re s )
         (partition 2)
         (map (partial apply str))
         (map s/triml))))

(defn search [d w]
  (let [splits (splitter d)
        lcw (s/lower-case w)]
    (seq (filter #(s/includes? (s/lower-case %) lcw) splits))))

Copy link

(defn search [xs x]  
  (->> (clojure.string/split xs #"(?<=[\.|!|\?])\s+")
       (filter (fn [y] (re-find (re-pattern (clojure.string/lower-case x)) (clojure.string/lower-case y))))

Copy link

I had to work a little to return nil instead of an empty vector for the base case:

(def same-and-uppercase
  (juxt identity clojure.string/upper-case))

(defn search [sentences word]
  (let [WORD (clojure.string/upper-case word)]
    (when-let [result (->> (re-seq #".*?[\.|\?|!]" sentences)
                           (map (comp same-and-uppercase clojure.string/trim))
                           (filter #(clojure.string/includes? % WORD))
                           (map first)
      (vec result))))

;; (search "This is my document." "Hello")
;; => nil

;; (search "This is my document. It has two sentences." "sentences")
;; => ["It has two sentences."]

;; (search "I like to write. Do you like to write?" "Write")
;; => ["I like to write." "Do you like to write?"]

Copy link

(defn search [text word]
  (let [sentence-endings #"\.|\!|\?"
        word             (clojure.string/lower-case word)]
    (->> (clojure.string/split text sentence-endings)
         (filter (fn [s] (-> s clojure.string/lower-case (.contains word))))
         (map clojure.string/trim)
         ((fn [result]
            (if (empty? result)
              (vec result)))))))

Copy link

heyarne commented Jul 14, 2021

This version respects word borders and returns the full sentences including punctuation:

(defn search [doc phrase]
    (re-seq #"[^\s].*?[.!?]" doc)
    (filter #(re-find (re-pattern (str "(?i)\\b" phrase "\\b")) %))

Copy link

alex-gerdom commented Jul 14, 2021

(defn escape-re [s]
  #?(:clj (java.util.regex.Pattern/quote s)
     :cljs (.replace s (js/RegExp. "[.*+?^${}()|[\\]\\\\]" "g") "\\$&")))

(defn split-sentences [s]
  (if (empty? s) (list s)
      (re-seq #"[^.!?]+[.!?]?" s)))

(defn search [s substr]
  (let [pattern (re-pattern (str "(?i)" (escape-re substr)))
        contains-substr? #(some? (re-find pattern %))
        matching-sentences (->> s
                                (filter contains-substr?)
                                (map clojure.string/trim))]
    (if (empty? matching-sentences) nil

Copy link

jumarko commented Jul 15, 2021

(require '[clojure.string :as str])

(defn sentences [document]
  (mapv str/trim
        (str/split document #"(?<=[.?!])")))

(defn contains-word? [sentence word]
  (let [sentence-words (str/split (str/lower-case sentence) #"\s")]
    ((set sentence-words) (str/lower-case word))))

(defn search [document word]
  (not-empty (filterv #(contains-word? % word)
                      (sentences document))))

(search "This is my document." "Hello")
;; => nil

(search "I like to write. Do you like to write?" "like")
;; => ["I like to write." "Do you like to write?"]

(search "This is not my document. It has No two sentences." "no")
;; => ["It has No two sentences."]

Copy link

sztamas commented Jul 15, 2021

(defn search [sentences word]
  (let [ci-word       (re-pattern (str "(?i)" "\\b+" word "\\b+"))
        matches-word? (partial re-find ci-word)]
    (->> sentences
         (re-seq #"[^\.\!\?]+[\.\!\?]+")
         (filter matches-word?)

Copy link

javierrweiss commented Jul 16, 2021

(:require [clojure.string :as st])

(defn process-str
(as-> text t
(st/split-lines t)
(remove st/blank? t)
(map #(st/split % #"(?<=(.|?|!))") t)
(flatten t)
(map #(st/trim %) t)))

(defn matching-indexes
[xs word]
(let [coll (map-indexed
(fn [idx itm]
(if-not (nil? (re-find (re-pattern (str "(?i)" word)) itm))
idx)) xs)]
(remove nil? coll)))

(defn search [text word]
(let [data (process-str text)
indexes (matching-indexes data word)]
(if (empty? indexes)
(vec (for [x indexes] (nth data x))))))

Copy link

KingCode commented Oct 7, 2021

(require '[clojure.string :as str])

(defn parse-ends [txt]
  (for [m (repeat (re-matcher #"([^.^!^?]+[.!?])" txt))
        :let [finds (re-find m)]
        :while finds]
    (->> finds rest (filter identity) first last str)))

(defn search [txt word]
  (let [word (str/lower-case word) 
        ends (parse-ends txt)]
    (->> (str/split txt #"[.!?]")
         (map vector ends)
         (sequence (comp
                    (filter (fn [[end sent]]
                              (->> (str/split (str/lower-case sent) #"\s+")
                                   (some #{word}))))
                    (map (fn [[end sent]]
                           (.concat sent end)))))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment