- A running Qdrant instance (local or remote).
- Clojure project with dependencies for HTTP requests and JSON handling.
- URLs stored in a Qdrant collection’s payload under a field like
url.
Add the following dependencies to your project.clj (if using Leiningen):
:dependencies [[org.clojure/clojure "1.11.1"]
[clj-http "3.12.0"] ;; For HTTP requests
[cheshire "5.12.0"]] ;; For JSON parsingOr, if you’re using deps.edn:
{:deps {org.clojure/clojure {:mvn/version "1.11.1"}
clj-http/clj-http {:mvn/version "3.12.0"}
cheshire/cheshire {:mvn/version "5.12.0"}}}We’ll implement two versions, similar to the Python code:
- Individual URL Check: Check each URL one by one (simpler but slower for many URLs).
- Batch URL Check: Check all URLs in a single query using
MatchAny(faster for large lists).
This version queries Qdrant for each URL individually using the /collections/{collection}/points/scroll endpoint.
(ns qdrant-checker
(:require [clj-http.client :as http]
[cheshire.core :as json]))
(def qdrant-config
{:host "http://localhost:6333"
:collection "scraped_pages"
:url-field "url"})
(defn build-url-filter
"Build a Qdrant filter for a single URL."
[url url-field]
{:must [{:key url-field
:match {:value url}}]})
(defn check-url
"Check if a single URL exists in the Qdrant collection."
[{:keys [host collection url-field]} url]
(let [endpoint (str host "/collections/" collection "/points/scroll")
payload {:filter (build-url-filter url url-field)
:limit 1
:with_payload true
:with_vector false}
response (http/post endpoint
{:body (json/generate-string payload)
:headers {"Content-Type" "application/json"}
:as :json})
points (get-in response [:body :result :points])]
(if (seq points)
{:url url :exists? true}
{:url url :exists? false})))
(defn check-urls
"Check which URLs exist in the Qdrant collection. Returns {:existing [], :non-existing []}."
[config urls]
(let [results (map #(check-url config %) urls)
existing (map :url (filter :exists? results))
non-existing (map :url (remove :exists? results))]
{:existing existing
:non-existing non-existing}))
;; Example usage
(def urls-to-check
["https://example.com/page1"
"https://example.com/page2"
"https://example.com/page3"])
(let [{:keys [existing non-existing]} (check-urls qdrant-config urls-to-check)]
(println "Existing URLs:" existing)
(println "Non-existing URLs:" non-existing)
(if (seq non-existing)
(println "Ready to insert" (count non-existing) "new URLs into Qdrant.")
(println "All URLs already exist in the collection.")))This version queries all URLs in a single request using a MatchAny filter, which is more efficient for large lists.
(ns qdrant-checker
(:require [clj-http.client :as http]
[cheshire.core :as json]))
(def qdrant-config
{:host "http://localhost:6333"
:collection "scraped_pages"
:url-field "url"})
(defn build-batch-url-filter
"Build a Qdrant filter for multiple URLs using MatchAny."
[urls url-field]
{:must [{:key url-field
:match {:any urls}}]})
(defn check-urls-batch
"Check which URLs exist in the Qdrant collection in a single query.
Returns {:existing [], :non-existing []}."
[{:keys [host collection url-field]} urls]
(let [endpoint (str host "/collections/" collection "/points/scroll")
payload {:filter (build-batch-url-filter urls url-field)
:limit (count urls)
:with_payload true
:with_vector false}
response (http/post endpoint
{:body (json/generate-string payload)
:headers {"Content-Type" "application/json"}
:as :json})
points (get-in response [:body :result :points])
existing (map #(get-in % [:payload url-field]) points)
non-existing (remove (set existing) urls)]
{:existing existing
:non-existing non-existing}))
;; Example usage
(def urls-to-check
["https://example.com/page1"
"https://example.com/page2"
"https://example.com/page3"])
(let [{:keys [existing non-existing]} (check-urls-batch qdrant-config urls-to-check)]
(println "Existing URLs:" existing)
(println "Non-existing URLs:" non-existing)
(if (seq non-existing)
(println "Ready to insert" (count non-existing) "new URLs into Qdrant.")
(println "All URLs already exist in the collection.")))- Qdrant Config: The
qdrant-configmap holds the Qdrant host, collection name, and payload field name (url-field). Update these to match your setup (e.g.,hostfor Qdrant Cloud,url-fieldif you use a different key likepage_url). - HTTP Requests: We use
clj-httpto send POST requests to Qdrant’s REST API (/collections/{collection}/points/scroll). The:as :jsonoption ensures the response is parsed as JSON. - Filter Construction:
- For individual checks,
build-url-filtercreates amustfilter with a singlematchcondition for one URL. - For batch checks,
build-batch-url-filterusesmatch.anyto match any URL in the list.
- For individual checks,
- Response Handling:
- The
scrollendpoint returns a:result.pointsarray. If it’s non-empty, the URL exists. - In the batch version, we extract all URLs from the points’ payloads and compute non-existing URLs by set difference.
- The
- Output: Both functions return a map with
:existingand:non-existingkeys, containing lists of URLs.
For urls-to-check ["https://example.com/page1" "https://example.com/page2"]:
- If
page1exists andpage2doesn’t:Existing URLs: (https://example.com/page1) Non-existing URLs: (https://example.com/page2) Ready to insert 1 new URLs into Qdrant.
- Batch Efficiency: The batch version (
check-urls-batch) is preferred for large URL lists because it minimizes HTTP requests. TheMatchAnyfilter checks all URLs in one go. - Payload Field: Assumes URLs are stored in a payload field named
url. If your field is different (e.g.,metadata.url), update:url-fieldinqdrant-configand ensure the filter key matches (e.g.,metadata.urlin the filter). - Error Handling: For production, add error handling for network issues or Qdrant errors:
(defn check-urls-batch [config urls] (try (let [endpoint ...] ;; Same as above ...) (catch Exception e (println "Error querying Qdrant:" (.getMessage e)) {:existing [] :non-existing urls})))
- Collection Existence: To verify the collection exists, you can query
/collections/{collection}before running checks:(defn collection-exists? [{:keys [host collection]}] (let [response (http/get (str host "/collections/" collection))] (= 200 (:status response))))
- REST vs. gRPC: This uses the REST API for simplicity. If you’re using gRPC, you’d need to interop with the
qdrant-java-client. I can provide that version if needed. - Inserting Non-existing URLs: After identifying
:non-existingURLs, you’ll need to:- Generate vectors for the scraped pages (e.g., using a Java/Clojure-compatible embedding library like
sentence-transformersvia interop). - Upsert points to Qdrant using the
/collections/{collection}/pointsendpoint. Let me know if you want code for this part!
- Generate vectors for the scraped pages (e.g., using a Java/Clojure-compatible embedding library like
- Qdrant Cloud: If using Qdrant Cloud, update
:hostto your cluster URL (e.g.,https://your-cluster.qdrant.io) and add an API key:(def qdrant-config {:host "https://your-cluster.qdrant.io" :collection "scraped_pages" :url-field "url" :api-key "your-api-key"}) ;; Add to http/post :headers {"Content-Type" "application/json" "api-key" (:api-key config)}
- Custom Payload: If URLs are nested (e.g.,
{:metadata {:url "..."}}), use a dotted key in the filter:{:must [{:key "metadata.url" :match {:value url}}]}
- Save the code in a file (e.g.,
src/qdrant_checker.clj). - Run with Leiningen:
lein run(orclojure -M -m qdrant-checkerfordeps.edn). - Adjust
urls-to-checkandqdrant-configto match your data.
- If you’re ready to insert the
:non-existingURLs, I can provide Clojure code to generate vectors and upsert points. - If you have a specific embedding model or Qdrant schema, share details for tailored code.
- If you prefer the gRPC client or have other constraints (e.g., async HTTP), let me know.
Let me know how this works or if you need further tweaks!