Skip to content

Instantly share code, notes, and snippets.

@doingandlearning
Last active December 5, 2025 09:52
Show Gist options
  • Select an option

  • Save doingandlearning/43b576bb4600d0f9a51f80ffe821fef6 to your computer and use it in GitHub Desktop.

Select an option

Save doingandlearning/43b576bb4600d0f9a51f80ffe821fef6 to your computer and use it in GitHub Desktop.

Lab 2 — Semantic Search: Finding Books by Meaning

  • Time: 30–40 minutes
  • Goal: Learn how to perform semantic similarity search using vector embeddings.

In this lab, you will:

  • Convert a natural-language query into an embedding
  • Compare that embedding against book embeddings in your database
  • Rank results by semantic similarity
  • Interpret what the similarity scores mean

You will use the embeddings you generated yesterday.

Make sure that:

  • Your Linux containers are running (docker-compose up -d)
  • That your virtual environment is active (.\.venv\Scripts\activate)

1. Write a Semantic Search Function

Open your Python file from yesterday and add this function:

def search_books(query: str, k: int = 5):
    embedding = get_embedding(query)

    with psycopg.connect(**DB_CONFIG) as conn:
        with conn.cursor() as cur:
            cur.execute("""
                WITH q AS (SELECT %s::vector AS v)
                SELECT 
                    name,
                    item_data->>'subject' AS subject,
                    1 - (embedding <=> q.v) AS similarity
                FROM items, q
                ORDER BY embedding <=> q.v
                LIMIT %s;
            """, (embedding, k))

            return cur.fetchall()

What this query does:

  • Embeds your search text
  • Uses the <=> cosine distance operator
  • Orders rows so the most similar books appear first
  • Converts distance to similarity (1 - distance) for easier interpretation

2. Try Out Some Search Queries

Add this test harness to the bottom of your file:

if __name__ == "__main__":
    tests = [
        "how do I build a website?",
        "machine learning basics",
        "history of computing",
        "clean code and design principles",
        "art and design theory",
    ]

    for t in tests:
        print(f"\nQuery: {t}")
        for name, subject, sim in search_books(t):
            print(f"  {name} [{subject}] — similarity={sim:.3f}")

Run your script.

Task

For each query:

  • Read the book titles
  • Note the subjects
  • Note the similarity scores

3. Interpret the Results

Think about these questions:

  1. Do the results match what you expected for each query?

  2. For the highest-ranked item:

    • What was its similarity score?
    • Does it “feel” close in meaning?
  3. Compare the 1st and 3rd items for any query:

    • How different are their scores?
    • Does that difference match the content difference?

4. Add a Similarity Threshold

By default, your semantic search always returns the top k results, even if some of them aren’t very similar. In this step, you’ll filter results so that only items above a chosen similarity level are returned.

Add this filter to your SQL:

Since <=> returns cosine distance, and similarity = 1 − distance, a similarity threshold of > 0.5 corresponds to:

WHERE (embedding <=> q.v) < 0.5

Add that line inside your search_books SQL query.

Then re-run your searches.


Task

  1. What happens to your result list when you require similarity > 0.5?
  2. Do some queries now return fewer items?
  3. Try different thresholds (e.g., 0.3, 0.7). How sensitive is the system to these changes?

5. Stretch Exercise (Optional)

Try one or more of the following:

A. Write your own query

Choose a topic you’re familiar with (e.g., design patterns, databases, art history).

Run:

search_books("your query here")

Record the top result and similarity score.


B. Test a totally unrelated query

Try:

"cooking recipes"

What happens? Why?


C. Compare two queries

Try two closely related queries:

"machine learning basics"
"introduction to neural networks"

Do the results overlap? Do the similarity scores behave as you expect?


What You've Learned

By completing this lab, you have:

  • Performed semantic similarity search
  • Used cosine distance to compare vectors
  • Interpreted similarity scores
  • Observed how meaning, not keywords, drives ranking
  • Prepared the foundation for using these search results inside an LLM in the next lab
import requests
from time import sleep
import psycopg
import json
OLLAMA_URL = "http://nat-lin7.neueda.com:11434/api/embed"
DB_CONFIG = {
"host": "nat-lin7.neueda.com",
"port": 5432,
"user": "postgres",
"password": "postgres",
"dbname": "pgvector"
}
def get_embedding(text):
response = requests.post(OLLAMA_URL,
json={
"model": "bge-m3",
"input": text})
data = response.json()
embedding = data["embeddings"][0]
return embedding
def fetch_books():
"""Fetch books across various subjects from Open Library."""
categories = [
"programming",
"web_development",
"artificial_intelligence",
"computer_science",
"software_engineering",
]
all_books = []
for category in categories:
url = f"https://openlibrary.org/subjects/{category}.json?limit=10"
response = requests.get(url)
response.raise_for_status() # Raises an error for a bad response
data = response.json()
books = data.get("works", [])
# Format each book
for book in books:
book_data = {
"title": book.get("title", "Untitled"),
"authors": [
author.get("name", "Unknown Author")
for author in book.get("authors", [])
],
"first_publish_year": book.get("first_publish_year", "Unknown"),
"subject": category,
}
all_books.append(book_data)
print(f"Successfully processed {len(books)} books for {category}")
if not all_books:
print("No books were fetched from any category.")
return all_books
def load_books_to_db():
"""Load books with embeddings into PostgreSQL."""
# Wait for the database to be ready
sleep(5)
# Connect to the database
conn = psycopg.connect(**DB_CONFIG)
cur = conn.cursor()
# Fetch data from the Open Library
books = fetch_books()
for book in books:
description = (
f"Book titled '{book['title']}' by {', '.join(book['authors'])}. "
f"Published in {book['first_publish_year']}. "
f"This is a book about {book['subject']}."
)
# Generate embedding
# embedding = "[" + ",".join(["0"] * 1536) + "]" # Placeholder embedding
sleep(1)
embedding = get_embedding(description)
cur.execute(
"""
INSERT INTO items (name, item_data, embedding)
VALUES (%s, %s, %s)
""",
(book["title"], json.dumps(book), embedding),
)
# Commit and close
conn.commit()
cur.close()
conn.close()
if __name__ == "__main__":
pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment