- Time: 30–40 minutes
- Goal: Learn how to perform semantic similarity search using vector embeddings.
In this lab, you will:
- Convert a natural-language query into an embedding
- Compare that embedding against book embeddings in your database
- Rank results by semantic similarity
- Interpret what the similarity scores mean
You will use the embeddings you generated yesterday.
Make sure that:
- Your Linux containers are running (
docker-compose up -d) - That your virtual environment is active (
.\.venv\Scripts\activate)
Open your Python file from yesterday and add this function:
def search_books(query: str, k: int = 5):
embedding = get_embedding(query)
with psycopg.connect(**DB_CONFIG) as conn:
with conn.cursor() as cur:
cur.execute("""
WITH q AS (SELECT %s::vector AS v)
SELECT
name,
item_data->>'subject' AS subject,
1 - (embedding <=> q.v) AS similarity
FROM items, q
ORDER BY embedding <=> q.v
LIMIT %s;
""", (embedding, k))
return cur.fetchall()What this query does:
- Embeds your search text
- Uses the
<=>cosine distance operator - Orders rows so the most similar books appear first
- Converts distance to similarity (
1 - distance) for easier interpretation
Add this test harness to the bottom of your file:
if __name__ == "__main__":
tests = [
"how do I build a website?",
"machine learning basics",
"history of computing",
"clean code and design principles",
"art and design theory",
]
for t in tests:
print(f"\nQuery: {t}")
for name, subject, sim in search_books(t):
print(f" {name} [{subject}] — similarity={sim:.3f}")Run your script.
For each query:
- Read the book titles
- Note the subjects
- Note the similarity scores
Think about these questions:
-
Do the results match what you expected for each query?
-
For the highest-ranked item:
- What was its similarity score?
- Does it “feel” close in meaning?
-
Compare the 1st and 3rd items for any query:
- How different are their scores?
- Does that difference match the content difference?
By default, your semantic search always returns the top k results, even if some of them aren’t very similar. In this step, you’ll filter results so that only items above a chosen similarity level are returned.
Since <=> returns cosine distance, and
similarity = 1 − distance,
a similarity threshold of > 0.5 corresponds to:
WHERE (embedding <=> q.v) < 0.5Add that line inside your search_books SQL query.
Then re-run your searches.
- What happens to your result list when you require similarity > 0.5?
- Do some queries now return fewer items?
- Try different thresholds (e.g., 0.3, 0.7). How sensitive is the system to these changes?
Try one or more of the following:
Choose a topic you’re familiar with (e.g., design patterns, databases, art history).
Run:
search_books("your query here")Record the top result and similarity score.
Try:
"cooking recipes"
What happens? Why?
Try two closely related queries:
"machine learning basics"
"introduction to neural networks"
Do the results overlap? Do the similarity scores behave as you expect?
By completing this lab, you have:
- Performed semantic similarity search
- Used cosine distance to compare vectors
- Interpreted similarity scores
- Observed how meaning, not keywords, drives ranking
- Prepared the foundation for using these search results inside an LLM in the next lab