Skip to content

Instantly share code, notes, and snippets.

@JonathanLoscalzo
Last active July 14, 2024 23:56
Show Gist options
  • Save JonathanLoscalzo/c40e8b9b1b50b58633a7f4993a417d8b to your computer and use it in GitHub Desktop.
Save JonathanLoscalzo/c40e8b9b1b50b58633a7f4993a417d8b to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/jloscalzo/Projects/llm-zoomcamp/venv/lib/python3.12/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py:11: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from tqdm.autonotebook import tqdm, trange\n"
]
}
],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"\n",
"model_name = \"multi-qa-distilbert-cos-v1\"\n",
"embedding_model = SentenceTransformer(model_name)\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"user_question = \"I just discovered the course. Can I still join it?\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Q1) What's the first value of the resulting vector?: 0.0782226100564003\n"
]
}
],
"source": [
"\n",
"print(\n",
" f\"Q1) What's the first value of the resulting vector?: {embedding_model.encode(user_question)[0]}\"\n",
")\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import requests \n",
"\n",
"base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'\n",
"relative_url = '03-vector-search/eval/documents-with-ids.json'\n",
"docs_url = f'{base_url}/{relative_url}?raw=1'\n",
"docs_response = requests.get(docs_url)\n",
"documents = docs_response.json()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"948"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(documents)"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"#documents"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"documents = list(filter(lambda d: d['course']=='machine-learning-zoomcamp', documents))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"assert len(documents) == 375"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now for each document, we will create an embedding for both question and answer fields.\n",
"\n",
"We want to put all of them into a single matrix X:\n",
"\n",
"- Create a list embeddings\n",
"- Iterate over each document\n",
"- qa_text = f'{question} {text}'\n",
"- compute the embedding for qa_text, append to embeddings\n",
"- At the end, let X = np.array(embeddings) (import numpy as np)\n",
"\n",
"What's the shape of X? (X.shape). Include the parantheses."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Q2) What's the shape of the resulting matrix?: (375, 768)\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"embeddings = []\n",
"\n",
"for doc in documents:\n",
" qa_text = f'{doc[\"question\"]} {doc[\"text\"]}'\n",
" qa_text_vect = embedding_model.encode(qa_text)\n",
" embeddings.append(qa_text_vect)\n",
"\n",
"X = np.array(embeddings)\n",
"\n",
"print(\"Q2) What's the shape of the resulting matrix?:\", X.shape)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Q3) What's the highest score in the results?: 14 0.6506574\n"
]
}
],
"source": [
"v = embedding_model.encode(user_question)\n",
"scores = X.dot(v)\n",
"\n",
"print(\"Q3) What's the highest score in the results?:\", scores.argmax(), scores.max())\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(375,)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"scores.shape"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('I just discovered the course. Can I still join it?',\n",
" {'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',\n",
" 'section': 'General course-related questions',\n",
" 'question': 'The course has already started. Can I still join it?',\n",
" 'course': 'machine-learning-zoomcamp',\n",
" 'id': 'ee58a693'})"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"user_question, documents[scores.argmax()]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',\n",
" 'section': 'General course-related questions',\n",
" 'question': 'The course has already started. Can I still join it?',\n",
" 'course': 'machine-learning-zoomcamp',\n",
" 'id': 'ee58a693'},\n",
" {'text': 'Welcome to the course! Go to the course page (http://mlzoomcamp.com/), scroll down and start going through the course materials. Then read everything in the cohort folder for your cohort’s year.\\nClick on the links and start watching the videos. Also watch office hours from previous cohorts. Go to DTC youtube channel and click on Playlists and search for {course yyyy}. ML Zoomcamp was first launched in 2021.\\nOr you can just use this link: http://mlzoomcamp.com/#syllabus',\n",
" 'section': 'General course-related questions',\n",
" 'question': 'I just joined. What should I do next? How can I access course materials?',\n",
" 'course': 'machine-learning-zoomcamp',\n",
" 'id': '0a278fb2'},\n",
" {'text': \"The process is automated now, so you should receive the email eventually. If you haven’t, check your promotions tab in Gmail as well as spam.\\nIf you unsubscribed from our newsletter, you won't get course related updates too.\\nBut don't worry, it’s not a problem. To make sure you don’t miss anything, join the #course-ml-zoomcamp channel in Slack and our telegram channel with announcements. This is enough to follow the course.\",\n",
" 'section': 'General course-related questions',\n",
" 'question': \"I filled the form, but haven't received a confirmation email. Is it normal?\",\n",
" 'course': 'machine-learning-zoomcamp',\n",
" 'id': '6ba259b1'},\n",
" {'text': 'Technically, yes. Advisable? Not really. Reasons:\\nSome homework(s) asks for specific python library versions.\\nAnswers may not match in MCQ options if using different languages other than Python 3.10 (the recommended version for 2023 cohort)\\nAnd as for midterms/capstones, your peer-reviewers may not know these other languages. Do you want to be penalized for others not knowing these other languages?\\nYou can create a separate repo using course’s lessons but written in other languages for your own learnings, but not advisable for submissions.\\ntx[source]',\n",
" 'section': 'Miscellaneous',\n",
" 'question': 'Can I do the course in other languages, like R or Scala?',\n",
" 'course': 'machine-learning-zoomcamp',\n",
" 'id': '9f261648'},\n",
" {'text': 'We won’t re-record the course videos. The focus of the course and the skills we want to teach remained the same, and the videos are still up-to-date.\\nIf you haven’t taken part in the previous iteration, you can start watching the videos. It’ll be useful for you and you will learn new things. However, we recommend using Python 3.10 now instead of Python 3.8.',\n",
" 'section': 'General course-related questions',\n",
" 'question': 'The course videos are from the previous iteration. Will you release new ones or we’ll use the videos from 2021?',\n",
" 'course': 'machine-learning-zoomcamp',\n",
" 'id': 'e7ba6b8a'}]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"class VectorSearchEngine():\n",
" def __init__(self, documents, embeddings):\n",
" self.documents = documents\n",
" self.embeddings = embeddings\n",
"\n",
" def search(self, v_query, num_results=10):\n",
" scores = self.embeddings.dot(v_query)\n",
" idx = np.argsort(-scores)[:num_results]\n",
" return [self.documents[i] for i in idx]\n",
"\n",
"search_engine = VectorSearchEngine(documents=documents, embeddings=X)\n",
"search_engine.search(v, num_results=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hit-rate for our search engine"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'\n",
"relative_url = '03-vector-search/eval/ground-truth-data.csv'\n",
"ground_truth_url = f'{base_url}/{relative_url}?raw=1'\n",
"\n",
"df_ground_truth = pd.read_csv(ground_truth_url)\n",
"df_ground_truth = df_ground_truth[df_ground_truth.course == 'machine-learning-zoomcamp']\n",
"ground_truth = df_ground_truth.to_dict(orient='records')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now use the code from the module to calculate the hitrate of VectorSearchEngine with num_results=5.\n",
"\n",
"What did you get?"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"def hit_rate(relevance_total):\n",
" cnt = 0\n",
"\n",
" for line in relevance_total:\n",
" if True in line:\n",
" cnt = cnt + 1\n",
"\n",
" return cnt / len(relevance_total)\n",
"\n",
"def mrr(relevance_total):\n",
" total_score = 0.0\n",
"\n",
" for line in relevance_total:\n",
" for rank in range(len(line)):\n",
" if line[rank] == True:\n",
" total_score = total_score + 1 / (rank + 1)\n",
"\n",
" return total_score / len(relevance_total)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'Where can I sign up for the course?',\n",
" 'course': 'machine-learning-zoomcamp',\n",
" 'document': '0227b872'}"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ground_truth[0]"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"results = []\n",
"relevances = []\n",
"\n",
"def get_qa_embeddings(q):\n",
" text = f'{q[\"question\"]}'\n",
" text_vect = embedding_model.encode(text)\n",
"\n",
" return text_vect\n",
"\n",
"for q in ground_truth:\n",
" doc_id = q['document']\n",
" text_vect=get_qa_embeddings(q)\n",
" result = search_engine.search(text_vect, num_results=5)\n",
" relevance = [r['id'] == doc_id for r in result]\n",
" results.append(result)\n",
" relevances.append(relevance)\n"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Q4) Hit-rate for our search engine? hit_rate=0.9398907103825137, mrr=0.8516484517304189\n"
]
}
],
"source": [
"print(f\"Q4) Hit-rate for our search engine? hit_rate={hit_rate(relevance_total=relevances)}, mrr={mrr(relevance_total=relevances)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Q5. Indexing with Elasticsearch\n",
"Now let's index these documents with elasticsearch\n",
"\n",
"- Create the index with the same settings as in the module (but change the dimensions)\n",
"- Index the embeddings (note: you've already computed them)\n",
"- After indexing, let's perform the search of the same query from Q1.\n",
"\n",
"What's the ID of the document with the highest score?"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions-hw3'})"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from elasticsearch import Elasticsearch\n",
"\n",
"es_client = Elasticsearch('http://localhost:9200') \n",
"\n",
"index_settings = {\n",
" \"settings\": {\n",
" \"number_of_shards\": 1,\n",
" \"number_of_replicas\": 0\n",
" },\n",
" \"mappings\": {\n",
" \"properties\": {\n",
" \"text\": {\"type\": \"text\"},\n",
" \"section\": {\"type\": \"text\"},\n",
" \"question\": {\"type\": \"text\"},\n",
" \"course\": {\"type\": \"keyword\"},\n",
" \"id\": {\"type\": \"keyword\"},\n",
" \"question_vector\": {\n",
" \"type\": \"dense_vector\",\n",
" \"dims\": 768,\n",
" \"index\": True,\n",
" \"similarity\": \"cosine\"\n",
" },\n",
" \"text_vector\": {\n",
" \"type\": \"dense_vector\",\n",
" \"dims\": 768,\n",
" \"index\": True,\n",
" \"similarity\": \"cosine\"\n",
" },\n",
" \"question_text_vector\": {\n",
" \"type\": \"dense_vector\",\n",
" \"dims\": 768,\n",
" \"index\": True,\n",
" \"similarity\": \"cosine\"\n",
" },\n",
" }\n",
" }\n",
"}\n",
"\n",
"index_name = \"course-questions-hw3\"\n",
"\n",
"es_client.indices.delete(index=index_name, ignore_unavailable=True)\n",
"es_client.indices.create(index=index_name, body=index_settings)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"for doc in documents:\n",
" question = doc['question']\n",
" text = doc['text']\n",
" qt = question + ' ' + text\n",
"\n",
" doc['question_vector'] = embedding_model.encode(question)\n",
" doc['text_vector'] = embedding_model.encode(text)\n",
" doc['question_text_vector'] = embedding_model.encode(qt)"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [],
"source": [
"#documents[0]"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"# - Index the embeddings (note: you've already computed them)\n",
"for doc in documents:\n",
" es_client.index(index=index_name, document=doc)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'I just discovered the course. Can I still join it?'"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"user_question"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"v = embedding_model.encode(user_question)"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
" \n",
"def elastic_search_knn(v):\n",
" es_query_dict = {\n",
" \"field\": \"question_text_vector\",\n",
" \"query_vector\": v,\n",
" \"k\": 5,\n",
" \"num_candidates\": 10000\n",
" }\n",
"\n",
" es_results = es_client.search(\n",
" index=index_name,\n",
" knn=es_query_dict\n",
" )\n",
"\n",
" result_docs = []\n",
" # it is ordered by score\n",
" for hit in es_results['hits']['hits']:\n",
" result_docs.append(hit['_source'])\n",
"\n",
" return es_results, result_docs\n"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"es_results, results = elastic_search_knn(v)"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Q5) What's the ID of the document with the highest score? document=ee58a693, score=0.8255229\n"
]
}
],
"source": [
"print(f\"Q5) What's the ID of the document with the highest score? document={results[0][\"id\"]}, score={es_results['hits'][\"max_score\"]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hit-rate for Elasticsearch\n",
"\n",
"The search engine we used in Q4 computed the similarity between the query and ALL the vectors in our database. \n",
"Usually this is not practical, as we may have a lot of data.\n",
"\n",
"Elasticsearch uses approximate techniques to make it faster.\n",
"\n",
"- Let's evaluate how worse the results are when we switch from exact search (as in Q4) to approximate search with Elastic.\n",
"\n",
"What's hitrate for our dataset for Elastic?"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Q6) What's hitrate for our dataset for Elastic? hit_rate=0.6190561315695388, mrr=0.5592182099868911\n"
]
}
],
"source": [
"for q in ground_truth:\n",
" doc_id = q['document']\n",
" text_vect=get_qa_embeddings(q)\n",
" _, result = elastic_search_knn(text_vect)\n",
" relevance = [r['id'] == doc_id for r in result]\n",
" results.append(result)\n",
" relevances.append(relevance)\n",
"\n",
"print(f\"Q6) What's hitrate for our dataset for Elastic? hit_rate={hit_rate(relevance_total=relevances)}, mrr={mrr(relevance_total=relevances)}\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment