Created
October 24, 2024 00:48
-
-
Save caleb-kaiser/b6701f17e8b061ffc625792ddeca24d6 to your computer and use it in GitHub Desktop.
7-monitor-youtube-assistant.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/caleb-kaiser/b6701f17e8b061ffc625792ddeca24d6/7-monitor-youtube-assistant.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"<img src=\"https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/opik-logo.svg\" width=\"250\"/>" | |
], | |
"metadata": { | |
"id": "JyRiHeF6_doD" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "zAHuR_PSVMhP" | |
}, | |
"source": [ | |
"# Build & Monitor a YouTube Search Assistant\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"In this exercise, you're going to build a YouTube search assistant and implement monitoring with Opik. You can use OpenAI or LiteLLM for your LLM API. The basic architecture for your application looks like this:\n", | |
"\n", | |
"- Users submit a question\n", | |
"- Your application searches YouTube for relevant videos\n", | |
"- Your application uses SentenceTransformers to extract relevant information from the video transcripts\n", | |
"- Finally, your application passes the relevant information + question to your LLM API and returns the answer to the user" | |
], | |
"metadata": { | |
"id": "03DjzjB8hDGB" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Imports & Configuration" | |
], | |
"metadata": { | |
"id": "FY3ONLhb_fPO" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"! pip install opik openai litellm pytube youtube-transcript-api sentence-transformers --quiet" | |
], | |
"metadata": { | |
"id": "cArXTpDBlaJz" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "-udSJ-rHVMhT" | |
}, | |
"outputs": [], | |
"source": [ | |
"import opik\n", | |
"from opik import Opik, track, DatasetItem\n", | |
"from opik.integrations.openai import track_openai\n", | |
"import openai\n", | |
"import os\n", | |
"import litellm\n", | |
"from getpass import getpass\n", | |
"\n", | |
"os.environ[\"OPIK_PROJECT_NAME\"] = \"youtube_search_assistant\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Opik configuration\n", | |
"if \"OPIK_API_KEY\" not in os.environ:\n", | |
" os.environ[\"OPIK_API_KEY\"] = getpass(\"Enter your Opik API key: \")\n", | |
"\n", | |
"opik.configure()" | |
], | |
"metadata": { | |
"id": "ot5-tWOXd8X0" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# OpenAI configuration (ignore if you're using LiteLLM)\n", | |
"if \"OPENAI_API_KEY\" not in os.environ:\n", | |
" os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OpenAI API key: \")\n" | |
], | |
"metadata": { | |
"id": "yyjDSp7uh3zx" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# LLM Application" | |
], | |
"metadata": { | |
"id": "4xxz5mOv_joo" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Simple little client class for using different LLM APIs (OpenAI or LiteLLM)\n", | |
"class LLMClient:\n", | |
" def __init__(self, client_type: str =\"openai\", model: str =\"gpt-4\"):\n", | |
" self.client_type = client_type\n", | |
" self.model = model\n", | |
"\n", | |
" if self.client_type == \"openai\":\n", | |
" self.client = track_openai(openai.OpenAI())\n", | |
"\n", | |
" else:\n", | |
" self.client = None\n", | |
"\n", | |
" # LiteLLM query function\n", | |
" def _get_litellm_response(self, query: str, system: str = \"You are a helpful assistant.\"):\n", | |
" messages = [\n", | |
" {\"role\": \"system\", \"content\": system },\n", | |
" { \"role\": \"user\", \"content\": query }\n", | |
" ]\n", | |
"\n", | |
" response = litellm.completion(\n", | |
" model=self.model,\n", | |
" messages=messages\n", | |
" )\n", | |
"\n", | |
" return response.choices[0].message.content\n", | |
"\n", | |
" # OpenAI query function - use **kwargs to pass arguments like temperature\n", | |
" def _get_openai_response(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n", | |
" messages = [\n", | |
" {\"role\": \"system\", \"content\": system },\n", | |
" { \"role\": \"user\", \"content\": query }\n", | |
" ]\n", | |
"\n", | |
" response = self.client.chat.completions.create(\n", | |
" model=self.model,\n", | |
" messages=messages,\n", | |
" **kwargs\n", | |
" )\n", | |
"\n", | |
" return response.choices[0].message.content\n", | |
"\n", | |
"\n", | |
" def query(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n", | |
" if self.client_type == 'openai':\n", | |
" return self._get_openai_response(query, system, **kwargs)\n", | |
"\n", | |
" else:\n", | |
" return self._get_litellm_response(query, system)\n", | |
"\n", | |
"\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "PBIyVFjKeJJ5" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Initialize your client!\n", | |
"\n", | |
"client = LLMClient(client_type=\"litellm\", model=\"huggingface/TinyLlama/TinyLlama-1.1B-Chat-v1.0\")" | |
], | |
"metadata": { | |
"id": "stqRZOGwgVQV" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from pytube import Search\n", | |
"\n", | |
"def search_youtube(query: str):\n", | |
" # Use PyTube's Search class to perform the search\n", | |
" search = Search(query)\n", | |
"\n", | |
" # Get the first 5 video results\n", | |
" videos = search.results[:5]\n", | |
"\n", | |
" # Extract the video URLs\n", | |
" video_urls = [f\"https://www.youtube.com/watch?v={video.video_id}\" for video in videos]\n", | |
"\n", | |
" return video_urls\n" | |
], | |
"metadata": { | |
"id": "OPnT9he0a8Gt" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from youtube_transcript_api import YouTubeTranscriptApi\n", | |
"\n", | |
"def get_video_transcripts(video_urls: list):\n", | |
" transcripts = []\n", | |
" for url in video_urls:\n", | |
" video_id = url.split(\"v=\")[1]\n", | |
" try:\n", | |
" transcript = YouTubeTranscriptApi.get_transcript(video_id)\n", | |
" full_transcript = \" \".join([entry['text'] for entry in transcript])\n", | |
" transcripts.append(full_transcript)\n", | |
" except Exception as e:\n", | |
" transcripts.append(f\"Error retrieving transcript for {url}: {str(e)}\")\n", | |
"\n", | |
" return transcripts\n" | |
], | |
"metadata": { | |
"id": "T0Hzup0qa8DW" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from sklearn.metrics.pairwise import cosine_similarity\n", | |
"from sentence_transformers import SentenceTransformer\n", | |
"\n", | |
"def find_relevant_context(query: str, transcripts: list, model_name: str = \"all-MiniLM-L6-v2\"):\n", | |
" model = SentenceTransformer(model_name)\n", | |
" query_embedding = model.encode([query])\n", | |
"\n", | |
" best_match = \"\"\n", | |
" highest_similarity = -1\n", | |
" for transcript in transcripts:\n", | |
" transcript_embedding = model.encode([transcript])\n", | |
" similarity = cosine_similarity(query_embedding, transcript_embedding)[0][0]\n", | |
" if similarity > highest_similarity:\n", | |
" highest_similarity = similarity\n", | |
" best_match = transcript\n", | |
"\n", | |
" return best_match\n" | |
], | |
"metadata": { | |
"id": "m5rCq_Mda8CG" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"@track\n", | |
"def query_llm_with_context(query: str, context: str):\n", | |
" prompt = f\"Given the following context: {context}\\nAnswer the question: {query}\"\n", | |
"\n", | |
" return client.query(prompt)\n" | |
], | |
"metadata": { | |
"id": "2ykve-j3a7_V" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Exercise" | |
], | |
"metadata": { | |
"id": "h_3pEqkK_mS5" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Exercise time! Try completing the missing sections in the below function:\n", | |
"\n", | |
"def question_answer_system(user_query: str):\n", | |
" # Step 1: Search YouTube with the phrase\n", | |
" video_urls =\n", | |
"\n", | |
" # Step 2: Pull transcripts for the videos\n", | |
" transcripts =\n", | |
"\n", | |
" # Step 3: Find relevant context\n", | |
" relevant_context =\n", | |
"\n", | |
" # Step 4: Query the LLM with the context\n", | |
" final_answer =\n", | |
"\n", | |
" return final_answer\n" | |
], | |
"metadata": { | |
"id": "ajJE8NL3a751" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Let's test it out!\n", | |
"\n", | |
"user_questions = [\n", | |
" \"Who is Moo Deng?\",\n", | |
" # Add your own questions\n", | |
"]" | |
], | |
"metadata": { | |
"id": "d6x3Fe8Ca72z" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"for question in user_questions:\n", | |
" answer = question_answer_system(question)\n", | |
" print(answer)" | |
], | |
"metadata": { | |
"id": "_4-eV77ea7qW" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Implemented question_answer_system()" | |
], | |
"metadata": { | |
"id": "OcTnXvbqug9G" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "OIsLRDhpVMhW" | |
}, | |
"outputs": [], | |
"source": [ | |
"def question_answer_system(user_query: str):\n", | |
" # Step 1: Search YouTube with the phrase\n", | |
" video_urls = search_youtube(user_query)\n", | |
"\n", | |
" # Step 2: Pull transcripts for the videos\n", | |
" transcripts = get_video_transcripts(video_urls)\n", | |
"\n", | |
" # Step 3: Find relevant context\n", | |
" relevant_context = find_relevant_context(user_query, transcripts)\n", | |
"\n", | |
" # Step 4: Query the LLM with the context\n", | |
" final_answer = query_llm_with_context(user_query, relevant_context)\n", | |
"\n", | |
" return final_answer\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "GufQWF_CVMhW" | |
}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "FLS1-c35VMhX" | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "comet-eval", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.15" | |
}, | |
"colab": { | |
"provenance": [], | |
"collapsed_sections": [ | |
"FY3ONLhb_fPO", | |
"4xxz5mOv_joo", | |
"h_3pEqkK_mS5", | |
"OcTnXvbqug9G" | |
], | |
"include_colab_link": true | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment