Skip to content

Instantly share code, notes, and snippets.

@caleb-kaiser
Created October 24, 2024 00:48
Show Gist options
  • Save caleb-kaiser/b6701f17e8b061ffc625792ddeca24d6 to your computer and use it in GitHub Desktop.
Save caleb-kaiser/b6701f17e8b061ffc625792ddeca24d6 to your computer and use it in GitHub Desktop.
7-monitor-youtube-assistant.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/caleb-kaiser/b6701f17e8b061ffc625792ddeca24d6/7-monitor-youtube-assistant.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"<img src=\"https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/opik-logo.svg\" width=\"250\"/>"
],
"metadata": {
"id": "JyRiHeF6_doD"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "zAHuR_PSVMhP"
},
"source": [
"# Build & Monitor a YouTube Search Assistant\n"
]
},
{
"cell_type": "markdown",
"source": [
"In this exercise, you're going to build a YouTube search assistant and implement monitoring with Opik. You can use OpenAI or LiteLLM for your LLM API. The basic architecture for your application looks like this:\n",
"\n",
"- Users submit a question\n",
"- Your application searches YouTube for relevant videos\n",
"- Your application uses SentenceTransformers to extract relevant information from the video transcripts\n",
"- Finally, your application passes the relevant information + question to your LLM API and returns the answer to the user"
],
"metadata": {
"id": "03DjzjB8hDGB"
}
},
{
"cell_type": "markdown",
"source": [
"# Imports & Configuration"
],
"metadata": {
"id": "FY3ONLhb_fPO"
}
},
{
"cell_type": "code",
"source": [
"! pip install opik openai litellm pytube youtube-transcript-api sentence-transformers --quiet"
],
"metadata": {
"id": "cArXTpDBlaJz"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-udSJ-rHVMhT"
},
"outputs": [],
"source": [
"import opik\n",
"from opik import Opik, track, DatasetItem\n",
"from opik.integrations.openai import track_openai\n",
"import openai\n",
"import os\n",
"import litellm\n",
"from getpass import getpass\n",
"\n",
"os.environ[\"OPIK_PROJECT_NAME\"] = \"youtube_search_assistant\""
]
},
{
"cell_type": "code",
"source": [
"# Opik configuration\n",
"if \"OPIK_API_KEY\" not in os.environ:\n",
" os.environ[\"OPIK_API_KEY\"] = getpass(\"Enter your Opik API key: \")\n",
"\n",
"opik.configure()"
],
"metadata": {
"id": "ot5-tWOXd8X0"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# OpenAI configuration (ignore if you're using LiteLLM)\n",
"if \"OPENAI_API_KEY\" not in os.environ:\n",
" os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OpenAI API key: \")\n"
],
"metadata": {
"id": "yyjDSp7uh3zx"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# LLM Application"
],
"metadata": {
"id": "4xxz5mOv_joo"
}
},
{
"cell_type": "code",
"source": [
"# Simple little client class for using different LLM APIs (OpenAI or LiteLLM)\n",
"class LLMClient:\n",
" def __init__(self, client_type: str =\"openai\", model: str =\"gpt-4\"):\n",
" self.client_type = client_type\n",
" self.model = model\n",
"\n",
" if self.client_type == \"openai\":\n",
" self.client = track_openai(openai.OpenAI())\n",
"\n",
" else:\n",
" self.client = None\n",
"\n",
" # LiteLLM query function\n",
" def _get_litellm_response(self, query: str, system: str = \"You are a helpful assistant.\"):\n",
" messages = [\n",
" {\"role\": \"system\", \"content\": system },\n",
" { \"role\": \"user\", \"content\": query }\n",
" ]\n",
"\n",
" response = litellm.completion(\n",
" model=self.model,\n",
" messages=messages\n",
" )\n",
"\n",
" return response.choices[0].message.content\n",
"\n",
" # OpenAI query function - use **kwargs to pass arguments like temperature\n",
" def _get_openai_response(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n",
" messages = [\n",
" {\"role\": \"system\", \"content\": system },\n",
" { \"role\": \"user\", \"content\": query }\n",
" ]\n",
"\n",
" response = self.client.chat.completions.create(\n",
" model=self.model,\n",
" messages=messages,\n",
" **kwargs\n",
" )\n",
"\n",
" return response.choices[0].message.content\n",
"\n",
"\n",
" def query(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n",
" if self.client_type == 'openai':\n",
" return self._get_openai_response(query, system, **kwargs)\n",
"\n",
" else:\n",
" return self._get_litellm_response(query, system)\n",
"\n",
"\n",
"\n"
],
"metadata": {
"id": "PBIyVFjKeJJ5"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Initialize your client!\n",
"\n",
"client = LLMClient(client_type=\"litellm\", model=\"huggingface/TinyLlama/TinyLlama-1.1B-Chat-v1.0\")"
],
"metadata": {
"id": "stqRZOGwgVQV"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from pytube import Search\n",
"\n",
"def search_youtube(query: str):\n",
" # Use PyTube's Search class to perform the search\n",
" search = Search(query)\n",
"\n",
" # Get the first 5 video results\n",
" videos = search.results[:5]\n",
"\n",
" # Extract the video URLs\n",
" video_urls = [f\"https://www.youtube.com/watch?v={video.video_id}\" for video in videos]\n",
"\n",
" return video_urls\n"
],
"metadata": {
"id": "OPnT9he0a8Gt"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from youtube_transcript_api import YouTubeTranscriptApi\n",
"\n",
"def get_video_transcripts(video_urls: list):\n",
" transcripts = []\n",
" for url in video_urls:\n",
" video_id = url.split(\"v=\")[1]\n",
" try:\n",
" transcript = YouTubeTranscriptApi.get_transcript(video_id)\n",
" full_transcript = \" \".join([entry['text'] for entry in transcript])\n",
" transcripts.append(full_transcript)\n",
" except Exception as e:\n",
" transcripts.append(f\"Error retrieving transcript for {url}: {str(e)}\")\n",
"\n",
" return transcripts\n"
],
"metadata": {
"id": "T0Hzup0qa8DW"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from sklearn.metrics.pairwise import cosine_similarity\n",
"from sentence_transformers import SentenceTransformer\n",
"\n",
"def find_relevant_context(query: str, transcripts: list, model_name: str = \"all-MiniLM-L6-v2\"):\n",
" model = SentenceTransformer(model_name)\n",
" query_embedding = model.encode([query])\n",
"\n",
" best_match = \"\"\n",
" highest_similarity = -1\n",
" for transcript in transcripts:\n",
" transcript_embedding = model.encode([transcript])\n",
" similarity = cosine_similarity(query_embedding, transcript_embedding)[0][0]\n",
" if similarity > highest_similarity:\n",
" highest_similarity = similarity\n",
" best_match = transcript\n",
"\n",
" return best_match\n"
],
"metadata": {
"id": "m5rCq_Mda8CG"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"@track\n",
"def query_llm_with_context(query: str, context: str):\n",
" prompt = f\"Given the following context: {context}\\nAnswer the question: {query}\"\n",
"\n",
" return client.query(prompt)\n"
],
"metadata": {
"id": "2ykve-j3a7_V"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Exercise"
],
"metadata": {
"id": "h_3pEqkK_mS5"
}
},
{
"cell_type": "code",
"source": [
"# Exercise time! Try completing the missing sections in the below function:\n",
"\n",
"def question_answer_system(user_query: str):\n",
" # Step 1: Search YouTube with the phrase\n",
" video_urls =\n",
"\n",
" # Step 2: Pull transcripts for the videos\n",
" transcripts =\n",
"\n",
" # Step 3: Find relevant context\n",
" relevant_context =\n",
"\n",
" # Step 4: Query the LLM with the context\n",
" final_answer =\n",
"\n",
" return final_answer\n"
],
"metadata": {
"id": "ajJE8NL3a751"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Let's test it out!\n",
"\n",
"user_questions = [\n",
" \"Who is Moo Deng?\",\n",
" # Add your own questions\n",
"]"
],
"metadata": {
"id": "d6x3Fe8Ca72z"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"for question in user_questions:\n",
" answer = question_answer_system(question)\n",
" print(answer)"
],
"metadata": {
"id": "_4-eV77ea7qW"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Implemented question_answer_system()"
],
"metadata": {
"id": "OcTnXvbqug9G"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OIsLRDhpVMhW"
},
"outputs": [],
"source": [
"def question_answer_system(user_query: str):\n",
" # Step 1: Search YouTube with the phrase\n",
" video_urls = search_youtube(user_query)\n",
"\n",
" # Step 2: Pull transcripts for the videos\n",
" transcripts = get_video_transcripts(video_urls)\n",
"\n",
" # Step 3: Find relevant context\n",
" relevant_context = find_relevant_context(user_query, transcripts)\n",
"\n",
" # Step 4: Query the LLM with the context\n",
" final_answer = query_llm_with_context(user_query, relevant_context)\n",
"\n",
" return final_answer\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "GufQWF_CVMhW"
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "FLS1-c35VMhX"
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "comet-eval",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
},
"colab": {
"provenance": [],
"collapsed_sections": [
"FY3ONLhb_fPO",
"4xxz5mOv_joo",
"h_3pEqkK_mS5",
"OcTnXvbqug9G"
],
"include_colab_link": true
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment