Created
September 9, 2022 23:56
-
-
Save linkerlin/bd27168101a5ec0775086e2b7d4741ae to your computer and use it in GitHub Desktop.
jina-ai.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"provenance": [], | |
"collapsed_sections": [], | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
}, | |
"accelerator": "GPU", | |
"gpuClass": "standard" | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/linkerlin/bd27168101a5ec0775086e2b7d4741ae/jina-ai.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# ⏰ Install & Import Dependencies" | |
], | |
"metadata": { | |
"id": "VZGABxkAge3q" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "lkyGStg_gZKY", | |
"outputId": "1d5c3f2c-d0da-42c4-9162-2a5ada537a8c" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", | |
"Collecting docarray\n", | |
" Downloading docarray-0.16.5.tar.gz (641 kB)\n", | |
"\u001b[K |████████████████████████████████| 641 kB 15.0 MB/s \n", | |
"\u001b[?25hRequirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from docarray) (1.21.6)\n", | |
"Collecting rich>=12.0.0\n", | |
" Downloading rich-12.5.1-py3-none-any.whl (235 kB)\n", | |
"\u001b[K |████████████████████████████████| 235 kB 67.0 MB/s \n", | |
"\u001b[?25hRequirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich>=12.0.0->docarray) (2.6.1)\n", | |
"Collecting commonmark<0.10.0,>=0.9.0\n", | |
" Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)\n", | |
"\u001b[K |████████████████████████████████| 51 kB 8.6 MB/s \n", | |
"\u001b[?25hRequirement already satisfied: typing-extensions<5.0,>=4.0.0 in /usr/local/lib/python3.7/dist-packages (from rich>=12.0.0->docarray) (4.1.1)\n", | |
"Building wheels for collected packages: docarray\n", | |
" Building wheel for docarray (setup.py) ... \u001b[?25l\u001b[?25hdone\n", | |
" Created wheel for docarray: filename=docarray-0.16.5-py3-none-any.whl size=693606 sha256=171ec80ed7b03d48ef6805080b7542371d91b9cb4decd82580d131774e0c95be\n", | |
" Stored in directory: /root/.cache/pip/wheels/03/f6/c0/fc82dc37bab5edfd37220de5689e9bc5667c2bbc290374a1d4\n", | |
"Successfully built docarray\n", | |
"Installing collected packages: commonmark, rich, docarray\n", | |
"Successfully installed commonmark-0.9.1 docarray-0.16.5 rich-12.5.1\n" | |
] | |
} | |
], | |
"source": [ | |
"!pip install docarray" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Importing necessary dependencies\n", | |
"from docarray import Document, DocumentArray" | |
], | |
"metadata": { | |
"id": "eRDNOSTFg_ps" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# 🪡 Data Pre-processing" | |
], | |
"metadata": { | |
"id": "GqA2yqFIh4rv" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from docarray import Document, DocumentArray\n", | |
"# uri=\"https://www.gutenberg.org/files/1342/1342-0.txt\"\n", | |
"uri=\"https://basoss.oss-ap-southeast-1.aliyuncs.com/ebooks/Pride_and_Prejudice.txt\"\n", | |
"doc = Document(uri=uri).load_uri_to_text()\n", | |
"\n", | |
"\n", | |
"# break large text into smaller chunks\n", | |
"docs = DocumentArray(Document(text = s.strip()) for s in doc.text.split('\\n') if s.strip())" | |
], | |
"metadata": { | |
"id": "JWQKqkrDhm4L" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# 🏗 Generate Vector Embeddings \n", | |
"\n", | |
"We use **feature hashing** to generate the vecor embeddings as its the faster and space-efficient way. It works by taking the features and applying a hash function that can hash the values and return them as indices." | |
], | |
"metadata": { | |
"id": "_TJIHs6eiLrw" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# apply feature hashing to embed the DocumentArray\n", | |
"docs.apply(lambda doc: doc.embed_feature_hashing())" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 305 | |
}, | |
"id": "4glBnUHBiAwp", | |
"outputId": "a1056c04-36d0-41ac-8005-97674ac10429" | |
}, | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"data": { | |
"text/plain": [ | |
"\n" | |
], | |
"text/html": [ | |
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n", | |
"</pre>\n" | |
] | |
}, | |
"metadata": {} | |
}, | |
{ | |
"output_type": "display_data", | |
"data": { | |
"text/plain": [ | |
"╭────────────────── Documents Summary ───────────────────╮\n", | |
"│ │\n", | |
"│ Type DocumentArrayInMemory │\n", | |
"│ Length \u001b[1;36m12153\u001b[0m │\n", | |
"│ Homogenous Documents \u001b[3;92mTrue\u001b[0m │\n", | |
"│ Common Attributes \u001b[1m(\u001b[0m\u001b[32m'id'\u001b[0m, \u001b[32m'text'\u001b[0m, \u001b[32m'embedding'\u001b[0m\u001b[1m)\u001b[0m │\n", | |
"│ Multimodal dataclass \u001b[3;91mFalse\u001b[0m │\n", | |
"│ │\n", | |
"╰────────────────────────────────────────────────────────╯\n", | |
"╭────────────────────── Attributes Summary ───────────────────────╮\n", | |
"│ │\n", | |
"│ \u001b[1m \u001b[0m\u001b[1mAttribute\u001b[0m\u001b[1m \u001b[0m \u001b[1m \u001b[0m\u001b[1mData type \u001b[0m\u001b[1m \u001b[0m \u001b[1m \u001b[0m\u001b[1m#Unique values\u001b[0m\u001b[1m \u001b[0m \u001b[1m \u001b[0m\u001b[1mHas empty value\u001b[0m\u001b[1m \u001b[0m │\n", | |
"│ ───────────────────────────────────────────────────────────── │\n", | |
"│ embedding \u001b[1m(\u001b[0m\u001b[32m'ndarray'\u001b[0m,\u001b[1m)\u001b[0m \u001b[1;36m12153\u001b[0m \u001b[3;91mFalse\u001b[0m │\n", | |
"│ id \u001b[1m(\u001b[0m\u001b[32m'str'\u001b[0m,\u001b[1m)\u001b[0m \u001b[1;36m12153\u001b[0m \u001b[3;91mFalse\u001b[0m │\n", | |
"│ text \u001b[1m(\u001b[0m\u001b[32m'str'\u001b[0m,\u001b[1m)\u001b[0m \u001b[1;36m12062\u001b[0m \u001b[3;91mFalse\u001b[0m │\n", | |
"│ │\n", | |
"╰─────────────────────────────────────────────────────────────────╯\n" | |
], | |
"text/html": [ | |
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭────────────────── Documents Summary ───────────────────╮\n", | |
"│ │\n", | |
"│ Type DocumentArrayInMemory │\n", | |
"│ Length <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">12153</span> │\n", | |
"│ Homogenous Documents <span style=\"color: #00ff00; text-decoration-color: #00ff00; font-style: italic\">True</span> │\n", | |
"│ Common Attributes <span style=\"font-weight: bold\">(</span><span style=\"color: #008000; text-decoration-color: #008000\">'id'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'text'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'embedding'</span><span style=\"font-weight: bold\">)</span> │\n", | |
"│ Multimodal dataclass <span style=\"color: #ff0000; text-decoration-color: #ff0000; font-style: italic\">False</span> │\n", | |
"│ │\n", | |
"╰────────────────────────────────────────────────────────╯\n", | |
"╭────────────────────── Attributes Summary ───────────────────────╮\n", | |
"│ │\n", | |
"│ <span style=\"font-weight: bold\"> Attribute </span> <span style=\"font-weight: bold\"> Data type </span> <span style=\"font-weight: bold\"> #Unique values </span> <span style=\"font-weight: bold\"> Has empty value </span> │\n", | |
"│ ───────────────────────────────────────────────────────────── │\n", | |
"│ embedding <span style=\"font-weight: bold\">(</span><span style=\"color: #008000; text-decoration-color: #008000\">'ndarray'</span>,<span style=\"font-weight: bold\">)</span> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">12153</span> <span style=\"color: #ff0000; text-decoration-color: #ff0000; font-style: italic\">False</span> │\n", | |
"│ id <span style=\"font-weight: bold\">(</span><span style=\"color: #008000; text-decoration-color: #008000\">'str'</span>,<span style=\"font-weight: bold\">)</span> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">12153</span> <span style=\"color: #ff0000; text-decoration-color: #ff0000; font-style: italic\">False</span> │\n", | |
"│ text <span style=\"font-weight: bold\">(</span><span style=\"color: #008000; text-decoration-color: #008000\">'str'</span>,<span style=\"font-weight: bold\">)</span> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">12062</span> <span style=\"color: #ff0000; text-decoration-color: #ff0000; font-style: italic\">False</span> │\n", | |
"│ │\n", | |
"╰─────────────────────────────────────────────────────────────────╯\n", | |
"</pre>\n" | |
] | |
}, | |
"metadata": {} | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# 🪄 Querying the Data \n", | |
"\n", | |
"Let's take the query sentence \"**she entered the room**\" from Pride and Prejudice and see what response we get." | |
], | |
"metadata": { | |
"id": "JdV6P4vQiciB" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# query sentence \n", | |
"query = (Document(text=\"she likes the young man\").embed_feature_hashing().match(docs, limit=3, exclude_self=True, \n", | |
"metric=\"jaccard\", use_scipy=True))" | |
], | |
"metadata": { | |
"id": "hJIctI21ibak" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# fetch the output\n", | |
"output = query.matches[:, ('text', 'scores__jaccard')]" | |
], | |
"metadata": { | |
"id": "5IZXv3rRijY6" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# print the results\n", | |
"num=0\n", | |
"for i in (output):\n", | |
" num+=1\n", | |
" print(num,i)" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "iF7nVdn0kChe", | |
"outputId": "bb002d86-58d2-4dd2-9875-5b68bb80c14d" | |
}, | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": [ | |
"1 ['turned her eyes on the daughter, she could almost have joined in', 'young man.', 'condescension, expressed what she felt on the occasion; when it']\n", | |
"2 [{'value': 0.6666666666666666}, {'value': 0.6666666666666666}, {'value': 0.6666666666666666}]\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Next Steps\n", | |
"\n", | |
"### Building into a real world application\n", | |
"\n", | |
"In a future notebook we'll use **[Jina's neural search framework](https://github.com/jina-ai/jina/)** and **[Jina Hub Executors](https://hub.jina.ai)** to build a [real world fashion search engine](http://examples.jina.ai/fashion) with minimal lines of code.\n", | |
"\n", | |
"![](https://github.com/alexcg1/jina-multimodal-fashion-search/raw/main/demo.gif)" | |
], | |
"metadata": { | |
"id": "IGSPWBYVllzM" | |
} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment