Skip to content

Instantly share code, notes, and snippets.

@linkerlin
Created September 9, 2022 23:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save linkerlin/bd27168101a5ec0775086e2b7d4741ae to your computer and use it in GitHub Desktop.
Save linkerlin/bd27168101a5ec0775086e2b7d4741ae to your computer and use it in GitHub Desktop.
jina-ai.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU",
"gpuClass": "standard"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/linkerlin/bd27168101a5ec0775086e2b7d4741ae/jina-ai.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# ⏰ Install & Import Dependencies"
],
"metadata": {
"id": "VZGABxkAge3q"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "lkyGStg_gZKY",
"outputId": "1d5c3f2c-d0da-42c4-9162-2a5ada537a8c"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
"Collecting docarray\n",
" Downloading docarray-0.16.5.tar.gz (641 kB)\n",
"\u001b[K |████████████████████████████████| 641 kB 15.0 MB/s \n",
"\u001b[?25hRequirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from docarray) (1.21.6)\n",
"Collecting rich>=12.0.0\n",
" Downloading rich-12.5.1-py3-none-any.whl (235 kB)\n",
"\u001b[K |████████████████████████████████| 235 kB 67.0 MB/s \n",
"\u001b[?25hRequirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich>=12.0.0->docarray) (2.6.1)\n",
"Collecting commonmark<0.10.0,>=0.9.0\n",
" Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)\n",
"\u001b[K |████████████████████████████████| 51 kB 8.6 MB/s \n",
"\u001b[?25hRequirement already satisfied: typing-extensions<5.0,>=4.0.0 in /usr/local/lib/python3.7/dist-packages (from rich>=12.0.0->docarray) (4.1.1)\n",
"Building wheels for collected packages: docarray\n",
" Building wheel for docarray (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for docarray: filename=docarray-0.16.5-py3-none-any.whl size=693606 sha256=171ec80ed7b03d48ef6805080b7542371d91b9cb4decd82580d131774e0c95be\n",
" Stored in directory: /root/.cache/pip/wheels/03/f6/c0/fc82dc37bab5edfd37220de5689e9bc5667c2bbc290374a1d4\n",
"Successfully built docarray\n",
"Installing collected packages: commonmark, rich, docarray\n",
"Successfully installed commonmark-0.9.1 docarray-0.16.5 rich-12.5.1\n"
]
}
],
"source": [
"!pip install docarray"
]
},
{
"cell_type": "code",
"source": [
"# Importing necessary dependencies\n",
"from docarray import Document, DocumentArray"
],
"metadata": {
"id": "eRDNOSTFg_ps"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# 🪡 Data Pre-processing"
],
"metadata": {
"id": "GqA2yqFIh4rv"
}
},
{
"cell_type": "code",
"source": [
"from docarray import Document, DocumentArray\n",
"# uri=\"https://www.gutenberg.org/files/1342/1342-0.txt\"\n",
"uri=\"https://basoss.oss-ap-southeast-1.aliyuncs.com/ebooks/Pride_and_Prejudice.txt\"\n",
"doc = Document(uri=uri).load_uri_to_text()\n",
"\n",
"\n",
"# break large text into smaller chunks\n",
"docs = DocumentArray(Document(text = s.strip()) for s in doc.text.split('\\n') if s.strip())"
],
"metadata": {
"id": "JWQKqkrDhm4L"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# 🏗 Generate Vector Embeddings \n",
"\n",
"We use **feature hashing** to generate the vecor embeddings as its the faster and space-efficient way. It works by taking the features and applying a hash function that can hash the values and return them as indices."
],
"metadata": {
"id": "_TJIHs6eiLrw"
}
},
{
"cell_type": "code",
"source": [
"# apply feature hashing to embed the DocumentArray\n",
"docs.apply(lambda doc: doc.embed_feature_hashing())"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 305
},
"id": "4glBnUHBiAwp",
"outputId": "a1056c04-36d0-41ac-8005-97674ac10429"
},
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"\n"
],
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
"</pre>\n"
]
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"╭────────────────── Documents Summary ───────────────────╮\n",
"│ │\n",
"│ Type DocumentArrayInMemory │\n",
"│ Length \u001b[1;36m12153\u001b[0m │\n",
"│ Homogenous Documents \u001b[3;92mTrue\u001b[0m │\n",
"│ Common Attributes \u001b[1m(\u001b[0m\u001b[32m'id'\u001b[0m, \u001b[32m'text'\u001b[0m, \u001b[32m'embedding'\u001b[0m\u001b[1m)\u001b[0m │\n",
"│ Multimodal dataclass \u001b[3;91mFalse\u001b[0m │\n",
"│ │\n",
"╰────────────────────────────────────────────────────────╯\n",
"╭────────────────────── Attributes Summary ───────────────────────╮\n",
"│ │\n",
"│ \u001b[1m \u001b[0m\u001b[1mAttribute\u001b[0m\u001b[1m \u001b[0m \u001b[1m \u001b[0m\u001b[1mData type \u001b[0m\u001b[1m \u001b[0m \u001b[1m \u001b[0m\u001b[1m#Unique values\u001b[0m\u001b[1m \u001b[0m \u001b[1m \u001b[0m\u001b[1mHas empty value\u001b[0m\u001b[1m \u001b[0m │\n",
"│ ───────────────────────────────────────────────────────────── │\n",
"│ embedding \u001b[1m(\u001b[0m\u001b[32m'ndarray'\u001b[0m,\u001b[1m)\u001b[0m \u001b[1;36m12153\u001b[0m \u001b[3;91mFalse\u001b[0m │\n",
"│ id \u001b[1m(\u001b[0m\u001b[32m'str'\u001b[0m,\u001b[1m)\u001b[0m \u001b[1;36m12153\u001b[0m \u001b[3;91mFalse\u001b[0m │\n",
"│ text \u001b[1m(\u001b[0m\u001b[32m'str'\u001b[0m,\u001b[1m)\u001b[0m \u001b[1;36m12062\u001b[0m \u001b[3;91mFalse\u001b[0m │\n",
"│ │\n",
"╰─────────────────────────────────────────────────────────────────╯\n"
],
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭────────────────── Documents Summary ───────────────────╮\n",
"│ │\n",
"│ Type DocumentArrayInMemory │\n",
"│ Length <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">12153</span> │\n",
"│ Homogenous Documents <span style=\"color: #00ff00; text-decoration-color: #00ff00; font-style: italic\">True</span> │\n",
"│ Common Attributes <span style=\"font-weight: bold\">(</span><span style=\"color: #008000; text-decoration-color: #008000\">'id'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'text'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'embedding'</span><span style=\"font-weight: bold\">)</span> │\n",
"│ Multimodal dataclass <span style=\"color: #ff0000; text-decoration-color: #ff0000; font-style: italic\">False</span> │\n",
"│ │\n",
"╰────────────────────────────────────────────────────────╯\n",
"╭────────────────────── Attributes Summary ───────────────────────╮\n",
"│ │\n",
"│ <span style=\"font-weight: bold\"> Attribute </span> <span style=\"font-weight: bold\"> Data type </span> <span style=\"font-weight: bold\"> #Unique values </span> <span style=\"font-weight: bold\"> Has empty value </span> │\n",
"│ ───────────────────────────────────────────────────────────── │\n",
"│ embedding <span style=\"font-weight: bold\">(</span><span style=\"color: #008000; text-decoration-color: #008000\">'ndarray'</span>,<span style=\"font-weight: bold\">)</span> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">12153</span> <span style=\"color: #ff0000; text-decoration-color: #ff0000; font-style: italic\">False</span> │\n",
"│ id <span style=\"font-weight: bold\">(</span><span style=\"color: #008000; text-decoration-color: #008000\">'str'</span>,<span style=\"font-weight: bold\">)</span> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">12153</span> <span style=\"color: #ff0000; text-decoration-color: #ff0000; font-style: italic\">False</span> │\n",
"│ text <span style=\"font-weight: bold\">(</span><span style=\"color: #008000; text-decoration-color: #008000\">'str'</span>,<span style=\"font-weight: bold\">)</span> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">12062</span> <span style=\"color: #ff0000; text-decoration-color: #ff0000; font-style: italic\">False</span> │\n",
"│ │\n",
"╰─────────────────────────────────────────────────────────────────╯\n",
"</pre>\n"
]
},
"metadata": {}
}
]
},
{
"cell_type": "markdown",
"source": [
"# 🪄 Querying the Data \n",
"\n",
"Let's take the query sentence \"**she entered the room**\" from Pride and Prejudice and see what response we get."
],
"metadata": {
"id": "JdV6P4vQiciB"
}
},
{
"cell_type": "code",
"source": [
"# query sentence \n",
"query = (Document(text=\"she likes the young man\").embed_feature_hashing().match(docs, limit=3, exclude_self=True, \n",
"metric=\"jaccard\", use_scipy=True))"
],
"metadata": {
"id": "hJIctI21ibak"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# fetch the output\n",
"output = query.matches[:, ('text', 'scores__jaccard')]"
],
"metadata": {
"id": "5IZXv3rRijY6"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# print the results\n",
"num=0\n",
"for i in (output):\n",
" num+=1\n",
" print(num,i)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "iF7nVdn0kChe",
"outputId": "bb002d86-58d2-4dd2-9875-5b68bb80c14d"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"1 ['turned her eyes on the daughter, she could almost have joined in', 'young man.', 'condescension, expressed what she felt on the occasion; when it']\n",
"2 [{'value': 0.6666666666666666}, {'value': 0.6666666666666666}, {'value': 0.6666666666666666}]\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"# Next Steps\n",
"\n",
"### Building into a real world application\n",
"\n",
"In a future notebook we'll use **[Jina's neural search framework](https://github.com/jina-ai/jina/)** and **[Jina Hub Executors](https://hub.jina.ai)** to build a [real world fashion search engine](http://examples.jina.ai/fashion) with minimal lines of code.\n",
"\n",
"![](https://github.com/alexcg1/jina-multimodal-fashion-search/raw/main/demo.gif)"
],
"metadata": {
"id": "IGSPWBYVllzM"
}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment