Skip to content

Instantly share code, notes, and snippets.

@lychrel
Created August 6, 2020 01:05
Show Gist options
  • Save lychrel/85600f0f198282a2393166a3d89f6ce6 to your computer and use it in GitHub Desktop.
Save lychrel/85600f0f198282a2393166a3d89f6ce6 to your computer and use it in GitHub Desktop.
Arxiv Citation Recommendations via collaborative filtering
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "ArxivCitationRecommender.ipynb",
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "FgiPvjubRjfz",
"colab_type": "text"
},
"source": [
"## Citation Recommendations via Collaborative Filtering\n",
"\n",
"Grab the [internal citation data](https://www.kaggle.com/Cornell-University/arxiv?select=internal-citations.json) from the Kaggle dataset and upload it here.\n",
"\n",
"*Thanks to [this fastai.collab tutorial](https://towardsdatascience.com/collaborative-filtering-using-fastai-a2ec5a2a4049)*"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YQLo0E9TRtgA",
"colab_type": "text"
},
"source": [
"### Imports"
]
},
{
"cell_type": "code",
"metadata": {
"id": "A9BP2qm9LHWX",
"colab_type": "code",
"colab": {}
},
"source": [
"import fastai\n",
"from google.colab import files\n",
"import json\n",
"import pandas as pd\n",
"from tqdm import tqdm"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "vLeyIORDRzbG",
"colab_type": "text"
},
"source": [
"### Upload/unzip citation data"
]
},
{
"cell_type": "code",
"metadata": {
"id": "hblNi-B1LJRw",
"colab_type": "code",
"colab": {}
},
"source": [
"files.upload()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "qAQjUy4-LUXl",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 51
},
"outputId": "4a60f36a-f9bf-4e4c-ae59-81da03acae0b"
},
"source": [
"!unzip 612177_1135627_compressed_internal-citations.json.zip"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Archive: 612177_1135627_compressed_internal-citations.json.zip\n",
"replace internal-citations.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: "
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UyXB-TztR67B",
"colab_type": "text"
},
"source": [
"### Get citation JSON"
]
},
{
"cell_type": "code",
"metadata": {
"id": "q0fxHmpoLhG3",
"colab_type": "code",
"colab": {}
},
"source": [
"with open(\"internal-citations.json\", \"r\") as fp:\n",
" citations = json.loads(fp.read())\n",
"\n",
"print(list(citations.keys())[:10])"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "z-bWdHr9MHAB",
"colab_type": "code",
"colab": {}
},
"source": [
"for paper, list_of_citations in list(citations.items())[:10]:\n",
" print(\"{}: {}\".format(paper, list_of_citations))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "M0zzxz1VR9ql",
"colab_type": "text"
},
"source": [
"### Generate Citation DF"
]
},
{
"cell_type": "code",
"metadata": {
"id": "6P6Vo0mqMQYF",
"colab_type": "code",
"colab": {}
},
"source": [
"# generate CSV\n",
"citing_papers = []\n",
"citees = []\n",
"scores = []\n",
"for paper, list_of_citations in tqdm(citations.items()):\n",
" for citation in list_of_citations:\n",
" citing_papers.append(paper)\n",
" citees.append(citation)\n",
" scores.append(1.0)\n",
"\n",
"citation_df = pd.DataFrame({'paperID': citing_papers, 'citationID': citees, 'target': 1.0})"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "3gORPN8aNeTk",
"colab_type": "code",
"colab": {}
},
"source": [
"citation_df"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "gRGwYtRmSDGj",
"colab_type": "text"
},
"source": [
"### CF via fastai"
]
},
{
"cell_type": "code",
"metadata": {
"id": "VnSnzTqqNnd_",
"colab_type": "code",
"colab": {}
},
"source": [
"from fastai.collab import *\n",
"from fastai.tabular import *"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "ESwlE3vJNv-W",
"colab_type": "code",
"colab": {}
},
"source": [
"data = CollabDataBunch.from_df(citation_df, seed=42, valid_pct=0.2)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "ylhKS5PAN39l",
"colab_type": "code",
"colab": {}
},
"source": [
"y_range = [0.0, 1.0]\n",
"learn = collab_learner(data, n_factors=50, y_range=y_range, wd=1e-1)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "8uUDFUfZOHHy",
"colab_type": "code",
"colab": {}
},
"source": [
"learn.fit_one_cycle(5, 5e-3)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "7ZnyuIYiOPRx",
"colab_type": "code",
"colab": {}
},
"source": [
""
],
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment