afender/multi_gpu_pagerank.ipynb

## multi_gpu_pagerank.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multi-GPU PageRank workflow on Twitter\n",
    "##### Alex Fender\n",
    "\n",
    "In this notebook, we will show how to use multi-GPU features in cuGraph to compute the PageRank of each user in Twitter's dataset.\n",
    "\n",
    "Please be aware that your system may be different, and you may need to modify the code or install packages to run the below examples. If you think you have found a bug or an error, please file an issue in [cuGraph](https://github.com/rapidsai/cugraph/issues)\n",
    "\n",
    "This notebook was run on 2 NVIDIA Tesla V100 GPUs (connected with nvlink) using RAPIDS 0.9.0 and CUDA 10.0. \n",
    "\n",
    "## Introduction\n",
    "Pagerank is measure of the relative importance of a vertex based on the relative importance of its neighbors.  PageRank was invented by Google Inc. and is (was) used to rank its search results. PageRank uses the connectivity information of a graph to rank the importance of each vertex. See [Wikipedia](https://en.wikipedia.org/wiki/PageRank) for more details on the algorithm.\n",
    "\n",
    "CuGraph's multi-GPU features leverage Dask. RAPIDS has other projects based on Dask such as dask-cudf and dask-cuda. These products will also be used in this example. Check out [RAPIDS.ai](https://rapids.ai/) to learn more about these technologies.\n",
    "\n",
    "---\n",
    "\n",
    "To compute the Pagerank with cuGraph we use:<br>\n",
    "\n",
    "```python\n",
    "cugraph.dask.pagerank.pagerank(edge_list, alpha=0.85, max_iter=30)\n",
    "```\n",
    "Parameters\n",
    "\n",
    "*  *edge_list* : `dask_cudf.DataFrame`<br>\n",
    "Contain the connectivity information as an edge list. Source 'src' and destination 'dst' columns must be of type 'int32'. Edge weights are not used for this algorithm. Indices must be in the range [0, V-1], where V is the global number of vertices. The input edge list should be provided in dask-cudf DataFrame with one partition per GPU.\n",
    "*  *alpha* : `float`<br>\n",
    "The damping factor alpha represents the probability to follow an outgoing edge, standard value is 0.85. Thus, 1.0-alpha is the probability to “teleport” to a random vertex. Alpha should be greater than 0.0 and strictly lower than 1.0.\n",
    "* *max_iter* : `int`<br>\n",
    "The maximum number of iterations before an answer is returned. If this value is lower or equal to 0 cuGraph will use the default value, which is 30. In this notebook, we will use 20 to compare against published results.<br>\n",
    "\n",
    "Returns\n",
    "\n",
    "* *PageRank* : `dask_cudf.DataFrame`<br>\n",
    "Dask GPU DataFrame containing two columns of size V: the vertex identifiers and the corresponding PageRank values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "We will be analyzing 41.7 million user profiles and 1.47 billion social relations from the Twitter dataset.  The CSV file is 26GB and was collected in :<br>\n",
    "*What is Twitter, a social network or a news media? Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010.*<br> \n",
    "\n",
    "---\n",
    "\n",
    "The fastest way to obtain the dataset is to run :\n",
    "```bash\n",
    "sh ./get_data.sh\n",
    "```\n",
    "\n",
    "Please refer to the README for further information and more options on how to obtain this dataset.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-GPU PageRank with cuGraph\n",
    "### Basic setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's check out our hardware setup\n",
    "!nvidia-smi\n",
    "\n",
    "# GPUs should be connected with NVlink\n",
    "!nvidia-smi nvlink --status\n",
    "\n",
    "# For best performance, we can limit the number of available devices\n",
    "# For this dataset, we can use only 2 Tesla V100(32GB) or 4 Tesla P100 (16GB)\n",
    "import os\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0,1\"\n",
    "\n",
    "# List available devices\n",
    "!echo Available GPUs: $CUDA_VISIBLE_DEVICES"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import needed libraries. We recommend using cugraph_dev env through conda\n",
    "import time\n",
    "from dask.distributed import Client, wait\n",
    "import dask_cudf\n",
    "from dask_cuda import LocalCUDACluster\n",
    "import cugraph.dask.pagerank as dcg"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup multi-GPU and Dask\n",
    "\n",
    "Before we get started, we need to setup a Dask local cluster of workers to execute our work and a client to coordinate and schedule work for that cluster. As we see below, we can initiate a `cluster` and `client` using only 2 lines of code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster = LocalCUDACluster(threads_per_worker=1)\n",
    "client = Client(cluster)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read the data from disk\n",
    "cuGraph depends on dask-cudf for data loading and the initial DataFrame creation. The CSV data file contains an edge list, which represents the connection of a vertex to another. The source to destination pairs is what is known as Coordinate Format (COO). In this test case, the data is just two columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# File path, assuming Notebook directory\n",
    "input_data_path = r\"twitter-2010.csv\"\n",
    "\n",
    "# Helper function to set the reader chunk size to automatically get one partition per GPU  \n",
    "chunksize = dcg.get_chunksize(input_data_path)\n",
    "\n",
    "# Start timer\n",
    "t_start = time.time()\n",
    "\n",
    "# Multi-GPU CSV reader\n",
    "e_list = dask_cudf.read_csv(input_data_path, chunksize = chunksize, delimiter=' ', names=['src', 'dst'], dtype=['int32', 'int32'])\n",
    "\n",
    "# Wait for the lazy reader\n",
    "tmp = wait(client.compute(e_list.to_delayed()))\n",
    "\n",
    "# Print time\n",
    "print(time.time()-t_start, \"s\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Call the Multi-GPU PageRank algorithm\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Start timer\n",
    "t_start = time.time()\n",
    "\n",
    "# Get the pagerank scores\n",
    "pr_ddf = dcg.pagerank(e_list, max_iter=20)\n",
    "\n",
    "# Print time\n",
    "print(time.time()-t_start, \"s\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It was that easy! PageRank should only take a few seconds to run on this 26GB input with 2 Tesla V100 GPUs.<br>\n",
    "Check out how it compares to published results in the [Annex](#annex_cell)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Further analysis on the Pagerank result\n",
    "\n",
    "We can now identify the most influent users in the network.<br>\n",
    "Notice that the PageRank result can fit in one GPU. Hence, we can gather it in a regular `cudf.DataFrame`. We will then sort by PageRank value and print the *Top 3*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Start timer\n",
    "t_start = time.time()\n",
    "\n",
    "# Dask Data Frame to regular cuDF Data Frame \n",
    "pr_df = pr_ddf.compute()\n",
    "\n",
    "# Sort, descending order\n",
    "pr_sorted_df = pr_df.sort_values('pagerank',ascending=False)\n",
    "\n",
    "# Print time\n",
    "print(time.time()-t_start, \"s\")\n",
    "\n",
    "# Print the Top 3\n",
    "print(pr_sorted_df.head(3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now use the [map](https://s3.us-east-2.amazonaws.com/rapidsai-data/cugraph/benchmark/twitter-2010-ids.csv.gz) to convert Vertex ID into to Twitter's numeric ID. The user name can also be retrieved using the [TwitterID](https://tweeterid.com/) web app.<br>\n",
    "The table below shows more information on our *Top 3*. Notice that this ranking is much better at capturing network influence compared the number of followers for instance. Further analysis of this dataset was published [here](https://doi.org/10.1145/1772690.1772751).\n",
    "\n",
    "| Vertex ID\t| Twitter ID\t| User name\t| Description |\n",
    "| --------- |  ---------   | --------   |   ----------  |\n",
    "| 21513299\t| 813286\t| barackobama\t| US President (2009-2017) |\n",
    "| 23933989\t| 14224719\t| 10DowningStreet | UK Prime Minister office |\n",
    "| 23933986\t| 15131310\t| WholeFoods\t| Food store from Austin |\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Close the multi-GPU environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client.close()\n",
    "cluster.close()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Annex\n",
    "<a id='annex_cell'></a>\n",
    "An experiment comparing various porducts for this workflow was published in *GraphX: Graph Processing in a Distributed Dataflow Framework,OSDI, 2014*. They used 16 m2.4xlarge worker nodes on Amazon EC2. There was a total of 128 CPU cores and 1TB of memory in this 2014 setup.\n",
    "\n",
    "![twitter-2010-spark.png](https://github.com/rapidsai/notebooks-contrib/blob/master/intermediate_notebooks/examples/cugraph/twitter-2010-spark.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "Copyright (c) 2019, NVIDIA CORPORATION.\n",
    "\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
    "___"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Multi-GPU PageRank workflow on Twitter\n",
	"##### Alex Fender\n",
	"\n",
	"In this notebook, we will show how to use multi-GPU features in cuGraph to compute the PageRank of each user in Twitter's dataset.\n",
	"\n",
	"Please be aware that your system may be different, and you may need to modify the code or install packages to run the below examples. If you think you have found a bug or an error, please file an issue in [cuGraph](https://github.com/rapidsai/cugraph/issues)\n",
	"\n",
	"This notebook was run on 2 NVIDIA Tesla V100 GPUs (connected with nvlink) using RAPIDS 0.9.0 and CUDA 10.0. \n",
	"\n",
	"## Introduction\n",
	"Pagerank is measure of the relative importance of a vertex based on the relative importance of its neighbors. PageRank was invented by Google Inc. and is (was) used to rank its search results. PageRank uses the connectivity information of a graph to rank the importance of each vertex. See [Wikipedia](https://en.wikipedia.org/wiki/PageRank) for more details on the algorithm.\n",
	"\n",
	"CuGraph's multi-GPU features leverage Dask. RAPIDS has other projects based on Dask such as dask-cudf and dask-cuda. These products will also be used in this example. Check out [RAPIDS.ai](https://rapids.ai/) to learn more about these technologies.\n",
	"\n",
	"---\n",
	"\n",
	"To compute the Pagerank with cuGraph we use:<br>\n",
	"\n",
	"```python\n",
	"cugraph.dask.pagerank.pagerank(edge_list, alpha=0.85, max_iter=30)\n",
	"```\n",
	"Parameters\n",
	"\n",
	"* edge_list : `dask_cudf.DataFrame`<br>\n",
	"Contain the connectivity information as an edge list. Source 'src' and destination 'dst' columns must be of type 'int32'. Edge weights are not used for this algorithm. Indices must be in the range [0, V-1], where V is the global number of vertices. The input edge list should be provided in dask-cudf DataFrame with one partition per GPU.\n",
	"* alpha : `float`<br>\n",
	"The damping factor alpha represents the probability to follow an outgoing edge, standard value is 0.85. Thus, 1.0-alpha is the probability to “teleport” to a random vertex. Alpha should be greater than 0.0 and strictly lower than 1.0.\n",
	"* max_iter : `int`<br>\n",
	"The maximum number of iterations before an answer is returned. If this value is lower or equal to 0 cuGraph will use the default value, which is 30. In this notebook, we will use 20 to compare against published results.<br>\n",
	"\n",
	"Returns\n",
	"\n",
	"* PageRank : `dask_cudf.DataFrame`<br>\n",
	"Dask GPU DataFrame containing two columns of size V: the vertex identifiers and the corresponding PageRank values."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Data\n",
	"We will be analyzing 41.7 million user profiles and 1.47 billion social relations from the Twitter dataset. The CSV file is 26GB and was collected in :<br>\n",
	"What is Twitter, a social network or a news media? Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010.<br> \n",
	"\n",
	"---\n",
	"\n",
	"The fastest way to obtain the dataset is to run :\n",
	"```bash\n",
	"sh ./get_data.sh\n",
	"```\n",
	"\n",
	"Please refer to the README for further information and more options on how to obtain this dataset.\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Multi-GPU PageRank with cuGraph\n",
	"### Basic setup"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Let's check out our hardware setup\n",
	"!nvidia-smi\n",
	"\n",
	"# GPUs should be connected with NVlink\n",
	"!nvidia-smi nvlink --status\n",
	"\n",
	"# For best performance, we can limit the number of available devices\n",
	"# For this dataset, we can use only 2 Tesla V100(32GB) or 4 Tesla P100 (16GB)\n",
	"import os\n",
	"os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0,1\"\n",
	"\n",
	"# List available devices\n",
	"!echo Available GPUs: $CUDA_VISIBLE_DEVICES"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Import needed libraries. We recommend using cugraph_dev env through conda\n",
	"import time\n",
	"from dask.distributed import Client, wait\n",
	"import dask_cudf\n",
	"from dask_cuda import LocalCUDACluster\n",
	"import cugraph.dask.pagerank as dcg"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Setup multi-GPU and Dask\n",
	"\n",
	"Before we get started, we need to setup a Dask local cluster of workers to execute our work and a client to coordinate and schedule work for that cluster. As we see below, we can initiate a `cluster` and `client` using only 2 lines of code."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"cluster = LocalCUDACluster(threads_per_worker=1)\n",
	"client = Client(cluster)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Read the data from disk\n",
	"cuGraph depends on dask-cudf for data loading and the initial DataFrame creation. The CSV data file contains an edge list, which represents the connection of a vertex to another. The source to destination pairs is what is known as Coordinate Format (COO). In this test case, the data is just two columns. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# File path, assuming Notebook directory\n",
	"input_data_path = r\"twitter-2010.csv\"\n",
	"\n",
	"# Helper function to set the reader chunk size to automatically get one partition per GPU \n",
	"chunksize = dcg.get_chunksize(input_data_path)\n",
	"\n",
	"# Start timer\n",
	"t_start = time.time()\n",
	"\n",
	"# Multi-GPU CSV reader\n",
	"e_list = dask_cudf.read_csv(input_data_path, chunksize = chunksize, delimiter=' ', names=['src', 'dst'], dtype=['int32', 'int32'])\n",
	"\n",
	"# Wait for the lazy reader\n",
	"tmp = wait(client.compute(e_list.to_delayed()))\n",
	"\n",
	"# Print time\n",
	"print(time.time()-t_start, \"s\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Call the Multi-GPU PageRank algorithm\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"# Start timer\n",
	"t_start = time.time()\n",
	"\n",
	"# Get the pagerank scores\n",
	"pr_ddf = dcg.pagerank(e_list, max_iter=20)\n",
	"\n",
	"# Print time\n",
	"print(time.time()-t_start, \"s\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"It was that easy! PageRank should only take a few seconds to run on this 26GB input with 2 Tesla V100 GPUs.<br>\n",
	"Check out how it compares to published results in the [Annex](#annex_cell)."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Further analysis on the Pagerank result\n",
	"\n",
	"We can now identify the most influent users in the network.<br>\n",
	"Notice that the PageRank result can fit in one GPU. Hence, we can gather it in a regular `cudf.DataFrame`. We will then sort by PageRank value and print the Top 3."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"# Start timer\n",
	"t_start = time.time()\n",
	"\n",
	"# Dask Data Frame to regular cuDF Data Frame \n",
	"pr_df = pr_ddf.compute()\n",
	"\n",
	"# Sort, descending order\n",
	"pr_sorted_df = pr_df.sort_values('pagerank',ascending=False)\n",
	"\n",
	"# Print time\n",
	"print(time.time()-t_start, \"s\")\n",
	"\n",
	"# Print the Top 3\n",
	"print(pr_sorted_df.head(3))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can now use the [map](https://s3.us-east-2.amazonaws.com/rapidsai-data/cugraph/benchmark/twitter-2010-ids.csv.gz) to convert Vertex ID into to Twitter's numeric ID. The user name can also be retrieved using the [TwitterID](https://tweeterid.com/) web app.<br>\n",
	"The table below shows more information on our Top 3. Notice that this ranking is much better at capturing network influence compared the number of followers for instance. Further analysis of this dataset was published [here](https://doi.org/10.1145/1772690.1772751).\n",
	"\n",
	"\| Vertex ID\t\| Twitter ID\t\| User name\t\| Description \|\n",
	"\| --------- \| --------- \| -------- \| ---------- \|\n",
	"\| 21513299\t\| 813286\t\| barackobama\t\| US President (2009-2017) \|\n",
	"\| 23933989\t\| 14224719\t\| 10DowningStreet \| UK Prime Minister office \|\n",
	"\| 23933986\t\| 15131310\t\| WholeFoods\t\| Food store from Austin \|\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Close the multi-GPU environment"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"client.close()\n",
	"cluster.close()"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Annex\n",
	"<a id='annex_cell'></a>\n",
	"An experiment comparing various porducts for this workflow was published in GraphX: Graph Processing in a Distributed Dataflow Framework,OSDI, 2014. They used 16 m2.4xlarge worker nodes on Amazon EC2. There was a total of 128 CPU cores and 1TB of memory in this 2014 setup.\n",
	"\n",
	"![twitter-2010-spark.png](https://github.com/rapidsai/notebooks-contrib/blob/master/intermediate_notebooks/examples/cugraph/twitter-2010-spark.png)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"___\n",
	"Copyright (c) 2019, NVIDIA CORPORATION.\n",
	"\n",
	"Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
	"\n",
	"Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
	"___"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.7"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}