afender/multi_gpu_pagerank2.ipynb Secret

## multi_gpu_pagerank2.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#Large and distributed datasets in Multi-GPU PageRank\n",
    "##### Alex Fender\n",
    "\n",
    "In this notebook, we will show how to use multi-GPU features in cuGraph to run the PageRank on a 300GB dataset on a DGX2.\n",
    "\n",
    "Please be aware that your system may be different, and you may need to modify the code or install packages to run the below examples. If you think you have found a bug or an error, please file an issue in [cuGraph](https://github.com/rapidsai/cugraph/issues)\n",
    "\n",
    "This notebook was run using RAPIDS 0.10 and CUDA 10.0. \n",
    "\n",
    "## Introduction\n",
    "Pagerank is measure of the relative importance of a vertex based on the relative importance of its neighbors.  PageRank was invented by Google Inc. and is (was) used to rank its search results. PageRank uses the connectivity information of a graph to rank the importance of each vertex. See [Wikipedia](https://en.wikipedia.org/wiki/PageRank) for more details on the algorithm.\n",
    "\n",
    "CuGraph's multi-GPU features leverage Dask. RAPIDS has other projects based on Dask such as dask-cudf and dask-cuda. These products will also be used in this example. Check out [RAPIDS.ai](https://rapids.ai/) to learn more about these technologies.\n",
    "\n",
    "---\n",
    "\n",
    "To compute the Pagerank with cuGraph we use:<br>\n",
    "\n",
    "```python\n",
    "cugraph.dask.pagerank.pagerank(edge_list, alpha=0.85, max_iter=30)\n",
    "```\n",
    "Parameters\n",
    "\n",
    "*  *edge_list* : `dask_cudf.DataFrame`<br>\n",
    "Contain the connectivity information as an edge list. Source 'src' and destination 'dst' columns must be of type 'int32'. Edge weights are not used for this algorithm. Indices must be in the range [0, V-1], where V is the global number of vertices. The input edge list should be provided in dask-cudf DataFrame with one partition per GPU.\n",
    "*  *alpha* : `float`<br>\n",
    "The damping factor alpha represents the probability to follow an outgoing edge, standard value is 0.85. Thus, 1.0-alpha is the probability to “teleport” to a random vertex. Alpha should be greater than 0.0 and strictly lower than 1.0.\n",
    "* *max_iter* : `int`<br>\n",
    "The maximum number of iterations before an answer is returned. If this value is lower or equal to 0 cuGraph will use the default value, which is 30. In this notebook, we will use 20 to compare against published results.<br>\n",
    "\n",
    "Returns\n",
    "\n",
    "* *PageRank* : `dask_cudf.DataFrame`<br>\n",
    "Dask GPU DataFrame containing two columns of size V: the vertex identifiers and the corresponding PageRank values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "We will be analyzing 400 million nodes and 16 billion links from an artificial HiBench dataset (Zipfian distribution). The CSV edge list file is 300GB and split into 32 partitions.\n",
    "\n",
    "---\n",
    "\n",
    "We recommend that you download and decompress the data on a ***local*** directory on the DGX-2 machine. Notice that on clusters, most user sessions are NFS-based which would be a huge performance bottleneck if the dataset is stored there. Your system may be different, and you may want to talk to your cluster administrator to identify the best storage option.\n",
    "\n",
    "---\n",
    "\n",
    "The fastest way to obtain the dataset is to run :\n",
    "```\n",
    "wget https://rapidsai-data.s3.us-east-2.amazonaws.com/cugraph/benchmark/hibench/HiBench_300GB.tar.gz\n",
    "tar -xzf HiBench_300GB.tar.gz\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-GPU PageRank with cuGraph\n",
    "### Basic setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's check out our hardware setup\n",
    "!nvidia-smi\n",
    "\n",
    "# GPUs should be connected with NVlink\n",
    "!nvidia-smi nvlink --status\n",
    "\n",
    "# List available devices (we need all 16 GPUs)\n",
    "# If empty, all GPUs are available by default\n",
    "!echo Available GPUs: $CUDA_VISIBLE_DEVICES"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import needed libraries. We recommend using cugraph_dev env via conda\n",
    "import time\n",
    "from dask.distributed import Client, wait\n",
    "import dask_cudf\n",
    "from dask_cuda import LocalCUDACluster\n",
    "import cugraph.dask.pagerank as dcg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup multi-GPU and Dask\n",
    "\n",
    "Before we get started, we need to setup a Dask local cluster of workers to execute our work and a client to coordinate and schedule work for that cluster. As we see below, we can initiate a `cluster` and `client` using only 2 lines of code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# By default, Dask will use the current directory to store temporary information\n",
    "cluster = LocalCUDACluster(threads_per_worker=1)\n",
    "# If your working directory is on a NFS, you will want to change the line above to\n",
    "# specify a local directory instead :\n",
    "# cluster = LocalCUDACluster(local_dir=\"/some_local_path_on the current_machine/\",threads_per_worker=1)\n",
    "\n",
    "client = Client(cluster)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read the data from disk\n",
    "cuGraph depends on dask-cudf for data loading and the initial DataFrame creation. The CSV data file contains an edge list, which represents the connection of a vertex to another. The source to destination pairs is what is known as Coordinate Format (COO). In this test case, the data is just two columns. \n",
    "\n",
    "CuGraph has a special `read_split_csv` for large datasets which cannot be read directly by dask-cudf due to memory requirements. This function takes large input split into smaller files (number of input files > number of gpus), reads two or more csv per gpu/worker and concatenates them into a single dataframe. Additional parameters (delimiter, names and dtype) can be specified for reading the csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# File path\n",
    "# *** edit this ***\n",
    "input_data_path = r\"/datasets/\"\n",
    "\n",
    "# Files\n",
    "input_files = ['file-00000.csv','file-00001.csv','file-00002.csv','file-00003.csv',\n",
    "             'file-00004.csv','file-00005.csv','file-00006.csv','file-00007.csv',\n",
    "             'file-00008.csv','file-00009.csv','file-00010.csv','file-00011.csv',\n",
    "             'file-00012.csv','file-00013.csv','file-00014.csv','file-00015.csv',\n",
    "             'file-00016.csv','file-00017.csv','file-00018.csv','file-00019.csv',\n",
    "             'file-00020.csv','file-00021.csv','file-00022.csv','file-00023.csv',\n",
    "             'file-00024.csv','file-00025.csv','file-00026.csv','file-00027.csv',\n",
    "             'file-00028.csv','file-00029.csv','file-00030.csv','file-00031.csv']\n",
    "\n",
    "# Concatenate file paths\n",
    "files = [input_data_path+f for f in input_files]\n",
    "\n",
    "# Start timer\n",
    "t_start = time.time()\n",
    "\n",
    "# Special Multi-GPU CSV reader for splited files\n",
    "e_list = dcg.read_split_csv(files)\n",
    "\n",
    "# Wait for the lazy reader\n",
    "tmp = wait(client.compute(e_list.to_delayed()))\n",
    "\n",
    "# Print time\n",
    "print(time.time()-t_start, \"s\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Call the Multi-GPU PageRank algorithm\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Start timer\n",
    "t_start = time.time()\n",
    "\n",
    "# Get the pagerank scores\n",
    "pr_ddf = dcg.pagerank(e_list, max_iter=20)\n",
    "\n",
    "# Print time\n",
    "print(time.time()-t_start, \"s\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It was that easy! cuGraph should only take a few seconds per iteration to run on this 300GB input with 16 Tesla V100 GPUs.<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Close the multi-GPU environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client.close()\n",
    "cluster.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "Copyright (c) 2019, NVIDIA CORPORATION.\n",
    "\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
    "___"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#Large and distributed datasets in Multi-GPU PageRank\n",
	"##### Alex Fender\n",
	"\n",
	"In this notebook, we will show how to use multi-GPU features in cuGraph to run the PageRank on a 300GB dataset on a DGX2.\n",
	"\n",
	"Please be aware that your system may be different, and you may need to modify the code or install packages to run the below examples. If you think you have found a bug or an error, please file an issue in [cuGraph](https://github.com/rapidsai/cugraph/issues)\n",
	"\n",
	"This notebook was run using RAPIDS 0.10 and CUDA 10.0. \n",
	"\n",
	"## Introduction\n",
	"Pagerank is measure of the relative importance of a vertex based on the relative importance of its neighbors. PageRank was invented by Google Inc. and is (was) used to rank its search results. PageRank uses the connectivity information of a graph to rank the importance of each vertex. See [Wikipedia](https://en.wikipedia.org/wiki/PageRank) for more details on the algorithm.\n",
	"\n",
	"CuGraph's multi-GPU features leverage Dask. RAPIDS has other projects based on Dask such as dask-cudf and dask-cuda. These products will also be used in this example. Check out [RAPIDS.ai](https://rapids.ai/) to learn more about these technologies.\n",
	"\n",
	"---\n",
	"\n",
	"To compute the Pagerank with cuGraph we use:<br>\n",
	"\n",
	"```python\n",
	"cugraph.dask.pagerank.pagerank(edge_list, alpha=0.85, max_iter=30)\n",
	"```\n",
	"Parameters\n",
	"\n",
	"* edge_list : `dask_cudf.DataFrame`<br>\n",
	"Contain the connectivity information as an edge list. Source 'src' and destination 'dst' columns must be of type 'int32'. Edge weights are not used for this algorithm. Indices must be in the range [0, V-1], where V is the global number of vertices. The input edge list should be provided in dask-cudf DataFrame with one partition per GPU.\n",
	"* alpha : `float`<br>\n",
	"The damping factor alpha represents the probability to follow an outgoing edge, standard value is 0.85. Thus, 1.0-alpha is the probability to “teleport” to a random vertex. Alpha should be greater than 0.0 and strictly lower than 1.0.\n",
	"* max_iter : `int`<br>\n",
	"The maximum number of iterations before an answer is returned. If this value is lower or equal to 0 cuGraph will use the default value, which is 30. In this notebook, we will use 20 to compare against published results.<br>\n",
	"\n",
	"Returns\n",
	"\n",
	"* PageRank : `dask_cudf.DataFrame`<br>\n",
	"Dask GPU DataFrame containing two columns of size V: the vertex identifiers and the corresponding PageRank values."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Data\n",
	"We will be analyzing 400 million nodes and 16 billion links from an artificial HiBench dataset (Zipfian distribution). The CSV edge list file is 300GB and split into 32 partitions.\n",
	"\n",
	"---\n",
	"\n",
	"We recommend that you download and decompress the data on a *local* directory on the DGX-2 machine. Notice that on clusters, most user sessions are NFS-based which would be a huge performance bottleneck if the dataset is stored there. Your system may be different, and you may want to talk to your cluster administrator to identify the best storage option.\n",
	"\n",
	"---\n",
	"\n",
	"The fastest way to obtain the dataset is to run :\n",
	"```\n",
	"wget https://rapidsai-data.s3.us-east-2.amazonaws.com/cugraph/benchmark/hibench/HiBench_300GB.tar.gz\n",
	"tar -xzf HiBench_300GB.tar.gz\n",
	"```\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Multi-GPU PageRank with cuGraph\n",
	"### Basic setup"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Let's check out our hardware setup\n",
	"!nvidia-smi\n",
	"\n",
	"# GPUs should be connected with NVlink\n",
	"!nvidia-smi nvlink --status\n",
	"\n",
	"# List available devices (we need all 16 GPUs)\n",
	"# If empty, all GPUs are available by default\n",
	"!echo Available GPUs: $CUDA_VISIBLE_DEVICES"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Import needed libraries. We recommend using cugraph_dev env via conda\n",
	"import time\n",
	"from dask.distributed import Client, wait\n",
	"import dask_cudf\n",
	"from dask_cuda import LocalCUDACluster\n",
	"import cugraph.dask.pagerank as dcg"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Setup multi-GPU and Dask\n",
	"\n",
	"Before we get started, we need to setup a Dask local cluster of workers to execute our work and a client to coordinate and schedule work for that cluster. As we see below, we can initiate a `cluster` and `client` using only 2 lines of code."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# By default, Dask will use the current directory to store temporary information\n",
	"cluster = LocalCUDACluster(threads_per_worker=1)\n",
	"# If your working directory is on a NFS, you will want to change the line above to\n",
	"# specify a local directory instead :\n",
	"# cluster = LocalCUDACluster(local_dir=\"/some_local_path_on the current_machine/\",threads_per_worker=1)\n",
	"\n",
	"client = Client(cluster)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Read the data from disk\n",
	"cuGraph depends on dask-cudf for data loading and the initial DataFrame creation. The CSV data file contains an edge list, which represents the connection of a vertex to another. The source to destination pairs is what is known as Coordinate Format (COO). In this test case, the data is just two columns. \n",
	"\n",
	"CuGraph has a special `read_split_csv` for large datasets which cannot be read directly by dask-cudf due to memory requirements. This function takes large input split into smaller files (number of input files > number of gpus), reads two or more csv per gpu/worker and concatenates them into a single dataframe. Additional parameters (delimiter, names and dtype) can be specified for reading the csv file."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# File path\n",
	"# * edit this *\n",
	"input_data_path = r\"/datasets/\"\n",
	"\n",
	"# Files\n",
	"input_files = ['file-00000.csv','file-00001.csv','file-00002.csv','file-00003.csv',\n",
	" 'file-00004.csv','file-00005.csv','file-00006.csv','file-00007.csv',\n",
	" 'file-00008.csv','file-00009.csv','file-00010.csv','file-00011.csv',\n",
	" 'file-00012.csv','file-00013.csv','file-00014.csv','file-00015.csv',\n",
	" 'file-00016.csv','file-00017.csv','file-00018.csv','file-00019.csv',\n",
	" 'file-00020.csv','file-00021.csv','file-00022.csv','file-00023.csv',\n",
	" 'file-00024.csv','file-00025.csv','file-00026.csv','file-00027.csv',\n",
	" 'file-00028.csv','file-00029.csv','file-00030.csv','file-00031.csv']\n",
	"\n",
	"# Concatenate file paths\n",
	"files = [input_data_path+f for f in input_files]\n",
	"\n",
	"# Start timer\n",
	"t_start = time.time()\n",
	"\n",
	"# Special Multi-GPU CSV reader for splited files\n",
	"e_list = dcg.read_split_csv(files)\n",
	"\n",
	"# Wait for the lazy reader\n",
	"tmp = wait(client.compute(e_list.to_delayed()))\n",
	"\n",
	"# Print time\n",
	"print(time.time()-t_start, \"s\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Call the Multi-GPU PageRank algorithm\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"# Start timer\n",
	"t_start = time.time()\n",
	"\n",
	"# Get the pagerank scores\n",
	"pr_ddf = dcg.pagerank(e_list, max_iter=20)\n",
	"\n",
	"# Print time\n",
	"print(time.time()-t_start, \"s\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"It was that easy! cuGraph should only take a few seconds per iteration to run on this 300GB input with 16 Tesla V100 GPUs.<br>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Close the multi-GPU environment"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"client.close()\n",
	"cluster.close()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"___\n",
	"Copyright (c) 2019, NVIDIA CORPORATION.\n",
	"\n",
	"Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\n",
	"\n",
	"Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n",
	"___"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.7"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}