Skip to content

Instantly share code, notes, and snippets.

@hanskyy
Created April 20, 2022 00:18
Show Gist options
  • Save hanskyy/e0e0c7cff1adfecfd84b5e590ddeee3d to your computer and use it in GitHub Desktop.
Save hanskyy/e0e0c7cff1adfecfd84b5e590ddeee3d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "ed6e76f4",
"metadata": {},
"source": [
"## PyTorch Geometric Homework\n",
"\n",
"PyTorch Geometric has two classes for storing and/or transforming graphs into tensor format. One is `torch_geometric.datasets`, which contains a variety of common graph datasets. Another is `torch_geometric.data`, which provides the data handling of graphs in PyTorch tensors.\n"
]
},
{
"cell_type": "markdown",
"id": "4faf72fc",
"metadata": {},
"source": [
"### PyG Datasets\n",
"\n",
"The `torch_geometric.datasets` class has many common graph datasets. Here we will explore its usage through one example dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "369fbd2c",
"metadata": {},
"outputs": [],
"source": [
"from torch_geometric.datasets import TUDataset\n",
"\n",
"if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
" root = './enzymes'\n",
" name = 'ENZYMES'\n",
"\n",
" # The ENZYMES dataset\n",
" pyg_dataset= TUDataset(root, name)\n",
"\n",
" # You will find that there are 600 graphs in this dataset\n",
" print(pyg_dataset)"
]
},
{
"cell_type": "markdown",
"id": "67e74090",
"metadata": {},
"source": [
"#### Question 1: What is the number of classes and number of features in the ENZYMES dataset? (2 points)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1e7c5d8a",
"metadata": {},
"outputs": [],
"source": [
"def get_num_classes(pyg_dataset):\n",
" # TODO: Implement a function that takes a PyG dataset object\n",
" # and returns the number of classes for that dataset.\n",
"\n",
" num_classes = 0\n",
"\n",
" ############# Your code here ############\n",
" ## (~1 line of code)\n",
"\n",
" #########################################\n",
"\n",
" return num_classes\n",
"\n",
"def get_num_features(pyg_dataset):\n",
" # TODO: Implement a function that takes a PyG dataset object\n",
" # and returns the number of features for that dataset.\n",
"\n",
" num_features = 0\n",
"\n",
" ############# Your code here ############\n",
" ## (~1 line of code)\n",
"\n",
" #########################################\n",
"\n",
" return num_features\n",
"\n",
"if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
" num_classes = get_num_classes(pyg_dataset)\n",
" num_features = get_num_features(pyg_dataset)\n",
" print(\"{} dataset has {} classes\".format(name, num_classes))\n",
" print(\"{} dataset has {} features\".format(name, num_features))"
]
},
{
"cell_type": "markdown",
"id": "2742e42c",
"metadata": {},
"source": [
"### PyG Data\n",
"\n",
"Each PyG dataset stores a list of `torch_geometric.data.Data` objects, where each `torch_geometric.data.Data` object represents a graph. We can easily get the `Data` object by indexing into the dataset.\n",
"\n",
"For more information such as what is stored in the `Data` object, please refer to the [documentation](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data)."
]
},
{
"cell_type": "markdown",
"id": "bb3ad669",
"metadata": {},
"source": [
"#### Question 2: What is the label of the graph with index 100 in the ENZYMES dataset? (1 point)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a0de3ddd",
"metadata": {},
"outputs": [],
"source": [
"def get_graph_class(pyg_dataset, idx):\n",
" # TODO: Implement a function that takes a PyG dataset object,\n",
" # an index of a graph within the dataset, and returns the class/label \n",
" # of the graph (as an integer).\n",
"\n",
" label = -1\n",
"\n",
" ############# Your code here ############\n",
" ## (~1 line of code)\n",
"\n",
" #########################################\n",
"\n",
" return label\n",
"\n",
"# Here pyg_dataset is a dataset for graph classification\n",
"if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
" graph_0 = pyg_dataset[0]\n",
" print(graph_0)\n",
" idx = 100\n",
" label = get_graph_class(pyg_dataset, idx)\n",
" print('Graph with index {} has label {}'.format(idx, label))"
]
},
{
"cell_type": "markdown",
"id": "c44c9cc1",
"metadata": {},
"source": [
"#### Question 3: How many edges does the graph with index 200 have? (1 point)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7d6026a6",
"metadata": {},
"outputs": [],
"source": [
"def get_graph_num_edges(pyg_dataset, idx):\n",
" # TODO: Implement a function that takes a PyG dataset object,\n",
" # the index of a graph in the dataset, and returns the number of \n",
" # edges in the graph (as an integer). You should not count an edge \n",
" # twice if the graph is undirected. For example, in an undirected \n",
" # graph G, if two nodes v and u are connected by an edge, this edge\n",
" # should only be counted once.\n",
"\n",
" num_edges = 0\n",
"\n",
" ############# Your code here ############\n",
" ## Note:\n",
" ## 1. You can't return the data.num_edges directly\n",
" ## 2. We assume the graph is undirected\n",
" ## 3. Look at the PyG dataset built in functions\n",
" ## (~4 lines of code)\n",
"\n",
" #########################################\n",
"\n",
" return num_edges\n",
"\n",
"if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
" idx = 200\n",
" num_edges = get_graph_num_edges(pyg_dataset, idx)\n",
" print('Graph with index {} has {} edges'.format(idx, num_edges))"
]
},
{
"cell_type": "markdown",
"id": "405ca900",
"metadata": {},
"source": [
"### Open Graph Benchmark (OGB)\n",
"\n",
"The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. Its datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can then be evaluated by using the OGB Evaluator in a unified manner."
]
},
{
"cell_type": "markdown",
"id": "00bc9944",
"metadata": {},
"source": [
"#### Dataset and Data\n",
"\n",
"OGB also supports PyG dataset and data classes. Here we take a look on the `ogbn-arxiv` dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bbf1a1be",
"metadata": {},
"outputs": [],
"source": [
"import torch_geometric.transforms as T\n",
"from ogb.nodeproppred import PygNodePropPredDataset\n",
"\n",
"if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
" dataset_name = 'ogbn-arxiv'\n",
" # Load the dataset and transform it to sparse tensor\n",
" dataset = PygNodePropPredDataset(name=dataset_name,\n",
" transform=T.ToSparseTensor())\n",
" print('The {} dataset has {} graph'.format(dataset_name, len(dataset)))\n",
"\n",
" # Extract the graph\n",
" data = dataset[0]\n",
" print(data)"
]
},
{
"cell_type": "markdown",
"id": "4090910d",
"metadata": {},
"source": [
"#### Question 4: How many features are in the ogbn-arxiv graph? (1 point)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c8982174",
"metadata": {},
"outputs": [],
"source": [
"def graph_num_features(data):\n",
" # TODO: Implement a function that takes a PyG data object,\n",
" # and returns the number of features in the graph (as an integer).\n",
"\n",
" num_features = 0\n",
"\n",
" ############# Your code here ############\n",
" ## (~1 line of code)\n",
"\n",
" #########################################\n",
"\n",
" return num_features\n",
"\n",
"if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
" num_features = graph_num_features(data)\n",
" print('The graph has {} features'.format(num_features))"
]
},
{
"cell_type": "markdown",
"id": "8fde3b0c",
"metadata": {},
"source": [
"### GNN: Node Property Prediction\n",
"\n",
"In this section we will build our first graph neural network using PyTorch Geometric. Then we will apply it to the task of node property prediction (node classification).\n",
"\n",
"Specifically, we will use GCN as the foundation for your graph neural network ([Kipf et al. (2017)](https://arxiv.org/pdf/1609.02907.pdf)). To do so, we will work with PyG's built-in `GCNConv` layer. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fd4812b0",
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import pandas as pd\n",
"import torch.nn.functional as F\n",
"print(torch.__version__)\n",
"\n",
"# The PyG built-in GCNConv\n",
"from torch_geometric.nn import GCNConv\n",
"\n",
"import torch_geometric.transforms as T\n",
"from ogb.nodeproppred import PygNodePropPredDataset, Evaluator"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ded8b4c7",
"metadata": {},
"outputs": [],
"source": [
"if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
" dataset_name = 'ogbn-arxiv'\n",
" dataset = PygNodePropPredDataset(name=dataset_name,\n",
" transform=T.ToSparseTensor())\n",
" data = dataset[0]\n",
"\n",
" # Make the adjacency matrix to symmetric\n",
" data.adj_t = data.adj_t.to_symmetric()\n",
"\n",
" device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
"\n",
" # If you use GPU, the device should be cuda\n",
" print('Device: {}'.format(device))\n",
"\n",
" data = data.to(device)\n",
" split_idx = dataset.get_idx_split()\n",
" train_idx = split_idx['train'].to(device)"
]
},
{
"cell_type": "markdown",
"id": "7517363b",
"metadata": {},
"source": [
"#### GCN Model\n",
"\n",
"Now we will implement our GCN model!\n",
"\n",
"Please follow the figure below to implement the `forward` function.\n",
"\n",
"\n",
"![test](https://drive.google.com/uc?id=128AuYAXNXGg7PIhJJ7e420DoPWKb-RtL)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5820a611",
"metadata": {},
"outputs": [],
"source": [
"class GCN(torch.nn.Module):\n",
" def __init__(self, input_dim, hidden_dim, output_dim, num_layers,\n",
" dropout, return_embeds=False):\n",
" # TODO: Implement a function that initializes self.convs, \n",
" # self.bns, and self.softmax.\n",
"\n",
" super(GCN, self).__init__()\n",
"\n",
" # A list of GCNConv layers\n",
" self.convs = None\n",
"\n",
" # A list of 1D batch normalization layers\n",
" self.bns = None\n",
"\n",
" # The log softmax layer\n",
" self.softmax = None\n",
"\n",
" ############# Your code here ############\n",
" ## Note:\n",
" ## 1. You should use torch.nn.ModuleList for self.convs and self.bns\n",
" ## 2. self.convs has num_layers GCNConv layers\n",
" ## 3. self.bns has num_layers - 1 BatchNorm1d layers\n",
" ## 4. You should use torch.nn.LogSoftmax for self.softmax\n",
" ## 5. The parameters you can set for GCNConv include 'in_channels' and \n",
" ## 'out_channels'. For more information please refer to the documentation:\n",
" ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv\n",
" ## 6. The only parameter you need to set for BatchNorm1d is 'num_features'\n",
" ## For more information please refer to the documentation: \n",
" ## https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html\n",
" ## (~10 lines of code)\n",
"\n",
"\n",
" #########################################\n",
"\n",
" # Probability of an element getting zeroed\n",
" self.dropout = dropout\n",
"\n",
" # Skip classification layer and return node embeddings\n",
" self.return_embeds = return_embeds\n",
"\n",
" def reset_parameters(self):\n",
" for conv in self.convs:\n",
" conv.reset_parameters()\n",
" for bn in self.bns:\n",
" bn.reset_parameters()\n",
"\n",
" def forward(self, x, adj_t):\n",
" # TODO: Implement a function that takes the feature tensor x and\n",
" # edge_index tensor adj_t and returns the output tensor as\n",
" # shown in the figure.\n",
"\n",
" out = None\n",
"\n",
" ############# Your code here ############\n",
" ## Note:\n",
" ## 1. Construct the network as shown in the figure\n",
" ## 2. torch.nn.functional.relu and torch.nn.functional.dropout are useful\n",
" ## For more information please refer to the documentation:\n",
" ## https://pytorch.org/docs/stable/nn.functional.html\n",
" ## 3. Don't forget to set F.dropout training to self.training\n",
" ## 4. If return_embeds is True, then skip the last softmax layer\n",
" ## (~7 lines of code)\n",
"\n",
" #########################################\n",
"\n",
" return out"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76af550c",
"metadata": {},
"outputs": [],
"source": [
"def train(model, data, train_idx, optimizer, loss_fn):\n",
" # TODO: Implement a function that trains the model by \n",
" # using the given optimizer and loss_fn.\n",
" model.train()\n",
" loss = 0\n",
"\n",
" ############# Your code here ############\n",
" ## Note:\n",
" ## 1. Zero grad the optimizer\n",
" ## 2. Feed the data into the model\n",
" ## 3. Slice the model output and label by train_idx\n",
" ## 4. Feed the sliced output and label to loss_fn\n",
" ## (~4 lines of code)\n",
"\n",
" #########################################\n",
"\n",
" loss.backward()\n",
" optimizer.step()\n",
"\n",
" return loss.item()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7c90267",
"metadata": {},
"outputs": [],
"source": [
"# Test function here\n",
"@torch.no_grad()\n",
"def test(model, data, split_idx, evaluator, save_model_results=False):\n",
" # TODO: Implement a function that tests the model by \n",
" # using the given split_idx and evaluator.\n",
" model.eval()\n",
"\n",
" # The output of model on all data\n",
" out = None\n",
"\n",
" ############# Your code here ############\n",
" ## (~1 line of code)\n",
" ## Note:\n",
" ## 1. No index slicing here\n",
"\n",
" #########################################\n",
"\n",
" y_pred = out.argmax(dim=-1, keepdim=True)\n",
"\n",
" train_acc = evaluator.eval({\n",
" 'y_true': data.y[split_idx['train']],\n",
" 'y_pred': y_pred[split_idx['train']],\n",
" })['acc']\n",
" valid_acc = evaluator.eval({\n",
" 'y_true': data.y[split_idx['valid']],\n",
" 'y_pred': y_pred[split_idx['valid']],\n",
" })['acc']\n",
" test_acc = evaluator.eval({\n",
" 'y_true': data.y[split_idx['test']],\n",
" 'y_pred': y_pred[split_idx['test']],\n",
" })['acc']\n",
"\n",
" if save_model_results:\n",
" print (\"Saving Model Predictions\")\n",
"\n",
" data = {}\n",
" data['y_pred'] = y_pred.view(-1).cpu().detach().numpy()\n",
"\n",
" df = pd.DataFrame(data=data)\n",
" # Save locally as csv\n",
" df.to_csv('ogbn-arxiv_node.csv', sep=',', index=False)\n",
"\n",
"\n",
" return train_acc, valid_acc, test_acc"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6adcf21f",
"metadata": {},
"outputs": [],
"source": [
"# Please do not change the args\n",
"if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
" args = {\n",
" 'device': device,\n",
" 'num_layers': 3,\n",
" 'hidden_dim': 256,\n",
" 'dropout': 0.5,\n",
" 'lr': 0.01,\n",
" 'epochs': 100,\n",
" }\n",
" args"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7c45d1d2",
"metadata": {},
"outputs": [],
"source": [
"if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
" model = GCN(data.num_features, args['hidden_dim'],\n",
" dataset.num_classes, args['num_layers'],\n",
" args['dropout']).to(device)\n",
" evaluator = Evaluator(name='ogbn-arxiv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73b71a8f",
"metadata": {},
"outputs": [],
"source": [
"import copy\n",
"if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
" # reset the parameters to initial random value\n",
" model.reset_parameters()\n",
"\n",
" optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])\n",
" loss_fn = F.nll_loss\n",
"\n",
" best_model = None\n",
" best_valid_acc = 0\n",
"\n",
" for epoch in range(1, 1 + args[\"epochs\"]):\n",
" loss = train(model, data, train_idx, optimizer, loss_fn)\n",
" result = test(model, data, split_idx, evaluator)\n",
" train_acc, valid_acc, test_acc = result\n",
" if valid_acc > best_valid_acc:\n",
" best_valid_acc = valid_acc\n",
" best_model = copy.deepcopy(model)\n",
" print(f'Epoch: {epoch:02d}, '\n",
" f'Loss: {loss:.4f}, '\n",
" f'Train: {100 * train_acc:.2f}%, '\n",
" f'Valid: {100 * valid_acc:.2f}% '\n",
" f'Test: {100 * test_acc:.2f}%')"
]
},
{
"cell_type": "markdown",
"id": "04bf0db2",
"metadata": {},
"source": [
"#### Question 5: What are your `best_model` validation and test accuracies?(5 points)\n",
"\n",
"Fill the code above and run the cell below to see the results of your best of model."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a797404e",
"metadata": {},
"outputs": [],
"source": [
"if 'IS_GRADESCOPE_ENV' not in os.environ:\n",
" best_result = test(best_model, data, split_idx, evaluator, save_model_results=True)\n",
" train_acc, valid_acc, test_acc = best_result\n",
" print(f'Best model: '\n",
" f'Train: {100 * train_acc:.2f}%, '\n",
" f'Valid: {100 * valid_acc:.2f}% '\n",
" f'Test: {100 * test_acc:.2f}%')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment