masterdesky/lab_10_cnn.ipynb Secret

## lab_10_cnn.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8b7e51d8-f81e-4c5f-8a48-7159ead65e99",
   "metadata": {
    "tags": []
   },
   "source": [
    "# 10. Convolutional neural networks\n",
    "\n",
    "In this assignment you'll use a photo-Z dataset acquired from the observations of the SDSS telescope located in New Mexico. The goal is to predict redshifts from multiband images of galaxies. As a warmup you'll work with the SVHN dataset.\n",
    "\n",
    "**<font color='red'>[WARN]:</font> For this assignment you'll need significantly more computational power compared to previous assignments! If you don't have a CUDA-capable GPU with >4Gb VRAM and >8Gb RAM, then you're advised to work on Google Colab!**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fec15c7-def2-45b9-8456-881564d76398",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 1. Load the Street View House Numbers (SVHN) dataset\n",
    "\n",
    "-   Download the SVHN database and load the train and test datasets!\n",
    "    There are multiple ways to do this. The easiest one is probably to install\n",
    "    and use the `extra-keras-datasets` Python package. You need to use the\n",
    "    standard/normal SVHN dataset only and NOT the one titled as `extra`!\n",
    "    (Of course, if you have enough RAM and VRAM, you can work with that one\n",
    "    too, if you want...)\n",
    "-   Preprocess the downloaded data if needed to be able to use it for training\n",
    "    and testing!\n",
    "-   Normalize the pixel values into the interval of [0,1]!\n",
    "-   How many and what classes do we have in the dataset? How many train and test\n",
    "    examples do we have?\n",
    "-   What are the dimensions of the images?\n",
    "-   Show some images randomly from the dataset!\n",
    "-   Make one-hot encoding for the labels!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d72057e3-894b-44e1-bef8-5a7ef8b811e2",
   "metadata": {},
   "source": [
    "### 2. Create a convolutional neural network for the SVHN dataset\n",
    "\n",
    "-   Train the following network on the training set and generate\n",
    "    prediction for the test images:\n",
    "    ```\n",
    "          > conv2D, 16 kernels, kernel size = (3,3), valid padding, relu actvation\n",
    "          > conv2D, 16 kernels, kernel size = (3,3), valid padding, relu actvation\n",
    "          > maxpooling kernel size = (2,2)\n",
    "          > conv2D, 32 kernels, kernel size = (3,3), valid padding, relu actvation\n",
    "          > conv2D, 32 kernels, kernel size = (3,3), valid padding, relu actvation\n",
    "          > maxpooling pool size = (2,2) strides = (2,2)\n",
    "          > flatten\n",
    "          > dense, 10 neurons, softmax activation\n",
    "    ```\n",
    "    -   Use Adam optimizer with default parameters\n",
    "    -   Use categorical crossentropy as loss function\n",
    "    -   Compile the model\n",
    "    -   Print out a summary of the model\n",
    "    -   Train the CNN on the training data for 25 epochs with batch size\n",
    "        of 64\n",
    "    -   Use the test data as validation data\n",
    "\n",
    "-   Calculate the categorical cross-entropy loss and the accuracy!\n",
    "    **<font color='green'>[HINT]:</font>** you should get at least $\\approx$ 80-90%\n",
    "    accuracy.\n",
    "-   Plot the training and the validation loss and accuracy on the same plot!\n",
    "    Do we experience overfitting?\n",
    "-   Show the confusion matrix of the predictions!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd6b8f6a-83a7-4fba-bfae-7fe501d934a9",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 3. Load the Sload Digital Sky Survey (SDSS) Dataset\n",
    "\n",
    "You can download the dataset from Kaggle via\n",
    "[this link](https://www.kaggle.com/masterdesky/multiband-photoz-sdss-dr16).\n",
    "\n",
    "-   Download the images from Kaggle (~10'000 images total) \n",
    "-   Preprocess the images similarly to the SVHN dataset if needed! (Normalize\n",
    "    pixel values to [0,1], etc.)\n",
    "-   What are the dimensions of the images?\n",
    "-   Show 15 images randomly from the dataset!\n",
    "-   Create a train-test-validation split using `train_test_split` from `sklearn`\n",
    "    where the test size is $0.33$ and the validation size is $0.2$\n",
    "    -   Set a random seed\n",
    "    -   Print the number of images in each of these sets after you've created\n",
    "        them"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3aa9ad0-06e9-4ce3-81d3-29b35e1cd6ff",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 4. Create a convolutional neural network for the SDSS dataset\n",
    "\n",
    "-   Train the following network on the training set and generate\n",
    "    prediction for the test images:\n",
    "    ```\n",
    "          > conv2D, 32 kernels, kernel size = (3,3), same padding\n",
    "          > batch normalization\n",
    "          > relu actvation\n",
    "          > conv2D, 32 kernels, kernel size = (3,3), same padding\n",
    "          > batch normalization\n",
    "          > relu actvation\n",
    "          > maxpooling pool size = (2,2), strides = (2,2)\n",
    "          \n",
    "          > conv2D, 64 kernels, kernel size = (3,3), same padding\n",
    "          > batch normalization\n",
    "          > relu actvation\n",
    "          > conv2D, 64 kernels, kernel size = (1,1), same padding\n",
    "          > batch normalization\n",
    "          > relu actvation\n",
    "          > conv2D, 64 kernels, kernel size = (3,3), same padding\n",
    "          > batch normalization\n",
    "          > relu actvation\n",
    "          > maxpooling pool size = (2,2), strides = (2,2)\n",
    "          \n",
    "          > global pooling\n",
    "          > dense, 1 neuron, no activation\n",
    "    ```\n",
    "\n",
    "    -   Use Adam optimizer with default parameters\n",
    "    -   Use mean squared error as loss function\n",
    "    -   Compile the model\n",
    "    -   Print out a summary of the model\n",
    "    -   Use the created validation set as validation during the training\n",
    "    -   Train the CNN on the training data for 25 epochs with batch size\n",
    "        of 64\n",
    "\n",
    "-   Calculate and print out the final accuracy using the R2 score!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc00f33e-fcb8-45e8-a27f-3378cb40523e",
   "metadata": {},
   "source": [
    "### 5. Evaluate performance\n",
    "\n",
    "-   Plot the training and the validation loss on the same plot!\n",
    "-   Show the predicted values vs the actual labels on a plot!\n",
    "-   Where does the model make mistakes? Try to plot the images corresponding to\n",
    "    some outlier values!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5904d4c7-00c4-404f-a683-cee4dd6cccdb",
   "metadata": {},
   "source": [
    "### 6. Train an other CNN\n",
    "\n",
    "-   The previous architecture can be further improved.\n",
    "-   Come up with an architecture that can achieve more than 80-85% accuracy on \n",
    "    the test set if the accuracy metric is the R2 score!\n",
    "    -   You can use any tool for this task.\n",
    "    -   Remember that there are different losses and optimizers, early stopping,\n",
    "        regularization, etc. that can be useful for you. You can find more about \n",
    "        these eg. in the\n",
    "        [tensorflow/keras documentation](https://www.tensorflow.org/api_docs/python/tf/all_symbols).\n",
    "    -   Don't forget that you can add more layers to the model and train for\n",
    "        more epochs too... :)\n",
    "-   Print out the summary of this model!\n",
    "-   Plot the loss curves for this model too!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "8b7e51d8-f81e-4c5f-8a48-7159ead65e99",
	"metadata": {
	"tags": []
	},
	"source": [
	"# 10. Convolutional neural networks\n",
	"\n",
	"In this assignment you'll use a photo-Z dataset acquired from the observations of the SDSS telescope located in New Mexico. The goal is to predict redshifts from multiband images of galaxies. As a warmup you'll work with the SVHN dataset.\n",
	"\n",
	"<font color='red'>[WARN]:</font> For this assignment you'll need significantly more computational power compared to previous assignments! If you don't have a CUDA-capable GPU with >4Gb VRAM and >8Gb RAM, then you're advised to work on Google Colab!"
	]
	},
	{
	"cell_type": "markdown",
	"id": "7fec15c7-def2-45b9-8456-881564d76398",
	"metadata": {
	"tags": []
	},
	"source": [
	"### 1. Load the Street View House Numbers (SVHN) dataset\n",
	"\n",
	"- Download the SVHN database and load the train and test datasets!\n",
	" There are multiple ways to do this. The easiest one is probably to install\n",
	" and use the `extra-keras-datasets` Python package. You need to use the\n",
	" standard/normal SVHN dataset only and NOT the one titled as `extra`!\n",
	" (Of course, if you have enough RAM and VRAM, you can work with that one\n",
	" too, if you want...)\n",
	"- Preprocess the downloaded data if needed to be able to use it for training\n",
	" and testing!\n",
	"- Normalize the pixel values into the interval of [0,1]!\n",
	"- How many and what classes do we have in the dataset? How many train and test\n",
	" examples do we have?\n",
	"- What are the dimensions of the images?\n",
	"- Show some images randomly from the dataset!\n",
	"- Make one-hot encoding for the labels!"
	]
	},
	{
	"cell_type": "markdown",
	"id": "d72057e3-894b-44e1-bef8-5a7ef8b811e2",
	"metadata": {},
	"source": [
	"### 2. Create a convolutional neural network for the SVHN dataset\n",
	"\n",
	"- Train the following network on the training set and generate\n",
	" prediction for the test images:\n",
	" ```\n",
	" > conv2D, 16 kernels, kernel size = (3,3), valid padding, relu actvation\n",
	" > conv2D, 16 kernels, kernel size = (3,3), valid padding, relu actvation\n",
	" > maxpooling kernel size = (2,2)\n",
	" > conv2D, 32 kernels, kernel size = (3,3), valid padding, relu actvation\n",
	" > conv2D, 32 kernels, kernel size = (3,3), valid padding, relu actvation\n",
	" > maxpooling pool size = (2,2) strides = (2,2)\n",
	" > flatten\n",
	" > dense, 10 neurons, softmax activation\n",
	" ```\n",
	" - Use Adam optimizer with default parameters\n",
	" - Use categorical crossentropy as loss function\n",
	" - Compile the model\n",
	" - Print out a summary of the model\n",
	" - Train the CNN on the training data for 25 epochs with batch size\n",
	" of 64\n",
	" - Use the test data as validation data\n",
	"\n",
	"- Calculate the categorical cross-entropy loss and the accuracy!\n",
	" <font color='green'>[HINT]:</font> you should get at least $\\approx$ 80-90%\n",
	" accuracy.\n",
	"- Plot the training and the validation loss and accuracy on the same plot!\n",
	" Do we experience overfitting?\n",
	"- Show the confusion matrix of the predictions!"
	]
	},
	{
	"cell_type": "markdown",
	"id": "bd6b8f6a-83a7-4fba-bfae-7fe501d934a9",
	"metadata": {
	"tags": []
	},
	"source": [
	"### 3. Load the Sload Digital Sky Survey (SDSS) Dataset\n",
	"\n",
	"You can download the dataset from Kaggle via\n",
	"[this link](https://www.kaggle.com/masterdesky/multiband-photoz-sdss-dr16).\n",
	"\n",
	"- Download the images from Kaggle (~10'000 images total) \n",
	"- Preprocess the images similarly to the SVHN dataset if needed! (Normalize\n",
	" pixel values to [0,1], etc.)\n",
	"- What are the dimensions of the images?\n",
	"- Show 15 images randomly from the dataset!\n",
	"- Create a train-test-validation split using `train_test_split` from `sklearn`\n",
	" where the test size is $0.33$ and the validation size is $0.2$\n",
	" - Set a random seed\n",
	" - Print the number of images in each of these sets after you've created\n",
	" them"
	]
	},
	{
	"cell_type": "markdown",
	"id": "d3aa9ad0-06e9-4ce3-81d3-29b35e1cd6ff",
	"metadata": {
	"tags": []
	},
	"source": [
	"### 4. Create a convolutional neural network for the SDSS dataset\n",
	"\n",
	"- Train the following network on the training set and generate\n",
	" prediction for the test images:\n",
	" ```\n",
	" > conv2D, 32 kernels, kernel size = (3,3), same padding\n",
	" > batch normalization\n",
	" > relu actvation\n",
	" > conv2D, 32 kernels, kernel size = (3,3), same padding\n",
	" > batch normalization\n",
	" > relu actvation\n",
	" > maxpooling pool size = (2,2), strides = (2,2)\n",
	" \n",
	" > conv2D, 64 kernels, kernel size = (3,3), same padding\n",
	" > batch normalization\n",
	" > relu actvation\n",
	" > conv2D, 64 kernels, kernel size = (1,1), same padding\n",
	" > batch normalization\n",
	" > relu actvation\n",
	" > conv2D, 64 kernels, kernel size = (3,3), same padding\n",
	" > batch normalization\n",
	" > relu actvation\n",
	" > maxpooling pool size = (2,2), strides = (2,2)\n",
	" \n",
	" > global pooling\n",
	" > dense, 1 neuron, no activation\n",
	" ```\n",
	"\n",
	" - Use Adam optimizer with default parameters\n",
	" - Use mean squared error as loss function\n",
	" - Compile the model\n",
	" - Print out a summary of the model\n",
	" - Use the created validation set as validation during the training\n",
	" - Train the CNN on the training data for 25 epochs with batch size\n",
	" of 64\n",
	"\n",
	"- Calculate and print out the final accuracy using the R2 score!"
	]
	},
	{
	"cell_type": "markdown",
	"id": "bc00f33e-fcb8-45e8-a27f-3378cb40523e",
	"metadata": {},
	"source": [
	"### 5. Evaluate performance\n",
	"\n",
	"- Plot the training and the validation loss on the same plot!\n",
	"- Show the predicted values vs the actual labels on a plot!\n",
	"- Where does the model make mistakes? Try to plot the images corresponding to\n",
	" some outlier values!"
	]
	},
	{
	"cell_type": "markdown",
	"id": "5904d4c7-00c4-404f-a683-cee4dd6cccdb",
	"metadata": {},
	"source": [
	"### 6. Train an other CNN\n",
	"\n",
	"- The previous architecture can be further improved.\n",
	"- Come up with an architecture that can achieve more than 80-85% accuracy on \n",
	" the test set if the accuracy metric is the R2 score!\n",
	" - You can use any tool for this task.\n",
	" - Remember that there are different losses and optimizers, early stopping,\n",
	" regularization, etc. that can be useful for you. You can find more about \n",
	" these eg. in the\n",
	" [tensorflow/keras documentation](https://www.tensorflow.org/api_docs/python/tf/all_symbols).\n",
	" - Don't forget that you can add more layers to the model and train for\n",
	" more epochs too... :)\n",
	"- Print out the summary of this model!\n",
	"- Plot the loss curves for this model too!"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.9.7"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}