-
-
Save bedohazizsolt/2965c5863df0330c00b5d2f4444ddc91 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"cells":[{"cell_type":"markdown","id":"498f1479","metadata":{"id":"498f1479"},"source":["## Decision tree exercises\n","\n","### 1) Preprocess the data\n","\n","- Load the Income dataset from the page of the course (train.csv).\n","\n","- Search for missing values and if needed, handle them!\n","\n","- Drop the finalweight (fnlwgt) and Educational_num columns!\n","\n","- Encode the non numeric variables into numeric ones! For the binary features simply encode them as (0/1). Do not create two separate columns for them!\n","\n","- Make some exploration of the categorical columns, plot the frequencies of each categories. Based on the finding, drop all entries that are outside the US.\n","\n","### 2) Train & visualize a decision tree classifier\n","\n","- Train a decision tree classifier using the sklearn API.\n","\n","- Use its default parameters and use all the data.\n","\n","- Visualize the decision tree, with the Gini impurities also showing on the plot. The plot_tree function in sklearn will be really helpful. You may or may not need to tune its arguments to get a reasonable result.\n","\n","- Manually check for two cases if the returned Gini impurities are correct.\n","\n","- In a few sentences, discuss the results.\n","\n","\n","### 3) Random forest feature importance vs Lasso features\n","- Train a random forest classifier on all the data using the sklearn API.\n","\n","- Use default values again, but fix the random_state to 137!\n","\n","- Plot the importance values of the most important features.\n","\n","- Create a bar plot where the height of the bar is the feature importance.\n","\n","- The feature_importances_ attribute is helpful.\n","\n","- Fit a Lasso regression with a hand tuned hyperparameter to end up with only approx. 10 non-zero coefficients (or use the same approach as before). What are the important columns here? Compare them with the ones you got before.\n","\n","### 4) Evaluation\n","\n","- Generate prediction probabilities with a decision tree and with a random forest model:\n","\n"," * Use 5-fold cross validation for both models.\n"," * Use default parameters for both models.\n"," \n","- Compare the two models with ROC curves.\n","\n","- Why does the shape of the decision tree's ROC curve looks different?\n","\n","### 5) Model tuning\n","\n","- Using 70% - 30% train-test split generate predictions for a random forest model.\n","- Set the random_state parameter for every run to for the train-test split and for the Random Forest Classifier as well!\n","- Plot the AUC as the function of the number of trees in the forest for both the traing and the test data!\n","- Do we experience overfitting if we use too many trees?\n","\n","\n"]}],"metadata":{"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.6"},"colab":{"provenance":[{"file_id":"1SBSG3k2I_XGNByNZWfxhvpU4BGiLx_Po","timestamp":1667336333242}]}},"nbformat":4,"nbformat_minor":5} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment