bedohazizsolt/HW_08.ipynb Secret

## HW_08.ipynb
{"cells":[{"cell_type":"markdown","id":"498f1479","metadata":{"id":"498f1479"},"source":["## Decision tree exercises\n","\n","### 1) Preprocess the data\n","\n","- Load the Income dataset from the page of the course (train.csv).\n","\n","- Search for missing values and if needed, handle them!\n","\n","- Drop the finalweight (fnlwgt) and Educational_num columns!\n","\n","- Encode the non numeric variables into numeric ones! For the binary features simply encode them as (0/1). Do not create two separate columns for them!\n","\n","- Make some exploration of the categorical columns, plot the frequencies of each categories. Based on the finding, drop all entries that are outside the US.\n","\n","### 2) Train & visualize a decision tree classifier\n","\n","- Train a decision tree classifier using the sklearn API.\n","\n","- Use its default parameters and use all the data.\n","\n","- Visualize the decision tree, with the Gini impurities also showing on the plot. The plot_tree function in sklearn will be really helpful. You may or may not need to tune its arguments to get a reasonable result.\n","\n","- Manually check for two cases if the returned Gini impurities are correct.\n","\n","- In a few sentences, discuss the results.\n","\n","\n","### 3) Random forest feature importance vs Lasso features\n","- Train a random forest classifier on all the data using the sklearn API.\n","\n","- Use default values again, but fix the random_state to 137!\n","\n","- Plot the importance values of the  most important features.\n","\n","- Create a bar plot where the height of the bar is the feature importance.\n","\n","- The feature_importances_ attribute is helpful.\n","\n","- Fit a Lasso regression with a hand tuned hyperparameter to end up with only approx. 10 non-zero coefficients (or use the same approach as before). What are the important columns here? Compare them with the ones you got before.\n","\n","### 4) Evaluation\n","\n","- Generate prediction probabilities with a decision tree and with a random forest model:\n","\n","    * Use 5-fold cross validation for both models.\n","    * Use default parameters for both models.\n","    \n","- Compare the two models with ROC curves.\n","\n","- Why does the shape of the decision tree's ROC curve looks different?\n","\n","### 5) Model tuning\n","\n","- Using  70% - 30%  train-test split generate predictions for a random forest model.\n","- Set the random_state parameter for every run to  for the train-test split and for the Random Forest Classifier as well!\n","- Plot the AUC as the function of the number of trees in the forest for both the traing and the test data!\n","- Do we experience overfitting if we use too many trees?\n","\n","\n"]}],"metadata":{"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.6"},"colab":{"provenance":[{"file_id":"1SBSG3k2I_XGNByNZWfxhvpU4BGiLx_Po","timestamp":1667336333242}]}},"nbformat":4,"nbformat_minor":5}