In this project I have attempted to create supervised learning models to assist in classifying certain employee data. The classes to predict are as follows:
- Active - the employee is still in their role
- Non-active - the employee has resigned
I pre-processed the data by removing one outlier and producing new features in Excel as the data set was small at 1056 rows. Some categorical features were also converted to numeric values in Excel. For example, Gender was originally "M" or "F", which was converted to 0 and 1 respectively. I also removed employee number as it provides no value as a feature and could compromise privacy.
After doing some research, see References, I found that the scikit-learn library does not handle categorical (string) features correctly in Decision Trees using the above approach. When added, these features provided no increase in accuracy, so I removed them. For example; Department, some departments have a highe