christinach/dataMethodology.md

## dataMethodology.md

      
    Raw
  

              dataMethodology.md
            
          
    Can we support the ticketing system by automatically classifying incoming emails?

Business understanding:

The goals:

Create a better way to triage incoming emails.
Provide a better customer support
Take into consideration the customer account.

The company has a large amount of emails from customers. Each incoming email represents an open or closed ticket.

Analytic approach:

We will group the emails based on the

subject,
body,
type of account
The target is unknown so we are dealing with an unsupervised problem. The best approach is to use a cluster algorithm. We will create three clusters.

Data requirements


We will use emails from last year.
We want emails across the annual seasons and not to specific months.
We want emails from all the account types.
We will use 800,000 emails from historical data from the last year.
We want emails from different time zones.

Data Collection


Work with the DB administrator or the developers to run SQL queries based on the requirements defined in the previous step. Collect 800,000 emails considering winter, autumn, spring and summer.

Data Understanding and Preparation


Import the data in a table with 4 variables:


Date
subject
body
account type


Look for missing values
Calculate the percentage of the missing values against the 800,000 emails.
Run descriptive statistics agianst the data columns.
Create histograms to see if any trends exist and to understand their distribution
Use pairwise correlations, to see how closely certain variables were related, and which ones, if any, were very highly correlated then adjust the variables and use one of the them for the model.

Modeling and Evaluation

We will use different cluster algorithms. We will start with kmeans We will use 70% training and 30% test sets.We want to make sure that we avoid over-fitting the model.
The purpose of the unsupervised learning with clustering is to find meaningful relationships in the data.
We also calculate the sum of squares by cluster to see if it is a good fit.