Skip to content

Instantly share code, notes, and snippets.

@christinach
Last active March 9, 2021 03:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save christinach/990bc053d19180ec3b7b3f1409d44513 to your computer and use it in GitHub Desktop.
Save christinach/990bc053d19180ec3b7b3f1409d44513 to your computer and use it in GitHub Desktop.
Can we support the ticketing system by automatically classifying incoming emails?

Can we support the ticketing system by automatically classifying incoming emails?

Business understanding:

The goals:

  • Create a better way to triage incoming emails.
  • Provide a better customer support
  • Take into consideration the customer account.
    The company has a large amount of emails from customers. Each incoming email represents an open or closed ticket.

Analytic approach:

We will group the emails based on the

  1. subject,
  2. body,
  3. type of account The target is unknown so we are dealing with an unsupervised problem. The best approach is to use a cluster algorithm. We will create three clusters.

Data requirements

  • We will use emails from last year.
  • We want emails across the annual seasons and not to specific months.
  • We want emails from all the account types.
  • We will use 800,000 emails from historical data from the last year.
  • We want emails from different time zones.

Data Collection

  • Work with the DB administrator or the developers to run SQL queries based on the requirements defined in the previous step. Collect 800,000 emails considering winter, autumn, spring and summer.

Data Understanding and Preparation

  • Import the data in a table with 4 variables:
  1. Date
  2. subject
  3. body
  4. account type
  • Look for missing values
  • Calculate the percentage of the missing values against the 800,000 emails.
  • Run descriptive statistics agianst the data columns.
  • Create histograms to see if any trends exist and to understand their distribution
  • Use pairwise correlations, to see how closely certain variables were related, and which ones, if any, were very highly correlated then adjust the variables and use one of the them for the model.

Modeling and Evaluation

We will use different cluster algorithms. We will start with kmeans We will use 70% training and 30% test sets.We want to make sure that we avoid over-fitting the model.

The purpose of the unsupervised learning with clustering is to find meaningful relationships in the data.

We also calculate the sum of squares by cluster to see if it is a good fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment