Skip to content

Instantly share code, notes, and snippets.

@jyfeather
Created December 2, 2015 07:19
Show Gist options
  • Save jyfeather/160eeaa3f2b9051fe5a4 to your computer and use it in GitHub Desktop.
Save jyfeather/160eeaa3f2b9051fe5a4 to your computer and use it in GitHub Desktop.

slide 1

Good morning, everyone. My name is Yan Jin, I am a phd student from industrial engineering department at university of washington. Today I will introduce our project, we call it insight project. My advisor, Dr. Shuai Huang and Dr. Guan Wang from LinkdedIn, and I are working on insight project together.

slide 2

The insight project is web-based, this is a screen shot of it. It is from a old version, because the current version is under development, I could not get screenshot of newer one.

slide 3

This is the outline for today. First, I will talk about what this project can do, and then how to do this work, so basically it has two main functionalities, one is from given dataset to answers that users might be interested in, and the other is from natural language questions that users ask to answers, and some tenical details in the end.

slide 4

We want the insight project could automatically provide users insights when users upload dataset. So in the insight project, it could smartly match an data mining algorithm with a given dataset, and then provide answers. Some features include, it supports formatted data for regression, classification and clustering task, and also support plain text data for text mining; and it supports answering natural language questions.

slide 5

This is the first important functionality. How to get answers that users might be interested in from given datasets. And This is the framework. Users upload raw data, then do preprocess to get processed data, based on the processed data, we do profiling to achieve a couple of data characteristics, and use these data characteristics to match a data mining algorithm. After we find a best fitting algorithm for the uploaded data, then using predefined mapping relation between algorithms and technical questions, and between technical questions and general questions to find a few answers that users might be interested in. Now I will introduce each procedure separately.

slide 6

First, let's take a look at preprocessing part.

slide 7

Since the input of insight project includes two parts, the whole dataset, and user specified variable, we need do preprocessing for these two.

And in terms of preprocessing, for formatted data, we do missing data imputation, normalization, and others; for plain text, we do text preprocessing like removing stopwords. And we defined a bunch of rules to check and preprocess the raw data, if one of these fail, then the system will notify users and help users debug their dataset mannually.

slide 8

Second is how to do profiling to get data characteristics from processed data. Because it is not easy to match the dataset to data mining algorithm directly, then we have to do this step to extract some data characteristics.

slide 9

We defined some general data characteristics here in this table. Some of them are easy to compute, like mean, max, some of them need a few computation, like PCA results, determinant. And one thing I have to mention here is that, the first 9 features are for user specified variable, and the rest are for the whole dataset.

slide 10

Then it is the main part of this framework, matching data characteristics to data mining algorithm.

slide 11

First step is we use some heuristic to match to these four groups, text mining problem, classification, regression and clustering. And then in each group, do the second matching to select a specific algorithm. In the first matching, because we have two inputs, first we check the data type of uploaded dataset, if it is plain text format, then we do text mining problem, otherwise we need further check the interested variable, if it is categorical, do classification task, if it is numerical, do regression, and if it is not specified, do clustering. And for second matching, I will take classification task as an example.

slide 12

In second matching stage, we need train a classification model to do the matching. So we used UCI dataset as training sets. We screened out 39 datasets from 335 UCI datasets, but this is not enought, so we keep all columns, resample over rows to get 4556 datasets in total for training.

Then we applied profiling over the training sets to get data characteristics as predictors. On the other hand, we ran all classification algorithms over all datasets to get the performance, like accuracy, for each algorithm, then choose the algorithm with best performance as the response variable. After we get predictors and response, then we ran a random forest to do the second matching for new incoming dataset. This is a screen shot of performance for each dataset each algorithm.

slide 13

Then we check the predefined table to look up the related technical questions and general questions.

slide 14

Here, table 1 shows some algorithm that the insight project supported, and table 2 shows some supported questions. In table 1, for each algorithm (each row), it has some predefined questins, which is in the last column, for example, for algorithm 13, random forest, it could answer questions 13 14 15 15 28, and these questions are definced in table 2.

slide 15

This is the second important functionality in insight project. It could handle the natural language questions. The problem is, after we have a questions from user, how can we find a question that is close to user's questions from our database. To do this, we are trying to convert the question into a numerical vector in Euclidean space, and then we can compute their cosine similarity.

So we use word2vec model to achieve this goal. Here Q1 to Qn are those questions in the database, we extracted some keywords, and query them in word2vec model to get a vector for each question, like this. For each question, the vector contains the name of words and its value. And for user's questions, do the same procedure to get a vector. Then we compute their similarity to find the top x closest questions to users' questions.

slide 16

The word2vec model is sensitive to the performance of this procedure. To train a good word2vec model, it requires large dataset and domain specific data. So we get our training dataset from stackexchange. This is a screenshot of stackexchange question, typically, it contains a title, content and some answers following the questions. Stackexchange has many branches, I used all the questions from data science and statistical analysis branches.

And I used google word2vec to train the model, this is the original word2vec toolbox. This is a screenshot of word2vec model. After query, it returns a table with two columns, one is the name of words, and the other is the distance value.

slide 17

Some technical details are here. Basically we used GO and R for programming.

slide 18

Thank you for listening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment