nudles/readme.md

## readme.md

      
    Raw
  

              readme.md
            
          
    Updated at 3:46PM, June 4
We are going to implement this model using SINGA. https://github.com/jojonki/QA-LSTM
Please ignore the text below currently.
Task

Create a question answering model for customer support.
The dataset

https://www.kaggle.com/thoughtvector/customer-support-on-twitter/data
Preprocessing the data to

get a table with each row including both the query and the response tweet like https://www.kaggle.com/psbots/customer-support-meets-spacy-universe
delete non-english tweets https://www.kaggle.com/psbots/customer-support-meets-spacy-universe; replace some words https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing, etc.
get a clean table for the dataset stored in pandas dataframe, including the query text, response text, company, time and one additional column for the emoji (if it has).

The Model


Information retrieval based. Search similar queries from the tweets for the same company and return the response tweets back for matching. The matching is done by training a DL model: Bi-LSTM + average-pooling over time + Linear layer. Cosine similarity for the matching score. If we cannot find a python lib for the retrieval step, we can use elasticsearch. Reference: https://cloud.tencent.com/developer/article/1196826


Generating response via seq2seq mode. Reference: https://www.kaggle.com/soaxelbrooke/twitter-basic-seq2seq.


1 is easier for the DL model development but it requires the retrieval step.
Platform

Developing on panda cluster; Tutorial on colab. Note that we may not be able to upload a large dataset to colab; in that case, we need to select a subset of the original dataset.