Updated at 3:46PM, June 4
We are going to implement this model using SINGA. https://github.com/jojonki/QA-LSTM Please ignore the text below currently.
Create a question answering model for customer support.
https://www.kaggle.com/thoughtvector/customer-support-on-twitter/data
Preprocessing the data to
- get a table with each row including both the query and the response tweet like https://www.kaggle.com/psbots/customer-support-meets-spacy-universe
- delete non-english tweets https://www.kaggle.com/psbots/customer-support-meets-spacy-universe; replace some words https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing, etc.
- get a clean table for the dataset stored in pandas dataframe, including the query text, response text, company, time and one additional column for the emoji (if it has).
-
Information retrieval based. Search similar queries from the tweets for the same company and return the response tweets back for matching. The matching is done by training a DL model: Bi-LSTM + average-pooling over time + Linear layer. Cosine similarity for the matching score. If we cannot find a python lib for the retrieval step, we can use elasticsearch. Reference: https://cloud.tencent.com/developer/article/1196826
-
Generating response via seq2seq mode. Reference: https://www.kaggle.com/soaxelbrooke/twitter-basic-seq2seq.
1 is easier for the DL model development but it requires the retrieval step.
Developing on panda cluster; Tutorial on colab. Note that we may not be able to upload a large dataset to colab; in that case, we need to select a subset of the original dataset.