Skip to content

Instantly share code, notes, and snippets.

View kirnap's full-sized avatar

Ömer Kırnap kirnap

View GitHub Profile

Training data

Sentence-transformers use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is above 1 billion sentences. We sampled each dataset given a weighted probability which configuration is detailed in the data_config.json file.

Dataset Paper Number of training tuples
Reddit comments (2015-2018) paper 726,484,430
@kirnap
kirnap / acl17.md
Last active July 31, 2017 03:26
A user manual and summary in ACL2017 conference

Day 1


There were some tutorials I took two of them:

- NLP for precision medicine

They basically apply the machine learning algorithms along with NLP techniques to cancer cure detection task. They borrow sequence tagging, dependency parsing, word embedding and apply to tackle their current research. They talked usual about graph LSTMs and biLSTMs. (The presenter)

- Deep Learning for Dialogue systems

They briefly introduce end to end conversation based personal assitant like Apple's siri, Amazon's alexa etc. In other words, this tutorial was a detailed overview on how to apply dialogue systems on top of deep learning. The emphasize was on RL and different structured LSTM like architectures. Their slides (at least take a look at references in btw slides 100 and 120) are quite explanatory, and the cited papers try to catch up with the current research on that area