Skip to content

Instantly share code, notes, and snippets.

@kirnap
Last active May 4, 2023 17:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kirnap/d1e121938d017df3a78365a75a574535 to your computer and use it in GitHub Desktop.
Save kirnap/d1e121938d017df3a78365a75a574535 to your computer and use it in GitHub Desktop.

Training data

Sentence-transformers use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is above 1 billion sentences. We sampled each dataset given a weighted probability which configuration is detailed in the data_config.json file.

Dataset Paper Number of training tuples
Reddit comments (2015-2018) paper 726,484,430
S2ORC Citation pairs (Abstracts) paper 116,288,806
WikiAnswers Duplicate question pairs paper 77,427,422
PAQ (Question, Answer) pairs paper 64,371,441
S2ORC Citation pairs (Titles) paper 52,603,982
S2ORC (Title, Abstract) paper 41,769,185
Stack Exchange (Title, Body) pairs - 25,316,456
Stack Exchange (Title+Body, Answer) pairs - 21,396,559
Stack Exchange (Title, Answer) pairs - 21,396,559
MS MARCO triplets paper 9,144,553
GOOAQ: Open Question Answering with Diverse Answer Types paper 3,012,496
Yahoo Answers (Title, Answer) paper 1,198,260
Code Search - 1,151,414
COCO Image captions paper 828,395
SPECTER citation triplets paper 684,100
Yahoo Answers (Question, Answer) paper 681,164
Yahoo Answers (Title, Question) paper 659,896
SearchQA paper 582,261
Eli5 paper 325,475
Flickr 30k paper 317,695
Stack Exchange Duplicate questions (titles) 304,525
AllNLI (SNLI and MultiNLI paper SNLI, paper MultiNLI 277,230
Stack Exchange Duplicate questions (bodies) 250,519
Stack Exchange Duplicate questions (titles+bodies) 250,460
Sentence Compression paper 180,000
Wikihow paper 128,542
Altlex paper 112,696
Quora Question Triplets - 103,663
Simple Wikipedia paper 102,225
Natural Questions (NQ) paper 100,231
SQuAD2.0 paper 87,599
TriviaQA - 73,346
Total 1,170,060,424
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment