Skip to content

Instantly share code, notes, and snippets.

@thistleknot
Last active March 25, 2024 07:14
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thistleknot/9aa7d6d2ef5f85da3cbda7cc6f55c18a to your computer and use it in GitHub Desktop.
Save thistleknot/9aa7d6d2ef5f85da3cbda7cc6f55c18a to your computer and use it in GitHub Desktop.
datasets
Target
Phi 1 - 7 Billion
#https://clarifai.com/microsoft/text-generation/models/phi-1_5
Phi-1.5 was trained on 150 billion tokens, with 20% from phi-1's training data(7B tokens) and 80% from the newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.).
Base Model
X marksverdhei/wordnet-definitions-en-2021
X Wiki-text
X idioms
X sep
X iep
14996118
X english_quotes
X az quotes
X gracious quotes
X AyoubChLin/CNN_News_Articles_2011-2022
X open-web-math/open-web-math
#https://gist.github.com/thistleknot/442f9b92a1100374f2a498bf6a32e0e6
1000
#X books
https://huggingface.co/datasets/suolyer/pile_books3
#54,200,654
X Brown Corpus
#1393837
-Bookcorpus?
(single strings)
Textbooks
X open-phi/textbooks
(gpt-4)
3,785,702 tokens
?open-phi/programming_books_llama
X Lyrics (lyrics.jsonl)
X chloeliu/lyrics
X Santarabantoosoo/small_lyrics_dataset
-sheacon/song_lyrics
(embeddings only)
Essays/Papers
x qwedsacf/ivypanda-essays
datajuicer/the-pile-philpaper-refined-by-data-juicer
(100)
X CShorten/ML-ArXiv-Papers
146,034,774
X Sampled RedPajama
https://gist.github.com/thistleknot/442f9b92a1100374f2a498bf6a32e0e6
n = 1000
increased c4 to account for inability to get common_crawl
- = derived dataset
Fine-tune
Reasoning 1
tasksource-instruct-v0?row=0
icl-symbol-tuning-instruct
math
open-web-math/open-web-math
1000
math_qa
Problem
Options
Correct
Rationale
annotate_formula
qwedsacf/competition_math
qwedsacf/grade-school-math-instructions
vietgpt/OIG_mathqa_flanv2_en
ArtifactAI/arxiv-math-instruct-50k
ccdv/arxiv-summarization
4GB
Coding
Instruct
Coding
codeparrot/self-instruct-starcoder
Nan-Do/reason_code-search-net-python
mhhmm/leetcode-solutions-python
mlabonne/Evol-Instruct-Python-1k
jamescalam/llama-2-arxiv-papers-chunked
(summary)
mlabonne/Evol-Instruct-Python-1k
Nan-Do/reason_code-search-net-python
iamtarun/python_code_instructions_18k_alpaca
Reasoning 2
OpenOrca
scientific_and_creative_analogy
Sciq
Cosmos QA
commonsense_qa
supernatural
subjqa
piqa
qwedsacf/story-generation
Instruction
EvolInstruct
Dolly
hakurei/open-instruct-v1
LinkSoul/instruction_merge_set
search_qa
TLDR
CarperAI/openai_summarize_tldr
JulesBelveze/tldr_news
Prompt Engineered
-cod on wiki
-cod on news
-spo triplets on cod
w RAG
-spo triplets on wiki
-'unpack' on quotes
Sentiment
tyqiangz/multilingual-sentiments (english)
CoT
iamketan25/open-assistant-instructions
SirNeural/flan_v2
https://github.com/google-research/FLAN/tree/main/flan/v2/cot_data
ostapeno/flanv2_100k_2
LogiCOT
X flanv2
#https://github.com/google-research/FLAN/tree/main/flan/v2
User-AI loop
Collective Cognition
gpt 4 llm cleaned
acrastt/EverythingLM-V3-ShareGPT
Conversational
- Reddit
?https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments/viewer/default/explainlikeimfive?row=1
oa-conversation
AI-Conversations
chatbot_arena_conversations
samantha-data
ehartford/samantha-data
datasets/HuggingFaceH4/ultrachat_200k
Adverserial
supernaturalz
Misc
Investopdia
DPO
HuggingFaceH4/ultrafeedback_binarized
Dahoas/instruct_helpful_preferences
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment