Last active
March 25, 2024 07:14
-
-
Save thistleknot/9aa7d6d2ef5f85da3cbda7cc6f55c18a to your computer and use it in GitHub Desktop.
datasets
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Target | |
Phi 1 - 7 Billion | |
#https://clarifai.com/microsoft/text-generation/models/phi-1_5 | |
Phi-1.5 was trained on 150 billion tokens, with 20% from phi-1's training data(7B tokens) and 80% from the newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.). | |
Base Model | |
X marksverdhei/wordnet-definitions-en-2021 | |
X Wiki-text | |
X idioms | |
X sep | |
X iep | |
14996118 | |
X english_quotes | |
X az quotes | |
X gracious quotes | |
X AyoubChLin/CNN_News_Articles_2011-2022 | |
X open-web-math/open-web-math | |
#https://gist.github.com/thistleknot/442f9b92a1100374f2a498bf6a32e0e6 | |
1000 | |
#X books | |
https://huggingface.co/datasets/suolyer/pile_books3 | |
#54,200,654 | |
X Brown Corpus | |
#1393837 | |
-Bookcorpus? | |
(single strings) | |
Textbooks | |
X open-phi/textbooks | |
(gpt-4) | |
3,785,702 tokens | |
?open-phi/programming_books_llama | |
X Lyrics (lyrics.jsonl) | |
X chloeliu/lyrics | |
X Santarabantoosoo/small_lyrics_dataset | |
-sheacon/song_lyrics | |
(embeddings only) | |
Essays/Papers | |
x qwedsacf/ivypanda-essays | |
datajuicer/the-pile-philpaper-refined-by-data-juicer | |
(100) | |
X CShorten/ML-ArXiv-Papers | |
146,034,774 | |
X Sampled RedPajama | |
https://gist.github.com/thistleknot/442f9b92a1100374f2a498bf6a32e0e6 | |
n = 1000 | |
increased c4 to account for inability to get common_crawl | |
- = derived dataset | |
Fine-tune | |
Reasoning 1 | |
tasksource-instruct-v0?row=0 | |
icl-symbol-tuning-instruct | |
math | |
open-web-math/open-web-math | |
1000 | |
math_qa | |
Problem | |
Options | |
Correct | |
Rationale | |
annotate_formula | |
qwedsacf/competition_math | |
qwedsacf/grade-school-math-instructions | |
vietgpt/OIG_mathqa_flanv2_en | |
ArtifactAI/arxiv-math-instruct-50k | |
ccdv/arxiv-summarization | |
4GB | |
Coding | |
Instruct | |
Coding | |
codeparrot/self-instruct-starcoder | |
Nan-Do/reason_code-search-net-python | |
mhhmm/leetcode-solutions-python | |
mlabonne/Evol-Instruct-Python-1k | |
jamescalam/llama-2-arxiv-papers-chunked | |
(summary) | |
mlabonne/Evol-Instruct-Python-1k | |
Nan-Do/reason_code-search-net-python | |
iamtarun/python_code_instructions_18k_alpaca | |
Reasoning 2 | |
OpenOrca | |
scientific_and_creative_analogy | |
Sciq | |
Cosmos QA | |
commonsense_qa | |
supernatural | |
subjqa | |
piqa | |
qwedsacf/story-generation | |
Instruction | |
EvolInstruct | |
Dolly | |
hakurei/open-instruct-v1 | |
LinkSoul/instruction_merge_set | |
search_qa | |
TLDR | |
CarperAI/openai_summarize_tldr | |
JulesBelveze/tldr_news | |
Prompt Engineered | |
-cod on wiki | |
-cod on news | |
-spo triplets on cod | |
w RAG | |
-spo triplets on wiki | |
-'unpack' on quotes | |
Sentiment | |
tyqiangz/multilingual-sentiments (english) | |
CoT | |
iamketan25/open-assistant-instructions | |
SirNeural/flan_v2 | |
https://github.com/google-research/FLAN/tree/main/flan/v2/cot_data | |
ostapeno/flanv2_100k_2 | |
LogiCOT | |
X flanv2 | |
#https://github.com/google-research/FLAN/tree/main/flan/v2 | |
User-AI loop | |
Collective Cognition | |
gpt 4 llm cleaned | |
acrastt/EverythingLM-V3-ShareGPT | |
Conversational | |
?https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments/viewer/default/explainlikeimfive?row=1 | |
oa-conversation | |
AI-Conversations | |
chatbot_arena_conversations | |
samantha-data | |
ehartford/samantha-data | |
datasets/HuggingFaceH4/ultrachat_200k | |
Adverserial | |
supernaturalz | |
Misc | |
Investopdia | |
DPO | |
HuggingFaceH4/ultrafeedback_binarized | |
Dahoas/instruct_helpful_preferences |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment