thistleknot/datasets.txt

## datasets.txt
Target
    Phi 1 - 7 Billion
    #https://clarifai.com/microsoft/text-generation/models/phi-1_5
    Phi-1.5 was trained on 150 billion tokens, with 20% from phi-1's training data(7B tokens) and 80% from the newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.).

Base Model
    X marksverdhei/wordnet-definitions-en-2021
    X Wiki-text
    X idioms
    X sep

    X iep
        14996118
    X english_quotes
    X az quotes
    X gracious quotes
    X AyoubChLin/CNN_News_Articles_2011-2022
    X open-web-math/open-web-math
        #https://gist.github.com/thistleknot/442f9b92a1100374f2a498bf6a32e0e6
            1000

    #X books
        https://huggingface.co/datasets/suolyer/pile_books3
        #54,200,654

        X Brown Corpus
        #1393837

        -Bookcorpus?
            (single strings)

        Textbooks
            X open-phi/textbooks
                (gpt-4)
                3,785,702 tokens
            ?open-phi/programming_books_llama

    X Lyrics (lyrics.jsonl)
        X chloeliu/lyrics
        X Santarabantoosoo/small_lyrics_dataset
        -sheacon/song_lyrics
            (embeddings only)

    Essays/Papers
        x qwedsacf/ivypanda-essays
        datajuicer/the-pile-philpaper-refined-by-data-juicer
            (100)
        X CShorten/ML-ArXiv-Papers
            146,034,774


    X Sampled RedPajama
        https://gist.github.com/thistleknot/442f9b92a1100374f2a498bf6a32e0e6
        n = 1000
        increased c4 to account for inability to get common_crawl

- = derived dataset
Fine-tune

    Reasoning 1
        tasksource-instruct-v0?row=0
        icl-symbol-tuning-instruct

    math
        open-web-math/open-web-math
            1000
        math_qa
            Problem
            Options
            Correct
            Rationale
            annotate_formula
            qwedsacf/competition_math
            qwedsacf/grade-school-math-instructions


        vietgpt/OIG_mathqa_flanv2_en
        ArtifactAI/arxiv-math-instruct-50k
        ccdv/arxiv-summarization
            4GB

    Coding
        Instruct

        Coding
            codeparrot/self-instruct-starcoder
            Nan-Do/reason_code-search-net-python
            mhhmm/leetcode-solutions-python
            mlabonne/Evol-Instruct-Python-1k
            jamescalam/llama-2-arxiv-papers-chunked
                (summary)
            mlabonne/Evol-Instruct-Python-1k
            Nan-Do/reason_code-search-net-python
            iamtarun/python_code_instructions_18k_alpaca

    Reasoning 2
        OpenOrca
        scientific_and_creative_analogy
        Sciq
        Cosmos QA
        commonsense_qa
        supernatural
        subjqa
        piqa
        qwedsacf/story-generation

    Instruction
        EvolInstruct
        Dolly
        hakurei/open-instruct-v1
        LinkSoul/instruction_merge_set
        search_qa

    TLDR
        CarperAI/openai_summarize_tldr
		JulesBelveze/tldr_news

    Prompt Engineered
        -cod on wiki
        -cod on news
        -spo triplets on cod

         w RAG
            -spo triplets on wiki
            -'unpack' on quotes

    Sentiment
        tyqiangz/multilingual-sentiments (english)

    CoT
        iamketan25/open-assistant-instructions
        SirNeural/flan_v2
            https://github.com/google-research/FLAN/tree/main/flan/v2/cot_data
        ostapeno/flanv2_100k_2
        LogiCOT
        X flanv2
            #https://github.com/google-research/FLAN/tree/main/flan/v2

    User-AI loop
        Collective Cognition
        gpt 4 llm cleaned
        acrastt/EverythingLM-V3-ShareGPT

    Conversational
        - Reddit
            ?https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments/viewer/default/explainlikeimfive?row=1
        oa-conversation

    AI-Conversations
        chatbot_arena_conversations
        samantha-data
        ehartford/samantha-data
        datasets/HuggingFaceH4/ultrachat_200k

    Adverserial
        supernaturalz

    Misc
        Investopdia

    DPO
        HuggingFaceH4/ultrafeedback_binarized
        Dahoas/instruct_helpful_preferences
	Target
	Phi 1 - 7 Billion
	#https://clarifai.com/microsoft/text-generation/models/phi-1_5
	Phi-1.5 was trained on 150 billion tokens, with 20% from phi-1's training data(7B tokens) and 80% from the newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.).

	Base Model
	X marksverdhei/wordnet-definitions-en-2021
	X Wiki-text
	X idioms
	X sep

	X iep
	14996118
	X english_quotes
	X az quotes
	X gracious quotes
	X AyoubChLin/CNN_News_Articles_2011-2022
	X open-web-math/open-web-math
	#https://gist.github.com/thistleknot/442f9b92a1100374f2a498bf6a32e0e6
	1000

	#X books
	https://huggingface.co/datasets/suolyer/pile_books3
	#54,200,654

	X Brown Corpus
	#1393837

	-Bookcorpus?
	(single strings)

	Textbooks
	X open-phi/textbooks
	(gpt-4)
	3,785,702 tokens
	?open-phi/programming_books_llama

	X Lyrics (lyrics.jsonl)
	X chloeliu/lyrics
	X Santarabantoosoo/small_lyrics_dataset
	-sheacon/song_lyrics
	(embeddings only)

	Essays/Papers
	x qwedsacf/ivypanda-essays
	datajuicer/the-pile-philpaper-refined-by-data-juicer
	(100)
	X CShorten/ML-ArXiv-Papers
	146,034,774


	X Sampled RedPajama
	https://gist.github.com/thistleknot/442f9b92a1100374f2a498bf6a32e0e6
	n = 1000
	increased c4 to account for inability to get common_crawl

	- = derived dataset
	Fine-tune

	Reasoning 1
	tasksource-instruct-v0?row=0
	icl-symbol-tuning-instruct

	math
	open-web-math/open-web-math
	1000
	math_qa
	Problem
	Options
	Correct
	Rationale
	annotate_formula
	qwedsacf/competition_math
	qwedsacf/grade-school-math-instructions


	vietgpt/OIG_mathqa_flanv2_en
	ArtifactAI/arxiv-math-instruct-50k
	ccdv/arxiv-summarization
	4GB

	Coding
	Instruct

	Coding
	codeparrot/self-instruct-starcoder
	Nan-Do/reason_code-search-net-python
	mhhmm/leetcode-solutions-python
	mlabonne/Evol-Instruct-Python-1k
	jamescalam/llama-2-arxiv-papers-chunked
	(summary)
	mlabonne/Evol-Instruct-Python-1k
	Nan-Do/reason_code-search-net-python
	iamtarun/python_code_instructions_18k_alpaca

	Reasoning 2
	OpenOrca
	scientific_and_creative_analogy
	Sciq
	Cosmos QA
	commonsense_qa
	supernatural
	subjqa
	piqa
	qwedsacf/story-generation

	Instruction
	EvolInstruct
	Dolly
	hakurei/open-instruct-v1
	LinkSoul/instruction_merge_set
	search_qa

	TLDR
	CarperAI/openai_summarize_tldr
	JulesBelveze/tldr_news

	Prompt Engineered
	-cod on wiki
	-cod on news
	-spo triplets on cod

	w RAG
	-spo triplets on wiki
	-'unpack' on quotes

	Sentiment
	tyqiangz/multilingual-sentiments (english)

	CoT
	iamketan25/open-assistant-instructions
	SirNeural/flan_v2
	https://github.com/google-research/FLAN/tree/main/flan/v2/cot_data
	ostapeno/flanv2_100k_2
	LogiCOT
	X flanv2
	#https://github.com/google-research/FLAN/tree/main/flan/v2

	User-AI loop
	Collective Cognition
	gpt 4 llm cleaned
	acrastt/EverythingLM-V3-ShareGPT

	Conversational
	- Reddit
	?https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments/viewer/default/explainlikeimfive?row=1
	oa-conversation

	AI-Conversations
	chatbot_arena_conversations
	samantha-data
	ehartford/samantha-data
	datasets/HuggingFaceH4/ultrachat_200k

	Adverserial
	supernaturalz

	Misc
	Investopdia

	DPO
	HuggingFaceH4/ultrafeedback_binarized
	Dahoas/instruct_helpful_preferences