Skip to content

Instantly share code, notes, and snippets.

@danielrosehill
Created May 4, 2025 18:11
Show Gist options
  • Select an option

  • Save danielrosehill/9a9f0d17a173bb161c301eb0e1d19604 to your computer and use it in GitHub Desktop.

Select an option

Save danielrosehill/9a9f0d17a173bb161c301eb0e1d19604 to your computer and use it in GitHub Desktop.
Have real user generated collections of recorded AI outputs being systematically gathered?

Overview of LLM Prompt-Response Datasets and Guidance for Creating Your Own

If you've been compiling a personal collection of prompts and responses from your interactions with AI tools, you're contributing to a growing practice in the AI community. Several large-scale datasets have been developed with similar structures, serving various purposes such as training, fine-tuning, and evaluating language models. Here's an overview of some notable datasets and guidance on how you might structure and share your own.


Notable LLM Prompt-Response Datasets

1. ShareGPT

  • Description: A collection of user-shared ChatGPT conversations.
  • Usage: Utilized in training models like Vicuna-13B.
  • Access: Available on Hugging Face.

2. LaMini-Instruction

  • Description: Contains 2.8 million instruction-response pairs derived from models like GPT and various instruction datasets.
  • Purpose: Enhances models' capabilities in responding to human-like instructions.
  • Access: Hosted on Hugging Face.

3. WizardLM Evol-Instruct

  • Description: Features evolved instructions generated by LLMs to increase complexity and diversity.
  • Purpose: Improves instruction-following abilities of models.
  • Access: Available at Hugging Face.

4. UltraChat

  • Description: Comprises 1.5 million multi-turn dialogues generated by AI models.
  • Purpose: Aims to improve the naturalness of dialogue-based responses.
  • Access: Dataset details can be found on Hugging Face.

5. OpenAssistant Conversations (OASST1)

  • Description: A human-generated, human-annotated assistant-style conversation corpus.
  • Purpose: Supports research on large-scale alignment of language models.
  • Access: Available on Hugging Face.

Structuring and Sharing Your Own Dataset

If you're considering sharing your collection:

  • Data Structure: Organize your data with clear fields such as:

    • prompt: The input given to the AI.
    • response: The AI's output.
    • timestamp: When the interaction occurred.
    • model: The AI model used.
    • metadata: Any additional relevant information.
  • Anonymization: Ensure all personal or sensitive information is removed to protect privacy.

  • Licensing: Choose an appropriate open-source license that specifies how others can use your dataset.

  • Hosting: Platforms like Hugging Face Datasets are ideal for sharing datasets with the community.

  • Documentation: Provide a comprehensive README file that explains the dataset's structure, purpose, and any preprocessing steps taken.

By following these guidelines, your dataset can become a valuable resource for researchers and developers working on language models.


This gist was generated with the help of OpenAI based on information provided by the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment