Not to become an enterprisy tool - so probably not fancy collaboration or auto-eval by LLM itself. Also focus on locally hosted open source LLM models and use cases like tool-using (retrieval augmented chatbot), document ingest, agent.
- Automatic output parsing and logging
- Prompt version control
- Variations - try concurrently
- Experiment - models and params... dataset?
- Allows rerun/sample multiple outputs to ensure robustness/reliability
- Show rendered prompt
- Set default choice for inputs
How about these (in other products):
- Compare results side by side
- Prompt Version diff
- Generate dataset from chat (interactive discovery)
- Input values
- Rich input data structure with editor UI
- Components/inheritance
- Version pinning and or specify multiple version to instantiate experiment
- Quick extract
- Guidance
- Continuation
- Union of sets
- Each set, each type of thing
- Can be "Try all", "Constant", or "Try subset"
- Draft initial prompt
- Can extract logical units
- Structured inputs
- First manual run
- Try variations
- Branch
- Then "Run Experiment"
Later during review:
- Can view past experiments
- Experiment pinned to specific version of prompt
- Can choose a "good" result and resume from there (load all states)
- Can modify based on experiment result
- eg token limit reached -> ask to extend answer (i.e. relax token limit)
- Can manually intervene/edit
- Quick extract for continuation
- Can save a dataset for fine-tuning
- Prompt suggestion ability
- Search on prompthub
- AI suggestion based on magic prompt
- Insert prompt snippet
- Data auto-generation
- Let LLM do it (!)
- Data import
Once we're satisfied with the result.
- Publish results in a fixed prompt + dataset.
- Have a separate UI (playground) that let other users try it out quickly.
- Export to download a file containing all relevant data
- So that it can be used elsewhere, e.g. in an LLM application development framework.
When running larger experiments, may want to out source evaluation to a team of human raters. Would like to have a separate part to manage this.
(Integration with Scale AI?)
- Separate page
- Category and keyword
- Display versions + branches
- List tagged versions
- New Prompt
- Basic option (empty, from template)
- Wizard to guide beginner (choose base template + content from AI or prompthub)
- Two Column Panel
- Left: Edit prompts or data (more later)
- Right: Experiment Results
- Filtered list of experiments (can change filter criteria)
- Accordian to expand individual experiments
- Alternative: Competition view of experiemnt
- Current accepted choice on top
- Candidates at bottom
- Has other panels
- Run Experiment
- Using default
- Use last config
- New Experiment (full config)
- Set Models and Params here
- Can have multiple instances (named)
- Must have one default model and default param (used for quick run)
- Root Prompt and subprompts with tree view
- List view of variations
- Preview of rendered?
- List view of all input variables
- Natural text or structured data
- Advanced JSON editor for structured data
- Dataset (with a default) (also list view)
- Big table
- Each row for an adjustable input type (prompt variants, dataset, model, param)
- First column: Run default, run specifics (ad hoc value?)
- Second column: Filterable dropdown to choose (can also enter text + autocomplete)
- Show the compiled prompt template
- Realtime update to show actual rendered final prompt when user input values to the variables?
- Simplified input
- Each variable has a textfield/JSON editor
- Dropbox to choose example value