hamelsmu/axolotl_prompt_notes.md

## axolotl_prompt_notes.md

      
    Raw
  

              axolotl_prompt_notes.md
            
          
    Axolotl Notes For Prompt Construction

the function parse_instruction_fields generates a tuple (instruction,input,response) The key to making a new input format is to make sure you can parse your input into these parts and deal with them.
prompt_tokenizers.InstructionPromptTokenizingStrategy
    def tokenize_prompt(self, prompt):
        (
            instruction,
            input,  # pylint: disable=redefined-builtin
            response,
        ) = self.parse_instruction_fields(prompt)
        user_prompt = next(
            iter(
                self.prompter.build_prompt(
                    instruction,
                    input,
                )
            )
        )
        tokenized_prompt = self._tokenize(user_prompt, add_eos_token=False)
        if not self.train_on_inputs:
            user_prompt_len = len(tokenized_prompt["input_ids"])
            # TODO this could be sped up using numpy array slicing
            tokenized_prompt["labels"] = [IGNORE_INDEX] * user_prompt_len
        tokenized_res_prompt = self._tokenize(
            response, strip_bos_token=True, add_eos_token=True
        )
        tokenized_prompt["input_ids"] += tokenized_res_prompt["input_ids"]
        tokenized_prompt["attention_mask"] += tokenized_res_prompt["attention_mask"]
        tokenized_prompt["labels"] += tokenized_res_prompt["input_ids"]
In Axolotl, many PromptStrategies override the parse_instruct_fields method like this:
prompt_tokenizers.AlpacaPromptTokenizingStrategy
class AlpacaPromptTokenizingStrategy(InstructionPromptTokenizingStrategy):
    """
    Tokenizing strategy for Alpaca prompts.
    """

    def parse_instruction_fields(self, prompt) -> Tuple[str, str, str]:
        return (
            prompt["instruction"],
            prompt["input"] if "input" in prompt else "",
            prompt["output"],
        )
The above is called from prompt_strategies.alpaca_chat
def load(tokenizer, cfg, ds_cfg: Optional[Dict[str, Any]] = None):
    prompt_style = PromptStyle.CHAT.value
    if ds_cfg and "conversation" in ds_cfg:
        prompt_style = ds_cfg["conversation"]

    return AlpacaPromptTokenizingStrategy(
        AlpacaPrompter(prompt_style),
        tokenizer,
        cfg.train_on_inputs,
        cfg.sequence_len,
    )