Skip to content

Instantly share code, notes, and snippets.

@joecummings
Created June 4, 2024 22:21
Show Gist options
  • Save joecummings/642eaa9ce539ad93360ee3f999dbcfa3 to your computer and use it in GitHub Desktop.
Save joecummings/642eaa9ce539ad93360ee3f999dbcfa3 to your computer and use it in GitHub Desktop.
from torchtune.models.llama3 import llama3_tokenizer
from torchtune.datasets import instruct_dataset
tokenizer = llama3_tokenizer("./model/original/tokenizer.model")
dataset = instruct_dataset(
tokenizer=tokenizer,
source="TIGER-Lab/WebInstructSub",
template="torchtune.data.AlpacaInstructTemplate",
column_map={
"instruction": "question",
"output": "answer",
},
max_seq_len=3072,
packed=True,
split="train",
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment