Skip to content

Instantly share code, notes, and snippets.

@Elijah-Bodden
Last active October 5, 2024 15:02
Show Gist options
  • Save Elijah-Bodden/1964bd02fcd19efef65f6e0cd92881c4 to your computer and use it in GitHub Desktop.
Save Elijah-Bodden/1964bd02fcd19efef65f6e0cd92881c4 to your computer and use it in GitHub Desktop.

How to make an LLM clone of yourself

Wanna create and play with an AI clone of yourself or someone else (my lawyer says please don't)1 like this one? You're in luck because it's super easy!

Step one: get you some datas

This step really varies depending on your data sources, but the end goal is to turn some of real-you's conversations (from your platforms of choice) into a ShareGPT format dataset with you as the gpt. Here's what your (json) file should end up looking like:

{"conversations": [{"from": "human", "value": "Hi"}, {"from": "gpt", "value": "Hello"}]} 
{"conversations": [{"from": "human", "value": "What's up "}, {"from": "gpt", "value": "not much, you?"}, {"from": "human", "value": "Just thinking, what if you're a robot and I don't realize it?"}, {"from": "gpt", "value": "hahaha don't be crazy"}]}
...

NOTE: Make sure every line starts with a message from the other person ("human") You'll probably want to split long (more than-a-few-messages) conversations into several lines. I've tried a few different things but it seems like 20-ish messages (10-ish them-you pairs) a line is good.
If you sometimes have one person send multiple messages in a row, I recommend merging them together with newlines and putting them in the dataset as one message that way.
Finally, upload the file to a new huggingface dataset.

Discord

I used only discord messages for my clone. Here's what I did to get my dataset:

  1. Used Discord Chat Exporter to export the conversations I wanted
  2. Put the resulting file names into filenames in this script and ran it (also add your discord username as it appears in the exported chats as myname). Boom, that easy. Now upload the result to huggingface.

Step two: train

This is the easy part! Make a copy of this colab notebook and edit the config. I think the options are self-explanatory. It uses llama 3 8b for the MODEL, but you should be able to use any. I'd be curious to see how other base models do.
Then run all and your model should be there on huggingface in TRAINED_REPO.

Step three: profit

This is the easy-est part. Just duplicate this space, swap out model_id for your new model's repo, and it should deploy.
Have fun with your new mini-you!

Footnotes

  1. Seriously don't be that guy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment