Skip to content

Instantly share code, notes, and snippets.

@izzymiller
Created April 12, 2023 15:47
Show Gist options
  • Select an option

  • Save izzymiller/ee6cbeec41ed198f86fb8105375ec79b to your computer and use it in GitHub Desktop.

Select an option

Save izzymiller/ee6cbeec41ed198f86fb8105375ec79b to your computer and use it in GitHub Desktop.
sess_dict = sessionized.to_dict('records')
items = []
counter = 0
for row in sess_dict:
context = []
cstring = ''
for i in range(10,0,-1):
try:
if sess_dict[counter-i]['chat_session_id'] == row['chat_session_id']:
msg = f"{sess_dict[counter-i]['sender']}: {sess_dict[counter-i]['text']}"
if len(context) > 0:
cstring += '\n'
context.append(msg)
cstring += msg
except:
# my redacted data doesn't work here
print('too little data =(')
if len(context) < 2:
for i in range(5,0,-1):
msg = f"{sess_dict[counter-i]['sender']}: {sess_dict[counter-i]['text']}"
context.append(msg)
cstring += '\n'
cstring += msg
items.append(cstring)
counter+= 1
@sensori
Copy link

sensori commented Apr 17, 2023

Hi, awesome post definitely gonna try haha. Question - I didn't quite understand the promptgen part in your blog post. "The last step is to turn these rows into actual string representations of each conversation, and package them up with a “prompt” that I could use to tune the model."

Does this mean that you grouped messages into conversations (with that 4 hour threshold), and then manually created the "instruction" in this data structure for each conversation?

{
"instruction": "...",
"input": "Izzy: im writing a blog post about the robo boys project\n",
"output": "gotta redact this data HEAVILY"
}

@stephendaimler
Copy link

@izzymiller +1 would love some more detail on how you generated the prompts. Thanks for the great post!

@sensori
Copy link

sensori commented Apr 23, 2023

i tried to walkthrough the code again and i think i figured it out.....looks like it breaks up by conversation, and each prompt is per user, with the other users messages as the 'context'.

i'm trying to use whatsapp chat export which is just a text file, so trying to figure out how to get it into the same data structure output while skipping all that SQL stuff

@igorcantele
Copy link

@sensori I'm trying to do the same, would you mind discussing it together?

@sensori
Copy link

sensori commented Jun 19, 2025

@sensori I'm trying to do the same, would you mind discussing it together?

hi @igorcantele sorry I never got a notification for your reply, did you ever continue to work on this? I'm revisiting it to refresh my dev skills

@izzymiller
Copy link
Author

yo! sorry i hadn't seen this. This tutorial / blog from a few years ago is sooo out of date in the fast moving world of AI.

The message sessionization logic should still be sound, but there's WAY better ways to train LLMs these days. I'd recommend checking out unsloth's sample notebooks to anyone reading this https://github.com/unslothai/notebooks/?tab=readme-ov-file

@sensori
Copy link

sensori commented Jun 19, 2025

yo! sorry i hadn't seen this. This tutorial / blog from a few years ago is sooo out of date in the fast moving world of AI.

The message sessionization logic should still be sound, but there's WAY better ways to train LLMs these days. I'd recommend checking out unsloth's sample notebooks to anyone reading this https://github.com/unslothai/notebooks/?tab=readme-ov-file

@izzymiller yooo thanks for the reply. Thanks for the tip on the training, I'm still using a WhatsApp chat so going to modify this one https://github.com/LatentMindAI/perzonalized-ai-chatbot for a group chat using some inspo from your stuff. If you don't mind I've a few questions:

  1. Looks like Llama 3.2 is still a good option as the model?
  2. I looked into Modal and it seriously still seems like magic. Would you still recommend hosting the trained model on it?
  3. I'm not too familiar with web apps...would you be able to quickly mention (even off the top of your head), how your web app frontend is actually calling the model that's running with Modal? I couldn't figure out where it's happening
  4. Back to our original question about how to set up the processed chat data to simulate multi-users - how does this work exactly. Is the idea that after parsing into 'conversations' (by 4 hours since last message), the first message of that convo goes to 'input' field, everything else in the convo session goes to 'output' field. And you use the same prompt for all of them? And then each 4-hour convo session is a separate JSON file?

Sorry that's actually a lot of questions lol, if you don't have too much time I'd prefer your help with #4 the most :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment