-
-
Save izzymiller/ee6cbeec41ed198f86fb8105375ec79b to your computer and use it in GitHub Desktop.
| sess_dict = sessionized.to_dict('records') | |
| items = [] | |
| counter = 0 | |
| for row in sess_dict: | |
| context = [] | |
| cstring = '' | |
| for i in range(10,0,-1): | |
| try: | |
| if sess_dict[counter-i]['chat_session_id'] == row['chat_session_id']: | |
| msg = f"{sess_dict[counter-i]['sender']}: {sess_dict[counter-i]['text']}" | |
| if len(context) > 0: | |
| cstring += '\n' | |
| context.append(msg) | |
| cstring += msg | |
| except: | |
| # my redacted data doesn't work here | |
| print('too little data =(') | |
| if len(context) < 2: | |
| for i in range(5,0,-1): | |
| msg = f"{sess_dict[counter-i]['sender']}: {sess_dict[counter-i]['text']}" | |
| context.append(msg) | |
| cstring += '\n' | |
| cstring += msg | |
| items.append(cstring) | |
| counter+= 1 |
@izzymiller +1 would love some more detail on how you generated the prompts. Thanks for the great post!
i tried to walkthrough the code again and i think i figured it out.....looks like it breaks up by conversation, and each prompt is per user, with the other users messages as the 'context'.
i'm trying to use whatsapp chat export which is just a text file, so trying to figure out how to get it into the same data structure output while skipping all that SQL stuff
@sensori I'm trying to do the same, would you mind discussing it together?
@sensori I'm trying to do the same, would you mind discussing it together?
hi @igorcantele sorry I never got a notification for your reply, did you ever continue to work on this? I'm revisiting it to refresh my dev skills
yo! sorry i hadn't seen this. This tutorial / blog from a few years ago is sooo out of date in the fast moving world of AI.
The message sessionization logic should still be sound, but there's WAY better ways to train LLMs these days. I'd recommend checking out unsloth's sample notebooks to anyone reading this https://github.com/unslothai/notebooks/?tab=readme-ov-file
yo! sorry i hadn't seen this. This tutorial / blog from a few years ago is sooo out of date in the fast moving world of AI.
The message sessionization logic should still be sound, but there's WAY better ways to train LLMs these days. I'd recommend checking out unsloth's sample notebooks to anyone reading this https://github.com/unslothai/notebooks/?tab=readme-ov-file
@izzymiller yooo thanks for the reply. Thanks for the tip on the training, I'm still using a WhatsApp chat so going to modify this one https://github.com/LatentMindAI/perzonalized-ai-chatbot for a group chat using some inspo from your stuff. If you don't mind I've a few questions:
- Looks like Llama 3.2 is still a good option as the model?
- I looked into Modal and it seriously still seems like magic. Would you still recommend hosting the trained model on it?
- I'm not too familiar with web apps...would you be able to quickly mention (even off the top of your head), how your web app frontend is actually calling the model that's running with Modal? I couldn't figure out where it's happening
- Back to our original question about how to set up the processed chat data to simulate multi-users - how does this work exactly. Is the idea that after parsing into 'conversations' (by 4 hours since last message), the first message of that convo goes to 'input' field, everything else in the convo session goes to 'output' field. And you use the same prompt for all of them? And then each 4-hour convo session is a separate JSON file?
Sorry that's actually a lot of questions lol, if you don't have too much time I'd prefer your help with #4 the most :D
Hi, awesome post definitely gonna try haha. Question - I didn't quite understand the promptgen part in your blog post. "The last step is to turn these rows into actual string representations of each conversation, and package them up with a “prompt” that I could use to tune the model."
Does this mean that you grouped messages into conversations (with that 4 hour threshold), and then manually created the "instruction" in this data structure for each conversation?
{
"instruction": "...",
"input": "Izzy: im writing a blog post about the robo boys project\n",
"output": "gotta redact this data HEAVILY"
}