Skip to content

Instantly share code, notes, and snippets.

@miagkyi
Created October 17, 2023 17:34
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save miagkyi/f99af352d3dbbdf90b390b4b81649e6b to your computer and use it in GitHub Desktop.
Save miagkyi/f99af352d3dbbdf90b390b4b81649e6b to your computer and use it in GitHub Desktop.

Fine-tuning llama 2 7B to analyze financial reports and write “funny” tweets

Sharing some insights from a recent weekend fun project where I tried to analyze and summarize financial reports using a fine-tuned LLM.

My initial goal was to train a model to summarize the annual/quarterly financial reports of public companies (aka 10-K / 10-Q). But, realizing that straightforward financial summaries are boring, I thought of tuning LLM to generate sarcastic summaries of these reports. Something short I could post on Twitter.

Data exploration and dataset prep

Working with financial reports ain’t easy. You download them in html format, they’re pretty dense with ~100 pages filled with tables that can be tough to parse, many legal disclaimers and various useless info. I knew I wanted to get 3-5 funny tweets as an output from a report. But I spent quite some time figuring out what data to actually input to get the result - a page, a section, a table?

Screenshot 2023-10-12 at 6 59 29 PM

To find the right format, I started experimenting with ChatGPT manually feeding it different data as an input and asking to produce fun tweets. I tried:

  • Single report pages - weren’t useful as they didn’t contain much of useful info.
  • Entire report sections - hit the token limit or confused the model.
  • Financial tables - alone they missed too much context.

After experimenting I started realizing I needed to process those reports first. So I decided (a) to focus on only one single section of the report called “Management discussion & analysis” that contained most of the important info and ignore the rest; (b) to summarize this section and feed summaries to the model instead of just raw text/tables. I did that and I ended up with 3-5 summaries per report. After feeding those to the model I was satisfied with the result as gpt-4 was spitting out some decent tweets that I was comfortable using in my training dataset:

Screenshot 2023-10-12 at 6 40 04 PM

More dataset prep

Now I had to sit down and actually build a pipeline to prepare dataset for fine-tuning. Here is what it looked like:

  1. Extract and parse the “Management Discussion and Analysis” section from the reports. a) Automatically download financial reports from SEC EDGAR. I used a cool API called sec-api. b) I split it in ~6k token chunks so ChatGPT can summarize it. Stole some recipes from openai-cookbook.
  2. Summarize each chunks using gpt-3.5-turbo-16k API.
  3. Generate 5 funny tweets based on each summary with gpt-4 API.

I ran this pipeline to pull financial reports of ~50 public companies over ~3 years. I ended up with a dataset of ~1k rows. To do this with openai’s APIs was not cheap but this was the fastest way and tbh I liked gpt-4’s humor.

Screenshot 2023-10-12 at 7 46 36 PM

Model training and even more dataset prep

Once dataset was ready, I jumped into training right away. I opted for the llama2 7b model with LoRA for simplicity and speed. To fine-tune I used llmhome.io that my folks created so I don’t ask them stupid questions about what version of torch should I install on the gpu machine so it works.

Initial results were not great. I tried different hyperparameters but still couldn’t get good enough output. In the situations like this, I tend to blame a model but deep inside I know the truth — my dataset is crap 💩

Once I looked deeper at a dataset I had to admit that was true. It was pretty weak — it had bunch of irrelevant summaries, some weird tweets created by ChatGPT, etc. I tried to write tools to clean it up automatically but they didn’t work well, so I had to sit down and clean it up manually again. Tbh, labelling and cleaning data is some sort of a full-distance Ironman type of challenge.

I was able cut the dataset from ~1,000 down to ~400 lines. It still felt a bit raw but much better than the initial one. Two other tweaks I made: (a) wrap my summaries into a prompt, (b) generate 3 instead of 5 tweets from a single summary.

I did couple more fine-tuning runs. The result that I got was better — now my new model was actually cracking some awkward jokes! 🥸 Also, it was doing pretty well interpreting the data from the summaries — I found it really cool for a 7b lora trained on 400 datapoints. It was still slightly worse compared to gpt-3.5 though and pretty far from the quality of gpt-4.

An LLM joke.

Takeaways

Will try this dataset with llama2-70b and mistral-7b and will report back. But in the meantime:

  • Bad news: data prep is everything; it’s where I spent 95% of my time.
  • Good news: even a small 400-line, but (relatively) high-quality, dataset can yield you satisfactory results.
  • Something that I didn't realize is that 7b can be fine-tuned on pretty complex and long data inputs of 2-3k+ symbols; it can also learn interpret this data to output something not very simple as a joke.
  • Gpt-4 can create really helpful datasets; but it is expensive and you have to clean them up manually as I haven’t found any good tools.
  • llama2 7b fine-tuned with LoRA is not great but sometimes does okay.
  • Hyperparameters that worked best: 16 batch, 8 epochs, lr 1e-4; haven't tried playing with LoRa's parameters.

Links to dataset and code for this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment