Skip to content

Instantly share code, notes, and snippets.

@glebmikha
glebmikha / eda_prompt.md
Last active March 20, 2024 14:41
EDA Prompt for ChatGPT (and Humans)
  1. Calculate the percentage of missing values in each column and sort them in descending order.
    1. Missing values and outliers are not problems to be fixed! They are facts.
    2. During EDA you must not “fix” them because you have to deal with your data and problem as it is.
    3. If you see missing values, just report them.
  2. Identify and understand your target variable.
    1. Understand the type of the target variable: binary, categorical, or numeric.
    2. Examine the distribution of the target variable.
      1. For a binary variable (which needs to be converted into 0s and 1s if it is in string format), the mean (a proportion of 1s) is simply used.
      2. For a categorical variable, value counts are used.
  3. For a numeric variable, a histogram or a pandas' describe table is used.
@abhishekkrthakur
abhishekkrthakur / llm_training_sft.py
Created July 16, 2023 09:09
Train LLMs in 50 lines of code. This is a reference code for YouTube tutorial: https://www.youtube.com/watch?v=JNMVulH7fCo&ab_channel=AbhishekThakur
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
def train():
train_dataset = load_dataset("tatsu-lab/alpaca", split="train")
tokenizer = AutoTokenizer.from_pretrained("Salesforce/xgen-7b-8k-base", trust_remote_code=True)