Skip to content

Instantly share code, notes, and snippets.

@cbowdon
Created April 27, 2023 09:57
Show Gist options
  • Save cbowdon/3361c697abfedc9602c8b9e458a7084f to your computer and use it in GitHub Desktop.
Save cbowdon/3361c697abfedc9602c8b9e458a7084f to your computer and use it in GitHub Desktop.
ML classifier sanity checklist

SAVE YOUR SANITY

Things I know but need a checklist to ensure I address them systematically.

  1. Dataset is balanced - dev, train and test. Do you have approx. balanced classes in each?
  2. Is each dataset distinct? Have you checked for duplicates within and across datasets?
  3. Is the data shuffled?
  4. Spot check the data. Are the classifications consistent and in line with the objective?
  5. Score the model before training. Is its accuracy close to random?
  6. Train the simplest available model (no pretrained vectors) on a small subset of data (overfit). Does its loss improve? Does its accuracy improve to something better than random?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment