cbowdon/sanity.md

## sanity.md

      
    Raw
  

              sanity.md
            
          
    SAVE YOUR SANITY

Things I know but need a checklist to ensure I address them systematically.

Dataset is balanced - dev, train and test. Do you have approx. balanced classes in each?
Is each dataset distinct? Have you checked for duplicates within and across datasets?
Is the data shuffled?
Spot check the data. Are the classifications consistent and in line with the objective?
Score the model before training. Is its accuracy close to random?
Train the simplest available model (no pretrained vectors) on a small subset of data (overfit). Does its loss improve? Does its accuracy improve to something better than random?