Skip to content

Instantly share code, notes, and snippets.

@Dref360
Created November 21, 2021 17:25
Show Gist options
  • Save Dref360/6a6fba8066a3346c53daaf6b961cffc5 to your computer and use it in GitHub Desktop.
Save Dref360/6a6fba8066a3346c53daaf6b961cffc5 to your computer and use it in GitHub Desktop.
Code for the blog post "Improving trust in text classification using HF and BaaL"

The code should run as is with the following dependencies:

pip install transformers datasets baal matplotlib tqdm

Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "fa54f9f8",
"metadata": {},
"source": [
"# Improve trust in text classification models using HuggingFace and BaaL\n",
"\n",
"As we introduce more deep learning models in production, it is essential that users trust decisions' made by our models. \n",
"\n",
"One of the worst experience for a user is when a model makes a wrong prediction with high confidence, they would stop trusting the model as the confidence does not match the expected performance.\n",
"\n",
"This is what we call **calibration**, we can compute how well a model is calibrated using the **Expected calibration error** or **ECE**. This metric can be summarized as the weighted average of the difference between a model's confidence and its accuracy at multiple bins of confidence. Below, we have a visual explanation coming from the excellent paper of Guo et al. 2017.\n",
"\n",
"![](https://i.imgur.com/WZCdroM.png)\n",
"\n",
"We want to minimize the gaps in this diagram. In this post, we will improve a model's calibration using HuggingFace and BaaL."
]
},
{
"cell_type": "markdown",
"id": "95936ab2",
"metadata": {},
"source": [
"#### Load our HuggingFace Pipeline and Dataset.\n",
"\n",
"The HuggingFace ecosystem is simple to use and in just a few lines of code, we can have a pretrained model and its associated dataset. We will use the well-known SST2 dataset along with a DistilBERT model."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "5abd21ca",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Reusing dataset glue (/home/fred/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a089f827c6be4730ae9a4a0121057a38",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from datasets import load_dataset, Dataset\n",
"from transformers import AutoTokenizer, TextClassificationPipeline, AutoModelForSequenceClassification\n",
"import numpy as np\n",
"\n",
"def load_model(checkpoint_path, use_cuda=False):\n",
" model = AutoModelForSequenceClassification.from_pretrained(\n",
" checkpoint_path\n",
" )\n",
" tokenizer = AutoTokenizer.from_pretrained(\n",
" checkpoint_path, use_fast=False\n",
" )\n",
" device = 1 if use_cuda else 0\n",
" # We set return_all_scores=True to get all softmax outputs\n",
" return TextClassificationPipeline(\n",
" model=model, tokenizer=tokenizer, device=device, return_all_scores=True\n",
" )\n",
"\n",
"pipeline = load_model(\"distilbert-base-uncased-finetuned-sst-2-english\", use_cuda=False)\n",
"\n",
"TEXT_COLUMN = \"sentence\"\n",
"LABEL_COLUMN = \"label\"\n",
"dataset = load_dataset(\"glue\", \"sst2\")[\"validation\"]\n",
"\n",
"def output_to_probs(predictions):\n",
" # Get the score directly\n",
" return np.array([[p['score'] for p in pred] for pred in predictions])\n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"id": "4c707e50",
"metadata": {},
"source": [
"#### Use MC-Dropout for better predictions with BaaL.\n",
"\n",
"BaaL is a Bayesian active learning library that will help us improve ECE.\n",
"\n",
"To do so, we will use Bayesian deep learning to gather multiple predictions for the same input. The key idea is that by drawing multiple sets of weights from the posterior distribution, the average prediction will be better than a single. This is not unalike Ensembles, but without retraining, we will call this a Bayesian Ensemble. Generally Ensembles are better, but require more computational power.\n",
"\n",
"While we have ways to separate the model's uncertainty from the data's uncertainty, we will focus on the predictive uncertainty which is ultimately what will affect the calibration of the model.\n",
"\n",
"Next, we will compare the regular model's ECE and its Bayesian alternative. BaaL will help us prepare the model and compute the ECE.\n",
"\n",
"To prepare the model, we simply do:\n",
"\n",
"```python\n",
"from baal.bayesian.dropout import patch_module\n",
"pipeline.model = patch_module(pipeline.model)\n",
"```\n",
"\n",
"This will modify the model of our loaded pipeline to use Dropout at test time."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "733246b8",
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"from baal.bayesian.dropout import patch_module\n",
"from baal.utils.metrics import ECE, Accuracy\n",
"from functools import partial\n",
"from pprint import pprint\n",
"\n",
"def evaluate_pipeline_on_dataset(pipeline: TextClassificationPipeline, dataset: Dataset):\n",
" predictions = output_to_probs(pipeline(dataset[TEXT_COLUMN]))\n",
" labels = np.array(dataset[LABEL_COLUMN])\n",
" ece = ECE(n_bins=20)\n",
" ece.update(output=torch.from_numpy(predictions), target=torch.from_numpy(labels))\n",
" accuracy = Accuracy()\n",
" accuracy.update(output=torch.from_numpy(predictions), target=torch.from_numpy(labels))\n",
" return {\"ece\":ece.value, \"accuracy\": accuracy.value}\n",
"\n",
"standard_metrics = evaluate_pipeline_on_dataset(pipeline, dataset)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "59062120",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 14%|███████████████████████████▎ | 1/7 [09:57<59:46, 597.69s/it]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'bayesian': {'accuracy': 0.9036697149276733, 'ece': 0.0636752943077341},\n",
" 'standard': {'accuracy': 0.9105504155158997, 'ece': 0.08026097557686887}}\n"
]
}
],
"source": [
"def evaluate_pipeline_on_dataset(pipeline: TextClassificationPipeline, dataset: Dataset):\n",
" predictions = pipeline(dataset[TEXT_COLUMN])\n",
" labels = np.array(dataset[LABEL_COLUMN])\n",
" ece = ECE(n_bins=20)\n",
" ece.update(output=torch.from_numpy(predictions), target=torch.from_numpy(labels))\n",
" accuracy = Accuracy()\n",
" accuracy.update(output=torch.from_numpy(predictions), target=torch.from_numpy(labels))\n",
" return {\"ece\":ece.value, \"accuracy\": accuracy.value}\n",
"\n",
"# We now modify the model to do MC-Dropout\n",
"pipeline.model = patch_module(pipeline.model)\n",
"\n",
"# We need a custom inference function to aggregate predictions\n",
"def inference(sentences, pipeline, iterations, **kwargs):\n",
" preds = np.array([output_to_probs(pipeline(sentences, **kwargs)) for _ in range(iterations)])\n",
" return preds.mean(0)\n",
"\n",
"bayesian_pipeline = partial(inference, pipeline=pipeline, iterations=20)\n",
"bayesian_metrics = evaluate_pipeline_on_dataset(bayesian_pipeline, dataset)\n",
"\n",
"\n",
"pprint({\"standard\": standard_metrics,\n",
" \"bayesian\": bayesian_metrics})"
]
},
{
"cell_type": "markdown",
"id": "3bf49eb7",
"metadata": {},
"source": [
"## Impact of the iterations parameters\n",
"\n",
"Using 20 iterations, we improved our model's calibration by a significant margin. This is quite good!\n",
"\n",
"Let's investigate how more iterations means better ECE."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "3f79ccf1",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [27:11<00:00, 233.01s/it]\n"
]
}
],
"source": [
"# Get the metrics at different number of samplings, we get the average over 3 runs.\n",
"from tqdm import tqdm\n",
"metrics = {it: [evaluate_pipeline_on_dataset(partial(inference, pipeline=pipeline, iterations=it), dataset) for r in range(3)]\n",
" for it in tqdm([1, 5, 10, 20, 40, 60, 80])}"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "f77f7c12",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# Merge metrics together.\n",
"metrics = {it: {m: (np.mean([v[m] for v in vals]),\n",
" np.std([v[m] for v in vals]))\n",
" for m in vals[0].keys()} for it, vals in metrics.items()}\n",
"\n",
"for met in [\"ece\", \"accuracy\"]:\n",
" iterations = list(metrics.keys())\n",
" means = [v[met][0] for v in metrics.values()]\n",
" std = [v[met][1] for v in metrics.values()]\n",
" plt.errorbar(iterations, means, yerr=std, label=\"Bayesian\")\n",
" plt.hlines(standard_metrics[met], xmin=0.0, xmax=max(iterations), color=\"red\", label=\"Standard\")\n",
" plt.xlabel(\"Iterations\")\n",
" plt.ylabel(met)\n",
" plt.title(f\"{met} between Bayesian sampling and frequentist\")\n",
" plt.legend()\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"id": "8434f62b",
"metadata": {},
"source": [
"#### Discussion\n",
"\n",
"Testing our ECE at multiple iterations, we see that it converges quickly after ~40 iterations. While the accuracy takes a hit in the beginning, it quickly comes back. Of course, sampling brings **noise** to the prediction, but it stabilizes quickly with enough iterations.\n",
"\n",
"\n",
"\n",
"### Conclusion\n",
"\n",
"Using a couple of line of code, we can improve our model's calibration. While we now require multiple predictions per input, the cost should not be too prohibitive for most cases. If you have access to large GPUs, I suggest duplicating your dataset and aggregate the predictions at end. \n",
"\n",
"I have gone quickly over Bayesian deep learning and MC-Dropout so here are some resources if you want to know more:\n",
"1. [BaaL background literature](https://baal.readthedocs.io/en/latest/literature/core-papers.html)\n",
"2. [BaaL user guide](https://baal.readthedocs.io/en/latest/user_guide/index.html)\n",
"\n",
"\n",
"Earlier, I mentioned model uncertainty versus data uncertainty, if you would like to know more I would recommend the following resources:\n",
"* [Bayesian active learning for production, a systematic study and a reusable library\n",
"](https://arxiv.org/abs/2006.09916) (Atighehchian et al. 2020)\n",
"* [Synbols: Probing Learning Algorithms with Synthetic Datasets (Section 3.3)\n",
"](https://nips.cc/virtual/2020/public/poster_0169cf885f882efd795951253db5cdfb.html) (Lacoste et al. 2020)\n",
"\n",
"\n",
"If you have any question or suggestion, please contact me at:\n",
"1. @Dref360 on [Slack](https://join.slack.com/t/baal-world/shared_invite/zt-z0izhn4y-Jt6Zu5dZaV2rsAS9sdISfg)\n",
"2. frederic.branchaud.charron@gmail.com\n",
"\n",
"I'm thinking of more blog posts combining HuggingFace and BaaL, let me know if that interest you!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7cc9c342",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "12130668",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment