Skip to content

Instantly share code, notes, and snippets.

@chunhualiao
Last active February 3, 2024 21:30
Show Gist options
  • Save chunhualiao/a61a8673675d2bca7dc688cfcd6d1be8 to your computer and use it in GitHub Desktop.
Save chunhualiao/a61a8673675d2bca7dc688cfcd6d1be8 to your computer and use it in GitHub Desktop.
arithmetic mean, geometric mean, and harmonic mean

It is never late to learn things.

I was curious about why F1-Score is needed in the first place. It turns out Accuracy is bad for imbalanced data.

Imagine you're a medical researcher developing a test for a rare disease. This disease affects only 1% of the population, meaning 99% of tests will be for healthy individuals (negative) and only 1% for individuals with the disease (positive).

Scenario 1: Using Accuracy:

  • Your test correctly identifies 99.9% of healthy individuals (true negatives). P(true_negative|negative_eamples)= Specificity

  • However, it only correctly identifies 50% of individuals with the disease (true positives), meaning the other 50% are missed (false negatives). This means Recall or sensitivity is 50%. P(true_positive|positve_examples) = Recall

  • Overall accuracy: (0.99999 true negatives + 10.5 true positives) / 100 = 99.4%

This seems like a high accuracy, suggesting a good test. However, in this context, it's misleading:

  • Missing half of the individuals with the disease is potentially dangerous and could lead to delayed diagnosis and treatment.
  • The high accuracy is mainly driven by correctly identifying the much larger group of healthy individuals, masking the poor performance for the group you actually care about (individuals with the disease).

Scenario 2: Computing F1 Score from Specificity and Recall

To compute the F1-score, we first need to calculate the precision and recall of the test. Then we can use these values to compute the F1-score.

From the given conditions, we have:

  • Recall (Sensitivity): 50%, chance to find positive for a positive sample
  • Specificity: 99.9%, chance to find negative for a negative sample

However, specificity is not directly used to compute the F1-score. Instead, we need precision:

Let's assume your test is applied to a population of 10,000 individuals (a large enough number to make the calculations easier and the percentages more accurate).

Given:

  • Disease prevalence is 1%: 100 individuals have the disease (actual positive), and 9,900 do not (actual negatives).
  • Recall: The test correctly identifies 50% of actual positives (true positives): 0.50 * 100 = 50 true positives. 50 false negative
  • Specificity: The test correctly identifies 99.9% of actual negatives (true negatives): 0.999 * 9,900 = 9,890 true negatives, 10 false positive (or 9.9)

Now, with true positives and false positives known, we can calculate precision:

Precision = True Positives / (True Positives + False Positives)

Precision = 50 / (50 + 10) = 50 / 60 = 0.8333... (or 83.33%)

Now that we have both recall and precision, we can calculate the F1-score:

F1-score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.8333 * 0.5) / (0.8333 + 0.5) ≈ 2 * 0.41665 / 1.3333 ≈ 0.8333 / 1.3333 ≈ 0.625 (or 62.5%)

So, the F1-score of the test is approximately 0.625, or 62.5%, reflecting the balance between precision and recall in the context of this rare disease diagnosis.

Scenario 3: Computing F1 Score from Precision and Recall:

Focus on two metrics related to finding true positives (over all positive warnings vs. positive samples): using harmonic mean to compute F1-Score.

Let's say the test has

  • a precision (P(True_Positives|All_Positive_Warnings)) of 80% (meaning 80% of identified positives are truly diseased) and

  • a recall (P (True_Positive|Positive_Sample)) of 50% (as mentioned before, correctly identifies 50% of individuals with the disease).

  • F1 score = 2 * (precision * recall) / (precision + recall) = 2 * (80% * 50%) / (80% + 50%) ≈ 66.7%

This F1 score paints a much clearer picture: while identifying healthy individuals is good, the test clearly struggles with accurately detecting the disease itself.

Therefore, accuracy becomes a poor metric in unbalanced data situations, especially when the minority class (individuals with the disease) is crucial to identify correctly. Metrics like F1 score, which consider both precision and recall, are more informative in such cases.

Remember, always choose the metric that best aligns with your specific problem and its priorities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment