Skip to content

Instantly share code, notes, and snippets.

@AaradhyaSaxena
Last active November 3, 2022 11:59
Show Gist options
  • Save AaradhyaSaxena/269511995f0a3be65946c1ad30ed2b7c to your computer and use it in GitHub Desktop.
Save AaradhyaSaxena/269511995f0a3be65946c1ad30ed2b7c to your computer and use it in GitHub Desktop.
Machine Learning

Precision and Recall

The terrorist detection task is an imbalanced classification problem: we have two classes we need to identify—terrorists and not terrorists—with one category (non-terrorists) representing the overwhelming majority of the data points. Another imbalanced classification problem occurs in disease detection when the rate of the disease in the public is very low. In both these cases, the positive class—disease or terrorist—greatly outnumbers the negative class. These types of problems are examples of the fairly common case in data science when accuracy is not a good measure for assessing model performance.

Recall

Intuitively, we know that proclaiming all data points as negative (not a terrorist) in the terrorist detection problem isn’t helpful, and, instead, we should focus on identifying the positive cases. The metric our intuition tells us we should maximize is known in statistics as recall, or the ability of a model to find all the relevant cases within a data set.

image

Precision

While recall expresses the ability to find all relevant instances of a class in a data set, precision expresses the proportion of the data points our model says existed in the relevant class that were indeed relevant.

image

ROC Curve

One of the main visualization technique for showing the performance of a classification model is the Receiver Operating Characteristic (ROC) curve. The idea is simple: the ROC curve shows how the recall vs. precision relationship changes as we vary the threshold for identifying a positive data point in our model. The threshold represents the value above which we consider a data point in the positive class.

image

AUC

We can quantify a model’s ROC curve by calculating the total Area Under the Curve (AUC), a metric that falls between zero and one with a higher number indicating better classification performance. In the graph above, the AUC for the blue curve will be greater than that for the red curve, meaning the blue model is better at achieving a blend of precision and recall. A random classifier (the black line) achieves an AUC of 0.5.

Metric for Evaluating Recommendation Engines

Cumulative Gain (CG)

If every recommendation has a graded relevance score associated with it, CG is the sum of graded relevance values of all results in a search result list.

The Cumulative Gain at a particular rank position p, where the rel_i is the graded relevance of the result at position i. The problem with CG is that it does not take into consideration the rank of the result set when determining the usefulness of a result set.

Discounted Cumulative Gain

To overcome this we introduce DCG. DCG penalizes highly relevant documents that appear lower in the search by reducing the graded relevance value logarithimically proportional to the position of the result.

Eg:

  • setA: [3, 1, 2, 3, 2, 0] CG setA: 11
  • setB: [3, 3, 2, 2, 1, 0] CG setB: 11
  • DCG setA: 13.306224081788834
  • DCG setB: 14.595390756454924

Normalized Discounted Cumulative Gain

An issue arises with DCG when we want to compare the search engines performance from one query to the next because search results list can vary in length depending on the query that has been provided.

Hence, by normalizing the cumulative gain at each position for a chosen value of p across queries we arrive at NDCG. We perform this by sorting all the relevant documents in the corpus by their relative relevance producing the max possible DCG through position p (a.k.a Ideal Discounted Cumulative Gain)

The ratios will always be in the range of [0, 1] with 1 being a perfect score — meaning that the DCG is the same as the IDCG. Therefore, the NDCG values can be averaged for all queries to obtain a measure of the average performance of a recommender systems ranking algorithm.

Limitations of NDCG

  • The NDCG does not penalize for bad documents in the results.
  • Does not penalize missing documents in the results.
  • May not be suitable to measure performance of queiries that may often have several equally good results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment