A Python package for calculating precision-recall-gain

#model-evaluation #python #statistics #my-oss #ml

In "Precision-Recall-Gain Curves: PR Analysis Done Right" by Flach and Kull (2015), they argue that the standard way of measuring the area under the precion-recall curve (the AUPRC metric) is flawed. They show that this metric does not have the same properties as the area under an ROC curve. In particular, the scores calculated for random models shifts as you change class balance, straight lines between two points are not valid because the scale is curved, and the AUPRC is hard to interpret.

Let $π$ be the fraction of positives (or class balance). Their fix is to move the origin to $(π, π)$ and stretch the axes:

$precision_gain = \frac{precision - π}{(1 - π) precision}$
$recall_gain = \frac{recall - π}{(1 - π) recall}$

Now, random models score zero, lines between points are straight, and the Pareto frontier of models is a convex hull.

Intuition on the origin shift

Let $π$ be the fraction of real positives in the data. A "coin-flip" model that guess positive with chance $π$ has a precision score of $π$ and recall of $π$ at the point where its threshold keeps only $π$ of the items. Also, a model which always predicts positive also has a precision score of $π$ . Therefore, the authors argue that $π$ is the "natural zero": the score you get for no skill beyond chance.

So the performance score of a random baseline depends on the class distribution, which is not good. If you want to measure your "gain" over random chance, you should put that chance point at the origin. That's what subtracting $π$ in the numerator does.

Intuition on the stretching

The distances in the the raw PR curve are deceptive. Moving from 0.02 to 0.04 precision is a 100% lift when $π$ = 0.01, but the same absolute jump from 0.52 to 0.54 is tiny. So we want a scale where "halfway to perfect" means the same everywhere: a model with a score of 0.5 should have the same relative ability whether the data are imbalanced or not.

Vanilla precision spans $π$ (random chance) to 1, so the middle absolute value can mean wildy different things as you vary $π$ . Rescaling this so that the minimum is $π$ .

However, the paper explains that this rescaling needs to be done in the harmonic scale (not linear scale) which is why the equation is a bit different. In essence, the vanilla AUPRC metric is taking the arithmetic mean of the precision scores, which is not appropriate in the linear coordinate system.

Calculating precision-recall-gain in Python

There's official implementation of PRG in Matlab, R, and Python here, but the Python implementation was very broken. Someone had opened a pull request to Sklearn, but unfortunately it looks like it won't get merged. So, I copied their implementation into a stand-alone Python library precision-recall-gain. You can find it here: https://github.com/crypdick/precision-recall-gain.

Intuition on the origin shift

Intuition on the stretching

Calculating precision-recall-gain in Python

Copyright Ricardo Decal. richarddecal.com