[core] calibration#

Metrics that meature model calibration.

Reference Papers:

[1] Lin, Zhen, Shubhendu Trivedi, and Jimeng Sun. “Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks.” ICLR 2023.

[2] Nixon, Jeremy, Michael W. Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. “Measuring Calibration in Deep Learning.” In CVPR workshops, vol. 2, no. 7. 2019.

[3] Patel, Kanil, William Beluch, Bin Yang, Michael Pfeiffer, and Dan Zhang. “Multi-class uncertainty calibration via mutual information maximization-based binning.” ICLR 2021.

[4] Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. “On calibration of modern neural networks.” ICML 2017.

[5] Kull, Meelis, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. “Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.” Advances in neural information processing systems 32 (2019).

[6] Brier, Glenn W. “Verification of forecasts expressed in terms of probability.” Monthly weather review 78, no. 1 (1950): 1-3.

Expected Calibration Error (ECE).

We group samples into ‘bins’ basing on the top-class prediction. Then, we compute the absolute difference between the average top-class prediction and the frequency of top-class being correct (i.e. accuracy) for each bin. ECE is the average (weighed by number of points in each bin) of these absolute differences. It could be expressed by the following formula, with $$B_m$$ denoting the m-th bin:

$ECE = \sum_{m=1}^M \frac{|B_m|}{N} |acc(B_m) - conf(B_m)|$

Example

>>> pred = np.asarray([[0.2, 0.2, 0.6], [0.2, 0.31, 0.49], [0.1, 0.1, 0.8]])
>>> label = np.asarray([2,1,2])
>>> ECE_confidence_multiclass(pred, label, bins=2)
0.36333333333333334


Explanation of the example: The bins are [0, 0.5] and (0.5, 1]. In the first bin, we have one sample with top-class prediction of 0.49, and its accuracy is 0. In the second bin, we have average confidence of 0.7 and average accuracy of 1. Thus, the ECE is $$\frac{1}{3} \cdot 0.49 + \frac{2}{3}\cdot 0.3=0.3633$$.

Parameters
• prob (np.ndarray) – (N, C)

• label (np.ndarray) – (N,)

• bins (int, optional) – Number of bins. Defaults to 20.

• adaptive (bool, optional) – If False, bins are equal width ([0, 0.05, 0.1, …, 1]) If True, bin widths are adaptive such that each bin contains the same number of points. Defaults to False.

Expected Calibration Error (ECE) for binary classification.

Similar to ece_confidence_multiclass(), but on class 1 instead of the top-prediction.

Parameters
• prob (np.ndarray) – (N, C)

• label (np.ndarray) – (N,)

• bins (int, optional) – Number of bins. Defaults to 20.

• adaptive (bool, optional) – If False, bins are equal width ([0, 0.05, 0.1, …, 1]) If True, bin widths are adaptive such that each bin contains the same number of points. Defaults to False.

Classwise Expected Calibration Error (ECE).

This is equivalent to applying ece_confidence_binary() to each class and take the average.

Parameters
• prob (np.ndarray) – (N, C)

• label (np.ndarray) – (N,)

• bins (int, optional) – Number of bins. Defaults to 20.

• threshold (float) – threshold to filter out samples. If the number of classes C is very large, many classes receive close to 0 prediction. Any prediction below threshold is considered noise and ignored. In recent papers, this is typically set to a small number (such as 1/C).

• adaptive (bool, optional) – If False, bins are equal width ([0, 0.05, 0.1, …, 1]) If True, bin widths are adaptive such that each bin contains the same number of points. Defaults to False.

pyhealth.metrics.calibration.brier_top1(prob, label)[source]#

Brier score (i.e. mean squared error between prediction and 0-1 label) of the top prediction.