mmeval指标代码生啃之AveragePrecision

发表于 2023-07-24 更新于 2023-07-25 分类于性能指标， mmeval 阅读次数： Valine：本文字数： 13k 阅读时长 ≈ 12 分钟

对mmeval中的AveragePrecision的计算过程做了实践剖析，学习了解代码计算逻辑。但举例运算的还原过程皆是设置断点全程debug记录，代码逻辑和转化理解全是个人琢磨，有些形容言语可能不太严谨，欢迎探讨指正。

mmeval官网代码：AveragePrecision — MMEval 0.2.1 文档

AveragePrecision

class AveragePrecision(MultiLabelMixin, BaseMetric):
    """Calculate the average precision with respect of classes.

    Args:
        average (str, optional): The average method. It supports two modes:

            - `"macro"`: Calculate metrics for each category, and calculate
                the mean value over all categories.
            - `None`: Return scores of all categories.

        Defaults to "macro".

    References
    ----------
    .. [1] `Wikipedia entry for the Average precision
           <https://en.wikipedia.org/w/index.php?title=Information_retrieval&
           oldid=793358396#Average_precision>`_

计算各label的平均精度。

方法支持两种模式

macro（默认）：先计算每个类别的指标，再计算出所有类别的平均精度；
None：返回所有类别的分数（eg：﹛’AP_classwise’: ［100.0, 83.33, 100.00, 0.0］﹜）

代码表示如下：

from mmeval import AveragePrecision
# 默认情况下，模式为macro
average_precision = AveragePrecision（）
# 特定模式为None
average_precision = AveragePrecision（average=None）

初始化函数

average_options：以列表的形式存储AveragePrecision的两种模式，用assert检查内容是否正确，如不正确返回错误提示语；
self.average：获取输入的模式，macro或者None；
self.pred_is_onehot：默认设为False（暂时不懂onehot编码，反正这个参数被设为False了）

def __init__(self, average: Optional[str] = 'macro', **kwargs) -> None:
	super().__init__(**kwargs)
  average_options = ['macro', None]
  assert average in average_options, 'Invalid `average` argument, ' \
  f'please specify from {average_options}.'
  self.average = average
  self.pred_is_onehot = False

add函数

应该用来对于累计批次求平均值的辅助函数

输入为两个序列（Sequence）类型的参数：preds、labels

preds：由模型计算得到的预测值，需要有每个类别的分数（N—类别个数，C—每个类别的分数）。
labels：真实标签值，标签格式为（N，）或者针对onehot的（N，C）

def add(self, preds: Sequence, labels: Sequence) -> None:  
# type: ignore # yapf: disable # noqa: E501
	"""Add the intermediate results to `self._results`.

	Args:
  	preds (Sequence): Predictions from the model. It should
			be scores of every class (N, C).
		labels (Sequence): The ground truth labels. It should be (N, ) for
			label-format, or (N, C) for one-hot encoding.
    
	Examples:
		No.1
      >>> _results = []
      >>> preds = [0,1,2,3,1,2]
      >>> labels = [0,1,2,4,2,4]
      >>> for pred, target in zip(preds, labels):
            _results.append((pred, target))
      >>> print(_result)
      [(0, 0), (1, 1), (2, 2), (3, 4), (1, 2), (2, 4)]
    No.2
      >>> _results = []
      >>> preds = np.array([[0.9, 0.8, 0.3, 0.2],
                            [0.1, 0.2, 0.2, 0.1],
                            [0.7, 0.5, 0.9, 0.3],
                            [0.8, 0.1, 0.1, 0.2]])
      >>> labels = [np.array([0, 1]), 
                    np.array([1]), 
                    np.array([2]), 
                    np.array([0])]
      >>> for pred, target in zip(preds, labels):
            _results.append((pred, target))
      >>> print(_result)
      [(array([0.9, 0.8, 0.3, 0.2]), array([0, 1])), 
       (array([0.1, 0.2, 0.2, 0.1]), array([1])), 
       (array([0.7, 0.5, 0.9, 0.3]), array([2])), 
       (array([0.8, 0.1, 0.1, 0.2]), array([0]))]

  """
    for pred, target in zip(preds, labels):
        self._results.append((pred, target))

该函数作用为：将preds和labels用zip函数打包成一一对应的关系，再由append保存进列表中（可看代码中给出Examples）。

输出结果格式化函数

作用：将得出的指标结果格式化为字典

输入值：ap：每个类别平均精确度的结果或者是单个marco的结果
返回值：result_metrics：dict类型。

理解：

定义result_metrics的字典类型变量，判断self.average的模式类型；
模式为None：将输入ap的第一行/维转为list类型存进_result中，result_metrics中key=AP_classwise所对应的值为_result所保存的结果（这里四舍五入保留4个小数点）；
模式为macro：可反推输入值为单列表形式（eg：[0.9, 0.8, 0.3, 0.2]，ap[0].item()=0.9）result_metrics中key=mAP所对应的值为输入参数ap第一次位的数值（四舍五入保留4个小数点）

def _format_metric_results(self, ap):
  """Format the given metric results into a dictionary.

        Args:
            ap (list): Results of average precision for each categories
                or the single marco result.

        Returns:
            dict: The formatted dictionary.
        """
  result_metrics = dict()

  if self.average is None:
    _result = ap[0].tolist()
    result_metrics['AP_classwise'] = [round(_r, 4) for _r in _result]
  else:
    result_metrics['mAP'] = round(ap[0].item(), 4)

    return result_metrics

tips：

round( x [, n] )方法：x — 数值表达式、n — 数值表达式，表示从小数点位数。

eg：round(_r, 4)表示对_r四舍五入保留4位小数点

还原运算过程

对例子计算过程的大白话理解：

现在目标检测4张图片，每张图片对应的GroundTrue（labels所示）为0、1，1，2，0（也就是说第一张图片里有两个目标，分别为0类和1类；第二张图片有一个目标，为1类…）。

preds第一行代表在第一张图片中检测出的目标预测值（不知道是不是置信度），第二行代表第二张图片……每一列代表一个类别，所以averageprecision就是要求出每一列的平均精确度。

因为都是使用重载，函数名称都是一样的，称呼上有点尴尬。

举例输入为（int/float类型）

preds = [[0.9, 0.8, 0.3, 0.2],
         [0.1, 0.2, 0.2, 0.1],
         [0.7, 0.5, 0.9, 0.3],
         [0.8, 0.1, 0.1, 0.2]]
labels = [[0, 1], [1], [2], [0]]
average_precision(preds, labels)
# {'mAP': 70.833..}

_compute_metric（第1个）

@overload
@dispatch
def _compute_metric(
  self, preds: Sequence[Union[int, Sequence[Union[int, float]]]],
  labels: Sequence[Union[int, Sequence[int]]]) -> List[List]:
  """A Builtin implementation that computes the metric."""

  return self._compute_metric([np.array(pred) for pred in preds],
                              [np.array(target) for target in labels])

[np.array(pred) for pred in preds]得出的结果是

1	>[array([0.9, 0.8, 0.3, 0.2]), array([0.1, 0.2, 0.2, 0.1]), array([0.7, 0.5, 0.9, 0.3]), array([0.8, 0.1, 0.1, 0.2])]

[np.array(target) for target in labels]得出的结果是

1	>[ array ( [ 0, 1], [1], [2], [0] ) ]

理解：将输入的preds和labels转化为np.array类型，内容并未发生改变。

打草稿（VSCode）

import numpy as np
preds = [[0.9, 0.8, 0.3, 0.2],
      		 [0.1, 0.2, 0.2, 0.1],
     		 [0.7, 0.5, 0.9, 0.3],
     		 [0.8, 0.1, 0.1, 0.2]]
predt = [np.array(pred) for pred in preds]
print(predt)
# [array([0.9, 0.8, 0.3, 0.2]), array([0.1, 0.2, 0.2, 0.1]), array([0.7, 0.5, 0.9, 0.3]), array([0.8, 0.1, 0.1, 0.2])]

_compute_metric（第2个）

通过return跳转进另一个_compute_metric函数，前者的输出为后者的输入

# 输入的preds=[array([0.9, 0.8, 0.3, 0.2]), 
# 			  array([0.1, 0.2, 0.2, 0.1]), 
# 			  array([0.7, 0.5, 0.9, 0.3]), 
# 			  array([0.8, 0.1, 0.1, 0.2])]
# 输入的labels=[array([0, 1], [1], [2], [0])]

@dispatch
def _compute_metric(
  self, preds: Sequence[Union[np.ndarray, np.number]],
  labels: Sequence[Union[np.ndarray, np.number]]) -> List[List]:
  """A NumPy implementation that computes the metric."""

  preds = np.stack(preds)
  num_classes = preds.shape[1]
  labels = format_data(labels, num_classes,
                       self._label_is_onehot).astype(np.int64)

  assert preds.shape[0] == labels.shape[0], \
  'Number of samples does not match between preds' \
  f'({preds.shape[0]}) and labels ({labels.shape[0]}).'

  return _average_precision(preds, labels, self.average)

preds = np.stack(preds)将preds堆叠在一起，效果如下：

># 输入的preds=[array([0.9, 0.8, 0.3, 0.2]), 
># 			  array([0.1, 0.2, 0.2, 0.1]), 
># 			  array([0.7, 0.5, 0.9, 0.3]), 
># 			  array([0.8, 0.1, 0.1, 0.2])]
># 通过np.stack(preds)，得到结果如下：
># print(preds)
>[[0.9 0.8 0.3 0.2]
>[0.1 0.2 0.2 0.1]
>[0.7 0.5 0.9 0.3]
>[0.8 0.1 0.1 0.2]]

num_classes = preds.shape[1]这个例子中可以理解为有多少列

num_classes = 4

labels = format_data(labels, num_classes, self._label_is_onehot).astype(np.int64)

1
2
3
4
# 输入labels=[array([0, 1], [1], [2], [0])]
#		输入的labels必须为np.ndarray, 'torch.Tensor', 'oneflow.Tensor'中其中一类，否则会报错
# 输入num_classes = 4
# self._label_is_onehot (?没有定义？？，默认为None)
format_data函数来自继承的multi_label，意义在于将不同输入的数据（如预测分数、标签格式数据和one-hot编码）格式化为相同的输出形状（N，num_classes）
1
2
3
4
5
6
# 输出结果（labels）为：
# print(labels)
[[1 1 0 0]
[0 1 0 0]
[0 0 1 0]
[1 0 0 0]]

assert（这条代码太长，放代码块里头，如下）：表示preds的shape[0]必须和labels的shape[0]相等，否则会报错。可以理解为preds的类别（行数）和labels的类别（行数）需要一致。

1
2
3

assert preds.shape[0] == labels.shape[0], \
            'Number of samples does not match between preds' \
            f'({preds.shape[0]}) and labels ({labels.shape[0]}).'

下一步为return函数，return _average_precision(preds, labels, self.average)

# 输入值preds
# print(preds)
[[0.9 0.8 0.3 0.2]
 [0.1 0.2 0.2 0.1]
 [0.7 0.5 0.9 0.3]
 [0.8 0.1 0.1 0.2]]
# 输入值（labels）
# print(labels)
[[1 1 0 0]
 [0 1 0 0]
 [0 0 1 0]
 [1 0 0 0]]

总而言之，在本函数return之前的处理中，是将List类型的preds和labels转化为numpy.ndarray类型。

>preds = [np.array(pred) for pred in preds]
>labels = [np.array(target) for target in labels]
>print(type(preds))
>print(type(labels))
># <class 'list'>
># <class 'list'>

>preds = np.stack(predt)
>num_classes = preds.shape[1]
>labels = format_data(labelt, num_classes, None).astype(np.int64)
>print(type(preds))
>print(type(labels))
># <class 'numpy.ndarray'>
># <class 'numpy.ndarray'>

再通过return函数进入_average_precision函数

_average_precision函数

计算numpy的平均精度。AP 将P-R曲线概括为在任意 r’>r 条件下获得的最大精度的加权平均值，r为召回率。

$\text{AP} = \sum_n (R_n - R_{n-1}) P_n$

def _average_precision(preds: np.ndarray, labels: np.ndarray,
                       average) -> np.ndarray:
    r"""Calculate the average precision for numpy.

    AP summarizes a precision-recall curve as the weighted mean of maximum
    precisions obtained for any r'>r, where r is the recall:

    .. math::
        \text{AP} = \sum_n (R_n - R_{n-1}) P_n

    Note that no approximation is involved since the curve is piecewise
    constant.

    Args:
        preds (np.ndarray): The model prediction with shape
            ``(N, num_classes)``.
        labels (np.ndarray): The target of predictions with shape
            ``(N, num_classes)``.

    Returns:
        np.ndarray: average precision result.
    """
    # sort examples along classes
    # np.argsort:返回的是元素值从小到大排序后的索引值的数组
    sorted_pred_inds = np.argsort(-preds, axis=0)
    sorted_target = np.take_along_axis(labels, sorted_pred_inds, axis=0)

    # get indexes when gt_true is positive
    pos_inds = sorted_target == 1

    # Calculate cumulative tp case numbers
    tps = np.cumsum(pos_inds, 0)
    total_pos = tps[-1].copy()  # the last of tensor may change later

    # Calculate cumulative tp&fp(pred_poss) case numbers
    pred_pos_nums = np.arange(1, len(sorted_target) + 1)

    tps[np.logical_not(pos_inds)] = 0
    precision = np.divide(
        tps, np.expand_dims(pred_pos_nums, -1), dtype=np.float32)
    ap = np.divide(
        np.sum(precision, 0), np.clip(total_pos, 1, np.inf), dtype=np.float32)

    if average == 'macro':
        return ap.mean() * 100.0
    else:
        return ap * 100

np.argsort(-preds, axis=0)：将preds按照每列从大到小排序（因为输入中preds加了个符号）

np.take_along_axis(labels, sorted_pred_inds, axis=0)：由上述操作得到了矩阵元素按从大到小排序的索引，接下来想由这个排序索引得到一个新的矩阵，这个新矩阵的元素就是按从大到小排列的。

也就是说，在这个函数中，母体是labels，索引矩阵是sorted_pred_inds，依旧按照列的形式（axis=0）一一对应。比如对应第0列：
1
sorted_target[i][0]=labels[sorted_pred_inds[[i][0]]][0]
也就是sorted_target的第0行第0列的数，对应着是label中index是sorted_pred_inds第0行第0列的位置，也就是想要找出在preds里得到的预测值从大到小排列的top中，是否对应上了真实的label值。如果某一张图的预测检测结果没有和label的真实值对上，那么得到的会是一个0。

# 将例子按类排序
# np.argsort:返回的是元素值从小到大排序后的索引值的数组
sorted_pred_inds = np.argsort(-preds, axis=0)
# print(sorted_pred_inds)
# [[0 0 2 2]
#  [3 2 0 0]
#  [2 1 1 3]
#  [1 3 3 1]]
# np.take_along_axis用于由索引矩阵生成新的矩阵
sorted_target = np.take_along_axis(labels, sorted_pred_inds, axis=0)
# print(sorted_target)
# [[1 1 1 0]
#  [1 0 0 0]
#  [0 1 0 0]
#  [0 0 0 0]]

将sorted_target中为1的索引值赋为True，反之为False

# get indexes when gt_true is positive
pos_inds = sorted_target == 1
# print(pos_inds)
# [[ True  True  True False]
#  [ True False False False]
#  [False  True False False]
#  [False False False False]]

在axis=0（行的累加）上计算累计True的个数，因此total_pos所得到的tps最后一行是单类之中的累积True的个数。

# Calculate cumulative tp case numbers
tps = np.cumsum(pos_inds, 0)
# print(tps)
# [[1 1 1 0]
#  [2 1 1 0]
#  [2 2 1 0]
#  [2 2 1 0]]
total_pos = tps[-1].copy()  # the last of tensor may change later
# print(total_pos)
# [2 2 1 0]

pred_pos_nums：将一个numpy类型保存累计预测类型个数

tps[np.logical_not(pos_inds)] = 0：np.logical_not(pos_inds)将pos_inds内的逻辑全部反过来，再将反过来之后逻辑为True的索引值=0。tps经过这里的操作就会留下只有确实有累加值的索引值。

# Calculate cumulative tp&fp(pred_poss) case numbers
pred_pos_nums = np.arange(1, len(sorted_target) + 1)
# len(sorted_target) ： 4
# pred_pos_nums ： [1 2 3 4]
tps[np.logical_not(pos_inds)] = 0
# print(np.logical_not(pos_inds))
# [[False False False  True]
#  [False  True  True  True]
#  [ True False  True  True]
#  [ True  True  True  True]]
# tps[np.logical_not(pos_inds)] ：[0 0 0 0 0 0 0 0 0 0 0]
# print(tps)
# [[1 1 1 0]
#  [2 0 0 0]
#  [0 2 0 0]
#  [0 0 0 0]]

np.divide：数组对应位置做除法，tps为被除数，输出类别为float32；
np.expand_dims(pred_pos_nums, -1)：取转置，为np.divide中的除数；

可以理解为tps每一行的值都对应除以np.expand_dims(pred_pos_nums, -1)的每一个数值，其实也就是累计预测对的标签数/累计类别个数。

precision = np.divide(
	tps, np.expand_dims(pred_pos_nums, -1), dtype=np.float32)
# print(np.expand_dims(pred_pos_nums, -1))
# [[1]
#  [2]
#  [3]
#  [4]]
# print(precision)
# [[1.        1.        1.        0.       ]
#  [1.        0.        0.        0.       ]
#  [0.        0.6666667 0.        0.       ]
#  [0.        0.        0.        0.       ]]

np.sum(precision, 0)：将每一列求和；
np.inf：表示无穷大；
np.clip(total_pos, 1, np.inf)：在total_pos中截取1-np.inf范围的部分，也就是小于1的数全改为1；

ap = np.divide(
        np.sum(precision, 0), np.clip(total_pos, 1, np.inf), dtype=np.float32)
# np.sum(precision, 0)：[2.        1.6666667 1.        0.       ]
# total_pos：[2 2 1 0]
# np.clip(total_pos, 1, np.inf)：[2. 2. 1. 1.]
# ap：[1.        0.8333334 1.        0.       ]

最后进行判断，如果传入的average参数是macro的话，返回ap的均值并乘以100；如果传入为None的话，直接乘以100，即能得到所有类别的预测分数。

if average == 'macro':
  return ap.mean() * 100.0
# print(ap.mean() * 100.0)：70.83333730697632
else:
  return ap * 100
# print(ap * 100)：[100.        83.333336 100.         0.      ]

然后一直返回返回返回，再进入到_format_metric_results中格式化输出，即得到最后结果。

# 当average = 'macro'时
{'mAP': 70.833..}
# 当average = None时
{'AP_classwise': [100.0, 83.33, 100.00, 0.0]}

tips

@overload和@dispatch

表示_compute_metric方法的重载，意思是_compute_metric方法名称一致，输入参数数量一致，但参数类型不一致。

@overload装饰器其实只是一种注解/提示：该函数允许传入不同的参数类型组合。最终，所有加了@overload装饰器的方法都会被一个不加装饰器的方法覆盖掉。因此加入@dispatch，按照传入参数的类型传入对应的_compute_metric函数。

format_data函数

来自继承的multi_label，意义在于将不同输入的数据（如预测分数、标签格式数据和one-hot编码）格式化为相同的输出形状（N，num_classes）

补充知识

torch.stack()

沿着一个新维度对输入张量序列进行连接。序列中所有的张量都应该为相同形状。其实就是把多个2维的张量凑成一个3维的张量；多个3维的凑成一个4维的张量…以此类推，也就是在增加新的维度进行堆叠。

round(x, n)

保留浮点数x的四舍五入的小数点后n位
1
2
round(80.34567, 2)
# 80.35

np.argsort(x, axis=n)

将矩阵x按照axis的值从小到大排序，并返回排序后的下标（如果想求从大到小，可以将x的所有值转为相反数）

在二维数组中，axis=0:按列排序，axis=1:按行排序（按照这个道理，三位数组应该就是axis=2对应深度吧）
1
2
3
4
5
6
7
8
9
10
# preds = [[0.9 0.8 0.3 0.2]
# 		  [0.1 0.2 0.2 0.1]
#	   	  [0.7 0.5 0.9 0.3]
# 		  [0.8 0.1 0.1 0.2]]
sorted_pred_inds = np.argsort(-preds, axis=0)
# print(sorted_pred_inds)
# [[0 0 2 2]
#  [3 2 0 0]
#  [2 1 1 3]
#  [1 3 3 1]]

np.take_along_axis(x, y, axis=n)

用于由索引矩阵生成新的矩阵。axis同np.argsort，axis=0时按照行，axis=1按照列。
1
2
3
4
5
6
7
8
9
10
# labels = [[1 1 0 0]
#		    [0 1 0 0]
#		    [0 0 1 0]
# 			[1 0 0 0]]
sorted_target = np.take_along_axis(labels, sorted_pred_inds, axis=0)
# print(sorted_target)
# [[1 1 1 0]
#  [1 0 0 0]
#  [0 1 0 0]
#  [0 0 0 0]]

np.cumsum(x, n)

计算在axis=n维度的累计

axis=0:按行累加，即本行=本行+上一行（迭代累加的上一行）

axis=1:按列累加，即本列=本列+上一列（迭代累加的上一列）
1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
a = np.asarray([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])
b = np.cumsum(a, axis=0)
# b = [[ 1 2 3]
#  	   [ 5 7 9]
#  	   [12 15 18]]
c = np.cumsum(a, axis=1)
# c = [[ 1 3 6]
#	   [ 4 9 15]
# 	   [ 7 15 24]]

np.clip(x, a, b)

将x中截取a-b范围的部分（闭区间，即[a, b]）,小于a的数设为a，大于b的数设为b。
1
2
3
4
5
>>> import numpy as np
>>> a = [0,1,2,3,4,5,6,7,8,9]
>>> a = np.clip(a,1,8)
>>> a
array([1, 1, 2, 3, 4, 5, 6, 7, 8, 8])