mmeval指标代码生啃之F1Score

发表于 2023-07-26 更新于 2023-07-27 分类于性能指标， mmeval 阅读次数： Valine：本文字数： 16k 阅读时长 ≈ 15 分钟

对mmeval中的F1Score的计算过程做了实践剖析，学习了解代码计算逻辑。但举例运算的还原过程皆是设置断点全程debug记录，代码逻辑和转化理解全是个人琢磨，有些形容言语可能不太严谨，欢迎探讨指正。

主打一个原本懂了的看了之后懂了，不懂的看了之后可能懂了。这一篇写的有点乱，但代码和运算逻辑都比上篇AveragePrecison好理解。

在过完一遍代码和手算过程之后，用大白话解释一下两种mode的运算逻辑，依旧是个人的理解总结（主要是为了防止之后自己忘记了= =）。

mode='macro'：是先算每个类别中的f1score计算出来后计算平均值。

tp：每个类中预测类别和真实类别一致的个数；

tp求出来是1xn，则n对应num_classes数量，里面的数代表每个类别中正确的个数。例如tp=[1,1,2,0,2]，表示num_classes=5，类别1正确1个，类别2正确1个…

fp：每个类中计算出来，在预测结果中，与标签类别对比不一致的个数：

fp求出来是1xn，则n对应num_classes数量，里面的数代表预测结果中错误的在各类别中的个数。例如fp=[1,1,2,0,2]，表示num_classes=5，预测结果中错误分到类别0中的有1个，错误分到类别1中的有1个，错误分到类别1的有1个…

fn：每个类中计算出来，在标签类别中，与预测结果对比不一致的个数：

fn求出来是1xn，则n对应num_classes数量，里面的数代表标签结果中没有被预测正确的在各类别中的个数。例如fn=[1,2,3,0,0]，表示num_classes=5，没有被正确分到类别0中的有1个，没有被正确分到类别1的有2个，没有被正确分到类别2的有3个…

之后根据f1-score公式计算出来的也是一个1xn的矩阵，每个值代表着每个分类中的f1score值，求出mean均值就是mode=‘macro’时的计算结果。

mode='micro'：是计算出在全部类合计中的tp、fp和fn。

tp：预测类别和真实类别一致的个数

fp：在预测结果中，与标签类别对比不一致的个数

fn：在真实类别中，与预测结果对比不一致的个数

以字面意思上的理解，fp和fn应该是相等的，因为是属于预测结果和真实类别的相互比较，错误的个数应该是一样的，也确实如此。但这种情况只能在mode=‘micro’中才能这样考虑。

之后根据f1-score公式即可直接计算出结果。

preds = np.asarray([[0, 1, 2]])
labels = np.asarray([[0, 1, 4]])
mode='macro'
    tp:[1. 1. 0. 0. 0.]
    fp:[0. 0. 1. 0. 0.]
    fn:[0. 0. 0. 0. 1.]
    precision: [1. 1. 0. 0. 0.]
    recall: [1. 1. 0. 0. 0.]
    'macro_f1': 0.4
mode='micro'
    tp:2.0
    fp:1.0
    fn:1.0
    precision: 0.6666666666666666
    recall: 0.6666666666666666
    'micro_f1': 0.6666666666666666

F1-Score

打算先把官网给的例子输出一下，结果就发现有问题了。代码块中官方给出的输入我在VSCode上运行会报错，还没看F1Score类中的源码，大概是想到输入类型不对，所以下方做了一些修改，好家伙竟然真的运行成功了，但二者之间的数据类型是一样的，那么就只能看完F1Score的源码才能知道为什么报错。（torch.Tensor类型输入也是需要这样加工才能成功运行）

from mmeval import F1Score
import numpy as np
import torch

f1 = F1Score(num_classes=5, mode=['macro', 'micro'])

labels = np.asarray([0, 1, 4])
preds = np.asarray([0, 1, 2])
print("labels'type: {},   preds'type: {}".format(type(labels),type(preds)))
# labels'type: <class 'numpy.ndarray'>,   preds'type: <class 'numpy.ndarray'>
result = f1(preds, labels)
print(result)
# ERROR EXPORT：
# plum.function.NotFoundLookupError: For function "_compute_tp_fp_fn" of 
# mmeval.metrics.f1_score.F1Score, signature Signature(mmeval.metrics.f1_score.F1Score, 
# Tuple[numpy.int64, numpy.int64, numpy.int64], 
# Tuple[numpy.int64, numpy.int64, numpy.int64]) could not be resolved.

labels = np.asarray([[0, 1, 4]])
preds = np.asarray([[0, 1, 2]])
print("labels'type: {},   preds'type: {}".format(type(labels),type(preds)))
# labels'type: <class 'numpy.ndarray'>,   preds'type: <class 'numpy.ndarray'>
result = f1(preds, labels)
print(result)
# {'macro_f1': 0.4, 'micro_f1': 0.6666666666666666}

参数

num_classes：标签labels的数量；
mode：传入为一个str类型的字符或者一个str类型的字符列表；

macro：计算每个标签的度量值，并找出它们的非加权平均值

micro：通过计算真阳性、假阴性和假阳性的总数来计算全局指标。假阴性、假阳性和假阳性的总数

如果mode是列表，将会分别计算出结果。默认计算mode=‘micro’下的结果。

cared_classes：参与度量计算的标签索引
ignored_classes：计算度量计算时时忽略的标签索引集

class F1Score(BaseMetric):
    """Compute F1 scores.

    Args:
        num_classes (int): Number of labels.
        mode (str or list[str]): There are 2 options:

            - 'micro': Calculate metrics globally by counting the total true
              positives, false negatives and false positives.
            - 'macro': Calculate metrics for each label, and find their
              unweighted mean.

            If mode is a list, then metrics in mode will be calculated
            separately. Defaults to 'micro'.
        cared_classes (list[int]): The indices of the labels participated in
            the metric computing. If both ``cared_classes`` and
            ``ignored_classes`` are empty, all classes will be taken into
            account. Defaults to []. Note: ``cared_classes`` and
            ``ignored_classes`` cannot be specified together.
        ignored_classes (list[int]): The index set of labels that are ignored
            when computing metrics. If both ``cared_classes`` and
            ``ignored_classes`` are empty, all classes will be taken into
            account. Defaults to []. Note: ``cared_classes`` and
            ``ignored_classes`` cannot be specified together.
        **kwargs: Keyword arguments passed to :class:`BaseMetric`.

    Warning:
        Only non-negative integer labels are involved in computing. All
        negative ground truth labels will be ignored.

以一个栗子来表示一下F1Score类中对mode、ignore_classes等的描述，更细节的理解在后文会debug。

preds = np.asarray([[0, 1, 2]])
labels = np.asarray([[0, 1, 4]])
f1 = F1Score(num_classes=5, mode=['macro', 'micro'])
f1_ignore = F1Score(num_classes=5, mode=['macro', 'micro'], ignored_classes=[2])
f1_cared = F1Score(num_classes=5, mode=['macro', 'micro'], cared_classes=[2])
print(f1(preds, labels))
# {'macro_f1': 0.4, 'micro_f1': 0.6666666666666666}
print(f1_ignore(preds, labels))
# {'macro_f1': 0.5, 'micro_f1': 0.8}
print(f1_cared(preds, labels))
# {'macro_f1': 0.0, 'micro_f1': 0.0}

初始化函数

首先对num_classes、cared_classes、ignored_classes、mode类型做鉴定，还有cared_classes和ignored_classes不能同时传参，所以在二者的长度都大于0时，会报错。

isinstance(mode, str)：判断mode是否是一个字符，是的话转为列表类型（如果不是字符的话那就是列表了，所以不用做操作），再self.mode = mode（list类型）

mode = 'macro'
if isinstance(mode, str):
    print([mode])
    # ['macro']
print("mode'type : {}  ---->  [mode]'type : {}".format(type(mode),type([mode])))
# mode'type : <class 'str'>  ---->  [mode]'type : <class 'list'>

如果有传入ignored_classes和cared_classes，需要保持里面的数值必须在[0,num_classes]。

如果传入的是cared_classes，则直接self.cared_labels = sorted(cared_classes)；如果传入的是ignored_classes，则用集合range(num_classes)减去集合ignored_classes再从小到大排序。
1
2
3
4
5
6
7
8
9
num_classes = 5
ignored_classes = [2]
cared_labels = sorted(set(range(num_classes)) - set(ignored_classes))
print(set(range(num_classes)))
# {0, 1, 2, 3, 4}
print(set(ignored_classes))
# {2}
print(cared_labels)
# [0, 1, 3, 4] type : <class 'list'>

如果没有传入ignored_classes和cared_classes，即将所有类别都包含计算。

>cared_labels = list(range(num_classes))
>cared_labels = np.array(cared_labels, dtype=np.int64)
>print(cared_labels)
># array([0, 1, 2, 3, 4])
>print(type(cared_labels))
># <class 'numpy.ndarray'>

tips

这里需要注意一点，如果ignored_classes或cared_classes有传参进来，那么self.cared_labels保留的是list类型；但如果前二者没有传参的话，self.cared_labels保留的是numpy.ndarray类型。

def __init__(self,
                 num_classes: int,
                 mode: Union[str, Sequence[str]] = 'micro',
                 cared_classes: Sequence[int] = [],
                 ignored_classes: Sequence[int] = [],
                 **kwargs) -> None:
  super().__init__(**kwargs)

  assert isinstance(num_classes, int)
  assert isinstance(cared_classes, (list, tuple))
  assert isinstance(ignored_classes, (list, tuple))
  assert isinstance(mode, (list, str))
  assert not (len(cared_classes) > 0 and len(ignored_classes) > 0), \
  'cared_classes and ignored_classes cannot be both non-empty'

  if isinstance(mode, str):
    mode = [mode]
    assert set(mode).issubset({'micro', 'macro'})
    self.mode = mode

    if len(cared_classes) > 0:
      assert min(cared_classes) >= 0 and \
      max(cared_classes) < num_classes, \
      'cared_classes must be a subset of [0, num_classes)'
      self.cared_labels = sorted(cared_classes)
    elif len(ignored_classes) > 0:
      assert min(ignored_classes) >= 0 and \
      max(ignored_classes) < num_classes, \
      'ignored_classes must be a subset of [0, num_classes)'
      self.cared_labels = sorted(
        set(range(num_classes)) - set(ignored_classes))
    else:
      self.cared_labels = list(range(num_classes))
      self.cared_labels = np.array(self.cared_labels, dtype=np.int64)
      self.num_classes = num_classes

add函数

def add(self, predictions: Sequence[Union[Sequence[int], np.ndarray]], labels: Sequence[Union[Sequence[int], np.ndarray]]) -> None:  # type: ignore # yapf: disable # noqa: E501
    """Process one batch of data and predictions.

        Calculate the following 2 stuff from the inputs and store them in
        ``self._results``:

        - prediction: prediction labels.
        - label: ground truth labels.

        Args:
            predictions (Sequence[Sequence[int] or np.ndarray]): A batch
                of sequences of non-negative integer labels.
            labels (Sequence[Sequence[int] or np.ndarray]): A batch of
                sequences of non-negative integer labels.
        """
    for prediction, label in zip(predictions, labels):
        self._results.append((prediction, label))

栗子输入值展示

# num_classes=5, mode=['macro', 'micro']
preds = np.asarray([[0, 1, 2]])
labels = np.asarray([[0, 1, 4]])

for prediction, label in zip(preds, labels):
    print(prediction,label)
    # [0 1 2] [0 1 4]
    _results.append((prediction, label))

print(_results)
# [(array([0, 1, 2]), array([0, 1, 4]))]

该函数主要针对将图像分为好几组时，按照每一组来算F1-Score，如上面的例子，就是分为了一组一组，每一组有3张图像做预测，所以正在计算当前这一组的F1-Score。

compute_metric

def compute_metric(
    self, results: Sequence[Tuple[np.ndarray, np.ndarray]]) -> Dict:
    """Compute the metrics from processed results.

        Args:
            results (list[(ndarray, ndarray)]): The processed results of each
                batch.

        Returns:
            dict[str, float]: The f1 scores. The keys are the names of the
            metrics, and the values are corresponding results. Possible
            keys are 'micro_f1' and 'macro_f1'.
        """

    preds, gts = zip(*results)

    tp, fp, fn = self._compute_tp_fp_fn(preds, gts)

    result = {}
    if 'macro' in self.mode:
        result['macro_f1'] = self._compute_f1(
            tp.sum(-1), fp.sum(-1), fn.sum(-1))
        if 'micro' in self.mode:
            result['micro_f1'] = self._compute_f1(tp.sum(), fp.sum(), fn.sum())

            return result

由add函数的输出传入compute_metric（别问怎么飞到这里来的，add函数里确实没有写出调用compute_metric函数的代码，但debug也的确跑过来了）。

输入值为一个list类型，在加了*之后，我理解上是相当于给输入的reult脱了一层衣服（记得有个专业名词，但是忘了），所以list—>tuple。

# _results = [(array([0, 1, 2]), array([0, 1, 4]))]
# *results = *_result = (array([0, 1, 2]), array([0, 1, 4]))

# _results'type : <class 'list'>  ---->  *_results'type : <class 'tuple'>
preds, gts = zip(*results)
# preds : (array([0, 1, 2]),)  gts : (array([0, 1, 4]),)

之后将preds和gts传入_compute_tp_fp_fn函数求出tp、fp、fn。

按照mode传入的值求出以字典类型的result，key分别为‘macro’和‘micro’（如果二者都有的话，否则只保留其中一个的字典结果）。

注意：mode不同的传入_compute_f1函数的参数是不一样的。

mode=‘macro’：传入的tp、fp和fn是每个类的合计数；

mode=‘micro’：传入的tp、fp、fn是全部类的个数。
1
2
3
4
5
6
7
8
# mode = 'macro'
tp.sum(-1)： array([1., 1., 0., 0., 0.])
fp.sum(-1)： array([0., 0., 1., 0., 0.])
fn.sum(-1)： array([0., 0., 0., 0., 1.])
# mode = 'micro'
tp.sum()：2.0
fp.sum()：1.0
fn.sum()：1.0

最后通过_compute_f1函数求出f1-score值，返回答案。

_compute_tp_fp_fn

@dispatch
def _compute_tp_fp_fn(self, predictions: Sequence[Union[np.ndarray, int]],
                          labels: Sequence[Union[np.ndarray, int]]) -> tuple:
    """Compute tp, fp and fn from predictions and labels."""
    preds = np.concatenate(predictions, axis=0).astype(np.int64).flatten()
    gts = np.concatenate(labels, axis=0).astype(np.int64).flatten()

    assert preds.max() < self.num_classes  # type: ignore
    assert gts.max() < self.num_classes  # type: ignore

    hits = np.equal(preds, gts)[None, :]
    preds_per_label = np.equal(self.cared_labels[:, None], preds[None, :])  # type: ignore # yapf: disable # noqa: E501
    gts_per_label = np.equal(self.cared_labels[:, None], gts[None, :])  # type: ignore # yapf: disable # noqa: E501

    tp = (hits * preds_per_label).astype(float)
    fp = ((1 - hits) * preds_per_label).astype(float)
    fn = ((1 - hits) * gts_per_label).astype(float)
    return tp, fp, fn

由preds, gts = zip(*results)中的preds和gts传参进来，因此输入参数如下(type：tuple)：

1 2	predictions = (array([0, 1, 2]),) labels = (array([0, 1, 4]),)

np.concatenate函数完成predictions这些数组（例子只有1个）在axis=0维度上的拼接，labels同理。

preds = np.concatenate(predictions, axis=0).astype(np.int64).flatten()
print(np.concatenate(predictions, axis=0))
# [0 1 2]
print(type(np.concatenate(predictions, axis=0)))
# <class 'numpy.ndarray'>
print(preds)
# [0 1 2]

之后判断preds和gts中的数值是否会出现超过类别数值的情况，有的话就报错（毕竟就这几个类别，如果预测结果超出类别，那肯定有问题了）

再然后开始对比preds和gts是否相等（也就是对比预测结果是否与真实值一致）

[None, :]：多一个维度

这里其实就是在用传进来的预测值以及对应的真实标签值，与总共num_classes=5的分类做标记，分别形成了5x3的矩阵（方便理解的称呼）。

hits = np.equal(preds, gts)[None, :]
# preds:[0,1,2]
# np.equal(preds, gts)：[ True  True False]
# hits ：[[ True  True False]]
preds_per_label = np.equal(self.cared_labels[:, None], preds[None, :])
# self.cared_labels：array([0, 1, 2, 3, 4])
# self.cared_labels[:, None]:[[0]
#							  [1]
#							  [2]
#							  [3]
#							  [4]]
# preds[None, :]:[[0 1 2]]
# preds_per_label:[[ True False False]
#				   [False  True False]
#				   [False False  True]
#				   [False False False]
#				   [False False False]]
gts_per_label = np.equal(self.cared_labels[:, None], gts[None, :])
# gts:[0,1,4]
# gts[None, :]:[[0,1,4]]
# gts_per_label:[[ True False False]
#				 [False  True False]
#				 [False False False]
#				 [False False False]
#				 [False False  True]]

接下来分别开始计算tp、fp和fn的结果（以下是便于我自己理解的描述，见仁见智）。

==hits==：[[1,1,0]]，给出这三张图，前两张分类正确，第三张分类错误；
==pres_per_label==：这是由预测的值与num_classes矩阵相乘得出的结果，得出的5x3矩阵哪些为1的地方就是对应预测的类别；
==Its_per_label==：这是由预测图片对应的真实值与num_classes相乘的到的结果，得出的5x3矩阵哪为1的地方就是对应预测图片的真实类别。

hits * preds_per_label：相乘得出的逻辑含义是预测的结果分类都是正确的，因为hits的逻辑值已经对应哪些图片分类正确，乘以预测的5x3labels后，得到的就是一个5x3的tp矩阵，每一行代表0-4的5个分类，每一列代表每张图片预测分类的正确与否（正确为1，错误为0）

(1 - hits) * preds_per_label：首先hits中的逻辑关系是分类对的是1，错误的是0。（1-hits）得出的是正确的为0，错误的为1，也就变成了[0,0,1]。与preds_per_label相乘代表的含义是算出预测错误的自己所预测的结果，也就是fp（fp的含义本身就是选择了positive，但实际上分类错误）。可以理解为，你预测某类有100个，但其中真的是某类的只有80个，那么就会有20个fp。

(1 - hits) * gts_per_label：（1-hits）与gts_per_label相乘的逻辑关系和fp一致，fn含义为本身选择了negative，但实际上分类错误。可以理解为，某类的个数有100个，你分类正确了80个，那么就会有20个fn。

tp = (hits * preds_per_label).astype(float)
# [[1. 0. 0.]
#  [0. 1. 0.]
#  [0. 0. 0.]
#  [0. 0. 0.]
#  [0. 0. 0.]]
fp = ((1 - hits) * preds_per_label).astype(float)
# [[0. 0. 0.]
#  [0. 0. 0.]
#  [0. 0. 1.]
#  [0. 0. 0.]
#  [0. 0. 0.]]
fn = ((1 - hits) * gts_per_label).astype(float)
# [[0. 0. 0.]
#  [0. 0. 0.]
#  [0. 0. 0.]
#  [0. 0. 0.]
#  [0. 0. 1.]]

之后将三个矩阵返回到conpute_metric。

_compute_f1

在compute_metrics函数中，传入_compute_f1的参数做了以下操作：sum(-1)指在最后一个维度上求和，对应例子中，也就是将每列上的数求和（同行不同列的数求和）。

tp.sum(-1), fp.sum(-1), fn.sum(-1)
# tp.sum(-1)： array([1., 1., 0., 0., 0.])
# fp.sum(-1)： array([0., 0., 1., 0., 0.])
# fn.sum(-1)： array([0., 0., 0., 0., 1.])

np.clip(a，min，max)再次遇到，即将a中小于min的数改为min，高于max的数改为max。这里这么做是为了防止出现分母为0的情况。再按照precision、recall和F1-Score的公式计算出结果。

$$precision=\frac{TP}{TP+FP}$$

$$recall = \frac{TP}{TP+FN}$$

$$F1-Score = 2·\frac{precision·recall}{precision+recall}$$

# PS : type都是numpy.array
precision = tp / (tp + fp).clip(min=1e-8)
# tp : array([1., 1., 0., 0., 0.])
# tp + fp : array([1., 1., 1., 0., 0.])
# precision = array([1., 1., 0., 0., 0.])
recall = tp / (tp + fn).clip(min=1e-8)
# tp + fn : array([1., 1., 0., 0., 1.])
# recall = array([1., 1., 0., 0., 0.])
f1 = 2 * precision * recall / (precision + recall).clip(min=1e-8)
# precision * recall : [1. 1. 0. 0. 0.]
# precision + recall : [2. 2. 0. 0. 0.]
# f1 = [1. 1. 0. 0. 0.]

最后对f1的值求均值：即（1+1+0+0+0）/（num_classes=5）=2/5=0.4（mode = macio）

# tp：array([1., 1., 0., 0., 0.])
# fp：array([0., 0., 1., 0., 0.])
# fn：array([0., 0., 0., 0., 1.])
def _compute_f1(self, tp: np.ndarray, fp: np.ndarray,
                    fn: np.ndarray) -> float:
    """Compute the F1-score based on the true positives, false positives
        and false negatives.

        Args:
            tp (np.ndarray): The true positives.
            fp (np.ndarray): The false positives.
            fn (np.ndarray): The false negatives.

        Returns:
            float: The F1-score.
        """
    precision = tp / (tp + fp).clip(min=1e-8)
    recall = tp / (tp + fn).clip(min=1e-8)
    f1 = 2 * precision * recall / (precision + recall).clip(min=1e-8)
    return float(f1.mean())

当mode=‘micro’时，同样的计算方式，不过传入的参数不一样（照搬前文的解释）

注意：mode不同的传入_compute_f1函数的参数是不一样的。

mode=‘macro’：传入的tp、fp和fn是每个类的合计数；

mode=‘micro’：传入的tp、fp、fn是全部类的个数。
1
2
3
4
5
6
7
8
# mode = 'macro'
tp.sum(-1)： array([1., 1., 0., 0., 0.])
fp.sum(-1)： array([0., 0., 1., 0., 0.])
fn.sum(-1)： array([0., 0., 0., 0., 1.])
# mode = 'micro'
tp.sum()：2.0
fp.sum()：1.0
fn.sum()：1.0

知识补充

np.concatenate((x1,x2,…), axis=n)

能够完成多个数组在axis=n维度上的拼接（x1…类型为数组，试了ndarray类型，出错了）

>>> a = np.array([1,2,3])
>>> b = np.array([10,11,12])
>>> np.concatenate((a,b),axis=0)
# array([ 1,  2,  3, 10, 11, 12])
>>> a = np.array([[1,2,3],[2,4,6]])
>>> b = np.array([[5,6,7],[7,8,9]])
>>> np.concatenate((a,b),axis=1)
# array([[1, 2, 3, 5, 6, 7],
#        [2, 4, 6, 7, 8, 9]])

[None, :]和[:,None]

二者都是为了加一个维度

>>>> a = np.array([[1,2,3],[2,3,4]])
>>>> b = np.array([[5,6,7],[8,9,10]])
>>>> np.equal(a,b)
>array([[False, False, False],
 [False, False, False]])
>>>> np.equal(a,b)[None,:]
>array([[[False, False, False],
  [False, False, False]]])
>>>> np.equal(a,b)[:,None]
>array([[[False, False, False]],

 [[False, False, False]]])

sum()

求和函数，当不传入任何参数时，默认求全部的和，返回一个值。传入的数需小于维数，因为是对应维数来求和的。按照下面的例子，当传入-1时，对应的最后一个维度。

>>> tp = np.array([[[0,1,2],
                 [3,4,5]],
                [[6,7,8],
                 [9,10,11]]])
>>> tp.sum()
66
>>> tp.sum(0)
array([[ 6,  8, 10],
    [12, 14, 16]])
>>> tp.sum(1)
array([[ 3,  5,  7],
    [15, 17, 19]])
>>> tp.sum(2)
array([[ 3, 12],
    [21, 30]])
>>> tp.sum(-1)
array([[ 3, 12],
    [21, 30]])

munpy中的称呼（张量 or 标量？）

numpy.array 是 numpy 中最常见的数据结构，用于表示多维数组，在数学上就是一个张量。张量的维度不同时候，会变换为不同的结构：

dimension > 2 普通张量

dimension == 2 矩阵

dimension == 1 矢量

dimension == 0 标量