机器学习中评估分类器性能

混淆矩阵（Confusion Matrix）

在机器学习领域和统计分类问题中，混淆矩阵（confusion matrix）是可视化工具，特别用于监督学习，在无监督学习一般叫做匹配矩阵。矩阵的每一列代表一个类的实例预测，而每一行表示一个实际的类的实例。之所以如此命名，是因为通过这个矩阵可以方便地看出机器是否将两个不同的类混淆了（比如说把一个类错当成了另一个）。¹

对于二分类问题，混淆矩阵为一个 2*2 的表，行代表真实值，列代表预测值，见下表：

真实 \ 预测	0	1
0	预测 negative 正确（TN）	预测 positive 错误（FP）
1	预测 negative 错误（FN）	预测 positive 正确（TP）

以「患癌症问题」举例，上表中 0 代表未得癌症，1 代表得了癌症，行代表患癌症的真实值，列代表患癌症的预测值，那么，TN 就代表着真实情况没有得癌症且预测没有得癌症正确，同样 FN 代表真实情况得了癌症但预测其未得癌症。

精准率和召回率

精准率（precision）：在所有预测值为 1 的情况下，实际也正确的概率。例如在癌症问题中表示预测患癌症成功的概率。公式如下：

$precision = \frac{TP}{TP+FP}$

召回率（recall）：在所有真实值为 1 的情况下，预测正确的概率。例如在癌症问题中表示患癌症的人群中成功预测的概率。公式如下：

$recall = \frac{TP}{TP+FN}$

下面代码实现：

import numpy as np
from sklearn import datasets

# 引入手写识别数据集
digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()

# 为了让数据集变成二分类问题，做如下处理
y[digits.target == 9] = 1
y[digits.target != 9] = 0

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
log_reg.score(X_test, y_test)
# 0.97555555555555551

y_log_predict = log_reg.predict(X_test)

# TN
def TN(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 0) & (y_predict == 0))
TN(y_test, y_log_predict)
# 397
# FP
def FP(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 0) & (y_predict == 1))
FP(y_test, y_log_predict)
# 5
# FN
def FN(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 1) & (y_predict == 0))
FN(y_test, y_log_predict)
# 6
# TP
def TP(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 1) & (y_predict == 1))
TP(y_test, y_log_predict)
#42

# 混淆矩阵
def confusion_matrix(y_true, y_predict):
    return np.array([
        [TN(y_true, y_predict), FP(y_true, y_predict)],
        [FN(y_true, y_predict), TP(y_true, y_predict)]
    ])
confusion_matrix(y_test, y_log_predict)
#array([[397,   5],
#       [  6,  42]])

# 精准率
def precision_score(y_true, y_predict):
    tp = TP(y_true, y_predict)
    fp = FP(y_true, y_predict)
    try:
        return tp / (tp + fp)
    except:
        return 0.0   
precision_score(y_test, y_log_predict)
# 0.8936170212765957

# 召回率
def recall_score(y_true, y_predict):
    tp = TP(y_true, y_predict)
    fn = FN(y_true, y_predict)
    try:
        return tp / (tp + fn)
    except:
        return 0.0
recall_score(y_test, y_log_predict)
# 0.875

在 Scikit-learn 中实现混淆矩阵、精准率和召回率

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_log_predict)

from sklearn.metrics import precision_score
precision_score(y_test, y_log_predict)

from sklearn.metrics import recall_score
recall_score(y_test, y_log_predict)

F1 Score

对于精准率和召回率，在不同的场景有不同的侧重点。

在股票预测问题中，设股票增长为 1，我们更加关注精准率，即我们预测股票增长情况下预测正确的概率，而对于回归率我们并不太关心，因为召回率代表着实际会增长的股票我们预测到会增长的股票的概率，而增长的股票有很多，我们只是漏掉了部分会增长的股票而已，我们并没有什么损失；在病人诊断问题中，我们就更加关注召回率，即病人已经得病了能够诊断出其患病的概率，显然这时候召回率越高越好，能够不漏掉任何一个患病的病人，而精准率低一些并没有关系，即有一些人没有病被预测为有病，再继续做检查确诊就行了，不会造成巨大危害。

如果要同时关注这两个指标，就需要引入一个新的指标——F1 Score

F1 Score 是精准率和召回率的调和平均值

$F_1 = \frac2{\frac1{precision}+\frac1{recall}} = \frac{2\times precision\times recall}{precision + recall}$

下面尝试不同的精准率和召回率下，F1 Score 的值变化

def f1_score(precision, recall):
    try:
        return 2 * precision * recall / (precision + recall)
    except:
        return 0.0

precision = 0.5
recall = 0.5
f1_score(precision, recall)
# 0.5

precision = 0.1
recall = 0.9
f1_score(precision, recall)
# 0.18000000000000002

precision = 0.0
recall = 1.0
f1_score(precision, recall)
# 0.0

在前面的例子中我们测得了精准率和召回率，现在使用 Scikit-learn 计算 f1_score：

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_log_predict)
#array([[403,   2],
#       [  9,  36]])

from sklearn.metrics import precision_score
precision_score(y_test, y_log_predict)
# 0.94736842105263153

from sklearn.metrics import recall_score
recall_score(y_test, y_log_predict)
# 0.80000000000000004

from sklearn.metrics import f1_score
f1_score(y_test, y_log_predict)
# 0.8842105263157894

F1 Score 对那些具有相近的精准率和召回率的分类器更为有利。

Precision-recall 的平衡

下图中用竖线代表阈值，划分左右两边分别为预测为 0 和 1，五角星代表实际值为 1，圆代表 0。

当阈值为 0 ，小于 0，大于 0 时分别不同的精准率和召回率，由此可知鱼与熊掌不可兼得。

Scikit-learn 不允许直接设置阈值，但可以访问它用于预测的决策分数。不是调用分类器的 predict() 方法，而是调用 decision_function() 方法，这个方法返回每个实例的分数，然后就可以根据这些分数，使用任意阈值进行预测了：

decision_scores = log_reg.decision_function(X_test)

np.min(decision_scores)
# -61.02813630853092
np.max(decision_scores)
# 17.504275181503946

y_predict_2 = np.array(decision_scores >= 5, dtype='int')
confusion_matrix(y_test, y_predict_2)
#array([[402,   0],
#       [ 20,  28]], dtype=int64)
precision_score(y_test, y_predict_2)
# 1.0
recall_score(y_test, y_predict_2)
# 0.5833333333333334

y_predict_3 = np.array(decision_scores >= -5, dtype='int')
confusion_matrix(y_test, y_predict_3)
#array([[379,  23],
#       [  2,  46]], dtype=int64)
precision_score(y_test, y_predict_3)
# 0.6666666666666666
recall_score(y_test, y_predict_3)
# 0.9583333333333334

下面用 for 循环拿到所有的阈值，绘制成图：

precisions = []
recalls = []
thresholds = np.arange(np.min(decision_scores), np.max(decision_scores), 0.1)
for threshold in thresholds:
    y_predict = np.array(decision_scores >= threshold, dtype = 'int')
    precisions.append(precision_score(y_test, y_predict))
    recalls.append(recall_score(y_test, y_predict))
    
plt.plot(thresholds, precisions)
plt.plot(thresholds, recalls)
plt.show()

Precision-Recall 曲线

将 Precision 和 Recall 分别放在坐标轴的 x 和 y 轴上，可以清晰的观察到两者的关系：

1 2	plt.plot(precisions, recalls) plt.show()

在 Scikit-learn 中可以直接调用 precision_recall_curve 方法来得到相应参数：

1 2	from sklearn.metrics import precision_recall_curve precisions, recalls, thresholds = precision_recall_curve(y_test, decision_scores)

返回了三个参数，分别是 precisions、recalls 和 thresholds（阈值），下面看看这几个参数的元素个数：

precisions.shape
# (93,)
recalls.shape
# (93,)
thresholds.shape
# (92,)

翻阅官方文档²可以知道最后一个精准率或回归率的值默认为 1 或 0，且没有 threshold。

最后用 matplotlib 绘制即可：

1
2
3

plt.plot(thresholds, precisons[:-1])
plt.plot(thresholds, recalls[:-1])
plt.show()

若绘制前面那种 precison 与 recall 的坐标系：

1 2	plt.plot(precisions, recalls) plt.show()

对比前面的图会发现这里只是前面的中间部分，是因为 precision_recall_curve 选择了它认为最重要的数据。

ROC 曲线

ROC 曲线经常与二元分类器一起使用，它与 precison-recall 曲线非常相似，但绘制的不是精准率和召回率，而是真正类率（TPR）与假正类率（FPR）。

回到前面的混淆矩阵二分类问题上，其实 TPR = recall，二者是一样的含义，而 FPR 表示在真实值为 0 的情况下，预测为 1 的概率，公式：

$FPR = \frac{FP}{TN+FP}$

TPR 和 FPR 二者的关系如下图，同大同小：

使用 Scikit-learn 中的 roc_curve 方法可以得到想要的参数：

1 2	from sklearn.metrics import roc_curve fprs, tprs, thresholds = roc_curve(y_test, decision_scores)

1 2	plt.plot(fprs, tprs) plt.show()

绘制结果：

ROC 曲线一般用来比较两个模型孰优孰劣，其曲线下面积（AUC）是非常重要的参数，完美的 ROC AUC 等于 1，而纯随机分类器的 ROC AUC 等于 0.5。

下面是 Scikit-learn 中提供的方法：

1
2
3

from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, decision_scores)
# 0.98304526748971188

完整代码：Classification-Performance-Measures.ipynb

机器学习中评估分类器性能

机器学习中评估分类器性能

混淆矩阵（Confusion Matrix）

精准率和召回率

F1 Score

Precision-recall 的平衡

Precision-Recall 曲线

ROC 曲线

相关文章：