本案例目的是实现机器学习模型的训练、调优、评估以及结果的可视化，主要包括以下步骤：

数据加载与预处理：首先从 Excel 文件中加载了包含多特征的医学数据，并将其存储在数据框中，使用LightGBM模型进行特征的重要性排序。通过训练LGBM模型并获取特征重要性，对特征进行排序，并选取前30个最重要的特征，最后根据LGBM模型的特征重要性，选择了前8个特征进行模型训练。这些特征是：X_30, X_39, X_46, X_32, X_34, X_33, X_9, X_28，目标变量是 y，即分类标签，并将特征和目标变量分开存储，进行后续的模型训练；
模型训练与调优：使用多种机器学习模型进行训练，包括人工神经网络、决策树、极限随机树、梯度提升机、K 近邻、LightGBM、随机森林、支持向量机和 XGBoost 。为每个模型定义了不同的超参数，并通过GridSearchCV进行网格搜索，找到最优的参数配置，每个模型使用 5 折交叉验证来评估其性能，确保模型在不同数据子集上的表现稳定；
模型评估与性能对比：对每个模型进行评估，计算常见的分类指标，包括准确率、敏感性（召回率）、特异性、精确度（阳性预测值）、负性预测值、F1 分数以及 Kappa 系数等，绘制多个图表进行模型性能对比，包括 ROC 曲线、Precision-Recall 曲线等，以展示不同模型在训练集和测试集上的表现；
模型结果的可视化：对最优模型进行了深入的可视化分析。通过 SHAP 值解释模型的决策过程，展示特征的重要性和对每个样本预测的影响，使用 PCA 进行降维，将数据投影到一维空间，并绘制散点图展示每个数据点在降维空间中的分布，同时根据预测的概率值对数据点进行颜色编码，区分不同类别,绘制了多个 SHAP 图（如条形图、散点图、瀑布图等），这些图表帮助解释模型的决策依据和各特征对模型预测的影响。

模拟案例

加载模块

import shap
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb
import seaborn as sns
import scipy.stats as stats
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import PrecisionRecallDisplay
from xgboost import XGBClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import roc_curve, auc
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, cohen_kappa_score, confusion_matrix

plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.unicode_minus'] = False
import warnings
warnings.filterwarnings("ignore")

加载数据

import os
wkdir = 'C:/Users/Administrator/Desktop'
os.chdir(wkdir)

path = 'Z:/TData/big-data/sad41d8cd/251027_Medical_Classification_Model_Comparison.xlsx'
df = pd.read_excel(path)

df.head()
Out[91]: 
   X_1  X_2  X_3  X_4  X_5  X_6  X_7  ...  X_41  X_42  X_43  X_44  X_45  X_46  y
0    1    0    0    0    1    0    0  ...   152  1.35  16.9  6.75   6.0    47  1
1    1    0    0    0    1    0    0  ...   132  1.49  12.8  4.71   5.8    63  0
2    1    0    0    0    1    0    0  ...   141  4.44  15.3  4.26   6.4    48  1
3    0    0    0    0    0    0    0  ...   105  1.64  14.4  4.12   5.5    67  0
4    1    0    0    0    1    1    0  ...   142  1.21  13.7  3.91   7.0    55  1

[5 rows x 47 columns]

df.columns
Out[92]: 
Index(['X_1', 'X_2', 'X_3', 'X_4', 'X_5', 'X_6', 'X_7', 'X_8', 'X_9', 'X_10',
       'X_11', 'X_12', 'X_13', 'X_14', 'X_15', 'X_16', 'X_17', 'X_18', 'X_19',
       'X_20', 'X_21', 'X_22', 'X_23', 'X_24', 'X_25', 'X_26', 'X_27', 'X_28',
       'X_29', 'X_30', 'X_31', 'X_32', 'X_33', 'X_34', 'X_35', 'X_36', 'X_37',
       'X_38', 'X_39', 'X_40', 'X_41', 'X_42', 'X_43', 'X_44', 'X_45', 'X_46',
       'y'],
      dtype='object')

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 256 entries, 0 to 255
Data columns (total 47 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X_1     256 non-null    int64  
 1   X_2     256 non-null    int64  
 2   X_3     256 non-null    int64  
 3   X_4     256 non-null    int64  
 4   X_5     256 non-null    int64  
 5   X_6     256 non-null    int64  
 6   X_7     256 non-null    int64  
 7   X_8     256 non-null    int64  
 8   X_9     256 non-null    int64  
 9   X_10    256 non-null    int64  
 10  X_11    256 non-null    int64  
 11  X_12    256 non-null    int64  
 12  X_13    256 non-null    int64  
 13  X_14    256 non-null    int64  
 14  X_15    256 non-null    int64  
 15  X_16    256 non-null    int64  
 16  X_17    256 non-null    int64  
 17  X_18    256 non-null    int64  
 18  X_19    256 non-null    int64  
 19  X_20    256 non-null    int64  
 20  X_21    256 non-null    int64  
 21  X_22    256 non-null    int64  
 22  X_23    256 non-null    int64  
 23  X_24    256 non-null    int64  
 24  X_25    256 non-null    int64  
 25  X_26    256 non-null    int64  
 26  X_27    256 non-null    int64  
 27  X_28    256 non-null    float64
 28  X_29    256 non-null    int64  
 29  X_30    256 non-null    int64  
 30  X_31    256 non-null    float64
 31  X_32    256 non-null    float64
 32  X_33    256 non-null    float64
 33  X_34    256 non-null    int64  
 34  X_35    256 non-null    int64  
 35  X_36    256 non-null    int64  
 36  X_37    256 non-null    int64  
 37  X_38    256 non-null    int64  
 38  X_39    256 non-null    int64  
 39  X_40    256 non-null    int64  
 40  X_41    256 non-null    int64  
 41  X_42    256 non-null    float64
 42  X_43    256 non-null    float64
 43  X_44    256 non-null    float64
 44  X_45    256 non-null    float64
 45  X_46    256 non-null    int64  
 46  y       256 non-null    int64  
dtypes: float64(8), int64(39)
memory usage: 94.1 KB

划分数据

# 划分特征和目标变量
X = df.drop(['y'], axis = 1)         # 从数据集中去掉目标变量 'y'，得到特征变量 X
y = df['y']                          # 提取目标变量 y

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X,                               # 特征变量
    y,                               # 目标变量
    test_size = 0.3,                 # 测试集所占比例，这里为 30%
    random_state = 42,               # 随机种子，确保结果可重复
    stratify = df['y']               # 按目标变量 'y' 的分布进行分层采样，确保训练集和测试集中目标变量的分布一致
)

特征筛选

特征重要性排序，使用初步训练的轻梯度提升机 (LGBM) 分类器对变量的重要性进行排序，并选择排名前 30 的变量；
前向特征选择，使用多个 LGBM 分类器，通过逐步添加变量的方式，依次评估每个变量对模型性能的贡献；
模型优化与最终选择，根据受试者特征曲线 AUC 评分分类器的性能。例如，当变量添加到第 7 个时，模型性能不再显著提升，因此最终选择前 7 个变量用于模型开发；
同时注意多重共线性的问题。

# 创建 LGBM 分类器
lgbm_clf = lgb.LGBMClassifier(random_state = 42, verbose = -1)

# 训练模型
lgbm_clf.fit(X_train, y_train)

# 获取特征重要性
feature_importances = lgbm_clf.feature_importances_

# 构建特征重要性排名
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by = 'Importance', ascending = False)


# 只取前30个重要特征
top_n = 30
top_features = feature_importance_df.head(top_n)

# 调整字体大小
plt.figure(figsize = (12, 8), dpi = 1200)  
plt.barh(top_features['Feature'], top_features['Importance'], color = 'skyblue')
plt.xlabel('Importance', fontsize = 14)  
plt.ylabel('Feature', fontsize = 14)  
plt.title(f'Top {top_n} Feature Importance', fontsize = 16)  
plt.xticks(fontsize = 12)  
plt.yticks(fontsize = 12)  
plt.gca().invert_yaxis()  
plt.savefig(f'{wkdir}/特征筛选.png', bbox_inches = 'tight')
plt.close()

# 初始化存储结果的 DataFrame
selection_results = pd.DataFrame(columns = ['Feature', 'Importance', 'Mean_ROC'])

# 初始化用于训练的特征列表
selected_features = []

# K 折交叉验证
kf = KFold(n_splits = 5, shuffle = True, random_state = 42)

# 获取折数
n_splits = kf.get_n_splits()

# 动态创建列名
fold_columns = [f'Fold_{i+1}_ROC' for i in range(n_splits)]

# 依次添加特征
for i in range(len(top_features)):
    # 当前特征
    current_feature = top_features.iloc[i]['Feature']
    selected_features.append(current_feature)

    fold_roc_scores = []  # 存储每折的 ROC AUC 分数

    # K 折交叉验证
    for train_idx, val_idx in kf.split(X_train):
        # 划分训练集和验证集
        X_train_fold, X_val_fold = X_train.iloc[train_idx][selected_features], X_train.iloc[val_idx][selected_features]
        y_train_fold, y_val_fold = y_train.iloc[train_idx], y_train.iloc[val_idx]

        # 创建并训练 LGBM 模型
        lgbm_clf = lgb.LGBMClassifier(random_state = 42, verbose = -1)
        lgbm_clf.fit(X_train_fold, y_train_fold)

        # 预测并计算 ROC AUC 分数
        y_val_proba = lgbm_clf.predict_proba(X_val_fold)[:, 1]  # 概率分数
        fold_roc_score = roc_auc_score(y_val_fold, y_val_proba)
        fold_roc_scores.append(fold_roc_score)

    # 计算平均 ROC AUC 分数
    mean_roc_score = np.mean(fold_roc_scores)

    # 保存结果
    row_data = {
        'Feature': current_feature,
        'Importance': top_features.iloc[i]['Importance'],
        'Mean_ROC': mean_roc_score,
    }

    # 添加每折的ROC AUC分数
    for j, score in enumerate(fold_roc_scores):
        row_data[fold_columns[j]] = score
    row_df = pd.DataFrame([row_data])  
    selection_results = pd.concat([selection_results, row_df], ignore_index = True)

selection_results
Out[108]: 
   Feature Importance  Mean_ROC  ...  Fold_3_ROC  Fold_4_ROC  Fold_5_ROC
0     X_30         62  0.632506  ...    0.696429    0.667488       0.364
1     X_39         55  0.721269  ...    0.754870    0.650246       0.656
2     X_46         46  0.747930  ...    0.795455    0.726601       0.678
3     X_32         36  0.749943  ...    0.792208    0.768473       0.648
4     X_34         35  0.759666  ...    0.844156    0.778325       0.600
5      X_9         32  0.822709  ...    0.853896    0.852217       0.740
6     X_33         32  0.834548  ...    0.840909    0.871921       0.744
7     X_28         31  0.837696  ...    0.847403    0.842365       0.764
8     X_27         27  0.842103  ...    0.860390    0.802956       0.784
9     X_44         22  0.825152  ...    0.889610    0.788177       0.808
10    X_45         22  0.847453  ...    0.899351    0.827586       0.824
11    X_35         19  0.827285  ...    0.896104    0.773399       0.804
12    X_40         19  0.812183  ...    0.879870    0.729064       0.796
13    X_36         18  0.820189  ...    0.873377    0.733990       0.776
14    X_43         15  0.814614  ...    0.879870    0.714286       0.832
15    X_37         13  0.822236  ...    0.912338    0.763547       0.768
16    X_38         13  0.790346  ...    0.857143    0.724138       0.756
17    X_24         11  0.796770  ...    0.873377    0.733990       0.772
18     X_3         11  0.792737  ...    0.860390    0.733990       0.788
19    X_31         10  0.784170  ...    0.844156    0.748768       0.796
20    X_42         10  0.789202  ...    0.827922    0.773399       0.800
21    X_26         10  0.789443  ...    0.834416    0.773399       0.804
22    X_21          9  0.787502  ...    0.837662    0.743842       0.800
23     X_5          7  0.788152  ...    0.837662    0.743842       0.800
24    X_16          6  0.794447  ...    0.837662    0.743842       0.800
25    X_17          4  0.785223  ...    0.844156    0.743842       0.796
26    X_41          4  0.797268  ...    0.860390    0.778325       0.812
27    X_29          3  0.796142  ...    0.853896    0.768473       0.816
28     X_2          3  0.796142  ...    0.853896    0.768473       0.816
29    X_20          2  0.796070  ...    0.834416    0.773399       0.820

[30 rows x 8 columns]

# 将 Importance 列百分比化并归一化
selection_results['Importance'] = (
    selection_results['Importance'] / selection_results['Importance'].sum()
)
selection_results

Out[109]: 
   Feature Importance  Mean_ROC  ...  Fold_3_ROC  Fold_4_ROC  Fold_5_ROC
0     X_30   0.105622  0.632506  ...    0.696429    0.667488       0.364
1     X_39   0.093697  0.721269  ...    0.754870    0.650246       0.656
2     X_46   0.078365  0.747930  ...    0.795455    0.726601       0.678
3     X_32   0.061329  0.749943  ...    0.792208    0.768473       0.648
4     X_34   0.059625  0.759666  ...    0.844156    0.778325       0.600
5      X_9   0.054514  0.822709  ...    0.853896    0.852217       0.740
6     X_33   0.054514  0.834548  ...    0.840909    0.871921       0.744
7     X_28   0.052811  0.837696  ...    0.847403    0.842365       0.764
8     X_27   0.045997  0.842103  ...    0.860390    0.802956       0.784
9     X_44   0.037479  0.825152  ...    0.889610    0.788177       0.808
10    X_45   0.037479  0.847453  ...    0.899351    0.827586       0.824
11    X_35   0.032368  0.827285  ...    0.896104    0.773399       0.804
12    X_40   0.032368  0.812183  ...    0.879870    0.729064       0.796
13    X_36   0.030664  0.820189  ...    0.873377    0.733990       0.776
14    X_43   0.025554  0.814614  ...    0.879870    0.714286       0.832
15    X_37   0.022147  0.822236  ...    0.912338    0.763547       0.768
16    X_38   0.022147  0.790346  ...    0.857143    0.724138       0.756
17    X_24   0.018739  0.796770  ...    0.873377    0.733990       0.772
18     X_3   0.018739  0.792737  ...    0.860390    0.733990       0.788
19    X_31   0.017036  0.784170  ...    0.844156    0.748768       0.796
20    X_42   0.017036  0.789202  ...    0.827922    0.773399       0.800
21    X_26   0.017036  0.789443  ...    0.834416    0.773399       0.804
22    X_21   0.015332  0.787502  ...    0.837662    0.743842       0.800
23     X_5   0.011925  0.788152  ...    0.837662    0.743842       0.800
24    X_16   0.010221  0.794447  ...    0.837662    0.743842       0.800
25    X_17   0.006814  0.785223  ...    0.844156    0.743842       0.796
26    X_41   0.006814  0.797268  ...    0.860390    0.778325       0.812
27    X_29   0.005111  0.796142  ...    0.853896    0.768473       0.816
28     X_2   0.005111  0.796142  ...    0.853896    0.768473       0.816
29    X_20   0.003407  0.796070  ...    0.834416    0.773399       0.820

[30 rows x 8 columns]

# 动态获取折列名称
fold_columns = [col for col in selection_results.columns if 'Fold_' in col]

# 添加置信区间列到 selection_results
selection_results['CI_Lower'] = None
selection_results['CI_Upper'] = None

# 遍历每行计算置信区间
for index, row in selection_results.iterrows():
    
    # 提取每折的ROC成绩
    fold_roc_scores = [row[fold] for fold in fold_columns]

    # 样本大小
    n_folds = len(fold_roc_scores)

    # 平均值和标准误差
    mean_roc = row['Mean_ROC']
    std_err = stats.sem(fold_roc_scores)  # 标准误差

    # t值（95%置信区间，n_folds - 1 自由度）
    t_value = stats.t.ppf(0.975, df=n_folds - 1)

    # 置信区间
    ci_lower = mean_roc - t_value * std_err
    ci_upper = mean_roc + t_value * std_err

    # 确保置信区间上限不超过1
    ci_upper = min(ci_upper, 1)

    # 保存到 DataFrame
    selection_results.at[index, 'CI_Lower'] = ci_lower
    selection_results.at[index, 'CI_Upper'] = ci_upper

# 输出结果
selection_results

Out[111]: 
   Feature Importance  Mean_ROC  ...  Fold_5_ROC  CI_Lower  CI_Upper
0     X_30   0.105622  0.632506  ...       0.364  0.402515  0.862497
1     X_39   0.093697  0.721269  ...       0.656  0.604669  0.837869
2     X_46   0.078365  0.747930  ...       0.678  0.612311  0.883548
3     X_32   0.061329  0.749943  ...       0.648  0.628602  0.871285
4     X_34   0.059625  0.759666  ...       0.600  0.624474  0.894858
5      X_9   0.054514  0.822709  ...       0.740  0.752107  0.893311
6     X_33   0.054514  0.834548  ...       0.744  0.768515  0.900581
7     X_28   0.052811  0.837696  ...       0.764  0.784572   0.89082
8     X_27   0.045997  0.842103  ...       0.784  0.782365   0.90184
9     X_44   0.037479  0.825152  ...       0.808  0.775929  0.874374
10    X_45   0.037479  0.847453  ...       0.824   0.80444  0.890466
11    X_35   0.032368  0.827285  ...       0.804  0.767662  0.886909
12    X_40   0.032368  0.812183  ...       0.796  0.739797  0.884568
13    X_36   0.030664  0.820189  ...       0.776  0.743339  0.897039
14    X_43   0.025554  0.814614  ...       0.832   0.73852  0.890709
15    X_37   0.022147  0.822236  ...       0.768   0.74653  0.897943
16    X_38   0.022147  0.790346  ...       0.756  0.726275  0.854416
17    X_24   0.018739  0.796770  ...       0.772  0.732925  0.860614
18     X_3   0.018739  0.792737  ...       0.788  0.736106  0.849367
19    X_31   0.017036  0.784170  ...       0.796  0.732317  0.836023
20    X_42   0.017036  0.789202  ...       0.800   0.75556  0.822844
21    X_26   0.017036  0.789443  ...       0.804  0.745793  0.833092
22    X_21   0.015332  0.787502  ...       0.800  0.735047  0.839958
23     X_5   0.011925  0.788152  ...       0.800  0.736725  0.839578
24    X_16   0.010221  0.794447  ...       0.800  0.732864  0.856029
25    X_17   0.006814  0.785223  ...       0.796  0.732467   0.83798
26    X_41   0.006814  0.797268  ...       0.812  0.741074  0.853461
27    X_29   0.005111  0.796142  ...       0.816  0.741829  0.850456
28     X_2   0.005111  0.796142  ...       0.816  0.741829  0.850456
29    X_20   0.003407  0.796070  ...       0.820  0.750079  0.842061

[30 rows x 10 columns]

# 确保 Importance 列为数值类型
selection_results['Importance'] = pd.to_numeric(selection_results['Importance'], errors = 'coerce')

# 确保置信区间的列为数值类型
selection_results['CI_Lower'] = pd.to_numeric(selection_results['CI_Lower'], errors = 'coerce')
selection_results['CI_Upper'] = pd.to_numeric(selection_results['CI_Upper'], errors = 'coerce')

# 参数：选择前 n 个特征
n_features = 7

fig, ax1 = plt.subplots(figsize = (16, 6), dpi = 1200)

# 渐变柱状图：特征贡献度
norm = plt.Normalize(selection_results['Importance'].min(), selection_results['Importance'].max())
colors = plt.cm.Blues(norm(selection_results['Importance']))

ax1.bar(selection_results['Feature'], selection_results['Importance'], color = colors, label = 'Feature Importance')
ax1.set_xlabel("Features", fontsize = 18, fontweight = 'bold')
ax1.set_ylabel("Feature Importance", fontsize = 18, fontweight = 'bold')
ax1.tick_params(axis = 'y', labelsize = 12, width = 1.5)

# 修改 x 轴特征颜色，前 n_features 用红色，其他用黑色
x_labels = selection_results['Feature']
x_colors = ['red' if i < n_features else 'black' for i in range(len(x_labels))]
for tick_label, color in zip(ax1.get_xticklabels(), x_colors):
    tick_label.set_color(color)

ax1.tick_params(axis = 'x', rotation = 45, labelsize = 12, width = 1.5)

# 折线图：AUC成绩
ax2 = ax1.twinx()

# 红点和红线：前 n_features 个特征
ax2.plot(
    selection_results['Feature'][:n_features + 1],  # 连接红点到黑点的过渡
    selection_results['Mean_ROC'][:n_features + 1],
    color = "red", marker = 'o', linestyle = '-', label = "Mean AUC (Top Features)"
)

# 黑点和黑线：其余特征
ax2.plot(
    selection_results['Feature'][n_features:],
    selection_results['Mean_ROC'][n_features:],
    color = "black", marker = 'o', linestyle = '-', label = "Mean AUC (Other Features)"
)

# 添加置信区间阴影
ax2.fill_between(
    selection_results['Feature'],
    selection_results['CI_Lower'],  # 置信区间下限
    selection_results['CI_Upper'],  # 置信区间上限
    color = 'red',
    alpha = 0.2,                    # 设置透明度
)

ax2.set_ylabel("Mean AUC", fontsize = 18, fontweight = 'bold')
ax2.tick_params(axis = 'y', labelsize = 12, width = 1.5)
ax2.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:.3f}'))      # 保留三位小数

# 添加标题
plt.title(f"Feature Contribution and AUC Performance (Top {n_features} Features Highlighted)", fontsize = 18, fontweight = 'bold')
fig.tight_layout()
plt.savefig("f{wkdir}/特征曲线.png", bbox_inches = 'tight')
plt.close()

1 2	list(selection_results['Feature'][0:8]) # Out[116]: ['X_30', 'X_39', 'X_46', 'X_32', 'X_34', 'X_9', 'X_33', 'X_28']

构建模型

基于特征筛选确定的特征 [‘X_30’, ‘X_39’, ‘X_46’, ‘X_32’, ‘X_34’, ‘X_33’, ‘X_9’, ‘X_28’] 进行多模型比较
模型包括 ANN、DT、ET、GBM、KNN、LightGBM、RF、SVM、XGBoost
利用 k 折交叉验证+网格搜索+手动微调确定各模型最优参数

混淆矩阵

# 混淆矩阵
def plot_confusion_matrix_plus(y_true, y_pred, title = "Model", labels = None, save_path = None):
    unique_labels = set(np.unique(np.concatenate([np.array(y_true), np.array(y_pred)])))
    labels_used = list(unique_labels)
    # 若未提供 labels，或者提供的 labels 与 y_true 不一致，则自动推断
    if labels is None:
        labels_with_totals = labels_used + ["Total"]
    else:
        labels_with_totals = [labels[lab] for lab in labels if lab in labels_used] + ["Total"]

    # 若最终只有单一标签（极端情况），也能正常绘制
    labels_used = list(labels_used)
    cm = confusion_matrix(y_true, y_pred, labels = labels_used)

    # 行标准化（避免除零）
    row_sums = cm.sum(axis = 1, keepdims = True)
    with np.errstate(divide = 'ignore', invalid = 'ignore'):
        cm_normalized = np.divide(cm, row_sums, where = (row_sums != 0))
        cm_normalized[np.isnan(cm_normalized)] = 0.0

    # 添加 Total 行和列（计数）
    cm_with_totals = np.vstack([cm, cm.sum(axis = 0, keepdims = True)])
    cm_with_totals = np.column_stack([cm_with_totals, cm_with_totals.sum(axis = 1, keepdims = True)])

    # 用于着色的矩阵（最后一行/列置 0，使其灰色覆盖）
    heatmap_data = cm_normalized.copy()
    heatmap_data = np.vstack([heatmap_data, np.zeros((1, heatmap_data.shape[1]))])
    heatmap_data = np.column_stack([heatmap_data, np.zeros((heatmap_data.shape[0], 1))])

    fig, ax = plt.subplots(figsize = (10, 8))
    colors = sns.color_palette("Greens", as_cmap = True)
    grey_color = "#f0f0f0"

    sns.heatmap(
        heatmap_data,
        annot = False,
        fmt = "",
        cmap = colors,
        xticklabels = labels_with_totals,
        yticklabels = labels_with_totals,
        cbar = False,
        square = True,
        linewidths = 1.5,
        linecolor = "white",
        ax = ax,
        vmin = 0.0, vmax = 1.0
    )

    # 覆盖 Total 行/列为灰色
    n_rows, n_cols = cm.shape
    for i in range(n_rows):
        ax.add_patch(plt.Rectangle((n_cols, i), 1, 1, fill = True, color = grey_color, edgecolor = "white", lw = 1.5))
    for j in range(n_cols):
        ax.add_patch(plt.Rectangle((j, n_rows), 1, 1, fill = True, color = grey_color, edgecolor = "white", lw = 1.5))
    ax.add_patch(plt.Rectangle((n_cols, n_rows), 1, 1, fill = True, color = grey_color, edgecolor = "white", lw = 1.5))

    # 填充数值与百分比
    total_all = cm_with_totals[-1, -1]
    for i in range(cm_with_totals.shape[0]):
        for j in range(cm_with_totals.shape[1]):
            if i < n_rows and j < n_cols:
                value = cm[i, j]
                percentage = cm_normalized[i, j] * 100 if row_sums[i, 0] > 0 else 0.0
                ax.text(j + 0.5, i + 0.45, f"{percentage:.1f}%", ha = "center", va = "center", fontsize = 14, color = "black")
                ax.text(j + 0.5, i + 0.72, f"{int(value)}", ha = "center", va = "center", fontsize = 13, color = "black")
            elif i == n_rows and j == n_cols:
                ax.text(j + 0.5, i + 0.5, f"{int(total_all)}", ha = "center", va = "center", fontsize = 15, color = "black")
            else:
                total_value = cm_with_totals[i, j]
                total_percentage = (total_value / total_all * 100) if total_all > 0 else 0.0
                ax.text(j + 0.5, i + 0.45, f"{total_percentage:.1f}%", ha = "center", va = "center", fontsize = 14, color = "black")
                ax.text(j + 0.5, i + 0.72, f"{int(total_value)}", ha = "center", va = "center", fontsize = 13, color = "black")

    plt.title(title, fontsize = 20)
    plt.xlabel("Prediction (model output)", fontsize = 16)
    plt.ylabel("Truth (observation)", fontsize = 16)
    plt.tight_layout()
    if save_path:
        plt.savefig(save_path, bbox_inches = 'tight')
    plt.close()

重新划分数据集

# 划分特征和目标变量
X = df[['X_30', 'X_39', 'X_46', 'X_32', 'X_34', 'X_33', 'X_9', 'X_28']] # 根据特征筛选变量重新划分特征
y = df['y']  # 提取目标变量 y

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X,                   # 特征变量
    y,                   # 目标变量
    test_size = 0.3,     # 测试集所占比例，这里为 30%
    random_state = 42,   # 随机种子，确保结果可重复
    stratify = df['y']   # 按目标变量 'y' 的分布进行分层采样，确保训练集和测试集中目标变量的分布一致
)

Artificial Neural Network

# 定义 ANN 的基础参数
params_ann = {
    'random_state': 42,               # 随机种子
    'max_iter': 200                   # 最大迭代次数
}

# 初始化 ANN 分类模型
model_ann = MLPClassifier(**params_ann)

# 定义参数网格
param_grid_ann = {
'hidden_layer_sizes': [(50,), (100,), (100, 50)],  # 隐藏层大小
'learning_rate_init': [0.001, 0.01, 0.1],          # 学习率
'alpha': [0.0001, 0.001, 0.01]                     # 正则化参数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_ann = GridSearchCV(
estimator = model_ann,
param_grid = param_grid_ann,
scoring = 'neg_log_loss',  # 评价指标为负对数损失
cv = 5,                    # 5 折交叉验证
n_jobs = -1,               # 并行计算
verbose = 1                # 输出详细进度信息
)

# 训练模型
grid_search_ann.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_ann = grid_search_ann.best_estimator_
        
# 预测测试集
y_pred = best_model_ann.predict(X_test)

# 输出模型报告，查看评价指标
print(classification_report(y_test, y_pred))   

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
			  precision    recall  f1-score   support

           0       0.88      0.83      0.85        52
           1       0.68      0.76      0.72        25

    accuracy                           0.81        77
   macro avg       0.78      0.79      0.78        77
weighted avg       0.81      0.81      0.81        77

Decision Tree, DT

# 定义 DT 的基础参数
params_dt = {'random_state': 42}

# 初始化决策树分类模型
model_dt = DecisionTreeClassifier(**params_dt)

# 定义参数网格
param_grid_dt = {
    'criterion': ['gini', 'entropy'],    # 划分标准
    'max_depth': [None, 10, 20, 30],     # 最大深度
    'min_samples_split': [2, 5, 10],     # 节点最小分裂样本数
    'min_samples_leaf': [1, 2, 5]        # 叶节点最小样本数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_dt = GridSearchCV(
    estimator = model_dt,
    param_grid = param_grid_dt,
    scoring = 'neg_log_loss',  # 评价指标为负对数损失
    cv = 5,                    # 5 折交叉验证
    n_jobs = -1,               # 并行计算
    verbose = 1                # 输出详细进度信息
)

# 训练模型
grid_search_dt.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_dt = grid_search_dt.best_estimator_ 
        
# 预测测试集
y_pred = best_model_dt.predict(X_test)

# 输出模型报告，查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

Fitting 5 folds for each of 72 candidates, totalling 360 fits
              precision    recall  f1-score   support

           0       0.78      0.69      0.73        52
           1       0.48      0.60      0.54        25

    accuracy                           0.66        77
   macro avg       0.63      0.65      0.64        77
weighted avg       0.69      0.66      0.67        77

Extra Trees, ET

# 定义 ET 的基础参数
params_et = {
    'random_state': 42  # 随机种子
}

# 初始化 Extra Trees 分类模型
model_et = ExtraTreesClassifier(**params_et)

# 定义参数网格
param_grid_et = {
    'n_estimators': [50, 100, 200],      # 树的数量
    'criterion': ['gini', 'entropy'],    # 划分标准
    'max_depth': [None, 10, 20, 30],     # 最大深度
    'min_samples_split': [2, 5, 10],     # 节点最小分裂样本数
    'min_samples_leaf': [1, 2, 5]        # 叶节点最小样本数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_et = GridSearchCV(
    estimator = model_et,
    param_grid = param_grid_et,
    scoring = 'neg_log_loss',            # 评价指标为负对数损失
    cv = 5,                              # 5 折交叉验证
    n_jobs = -1,                         # 并行计算
    verbose = 1                          # 输出详细进度信息
)

# 训练模型
grid_search_et.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_et = grid_search_et.best_estimator_
        
# 预测测试集
y_pred = best_model_et.predict(X_test)

# 输出模型报告，查看评价指标
print(classification_report(y_test, y_pred))   
        
# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
              precision    recall  f1-score   support

           0       0.94      0.85      0.89        52
           1       0.73      0.88      0.80        25

    accuracy                           0.86        77
   macro avg       0.83      0.86      0.84        77
weighted avg       0.87      0.86      0.86        77

GBM

# 定义 GBM 的基础参数
params_gbm = {'random_state': 42}

# 初始化梯度增强机分类模型
model_gbm = GradientBoostingClassifier(**params_gbm)

# 定义参数网格
param_grid_gbm = {
    'n_estimators': [50, 100, 200],      # 树的数量
    'learning_rate': [0.01, 0.1, 0.2],   # 学习率
    'max_depth': [3, 5, 10],             # 最大深度
    'min_samples_split': [2, 5, 10],     # 节点最小分裂样本数
    'min_samples_leaf': [1, 2, 5]        # 叶节点最小样本数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_gbm = GridSearchCV(
    estimator = model_gbm,
    param_grid = param_grid_gbm,
    scoring = 'neg_log_loss',            # 评价指标为负对数损失
    cv = 5,                              # 5 折交叉验证
    n_jobs = -1,                         # 并行计算
    verbose = 1                          # 输出详细进度信息
)

# 训练模型
grid_search_gbm.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_gbm = grid_search_gbm.best_estimator_
        
# 预测测试集
y_pred = best_model_gbm.predict(X_test)

# 输出模型报告，查看评价指标
print(classification_report(y_test, y_pred))
                
# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

Fitting 5 folds for each of 243 candidates, totalling 1215 fits
              precision    recall  f1-score   support

           0       0.82      0.87      0.84        52
           1       0.68      0.60      0.64        25

    accuracy                           0.78        77
   macro avg       0.75      0.73      0.74        77
weighted avg       0.77      0.78      0.78        77

KNN

# 初始化 KNN 分类模型
model_knn = KNeighborsClassifier()

# 定义参数网格
param_grid_knn = {
    'n_neighbors': [3, 5, 10],                          # 邻居数量
    'weights': ['uniform', 'distance'],                 # 权重方式
    'metric': ['euclidean', 'manhattan', 'minkowski']   # 距离度量方式
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_knn = GridSearchCV(
    estimator = model_knn,
    param_grid = param_grid_knn,
    scoring = 'neg_log_loss',  # 评价指标为负对数损失
    cv = 5,                    # 5 折交叉验证
    n_jobs = -1,               # 并行计算
    verbose = 1                # 输出详细进度信息
)

# 训练模型
grid_search_knn.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_knn = grid_search_knn.best_estimator_
        
# 预测测试集
y_pred = best_model_knn.predict(X_test)

# 输出模型报告，查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
              precision    recall  f1-score   support

           0       0.81      0.85      0.83        52
           1       0.65      0.60      0.62        25

    accuracy                           0.77        77
   macro avg       0.73      0.72      0.73        77
weighted avg       0.76      0.77      0.76        77

LightGBM

# 初始化 LightGBM 分类模型
model_lgbm = LGBMClassifier(random_state = 42, verbose = -1)

# 定义参数网格
param_grid_lgbm = {
    'n_estimators': [50, 100, 200],       # 树的数量
    'learning_rate': [0.01, 0.1, 0.2],    # 学习率
    'max_depth': [-1, 10, 20],            # 最大深度
    'num_leaves': [31, 50, 100],          # 叶节点数
    'min_child_samples': [10, 20, 30]    # 最小叶节点样本数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_lgbm = GridSearchCV(
    estimator = model_lgbm,
    param_grid = param_grid_lgbm,
    scoring = 'neg_log_loss',  # 评价指标为负对数损失
    cv = 5,                    # 5 折交叉验证
    n_jobs = -1,               # 并行计算
    verbose = 1                # 输出详细进度信息
)

# 训练模型
grid_search_lgbm.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_lgbm = grid_search_lgbm.best_estimator_
        
# 预测测试集
y_pred = best_model_lgbm.predict(X_test)

# 输出模型报告，查看评价指标
print(classification_report(y_test, y_pred))    
        
# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

Fitting 5 folds for each of 243 candidates, totalling 1215 fits
              precision    recall  f1-score   support

           0       0.83      0.85      0.84        52
           1       0.67      0.64      0.65        25

    accuracy                           0.78        77
   macro avg       0.75      0.74      0.75        77
weighted avg       0.78      0.78      0.78        77

Random Forest, RF

# 初始化随机森林分类模型
model_rf = RandomForestClassifier(random_state = 42)

# 定义参数网格
param_grid_rf = {
    'n_estimators': [50, 100, 200],       # 树的数量
    'max_depth': [None, 10, 20, 30],      # 最大深度
    'min_samples_split': [2, 5, 10],      # 节点最小分裂样本数
    'min_samples_leaf': [1, 2, 5],        # 叶节点最小样本数
    'criterion': ['gini', 'entropy']      # 划分标准
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_rf = GridSearchCV(
    estimator = model_rf,
    param_grid = param_grid_rf,
    scoring = 'neg_log_loss',  # 评价指标为负对数损失
    cv = 5,                    # 5 折交叉验证
    n_jobs = -1,               # 并行计算
    verbose = 1                # 输出详细进度信息
)

# 训练模型
grid_search_rf.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_rf = grid_search_rf.best_estimator_
    
# 预测测试集
y_pred = best_model_rf.predict(X_test)

# 输出模型报告，查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
              precision    recall  f1-score   support

           0       0.87      0.87      0.87        52
           1       0.72      0.72      0.72        25

    accuracy                           0.82        77
   macro avg       0.79      0.79      0.79        77
weighted avg       0.82      0.82      0.82        77

Support Vector Machine, SVM

# 初始化支持向量机分类模型
model_svm = SVC(probability = True, random_state = 42)

# 定义参数网格
param_grid_svm = {
    'C': [0.1, 1, 10, 100],             # 惩罚参数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_svm = GridSearchCV(
    estimator = model_svm,
    param_grid = param_grid_svm,
    scoring = 'neg_log_loss',  # 评价指标为负对数损失
    cv = 5,                    # 5 折交叉验证
    n_jobs = -1,               # 并行计算
    verbose = 1                # 输出详细进度信息
)

# 训练模型
grid_search_svm.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_svm = grid_search_svm.best_estimator_
    
# 预测测试集
y_pred = best_model_svm.predict(X_test)

# 输出模型报告，查看评价指标
print(classification_report(y_test, y_pred))      
        
# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
              precision    recall  f1-score   support

           0       0.84      0.90      0.87        52
           1       0.76      0.64      0.70        25

    accuracy                           0.82        77
   macro avg       0.80      0.77      0.78        77
weighted avg       0.81      0.82      0.81        77

eXtreme Gradient Boosting, XGBoost

# 初始化 XGBoost 分类模型
model_xgb = XGBClassifier(use_label_encoder = False, eval_metric = 'logloss', random_state = 42)

# 定义参数网格
param_grid_xgb = {
    'n_estimators': [50, 100, 200],       # 树的数量
    'learning_rate': [0.01, 0.1, 0.2],    # 学习率
    'max_depth': [3, 5, 10],              # 最大深度
    'subsample': [0.8, 1.0],              # 子采样比率
    'colsample_bytree': [0.8, 1.0]        # 树的列采样比率
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_xgb = GridSearchCV(
    estimator = model_xgb,
    param_grid = param_grid_xgb,
    scoring = 'neg_log_loss',  # 评价指标为负对数损失
    cv = 5,                    # 5 折交叉验证
    n_jobs = -1,               # 并行计算
    verbose = 1                # 输出详细进度信息
)

# 训练模型
grid_search_xgb.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_xgb = grid_search_xgb.best_estimator_

# 预测测试集
y_pred = best_model_xgb.predict(X_test)

# 输出模型报告，查看评价指标
print(classification_report(y_test, y_pred))
    
# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
              precision    recall  f1-score   support

           0       0.86      0.83      0.84        52
           1       0.67      0.72      0.69        25

    accuracy                           0.79        77
   macro avg       0.76      0.77      0.77        77
weighted avg       0.80      0.79      0.79        77

Test AUC

# 获取所有模型及其颜色的映射
models = {
    "Random Forest": {"model": best_model_rf, "color": "#FDD379"},
    "LightGBM": {"model": best_model_lgbm, "color": "#F7C2CD"},
    "Gradient Boosting": {"model": best_model_gbm, "color": "#A6DAEF"},
    "XGBoost": {"model": best_model_xgb, "color": "#B0D9A5"},
    "SVM": {"model": best_model_svm, "color": "#F06292"},
    "KNN": {"model": best_model_knn, "color": "#64B5F6"},
    "Decision Tree": {"model": best_model_dt, "color": "#FFB74D"},
    "Extra Trees": {"model": best_model_et, "color": "#9575CD"},
    "ANN": {"model": best_model_ann, "color": "#4DB6AC"}
}

# 绘制 ROC 曲线
plt.figure(figsize = (12, 10))

# 遍历每个模型计算 ROC 和 AUC
for name, info in models.items():
    model = info["model"]
    color = info["color"]
    try:
        y_pred = model.predict_proba(X_test)[:, 1]  # 获取概率分布
    except AttributeError:
        y_pred = model.decision_function(X_test)  # 对于 SVM，使用 decision_function
        y_pred = (y_pred - y_pred.min()) / (y_pred.max() - y_pred.min())  # 归一化

    # 计算 ROC 曲线和 AUC 值
    fpr, tpr, _ = roc_curve(y_test, y_pred)
    roc_auc = auc(fpr, tpr)

    # 绘制曲线
    plt.plot(fpr, tpr, label = f"{name}: AUC = {roc_auc:.3f}", color = color, linewidth = 2)

# 绘制随机分类的参考线
plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)

# 图形细节设置
plt.title("ROC Curve - Test Set Model Comparison", fontsize = 20, fontweight = "bold")
plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.legend(loc = "lower right", fontsize = 12)

# 去除顶部和右侧边框
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_linewidth(1.5)
plt.gca().spines['bottom'].set_linewidth(1.5)

# 关闭网格线
plt.grid(False)
plt.tight_layout()
plt.savefig(f"{wkdir}/Test-AUC.png", bbox_inches = 'tight')
plt.close()

Train AUC

# 绘制训练集 ROC 曲线
plt.figure(figsize = (12, 10))

# 遍历每个模型计算 ROC 和 AUC（训练集）
for name, info in models.items():
    model = info["model"]
    color = info["color"]
    try:
        y_pred = model.predict_proba(X_train)[:, 1]  # 获取概率分布
    except AttributeError:
        y_pred = model.decision_function(X_train)  # 对于 SVM，使用 decision_function
        y_pred = (y_pred - y_pred.min()) / (y_pred.max() - y_pred.min())  # 归一化

    # 计算 ROC 曲线和 AUC 值
    fpr, tpr, _ = roc_curve(y_train, y_pred)
    roc_auc = auc(fpr, tpr)

    # 绘制曲线
    plt.plot(fpr, tpr, label=f"{name} (Train AUC = {roc_auc:.3f})", color = color, linewidth = 2)

# 绘制随机分类的参考线
plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)

# 图形细节设置
plt.title("ROC Curve - Training Set Model Comparison", fontsize = 20, fontweight = "bold")
plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.legend(loc="lower right", fontsize = 12)

# 去除顶部和右侧边框
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_linewidth(1.5)
plt.gca().spines['bottom'].set_linewidth(1.5)

# 关闭网格线
plt.grid(False)
plt.tight_layout()
plt.savefig(f"{wkdir}/Train-AUC.png", bbox_inches = 'tight')
plt.close()

训练集测试集评价指标系统输出比较

>为什么多指标评价模型

多指标评价模型能够全面反映模型在各个方面的综合性能，满足不同任务对模型表现的多样需求，尤其在处理数据不平衡时，能够更好地揭示模型在不同类别上的表现差异，从而为模型优化提供更加精准的指导，如在数据不平衡的情况下，F1 分数、敏感性 Sensitivity 和特异性 Specificity 比准确率 Accuracy 更具参考性，因为它们能更好地反映模型在不同类别上的表现，尤其是在处理少数类时。

为什么要对训练集、测试集评价指标计算

计算训练集上的评价指标可以评估模型的学习效果，而训练集和测试集之间的表现差异帮助我们检测是否存在过拟合或欠拟合，从而确保模型能够在未知数据上保持良好的预测能力，一个好的模型应该在各数据集上趋势保持一致。

最终模型选择

最终确定的模型为 ET 其在测试集上的 ROC-AUC 曲线下面积为 0.873，在训练集上为 0.987，且其准确率、敏感性、特异性、阳性预测值、阴性预测值、F1 分数和 Kappa 得分等指标在训练集和测试集上相对较好，且在几个模型里面识别少数类样本最多的模型。

需要强调的是，读者在运用模型时应根据自身的数据特点和评价指标偏好来做选择，因为在这里，多个模型的评价指标差异并不大。此外，模型的性能也与选择的参数范围密切相关，在当前网格搜索的范围内，得到的是最优结果，但如果更换参数，模型可能不再是最优或最差的，因此在实际应用中仍具有很大的可操作性和调整空间。

测试集

# 定义函数计算评价指标
def calculate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    sensitivity = recall_score(y_true, y_pred)  # 敏感性（召回率）
    specificity = confusion_matrix(y_true, y_pred)[0, 0] / (confusion_matrix(y_true, y_pred)[0, 0] + confusion_matrix(y_true, y_pred)[0, 1])
    positive_predictive_value = precision_score(y_true, y_pred)
    negative_predictive_value = confusion_matrix(y_true, y_pred)[1, 1] / (confusion_matrix(y_true, y_pred)[1, 1] + confusion_matrix(y_true, y_pred)[1, 0])
    f1 = f1_score(y_true, y_pred)
    kappa = cohen_kappa_score(y_true, y_pred)
    return [accuracy, sensitivity, specificity, positive_predictive_value, negative_predictive_value, f1, kappa]

# 模型字典
models = {
    'ANN': best_model_ann,
    'Decision Tree': best_model_dt,
    'Extra Trees': best_model_et,
    'Gradient Boosting': best_model_gbm,
    'KNN': best_model_knn,
    'LightGBM': best_model_lgbm,
    'Random Forest': best_model_rf,
    'SVM': best_model_svm,
    'XGBoost': best_model_xgb
}

# 初始化存储模型评价指标的列表
metrics_dict = {'Metrics': ['accuracy', 'sensitivity', 'specificity', 'Positive predictive value', 'Negative predictive value', 'F1 score', 'Kappa score']}

# 计算每个模型的指标并存储
for model_name, model in models.items():
    # 预测测试集
    y_pred = model.predict(X_test)
    # 计算评价指标
    metrics_dict[model_name] = calculate_metrics(y_test, y_pred)

# 创建DataFrame存储所有模型的评价指标
metrics_df_test = pd.DataFrame(metrics_dict)
metrics_df_test

Out[156]: 
                     Metrics       ANN  ...       SVM   XGBoost
0                   accuracy  0.805195  ...  0.818182  0.792208
1                sensitivity  0.760000  ...  0.640000  0.720000
2                specificity  0.826923  ...  0.903846  0.826923
3  Positive predictive value  0.678571  ...  0.761905  0.666667
4  Negative predictive value  0.760000  ...  0.640000  0.720000
5                   F1 score  0.716981  ...  0.695652  0.692308
6                Kappa score  0.569191  ...  0.567416  0.535795

[7 rows x 10 columns]

# 可视化展示
def plot_metrics(metrics_df, title = "Metrics Plot", save_path = None):
    
    # 获取一个颜色映射（确保颜色的多样性）
    colors = plt.cm.get_cmap('tab20', len(metrics_df.columns) - 1)  # 根据模型的数量选择足够的颜色
    plt.figure(figsize = (10, 8), dpi = 1200)
    
    # 遍历每一列模型，忽略 'Metrics' 列
    for idx, model in enumerate(metrics_df.columns[1:]):
        plt.plot(metrics_df['Metrics'], metrics_df[model], marker = 'o', label = model, color = colors(idx))
    
    ax = plt.gca()
    
    # 优化坐标轴边框设置
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
    ax.yaxis.set_major_locator(ticker.MultipleLocator(0.25)) 
    ax.yaxis.set_minor_locator(ticker.MultipleLocator(0.125))  
    ax.yaxis.set_minor_formatter(ticker.NullFormatter())  
    
    # 设置图表标题和标签
    plt.title(title, fontsize = 18, fontweight = "bold")
    plt.ylim(0.1, 1.0)
    plt.yticks([0.25, 0.50, 0.75, 1.00])  # 设置主要刻度
    plt.xticks(rotation = 45, ha = 'right', fontsize = 14)
    plt.yticks(fontsize = 14)
    
    # 显示图例
    plt.legend(title = "Models", loc = 'center left', bbox_to_anchor = (1, 0.5), frameon = False, fontsize = 12)
    
    # 显示网格线，增强可视化效果
    plt.grid(which = 'both', linestyle = '-', linewidth = 0.5, color = 'gray')
    
    # 保存图形到文件，如果提供了保存路径
    if save_path:
        plt.savefig(save_path, bbox_inches = 'tight')  # 确保内容不会被裁剪
    
    # 布局优化，避免标签重叠
    plt.tight_layout()
    plt.close()

# 调用该函数来绘制测试集的指标
plot_metrics(metrics_df_test, title = "Test Set Metrics", save_path = f"{wkdir}/综合评价.png")

训练集

# 计算每个模型在训练集上的评价指标
metrics_dict_train = {'Metrics': ['accuracy', 'sensitivity', 'specificity', 'Positive predictive value', 'Negative predictive value', 'F1 score', 'Kappa score']}

# 遍历模型，计算训练集上的评价指标
for model_name, model in models.items():
    # 预测训练集
    y_pred_train = model.predict(X_train)
    # 计算评价指标
    metrics_dict_train[model_name] = calculate_metrics(y_train, y_pred_train)

# 创建DataFrame存储所有模型的训练集评价指标
metrics_df_train = pd.DataFrame(metrics_dict_train)

metrics_df_train
Out[162]: 
                     Metrics       ANN  ...       SVM   XGBoost
0                   accuracy  0.849162  ...  0.720670  0.983240
1                sensitivity  0.771930  ...  0.315789  0.964912
2                specificity  0.885246  ...  0.909836  0.991803
3  Positive predictive value  0.758621  ...  0.620690  0.982143
4  Negative predictive value  0.771930  ...  0.315789  0.964912
5                   F1 score  0.765217  ...  0.418605  0.973451
6                Kappa score  0.654119  ...  0.259596  0.961208

[7 rows x 10 columns]

1 2	# 绘制训练集的评价指标 plot_metrics(metrics_df_train, title = "Training Set Metrics", save_path = f"{wkdir}/综合评价-Train.png")

针对最优模型进行可视化

# 预测类别的概率
probabilities = best_model_et.predict_proba(X_test)

# 创建一个 DataFrame，列名为类别
probability_df = pd.DataFrame(probabilities, columns = [f'Prob_Class_{i}' for i in range(probabilities.shape[1])])

# 如果需要，可以添加 X_test 的索引或其他标识列
probability_df.index = X_test.index

# 重置索引，并选择是否保留原索引作为一列
probability_df.reset_index(drop = True, inplace = True)

probability_df.head()
Out[168]: 
   Prob_Class_0  Prob_Class_1
0      0.870952      0.129048
1      0.448952      0.551048
2      0.857429      0.142571
3      0.318198      0.681802
4      0.381310      0.618690

df_selected = X_test

# 创建PCA对象，设置提取1个主成分
pca = PCA(n_components = 1)
# 进行PCA降维
pca_result = pca.fit_transform(df_selected)
# 获取解释方差比率（贡献度）
explained_variance = pca.explained_variance_ratio_
# 将PCA结果保存为DataFrame，并用贡献度标记列名
pca_df = pd.DataFrame(pca_result, columns = [f"PC1 ({explained_variance[0]*100:.2f}%)"])

pca_df.head()
Out[171]: 
   PC1 (97.93%)
0   -121.580604
1    -96.202631
2    898.441250
3   -437.343118
4   -519.324984

1
2
3

# 将 PCA 结果 (pca_df) 和其他数据合并
combined_df = pd.concat([probability_df, pca_df, y_test.reset_index(drop = True)], axis = 1)
combined_df.rename(columns = {y_test.name: 'True'}, inplace = True)

combined_df.head()
Out[174]: 
   Prob_Class_0  Prob_Class_1  PC1 (97.93%)  True
0      0.870952      0.129048   -121.580604     0
1      0.448952      0.551048    -96.202631     1
2      0.857429      0.142571    898.441250     0
3      0.318198      0.681802   -437.343118     0
4      0.381310      0.618690   -519.324984     0

colors = combined_df['True'].map({0: '#A0D6B4', 1: '#C3A6D8'})

# 绘制散点图，设置更大的点和黑色边框
plt.figure(figsize = (10, 6), dpi = 1200)
plt.scatter(combined_df['Prob_Class_1'], combined_df['PC1 (97.93%)'], 
            c=colors, edgecolor = 'black', s = 100)  # s = 100 增大点的大小，edgecolor = 'black' 添加黑色边框

# 添加分界线和轴标签
plt.axvline(x = 0.5, color = 'gray', linestyle = '--')
plt.xlabel('Predicted value for benign')
plt.ylabel('PC1 (97.93%)')
plt.title('Test set')
# 添加类别图例并移到图外
for true_value, color, label in [(0, '#A0D6B4', 'malignant'), (1, '#C3A6D8', 'benign')]:
    plt.scatter([], [], color = color, edgecolor = 'black', s = 100, label = label)
plt.legend(title = "Class", loc = "center left", bbox_to_anchor = (1, 0.5))
plt.tight_layout(rect = [0, 0, 0.85, 1])
plt.savefig(f'{wkdir}/分类效果-Test.png', bbox_inches = 'tight')
plt.close()

# 预测类别的概率
probabilities = best_model_et.predict_proba(X_train)
# 创建一个 DataFrame，列名为类别
probability_df = pd.DataFrame(probabilities, columns = [f'Prob_Class_{i}' for i in range(probabilities.shape[1])])
# 如果需要，可以添加 X_train 的索引或其他标识列
probability_df.index = X_train.index
# 重置索引，并选择是否保留原索引作为一列
probability_df.reset_index(drop = True, inplace = True)
df_selected = X_train  # 使用训练集数据
# 创建PCA对象，设置提取1个主成分
pca = PCA(n_components = 1)
# 进行PCA降维
pca_result = pca.fit_transform(df_selected)
# 获取解释方差比率（贡献度）
explained_variance = pca.explained_variance_ratio_
# 将PCA结果保存为DataFrame，并用贡献度标记列名
pca_df = pd.DataFrame(pca_result, columns = [f"PC1 ({explained_variance[0]*100:.2f}%)"])
# 将 PCA 结果 (pca_df) 和其他数据合并
combined_df = pd.concat([probability_df, pca_df, y_train.reset_index(drop = True)], axis = 1)
combined_df.rename(columns = {y_train.name: 'True'}, inplace = True)

colors = combined_df['True'].map({0: '#A0D6B4', 1: '#C3A6D8'})
# 绘制散点图，设置更大的点和黑色边框
plt.figure(figsize = (10, 6),dpi = 1200)
plt.scatter(combined_df['Prob_Class_1'], combined_df['PC1 (99.74%)'], 
            c = colors, edgecolor = 'black', s = 100)  # s = 100 增大点的大小，edgecolor = 'black' 添加黑色边框

# 添加分界线和轴标签
plt.axvline(x = 0.5, color = 'gray', linestyle = '--')
plt.xlabel('Predicted value for benign')
plt.ylabel('PC1 (99.74%)')
plt.title('Train set')
# 添加类别图例并移到图外
for true_value, color, label in [(0, '#A0D6B4', 'malignant'), (1, '#C3A6D8', 'benign')]:
    plt.scatter([], [], color = color, edgecolor = 'black', s = 100, label = label)
plt.legend(title = "Class", loc = "center left", bbox_to_anchor = (1, 0.5))
plt.tight_layout(rect = [0, 0, 0.85, 1])
plt.savefig(f'{wkdir}/分类效果-Train.png', bbox_inches = 'tight')
plt.close()

绘制 Precision-Recall 曲线

# 绘制 Precision-Recall 曲线，添加随机分类器基线
pr_display = PrecisionRecallDisplay.from_estimator(
    best_model_et, 
    X_test, 
    y_test, 
    plot_chance_level = True,  # 添加基线
    name = "AdaBoost"
)

_ = pr_display.ax_.set_title("Test set Precision-Recall curve")
plt.savefig(f'{wkdir}/Precision-Recall-Test.png', bbox_inches = 'tight', dpi = 1200)
plt.close()

pr_display = PrecisionRecallDisplay.from_estimator(
    best_model_et, 
    X_train, 
    y_train, 
    plot_chance_level = True,  # 添加基线
    name = "AdaBoost"
)
_ = pr_display.ax_.set_title("Train set Precision-Recall curve")
plt.savefig(f'{wkdir}/Precision-Recall-Train.png', bbox_inches = 'tight', dpi = 1200)
plt.close()

AdaBoost

# 计算训练集和测试集的预测概率
y_train_pred = best_model_et.predict_proba(X_train)[:, 1]
y_test_pred = best_model_et.predict_proba(X_test)[:, 1]

# 计算 ROC 曲线和 AUC 值
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred)
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred)

roc_auc_train = auc(fpr_train, tpr_train)
roc_auc_test = auc(fpr_test, tpr_test)

# 绘制 ROC 曲线
plt.figure(figsize=(10, 8))

# 绘制训练集 ROC 曲线
plt.plot(fpr_train, tpr_train, label = f"Train: AUC={roc_auc_train:.3f} (n={len(y_train)})", color = "#CECCE5", linewidth = 2)

# 绘制测试集 ROC 曲线
plt.plot(fpr_test, tpr_test, label = f"Test: AUC={roc_auc_test:.3f} (n={len(y_test)})", color = "#B0D9A5", linewidth = 2)

# 绘制随机分类的参考线
plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)

# 图形细节设置
plt.title("ROC Curve - AdaBoost (Train & Test)", fontsize = 20, fontweight = "bold")
plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.legend(loc="lower right", fontsize = 12)

# 去除顶部和右侧边框
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_linewidth(1.5)
plt.gca().spines['bottom'].set_linewidth(1.5)

# 关闭网格线
plt.grid(False)
plt.tight_layout()

# 保存图像为 PDF
plt.savefig(f"{wkdir}/AdaBoost.png", bbox_inches = 'tight')
plt.close()

针对最优模型进行解释

explainer = shap.TreeExplainer(best_model_et)
shap_values = explainer.shap_values(X_test)
    
# 提取类别 0 的 SHAP 值
shap_values_class_0 = shap_values[:, :, 0]

# 提取类别 1 的 SHAP 值
shap_values_class_1 = shap_values[:, :, 1]

plt.figure(figsize = (10, 5), dpi = 1200)
shap.summary_plot(shap_values_class_1, X_test, plot_type = "bar", show = False)
plt.title('SHAP_numpy Sorted Feature Importance')
plt.tight_layout()
plt.savefig(f"{wkdir}/SHAP_numpy Sorted Feature Importance.png", bbox_inches = 'tight')
plt.close()

plt.figure()
shap.summary_plot(shap_values_class_0, X_test, feature_names = X_test.columns, plot_type = "dot", show = False)
plt.savefig(f"{wkdir}/SHAP_numpy Sorted Feature Importance.dot.png", bbox_inches = 'tight')
plt.close()

1	shap_values_df_1 = pd.DataFrame(shap_values_class_0, columns = X_test.columns)

shap_values_df_1.head()
Out[199]: 
       X_30      X_39      X_46  ...      X_33       X_9      X_28
0  0.028375 -0.005348  0.019653  ...  0.008910  0.163576 -0.054551
1  0.015743 -0.017699 -0.024495  ... -0.010931 -0.193108  0.022756
2  0.014259 -0.070168 -0.012669  ... -0.022715  0.143295  0.068182
3  0.004616  0.021329 -0.015542  ...  0.004962 -0.317984 -0.008150
4  0.033777  0.071995 -0.030324  ...  0.002056 -0.328638  0.004778

[5 rows x 8 columns]

features = shap_values_df_1.columns.tolist()  # 获取所有的特征名称

# 设置画布和子图结构（3行3列）
fig, axes = plt.subplots(3, 3, figsize = (10, 10), dpi = 1200)
axes = axes.flatten()

# 循环绘制每个特征的散点图
for i in range(len(axes)):
    if i < len(features):  # 如果还有特征未绘制
        feature = features[i]
        if feature in X_test.columns and feature in shap_values_df_1.columns:
            ax = axes[i]
            ax.scatter(X_test[feature], shap_values_df_1[feature], s = 10, color = "#6A9ACE")
            ax.axhline(y = 0, color = 'red', linestyle = '-.', linewidth = 1)  # 添加横线
            ax.set_xlabel(feature, fontsize = 10)
            ax.set_ylabel(f'SHAP value for\n{feature}', fontsize = 10)
            ax.spines['top'].set_visible(False)
            ax.spines['right'].set_visible(False)
        else:
            # 如果特征不存在，隐藏对应的子图
            axes[i].axis('off')
    else:
        # 如果超过了特征数量，关闭剩余的子图
        axes[i].axis('off')

# 调整布局
plt.tight_layout()
plt.savefig(f"{wkdir}/SHAP value for features.png", bbox_inches = 'tight')
plt.close()

以下针对预测正确的类别 0 和类别 1 分别绘制瀑布图[这里就绘制测试集第一个样本 (0) 和第二个样本都预测正确 (1)：

y_pred_et = best_model_et.predict(X_test)
explainer = shap.TreeExplainer(best_model_et)

# 计算shap值为Explanation格式
shap_values_Explanation = explainer(X_test)

# 提取类别 0 的 SHAP 值
shap_values_class_0 = shap_values_Explanation[:, :, 0]

# 提取类别 1 的 SHAP 值
shap_values_class_1 = shap_values_Explanation[:, :, 1]

y_pred_et
Out[203]: 
array([0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1])

预测类别 0 样本绘制

plt.figure(figsize = (10, 5), dpi = 1200)

# 绘制第 1 个样本 0 类别的 SHAP 瀑布图
shap.plots.waterfall(shap_values_class_0[0], show = False, max_display = 8)
plt.savefig(f"{wkdir}/绘制第 1 个样本 0 类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
plt.tight_layout()
plt.close()

plt.figure(figsize = (10, 5), dpi = 1200)

# 绘制第 1 个样本 1 类别的 SHAP 瀑布图
shap.plots.waterfall(shap_values_class_1[0], show = False, max_display = 8)
plt.savefig(f"{wkdir}/绘制第 1 个样本 1 类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
plt.tight_layout()
plt.show()

预测类别 1 样本绘制

plt.figure(figsize = (10, 5), dpi = 1200)

# 绘制第 2 个样本0类别的 SHAP 瀑布图
shap.plots.waterfall(shap_values_class_0[1], show = False, max_display = 8)
plt.savefig(f"{wkdir}/绘制第 2 个样本0类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
plt.tight_layout()
plt.show()

plt.figure(figsize = (10, 5), dpi = 1200)

# 绘制第 2 个样本1类别的 SHAP 瀑布图
shap.plots.waterfall(shap_values_class_1[1], show = False, max_display = 8)
plt.savefig(f"{wkdir}/绘制第 2 个样本1类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
plt.tight_layout()
plt.show()

完整代码

# -*- coding: utf-8 -*-

import shap
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb
import seaborn as sns
import scipy.stats as stats
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import PrecisionRecallDisplay
from xgboost import XGBClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import roc_curve, auc
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, cohen_kappa_score, confusion_matrix

plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.unicode_minus'] = False
import warnings
warnings.filterwarnings("ignore")


# 混淆矩阵
def plot_confusion_matrix_plus(y_true, y_pred, title = "Model", labels = None, save_path = None):
    unique_labels = set(np.unique(np.concatenate([np.array(y_true), np.array(y_pred)])))
    labels_used = list(unique_labels)
    # 若未提供 labels，或者提供的 labels 与 y_true 不一致，则自动推断
    if labels is None:
        labels_with_totals = labels_used + ["Total"]
    else:
        labels_with_totals = [labels[lab] for lab in labels if lab in labels_used] + ["Total"]

    # 若最终只有单一标签（极端情况），也能正常绘制
    labels_used = list(labels_used)
    cm = confusion_matrix(y_true, y_pred, labels = labels_used)

    # 行标准化（避免除零）
    row_sums = cm.sum(axis = 1, keepdims = True)
    with np.errstate(divide = 'ignore', invalid = 'ignore'):
        cm_normalized = np.divide(cm, row_sums, where = (row_sums != 0))
        cm_normalized[np.isnan(cm_normalized)] = 0.0

    # 添加 Total 行和列（计数）
    cm_with_totals = np.vstack([cm, cm.sum(axis = 0, keepdims = True)])
    cm_with_totals = np.column_stack([cm_with_totals, cm_with_totals.sum(axis = 1, keepdims = True)])

    # 用于着色的矩阵（最后一行/列置 0，使其灰色覆盖）
    heatmap_data = cm_normalized.copy()
    heatmap_data = np.vstack([heatmap_data, np.zeros((1, heatmap_data.shape[1]))])
    heatmap_data = np.column_stack([heatmap_data, np.zeros((heatmap_data.shape[0], 1))])

    fig, ax = plt.subplots(figsize = (10, 8))
    colors = sns.color_palette("Greens", as_cmap = True)
    grey_color = "#f0f0f0"

    sns.heatmap(
        heatmap_data,
        annot = False,
        fmt = "",
        cmap = colors,
        xticklabels = labels_with_totals,
        yticklabels = labels_with_totals,
        cbar = False,
        square = True,
        linewidths = 1.5,
        linecolor = "white",
        ax = ax,
        vmin = 0.0, vmax = 1.0
    )

    # 覆盖 Total 行/列为灰色
    n_rows, n_cols = cm.shape
    for i in range(n_rows):
        ax.add_patch(plt.Rectangle((n_cols, i), 1, 1, fill = True, color = grey_color, edgecolor = "white", lw = 1.5))
    for j in range(n_cols):
        ax.add_patch(plt.Rectangle((j, n_rows), 1, 1, fill = True, color = grey_color, edgecolor = "white", lw = 1.5))
    ax.add_patch(plt.Rectangle((n_cols, n_rows), 1, 1, fill = True, color = grey_color, edgecolor = "white", lw = 1.5))

    # 填充数值与百分比
    total_all = cm_with_totals[-1, -1]
    for i in range(cm_with_totals.shape[0]):
        for j in range(cm_with_totals.shape[1]):
            if i < n_rows and j < n_cols:
                value = cm[i, j]
                percentage = cm_normalized[i, j] * 100 if row_sums[i, 0] > 0 else 0.0
                ax.text(j + 0.5, i + 0.45, f"{percentage:.1f}%", ha = "center", va = "center", fontsize = 14, color = "black")
                ax.text(j + 0.5, i + 0.72, f"{int(value)}", ha = "center", va = "center", fontsize = 13, color = "black")
            elif i == n_rows and j == n_cols:
                ax.text(j + 0.5, i + 0.5, f"{int(total_all)}", ha = "center", va = "center", fontsize = 15, color = "black")
            else:
                total_value = cm_with_totals[i, j]
                total_percentage = (total_value / total_all * 100) if total_all > 0 else 0.0
                ax.text(j + 0.5, i + 0.45, f"{total_percentage:.1f}%", ha = "center", va = "center", fontsize = 14, color = "black")
                ax.text(j + 0.5, i + 0.72, f"{int(total_value)}", ha = "center", va = "center", fontsize = 13, color = "black")

    plt.title(title, fontsize = 20)
    plt.xlabel("Prediction (model output)", fontsize = 16)
    plt.ylabel("Truth (observation)", fontsize = 16)
    plt.tight_layout()
    if save_path:
        plt.savefig(save_path, bbox_inches = 'tight')
    plt.close()


if __name__ == '__main__':
    
    import os
    wkdir = 'C:/Users/Administrator/Desktop'
    os.chdir(wkdir)
    
    path = 'Z:/TData/big-data/sad41d8cd/251027_Medical_Classification_Model_Comparison.xlsx'
    df = pd.read_excel(path)
    
    if True:
        df.head()
        df.columns
        df.info()

    if True:
        
        # 划分特征和目标变量
        X = df.drop(['y'], axis = 1)         # 从数据集中去掉目标变量 'y'，得到特征变量 X
        y = df['y']                          # 提取目标变量 y
        
        # 划分训练集和测试集
        X_train, X_test, y_train, y_test = train_test_split(
            X,                               # 特征变量
            y,                               # 目标变量
            test_size = 0.3,                 # 测试集所占比例，这里为 30%
            random_state = 42,               # 随机种子，确保结果可重复
            stratify = df['y']               # 按目标变量 'y' 的分布进行分层采样，确保训练集和测试集中目标变量的分布一致
        )
        
    # 特征筛选
    if True:
        
        # 创建 LGBM 分类器
        lgbm_clf = lgb.LGBMClassifier(random_state = 42, verbose = -1)
        
        # 训练模型
        lgbm_clf.fit(X_train, y_train)
        
        # 获取特征重要性
        feature_importances = lgbm_clf.feature_importances_
        
        # 构建特征重要性排名
        feature_importance_df = pd.DataFrame({
            'Feature': X.columns,
            'Importance': feature_importances
        }).sort_values(by = 'Importance', ascending = False)
        
        
        # 只取前30个重要特征
        top_n = 30
        top_features = feature_importance_df.head(top_n)
        
        # 调整字体大小
        plt.figure(figsize = (12, 8), dpi = 1200)  
        plt.barh(top_features['Feature'], top_features['Importance'], color = 'skyblue')
        plt.xlabel('Importance', fontsize = 14)  
        plt.ylabel('Feature', fontsize = 14)  
        plt.title(f'Top {top_n} Feature Importance', fontsize = 16)  
        plt.xticks(fontsize = 12)  
        plt.yticks(fontsize = 12)  
        plt.gca().invert_yaxis()  
        plt.savefig(f'{wkdir}/特征筛选.png', bbox_inches = 'tight')
        plt.close()
        
    if True:
        
        # 初始化存储结果的 DataFrame
        selection_results = pd.DataFrame(columns = ['Feature', 'Importance', 'Mean_ROC'])
        
        # 初始化用于训练的特征列表
        selected_features = []
        
        # K 折交叉验证
        kf = KFold(n_splits = 5, shuffle = True, random_state = 42)
        
        # 获取折数
        n_splits = kf.get_n_splits()
        
        # 动态创建列名
        fold_columns = [f'Fold_{i+1}_ROC' for i in range(n_splits)]
        
        # 依次添加特征
        for i in range(len(top_features)):
            # 当前特征
            current_feature = top_features.iloc[i]['Feature']
            selected_features.append(current_feature)
        
            fold_roc_scores = []  # 存储每折的 ROC AUC 分数
        
            # K 折交叉验证
            for train_idx, val_idx in kf.split(X_train):
                # 划分训练集和验证集
                X_train_fold, X_val_fold = X_train.iloc[train_idx][selected_features], X_train.iloc[val_idx][selected_features]
                y_train_fold, y_val_fold = y_train.iloc[train_idx], y_train.iloc[val_idx]
        
                # 创建并训练 LGBM 模型
                lgbm_clf = lgb.LGBMClassifier(random_state = 42, verbose = -1)
                lgbm_clf.fit(X_train_fold, y_train_fold)
        
                # 预测并计算 ROC AUC 分数
                y_val_proba = lgbm_clf.predict_proba(X_val_fold)[:, 1]  # 概率分数
                fold_roc_score = roc_auc_score(y_val_fold, y_val_proba)
                fold_roc_scores.append(fold_roc_score)
        
            # 计算平均 ROC AUC 分数
            mean_roc_score = np.mean(fold_roc_scores)
        
            # 保存结果
            row_data = {
                'Feature': current_feature,
                'Importance': top_features.iloc[i]['Importance'],
                'Mean_ROC': mean_roc_score,
            }
        
            # 添加每折的ROC AUC分数
            for j, score in enumerate(fold_roc_scores):
                row_data[fold_columns[j]] = score
            row_df = pd.DataFrame([row_data])  
            selection_results = pd.concat([selection_results, row_df], ignore_index = True)
        
        selection_results
        
    if True:
        
        # 将 Importance 列百分比化并归一化
        selection_results['Importance'] = (
            selection_results['Importance'] / selection_results['Importance'].sum()
        )
        selection_results
        
    if True:
        
        # 动态获取折列名称
        fold_columns = [col for col in selection_results.columns if 'Fold_' in col]
        
        # 添加置信区间列到 selection_results
        selection_results['CI_Lower'] = None
        selection_results['CI_Upper'] = None
        
        # 遍历每行计算置信区间
        for index, row in selection_results.iterrows():
            
            # 提取每折的ROC成绩
            fold_roc_scores = [row[fold] for fold in fold_columns]
        
            # 样本大小
            n_folds = len(fold_roc_scores)
        
            # 平均值和标准误差
            mean_roc = row['Mean_ROC']
            std_err = stats.sem(fold_roc_scores)  # 标准误差
        
            # t值（95%置信区间，n_folds - 1 自由度）
            t_value = stats.t.ppf(0.975, df=n_folds - 1)
        
            # 置信区间
            ci_lower = mean_roc - t_value * std_err
            ci_upper = mean_roc + t_value * std_err
        
            # 确保置信区间上限不超过1
            ci_upper = min(ci_upper, 1)
        
            # 保存到 DataFrame
            selection_results.at[index, 'CI_Lower'] = ci_lower
            selection_results.at[index, 'CI_Upper'] = ci_upper
        
        # 输出结果
        selection_results
        
    if True:
        
        # 确保 Importance 列为数值类型
        selection_results['Importance'] = pd.to_numeric(selection_results['Importance'], errors = 'coerce')
        
        # 确保置信区间的列为数值类型
        selection_results['CI_Lower'] = pd.to_numeric(selection_results['CI_Lower'], errors = 'coerce')
        selection_results['CI_Upper'] = pd.to_numeric(selection_results['CI_Upper'], errors = 'coerce')
        
        # 参数：选择前 n 个特征
        n_features = 7
        
        fig, ax1 = plt.subplots(figsize = (16, 6), dpi = 1200)
        
        # 渐变柱状图：特征贡献度
        norm = plt.Normalize(selection_results['Importance'].min(), selection_results['Importance'].max())
        colors = plt.cm.Blues(norm(selection_results['Importance']))
        
        ax1.bar(selection_results['Feature'], selection_results['Importance'], color = colors, label = 'Feature Importance')
        ax1.set_xlabel("Features", fontsize = 18, fontweight = 'bold')
        ax1.set_ylabel("Feature Importance", fontsize = 18, fontweight = 'bold')
        ax1.tick_params(axis = 'y', labelsize = 12, width = 1.5)
        
        # 修改 x 轴特征颜色，前 n_features 用红色，其他用黑色
        x_labels = selection_results['Feature']
        x_colors = ['red' if i < n_features else 'black' for i in range(len(x_labels))]
        for tick_label, color in zip(ax1.get_xticklabels(), x_colors):
            tick_label.set_color(color)
        
        ax1.tick_params(axis = 'x', rotation = 45, labelsize = 12, width = 1.5)
        
        # 折线图：AUC成绩
        ax2 = ax1.twinx()
        
        # 红点和红线：前 n_features 个特征
        ax2.plot(
            selection_results['Feature'][:n_features + 1],  # 连接红点到黑点的过渡
            selection_results['Mean_ROC'][:n_features + 1],
            color = "red", marker = 'o', linestyle = '-', label = "Mean AUC (Top Features)"
        )
        
        # 黑点和黑线：其余特征
        ax2.plot(
            selection_results['Feature'][n_features:],
            selection_results['Mean_ROC'][n_features:],
            color = "black", marker = 'o', linestyle = '-', label = "Mean AUC (Other Features)"
        )
        
        # 添加置信区间阴影
        ax2.fill_between(
            selection_results['Feature'],
            selection_results['CI_Lower'],  # 置信区间下限
            selection_results['CI_Upper'],  # 置信区间上限
            color = 'red',
            alpha = 0.2,                    # 设置透明度
        )
        
        ax2.set_ylabel("Mean AUC", fontsize = 18, fontweight = 'bold')
        ax2.tick_params(axis = 'y', labelsize = 12, width = 1.5)
        ax2.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:.3f}'))      # 保留三位小数
        
        # 添加标题
        plt.title(f"Feature Contribution and AUC Performance (Top {n_features} Features Highlighted)", fontsize = 18, fontweight = 'bold')
        fig.tight_layout()
        plt.savefig(f"{wkdir}/特征曲线.png", bbox_inches = 'tight')
        plt.close()
                
        list(selection_results['Feature'][0:8])
        
    # 重新划分数据集
    if True:
        
        # 划分特征和目标变量
        X = df[['X_30', 'X_39', 'X_46', 'X_32', 'X_34', 'X_33', 'X_9', 'X_28']] # 根据特征筛选变量重新划分特征
        y = df['y']  # 提取目标变量 y
        
        # 划分训练集和测试集
        X_train, X_test, y_train, y_test = train_test_split(
            X,                   # 特征变量
            y,                   # 目标变量
            test_size = 0.3,     # 测试集所占比例，这里为 30%
            random_state = 42,   # 随机种子，确保结果可重复
            stratify = df['y']   # 按目标变量 'y' 的分布进行分层采样，确保训练集和测试集中目标变量的分布一致
        )
        
    # Artificial Neural Network
    if True:

        # 定义 ANN 的基础参数
        params_ann = {
            'random_state': 42,               # 随机种子
            'max_iter': 200                   # 最大迭代次数
        }
        
        # 初始化 ANN 分类模型
        model_ann = MLPClassifier(**params_ann)
        
        # 定义参数网格
        param_grid_ann = {
        'hidden_layer_sizes': [(50,), (100,), (100, 50)],  # 隐藏层大小
        'learning_rate_init': [0.001, 0.01, 0.1],          # 学习率
        'alpha': [0.0001, 0.001, 0.01]                     # 正则化参数
        }
        
        # 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
        grid_search_ann = GridSearchCV(
        estimator = model_ann,
        param_grid = param_grid_ann,
        scoring = 'neg_log_loss',  # 评价指标为负对数损失
        cv = 5,                    # 5 折交叉验证
        n_jobs = -1,               # 并行计算
        verbose = 1                # 输出详细进度信息
        )
        
        # 训练模型
        grid_search_ann.fit(X_train, y_train)
        
        # 使用最优参数训练模型
        best_model_ann = grid_search_ann.best_estimator_
                
        # 预测测试集
        y_pred = best_model_ann.predict(X_test)
        
        # 输出模型报告，查看评价指标
        print(classification_report(y_test, y_pred))   
        
        # 绘制 AUC 曲线
        plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)      
        
    # DT
    if True:

        # 定义 DT 的基础参数
        params_dt = {'random_state': 42}
        
        # 初始化决策树分类模型
        model_dt = DecisionTreeClassifier(**params_dt)
        
        # 定义参数网格
        param_grid_dt = {
            'criterion': ['gini', 'entropy'],    # 划分标准
            'max_depth': [None, 10, 20, 30],     # 最大深度
            'min_samples_split': [2, 5, 10],     # 节点最小分裂样本数
            'min_samples_leaf': [1, 2, 5]        # 叶节点最小样本数
        }
        
        # 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
        grid_search_dt = GridSearchCV(
            estimator = model_dt,
            param_grid = param_grid_dt,
            scoring = 'neg_log_loss',  # 评价指标为负对数损失
            cv = 5,                    # 5 折交叉验证
            n_jobs = -1,               # 并行计算
            verbose = 1                # 输出详细进度信息
        )
        
        # 训练模型
        grid_search_dt.fit(X_train, y_train)
        
        # 使用最优参数训练模型
        best_model_dt = grid_search_dt.best_estimator_ 
                
        # 预测测试集
        y_pred = best_model_dt.predict(X_test)
        
        # 输出模型报告，查看评价指标
        print(classification_report(y_test, y_pred))
        
        # 绘制 AUC 曲线
        plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
        
    # ET
    if True:

        # 定义 ET 的基础参数
        params_et = {
            'random_state': 42  # 随机种子
        }
        
        # 初始化 Extra Trees 分类模型
        model_et = ExtraTreesClassifier(**params_et)
        
        # 定义参数网格
        param_grid_et = {
            'n_estimators': [50, 100, 200],      # 树的数量
            'criterion': ['gini', 'entropy'],    # 划分标准
            'max_depth': [None, 10, 20, 30],     # 最大深度
            'min_samples_split': [2, 5, 10],     # 节点最小分裂样本数
            'min_samples_leaf': [1, 2, 5]        # 叶节点最小样本数
        }
        
        # 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
        grid_search_et = GridSearchCV(
            estimator = model_et,
            param_grid = param_grid_et,
            scoring = 'neg_log_loss',            # 评价指标为负对数损失
            cv = 5,                              # 5 折交叉验证
            n_jobs = -1,                         # 并行计算
            verbose = 1                          # 输出详细进度信息
        )
        
        # 训练模型
        grid_search_et.fit(X_train, y_train)
        
        # 使用最优参数训练模型
        best_model_et = grid_search_et.best_estimator_
                
        # 预测测试集
        y_pred = best_model_et.predict(X_test)
        
        # 输出模型报告，查看评价指标
        print(classification_report(y_test, y_pred))   
                
        # 绘制 AUC 曲线
        plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
        
    # GBM
    if True:

        # 定义 GBM 的基础参数
        params_gbm = {'random_state': 42}
        
        # 初始化梯度增强机分类模型
        model_gbm = GradientBoostingClassifier(**params_gbm)
        
        # 定义参数网格
        param_grid_gbm = {
            'n_estimators': [50, 100, 200],      # 树的数量
            'learning_rate': [0.01, 0.1, 0.2],   # 学习率
            'max_depth': [3, 5, 10],             # 最大深度
            'min_samples_split': [2, 5, 10],     # 节点最小分裂样本数
            'min_samples_leaf': [1, 2, 5]        # 叶节点最小样本数
        }
        
        # 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
        grid_search_gbm = GridSearchCV(
            estimator = model_gbm,
            param_grid = param_grid_gbm,
            scoring = 'neg_log_loss',            # 评价指标为负对数损失
            cv = 5,                              # 5 折交叉验证
            n_jobs = -1,                         # 并行计算
            verbose = 1                          # 输出详细进度信息
        )
        
        # 训练模型
        grid_search_gbm.fit(X_train, y_train)
        
        # 使用最优参数训练模型
        best_model_gbm = grid_search_gbm.best_estimator_
                
        # 预测测试集
        y_pred = best_model_gbm.predict(X_test)
        
        # 输出模型报告，查看评价指标
        print(classification_report(y_test, y_pred))
                        
        # 绘制 AUC 曲线
        plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

    # KNN
    if True:
       
        # 初始化 KNN 分类模型
        model_knn = KNeighborsClassifier()
        
        # 定义参数网格
        param_grid_knn = {
            'n_neighbors': [3, 5, 10],                          # 邻居数量
            'weights': ['uniform', 'distance'],                 # 权重方式
            'metric': ['euclidean', 'manhattan', 'minkowski']   # 距离度量方式
        }
        
        # 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
        grid_search_knn = GridSearchCV(
            estimator = model_knn,
            param_grid = param_grid_knn,
            scoring = 'neg_log_loss',  # 评价指标为负对数损失
            cv = 5,                    # 5 折交叉验证
            n_jobs = -1,               # 并行计算
            verbose = 1                # 输出详细进度信息
        )
        
        # 训练模型
        grid_search_knn.fit(X_train, y_train)
        
        # 使用最优参数训练模型
        best_model_knn = grid_search_knn.best_estimator_
                
        # 预测测试集
        y_pred = best_model_knn.predict(X_test)
        
        # 输出模型报告，查看评价指标
        print(classification_report(y_test, y_pred))
        
        # 绘制 AUC 曲线
        plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
        
    # LightGBM
    if True:

        # 初始化 LightGBM 分类模型
        model_lgbm = LGBMClassifier(random_state = 42, verbose = -1)
        
        # 定义参数网格
        param_grid_lgbm = {
            'n_estimators': [50, 100, 200],       # 树的数量
            'learning_rate': [0.01, 0.1, 0.2],    # 学习率
            'max_depth': [-1, 10, 20],            # 最大深度
            'num_leaves': [31, 50, 100],          # 叶节点数
            'min_child_samples': [10, 20, 30]    # 最小叶节点样本数
        }
        
        # 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
        grid_search_lgbm = GridSearchCV(
            estimator = model_lgbm,
            param_grid = param_grid_lgbm,
            scoring = 'neg_log_loss',  # 评价指标为负对数损失
            cv = 5,                    # 5 折交叉验证
            n_jobs = -1,               # 并行计算
            verbose = 1                # 输出详细进度信息
        )
        
        # 训练模型
        grid_search_lgbm.fit(X_train, y_train)
        
        # 使用最优参数训练模型
        best_model_lgbm = grid_search_lgbm.best_estimator_
                
        # 预测测试集
        y_pred = best_model_lgbm.predict(X_test)
        
        # 输出模型报告，查看评价指标
        print(classification_report(y_test, y_pred))    
                
        # 绘制 AUC 曲线
        plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)       
        
    # Random Forest
    if True:
        
        # 初始化随机森林分类模型
        model_rf = RandomForestClassifier(random_state = 42)
        
        # 定义参数网格
        param_grid_rf = {
            'n_estimators': [50, 100, 200],       # 树的数量
            'max_depth': [None, 10, 20, 30],      # 最大深度
            'min_samples_split': [2, 5, 10],      # 节点最小分裂样本数
            'min_samples_leaf': [1, 2, 5],        # 叶节点最小样本数
            'criterion': ['gini', 'entropy']      # 划分标准
        }
        
        # 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
        grid_search_rf = GridSearchCV(
            estimator = model_rf,
            param_grid = param_grid_rf,
            scoring = 'neg_log_loss',  # 评价指标为负对数损失
            cv = 5,                    # 5 折交叉验证
            n_jobs = -1,               # 并行计算
            verbose = 1                # 输出详细进度信息
        )
        
        # 训练模型
        grid_search_rf.fit(X_train, y_train)
        
        # 使用最优参数训练模型
        best_model_rf = grid_search_rf.best_estimator_
            
        # 预测测试集
        y_pred = best_model_rf.predict(X_test)
        
        # 输出模型报告，查看评价指标
        print(classification_report(y_test, y_pred))
        
        # 绘制 AUC 曲线
        plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
 
    # Support Vector Machine
    if True:

        # 初始化支持向量机分类模型
        model_svm = SVC(probability = True, random_state = 42)
        
        # 定义参数网格
        param_grid_svm = {
            'C': [0.1, 1, 10, 100],             # 惩罚参数
        }
        
        # 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
        grid_search_svm = GridSearchCV(
            estimator = model_svm,
            param_grid = param_grid_svm,
            scoring = 'neg_log_loss',  # 评价指标为负对数损失
            cv = 5,                    # 5 折交叉验证
            n_jobs = -1,               # 并行计算
            verbose = 1                # 输出详细进度信息
        )
        
        # 训练模型
        grid_search_svm.fit(X_train, y_train)
        
        # 使用最优参数训练模型
        best_model_svm = grid_search_svm.best_estimator_
            
        # 预测测试集
        y_pred = best_model_svm.predict(X_test)
        
        # 输出模型报告，查看评价指标
        print(classification_report(y_test, y_pred))      
                
        # 绘制 AUC 曲线
        plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)      
       
    # eXtreme Gradient Boosting, XGBoost
    if True:

        # 初始化 XGBoost 分类模型
        model_xgb = XGBClassifier(use_label_encoder = False, eval_metric = 'logloss', random_state = 42)
        
        # 定义参数网格
        param_grid_xgb = {
            'n_estimators': [50, 100, 200],       # 树的数量
            'learning_rate': [0.01, 0.1, 0.2],    # 学习率
            'max_depth': [3, 5, 10],              # 最大深度
            'subsample': [0.8, 1.0],              # 子采样比率
            'colsample_bytree': [0.8, 1.0]        # 树的列采样比率
        }
        
        # 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
        grid_search_xgb = GridSearchCV(
            estimator = model_xgb,
            param_grid = param_grid_xgb,
            scoring = 'neg_log_loss',  # 评价指标为负对数损失
            cv = 5,                    # 5 折交叉验证
            n_jobs = -1,               # 并行计算
            verbose = 1                # 输出详细进度信息
        )
        
        # 训练模型
        grid_search_xgb.fit(X_train, y_train)
        
        # 使用最优参数训练模型
        best_model_xgb = grid_search_xgb.best_estimator_
        
        # 预测测试集
        y_pred = best_model_xgb.predict(X_test)
        
        # 输出模型报告，查看评价指标
        print(classification_report(y_test, y_pred))
            
        # 绘制 AUC 曲线
        plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
    
    # Test AUC
    if True:
        
        # 获取所有模型及其颜色的映射
        models = {
            "Random Forest": {"model": best_model_rf, "color": "#FDD379"},
            "LightGBM": {"model": best_model_lgbm, "color": "#F7C2CD"},
            "Gradient Boosting": {"model": best_model_gbm, "color": "#A6DAEF"},
            "XGBoost": {"model": best_model_xgb, "color": "#B0D9A5"},
            "SVM": {"model": best_model_svm, "color": "#F06292"},
            "KNN": {"model": best_model_knn, "color": "#64B5F6"},
            "Decision Tree": {"model": best_model_dt, "color": "#FFB74D"},
            "Extra Trees": {"model": best_model_et, "color": "#9575CD"},
            "ANN": {"model": best_model_ann, "color": "#4DB6AC"}
        }
        
        # 绘制 ROC 曲线
        plt.figure(figsize = (12, 10))
        
        # 遍历每个模型计算 ROC 和 AUC
        for name, info in models.items():
            model = info["model"]
            color = info["color"]
            try:
                y_pred = model.predict_proba(X_test)[:, 1]  # 获取概率分布
            except AttributeError:
                y_pred = model.decision_function(X_test)  # 对于 SVM，使用 decision_function
                y_pred = (y_pred - y_pred.min()) / (y_pred.max() - y_pred.min())  # 归一化
        
            # 计算 ROC 曲线和 AUC 值
            fpr, tpr, _ = roc_curve(y_test, y_pred)
            roc_auc = auc(fpr, tpr)
        
            # 绘制曲线
            plt.plot(fpr, tpr, label = f"{name}: AUC = {roc_auc:.3f}", color = color, linewidth = 2)
        
        # 绘制随机分类的参考线
        plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)
        
        # 图形细节设置
        plt.title("ROC Curve - Test Set Model Comparison", fontsize = 20, fontweight = "bold")
        plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
        plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
        plt.xticks(fontsize = 16)
        plt.yticks(fontsize = 16)
        plt.legend(loc = "lower right", fontsize = 12)
        
        # 去除顶部和右侧边框
        plt.gca().spines['top'].set_visible(False)
        plt.gca().spines['right'].set_visible(False)
        plt.gca().spines['left'].set_linewidth(1.5)
        plt.gca().spines['bottom'].set_linewidth(1.5)
        
        # 关闭网格线
        plt.grid(False)
        plt.tight_layout()
        plt.savefig(f"{wkdir}/Test-AUC.png", bbox_inches = 'tight')
        plt.close()
    
    # Train AUC
    if True:
        
        # 绘制训练集 ROC 曲线
        plt.figure(figsize = (12, 10))
        
        # 遍历每个模型计算 ROC 和 AUC（训练集）
        for name, info in models.items():
            model = info["model"]
            color = info["color"]
            try:
                y_pred = model.predict_proba(X_train)[:, 1]  # 获取概率分布
            except AttributeError:
                y_pred = model.decision_function(X_train)  # 对于 SVM，使用 decision_function
                y_pred = (y_pred - y_pred.min()) / (y_pred.max() - y_pred.min())  # 归一化
        
            # 计算 ROC 曲线和 AUC 值
            fpr, tpr, _ = roc_curve(y_train, y_pred)
            roc_auc = auc(fpr, tpr)
        
            # 绘制曲线
            plt.plot(fpr, tpr, label=f"{name} (Train AUC = {roc_auc:.3f})", color = color, linewidth = 2)
        
        # 绘制随机分类的参考线
        plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)
        
        # 图形细节设置
        plt.title("ROC Curve - Training Set Model Comparison", fontsize = 20, fontweight = "bold")
        plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
        plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
        plt.xticks(fontsize = 16)
        plt.yticks(fontsize = 16)
        plt.legend(loc="lower right", fontsize = 12)
        
        # 去除顶部和右侧边框
        plt.gca().spines['top'].set_visible(False)
        plt.gca().spines['right'].set_visible(False)
        plt.gca().spines['left'].set_linewidth(1.5)
        plt.gca().spines['bottom'].set_linewidth(1.5)
        
        # 关闭网格线
        plt.grid(False)
        plt.tight_layout()
        plt.savefig(f"{wkdir}/Train-AUC.png", bbox_inches = 'tight')
        plt.close()
    
    # 训练集测试集评价指标系统输出比较
    if True:

        # 定义函数计算评价指标
        def calculate_metrics(y_true, y_pred):
            accuracy = accuracy_score(y_true, y_pred)
            sensitivity = recall_score(y_true, y_pred)  # 敏感性（召回率）
            specificity = confusion_matrix(y_true, y_pred)[0, 0] / (confusion_matrix(y_true, y_pred)[0, 0] + confusion_matrix(y_true, y_pred)[0, 1])
            positive_predictive_value = precision_score(y_true, y_pred)
            negative_predictive_value = confusion_matrix(y_true, y_pred)[1, 1] / (confusion_matrix(y_true, y_pred)[1, 1] + confusion_matrix(y_true, y_pred)[1, 0])
            f1 = f1_score(y_true, y_pred)
            kappa = cohen_kappa_score(y_true, y_pred)
            return [accuracy, sensitivity, specificity, positive_predictive_value, negative_predictive_value, f1, kappa]
        
        # 模型字典
        models = {
            'ANN': best_model_ann,
            'Decision Tree': best_model_dt,
            'Extra Trees': best_model_et,
            'Gradient Boosting': best_model_gbm,
            'KNN': best_model_knn,
            'LightGBM': best_model_lgbm,
            'Random Forest': best_model_rf,
            'SVM': best_model_svm,
            'XGBoost': best_model_xgb
        }
        
        # 初始化存储模型评价指标的列表
        metrics_dict = {'Metrics': ['accuracy', 'sensitivity', 'specificity', 'Positive predictive value', 'Negative predictive value', 'F1 score', 'Kappa score']}
        
        # 计算每个模型的指标并存储
        for model_name, model in models.items():
            # 预测测试集
            y_pred = model.predict(X_test)
            # 计算评价指标
            metrics_dict[model_name] = calculate_metrics(y_test, y_pred)
        
        # 创建DataFrame存储所有模型的评价指标
        metrics_df_test = pd.DataFrame(metrics_dict)
        metrics_df_test
    
    # 可视化展示
    def plot_metrics(metrics_df, title = "Metrics Plot", save_path = None):
        
        # 获取一个颜色映射（确保颜色的多样性）
        colors = plt.cm.get_cmap('tab20', len(metrics_df.columns) - 1)  # 根据模型的数量选择足够的颜色
        plt.figure(figsize = (10, 8), dpi = 1200)
        
        # 遍历每一列模型，忽略 'Metrics' 列
        for idx, model in enumerate(metrics_df.columns[1:]):
            plt.plot(metrics_df['Metrics'], metrics_df[model], marker = 'o', label = model, color = colors(idx))
        
        ax = plt.gca()
        
        # 优化坐标轴边框设置
        ax.spines['right'].set_visible(False)
        ax.spines['top'].set_visible(False)
        ax.yaxis.set_major_locator(ticker.MultipleLocator(0.25)) 
        ax.yaxis.set_minor_locator(ticker.MultipleLocator(0.125))  
        ax.yaxis.set_minor_formatter(ticker.NullFormatter())  
        
        # 设置图表标题和标签
        plt.title(title, fontsize = 18, fontweight = "bold")
        plt.ylim(0.1, 1.0)
        plt.yticks([0.25, 0.50, 0.75, 1.00])  # 设置主要刻度
        plt.xticks(rotation = 45, ha = 'right', fontsize = 14)
        plt.yticks(fontsize = 14)
        
        # 显示图例
        plt.legend(title = "Models", loc = 'center left', bbox_to_anchor = (1, 0.5), frameon = False, fontsize = 12)
        
        # 显示网格线，增强可视化效果
        plt.grid(which = 'both', linestyle = '-', linewidth = 0.5, color = 'gray')
        
        # 保存图形到文件，如果提供了保存路径
        if save_path:
            plt.savefig(save_path, bbox_inches = 'tight')  # 确保内容不会被裁剪
        
        # 布局优化，避免标签重叠
        plt.tight_layout()
        plt.close()
    
    # 调用该函数来绘制测试集的指标
    plot_metrics(metrics_df_test, title = "Test Set Metrics", save_path = f"{wkdir}/综合评价-Test.png")
    
    # 训练集测试集评价指标系统输出比较
    if True:
    
        # 计算每个模型在训练集上的评价指标
        metrics_dict_train = {'Metrics': ['accuracy', 'sensitivity', 'specificity', 'Positive predictive value', 'Negative predictive value', 'F1 score', 'Kappa score']}
        
        # 遍历模型，计算训练集上的评价指标
        for model_name, model in models.items():
            # 预测训练集
            y_pred_train = model.predict(X_train)
            # 计算评价指标
            metrics_dict_train[model_name] = calculate_metrics(y_train, y_pred_train)
        
        # 创建DataFrame存储所有模型的训练集评价指标
        metrics_df_train = pd.DataFrame(metrics_dict_train)
        
        # 绘制训练集的评价指标
        plot_metrics(metrics_df_train, title = "Training Set Metrics", save_path = f"{wkdir}/综合评价-Train.png")  
    
    # 针对最优模型进行可视化
    if True:
        
        # 预测类别的概率
        probabilities = best_model_et.predict_proba(X_test)
        
        # 创建一个 DataFrame，列名为类别
        probability_df = pd.DataFrame(probabilities, columns = [f'Prob_Class_{i}' for i in range(probabilities.shape[1])])
        
        # 如果需要，可以添加 X_test 的索引或其他标识列
        probability_df.index = X_test.index
        
        # 重置索引，并选择是否保留原索引作为一列
        probability_df.reset_index(drop = True, inplace = True)

    if True:
        
        df_selected = X_test
        
        # 创建PCA对象，设置提取1个主成分
        pca = PCA(n_components = 1)
        # 进行PCA降维
        pca_result = pca.fit_transform(df_selected)
        # 获取解释方差比率（贡献度）
        explained_variance = pca.explained_variance_ratio_
        # 将PCA结果保存为DataFrame，并用贡献度标记列名
        pca_df = pd.DataFrame(pca_result, columns = [f"PC1 ({explained_variance[0]*100:.2f}%)"])
        
        pca_df.head()
        
    if True:
        
        # 将 PCA 结果 (pca_df) 和其他数据合并
        combined_df = pd.concat([probability_df, pca_df, y_test.reset_index(drop = True)], axis = 1)
        combined_df.rename(columns = {y_test.name: 'True'}, inplace = True)
      
    if True: 
        
        colors = combined_df['True'].map({0: '#A0D6B4', 1: '#C3A6D8'})
        
        # 绘制散点图，设置更大的点和黑色边框
        plt.figure(figsize = (10, 6), dpi = 1200)
        plt.scatter(combined_df['Prob_Class_1'], combined_df['PC1 (97.93%)'], 
                    c=colors, edgecolor = 'black', s = 100)  # s = 100 增大点的大小，edgecolor = 'black' 添加黑色边框
        
        # 添加分界线和轴标签
        plt.axvline(x = 0.5, color = 'gray', linestyle = '--')
        plt.xlabel('Predicted value for benign')
        plt.ylabel('PC1 (97.93%)')
        plt.title('Test set')
        # 添加类别图例并移到图外
        for true_value, color, label in [(0, '#A0D6B4', 'malignant'), (1, '#C3A6D8', 'benign')]:
            plt.scatter([], [], color = color, edgecolor = 'black', s = 100, label = label)
        plt.legend(title = "Class", loc = "center left", bbox_to_anchor = (1, 0.5))
        plt.tight_layout(rect = [0, 0, 0.85, 1])
        plt.savefig(f'{wkdir}/分类效果-Test.png', bbox_inches = 'tight')
        plt.close()
                
    if True: 
        
        # 预测类别的概率
        probabilities = best_model_et.predict_proba(X_train)
        # 创建一个 DataFrame，列名为类别
        probability_df = pd.DataFrame(probabilities, columns = [f'Prob_Class_{i}' for i in range(probabilities.shape[1])])
        # 如果需要，可以添加 X_train 的索引或其他标识列
        probability_df.index = X_train.index
        # 重置索引，并选择是否保留原索引作为一列
        probability_df.reset_index(drop = True, inplace = True)
        df_selected = X_train  # 使用训练集数据
        # 创建PCA对象，设置提取1个主成分
        pca = PCA(n_components = 1)
        # 进行PCA降维
        pca_result = pca.fit_transform(df_selected)
        # 获取解释方差比率（贡献度）
        explained_variance = pca.explained_variance_ratio_
        # 将PCA结果保存为DataFrame，并用贡献度标记列名
        pca_df = pd.DataFrame(pca_result, columns = [f"PC1 ({explained_variance[0]*100:.2f}%)"])
        # 将 PCA 结果 (pca_df) 和其他数据合并
        combined_df = pd.concat([probability_df, pca_df, y_train.reset_index(drop = True)], axis = 1)
        combined_df.rename(columns = {y_train.name: 'True'}, inplace = True)
        
        colors = combined_df['True'].map({0: '#A0D6B4', 1: '#C3A6D8'})
        # 绘制散点图，设置更大的点和黑色边框
        plt.figure(figsize = (10, 6),dpi = 1200)
        plt.scatter(combined_df['Prob_Class_1'], combined_df['PC1 (99.74%)'], 
                    c = colors, edgecolor = 'black', s = 100)  # s = 100 增大点的大小，edgecolor = 'black' 添加黑色边框
        
        # 添加分界线和轴标签
        plt.axvline(x = 0.5, color = 'gray', linestyle = '--')
        plt.xlabel('Predicted value for benign')
        plt.ylabel('PC1 (99.74%)')
        plt.title('Train set')
        # 添加类别图例并移到图外
        for true_value, color, label in [(0, '#A0D6B4', 'malignant'), (1, '#C3A6D8', 'benign')]:
            plt.scatter([], [], color = color, edgecolor = 'black', s = 100, label = label)
        plt.legend(title = "Class", loc = "center left", bbox_to_anchor = (1, 0.5))
        plt.tight_layout(rect = [0, 0, 0.85, 1])
        plt.savefig(f'{wkdir}/分类效果-Train.png', bbox_inches = 'tight')
        plt.close()
                
        
    # 绘制 Precision-Recall 曲线，添加随机分类器基线
    if True:
        
        pr_display = PrecisionRecallDisplay.from_estimator(
            best_model_et, 
            X_test, 
            y_test, 
            plot_chance_level = True,  # 添加基线
            name = "AdaBoost"
        )
        
        _ = pr_display.ax_.set_title("Test set Precision-Recall curve")
        plt.savefig(f'{wkdir}/Precision-Recall-Test.png', bbox_inches = 'tight', dpi = 1200)
        plt.close()

        pr_display = PrecisionRecallDisplay.from_estimator(
            best_model_et, 
            X_train, 
            y_train, 
            plot_chance_level = True,  # 添加基线
            name = "AdaBoost"
        )
        _ = pr_display.ax_.set_title("Train set Precision-Recall curve")
        plt.savefig(f'{wkdir}/Precision-Recall-Train.png', bbox_inches = 'tight', dpi = 1200)
        plt.close()
    
    # AdaBoost
    if True:
        
        # 计算训练集和测试集的预测概率
        y_train_pred = best_model_et.predict_proba(X_train)[:, 1]
        y_test_pred = best_model_et.predict_proba(X_test)[:, 1]
        
        # 计算 ROC 曲线和 AUC 值
        fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred)
        fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred)
        
        roc_auc_train = auc(fpr_train, tpr_train)
        roc_auc_test = auc(fpr_test, tpr_test)
        
        # 绘制 ROC 曲线
        plt.figure(figsize=(10, 8))
        
        # 绘制训练集 ROC 曲线
        plt.plot(fpr_train, tpr_train, label = f"Train: AUC={roc_auc_train:.3f} (n={len(y_train)})", color = "#CECCE5", linewidth = 2)
        
        # 绘制测试集 ROC 曲线
        plt.plot(fpr_test, tpr_test, label = f"Test: AUC={roc_auc_test:.3f} (n={len(y_test)})", color = "#B0D9A5", linewidth = 2)
        
        # 绘制随机分类的参考线
        plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)
        
        # 图形细节设置
        plt.title("ROC Curve - AdaBoost (Train & Test)", fontsize = 20, fontweight = "bold")
        plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
        plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
        plt.xticks(fontsize = 16)
        plt.yticks(fontsize = 16)
        plt.legend(loc="lower right", fontsize = 12)
        
        # 去除顶部和右侧边框
        plt.gca().spines['top'].set_visible(False)
        plt.gca().spines['right'].set_visible(False)
        plt.gca().spines['left'].set_linewidth(1.5)
        plt.gca().spines['bottom'].set_linewidth(1.5)
        
        # 关闭网格线
        plt.grid(False)
        plt.tight_layout()
        
        # 保存图像为 PDF
        plt.savefig(f"{wkdir}/AdaBoost.png", bbox_inches = 'tight')
        plt.close()
        
    # 针对最优模型进行解释
    if True:
        
        explainer = shap.TreeExplainer(best_model_et)
        shap_values = explainer.shap_values(X_test)
            
        # 提取类别 0 的 SHAP 值
        shap_values_class_0 = shap_values[:, :, 0]
        
        # 提取类别 1 的 SHAP 值
        shap_values_class_1 = shap_values[:, :, 1]
            
            
        plt.figure(figsize = (10, 5), dpi = 1200)
        shap.summary_plot(shap_values_class_1, X_test, plot_type = "bar", show = False)
        plt.title('SHAP_numpy Sorted Feature Importance')
        plt.tight_layout()
        plt.savefig(f"{wkdir}/SHAP_numpy Sorted Feature Importance.png", bbox_inches = 'tight')
        plt.close()
            
        plt.figure()
        shap.summary_plot(shap_values_class_0, X_test, feature_names = X_test.columns, plot_type = "dot", show = False)
        plt.savefig(f"{wkdir}/SHAP_numpy Sorted Feature Importance.dot.png", bbox_inches = 'tight')
        plt.close()
        
        shap_values_df_1 = pd.DataFrame(shap_values_class_0, columns = X_test.columns)
        shap_values_df_1.head()
        
    if True:
    
        features = shap_values_df_1.columns.tolist()  # 获取所有的特征名称
        
        # 设置画布和子图结构（3行3列）
        fig, axes = plt.subplots(3, 3, figsize = (10, 10), dpi = 1200)
        axes = axes.flatten()
        
        # 循环绘制每个特征的散点图
        for i in range(len(axes)):
            if i < len(features):  # 如果还有特征未绘制
                feature = features[i]
                if feature in X_test.columns and feature in shap_values_df_1.columns:
                    ax = axes[i]
                    ax.scatter(X_test[feature], shap_values_df_1[feature], s = 10, color = "#6A9ACE")
                    ax.axhline(y = 0, color = 'red', linestyle = '-.', linewidth = 1)  # 添加横线
                    ax.set_xlabel(feature, fontsize = 10)
                    ax.set_ylabel(f'SHAP value for\n{feature}', fontsize = 10)
                    ax.spines['top'].set_visible(False)
                    ax.spines['right'].set_visible(False)
                else:
                    # 如果特征不存在，隐藏对应的子图
                    axes[i].axis('off')
            else:
                # 如果超过了特征数量，关闭剩余的子图
                axes[i].axis('off')
        
        # 调整布局
        plt.tight_layout()
        plt.savefig(f"{wkdir}/SHAP value for features.png", bbox_inches = 'tight')
        plt.close()
    
    # 以下针对预测正确的类别 0 和类别 1 分别绘制瀑布图[这里就绘制测试集第一个样本 (0) 和第二个样本都预测正确 (1)：
    if True:
        
        y_pred_et = best_model_et.predict(X_test)
        explainer = shap.TreeExplainer(best_model_et)
        
        # 计算shap值为Explanation格式
        shap_values_Explanation = explainer(X_test)
        
        # 提取类别 0 的 SHAP 值
        shap_values_class_0 = shap_values_Explanation[:, :, 0]
        
        # 提取类别 1 的 SHAP 值
        shap_values_class_1 = shap_values_Explanation[:, :, 1]
    
    # 预测类别 0 样本绘制
    if True:
        
        plt.figure(figsize = (10, 5), dpi = 1200)
        
        # 绘制第 1 个样本 0 类别的 SHAP 瀑布图
        shap.plots.waterfall(shap_values_class_0[0], show = False, max_display = 8)
        plt.savefig(f"{wkdir}/绘制第 1 个样本 0 类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
        plt.tight_layout()
        plt.close()
        
        plt.figure(figsize = (10, 5), dpi = 1200)
        
        # 绘制第 1 个样本 1 类别的 SHAP 瀑布图
        shap.plots.waterfall(shap_values_class_1[0], show = False, max_display = 8)
        plt.savefig(f"{wkdir}/绘制第 1 个样本 1 类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
        plt.tight_layout()
        plt.show()
    
    # 预测类别1样本绘制
    if True:
        plt.figure(figsize = (10, 5), dpi = 1200)
        
        # 绘制第 2 个样本0类别的 SHAP 瀑布图
        shap.plots.waterfall(shap_values_class_0[1], show = False, max_display = 8)
        plt.savefig(f"{wkdir}/绘制第 2 个样本0类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
        plt.tight_layout()
        plt.show()
            
        plt.figure(figsize = (10, 5), dpi = 1200)
        
        # 绘制第 2 个样本1类别的 SHAP 瀑布图
        shap.plots.waterfall(shap_values_class_1[1], show = False, max_display = 8)
        plt.savefig(f"{wkdir}/绘制第 2 个样本1类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
        plt.tight_layout()
        plt.show()

从特征筛选到多模型机器学习对比与优化 -- 基于医学数据的分类分析完整案例/多模型机器学习对比与优化