从特征筛选到多模型机器学习对比与优化 -- 基于医学数据的分类分析完整案例/多模型机器学习对比与优化

Xing Abao Lv3

本案例目的是实现机器学习模型的训练、调优、评估以及结果的可视化,主要包括以下步骤:

  1. 数据加载与预处理:首先从 Excel 文件中加载了包含多特征的医学数据,并将其存储在数据框中,使用LightGBM模型进行特征的重要性排序。通过训练LGBM模型并获取特征重要性,对特征进行排序,并选取前30个最重要的特征,最后根据LGBM模型的特征重要性,选择了前8个特征进行模型训练。这些特征是:X_30, X_39, X_46, X_32, X_34, X_33, X_9, X_28,目标变量是 y,即分类标签,并将特征和目标变量分开存储,进行后续的模型训练;
  2. 模型训练与调优:使用多种机器学习模型进行训练,包括人工神经网络、决策树、极限随机树、梯度提升机、K 近邻、LightGBM、随机森林、支持向量机和 XGBoost 。为每个模型定义了不同的超参数,并通过GridSearchCV进行网格搜索,找到最优的参数配置,每个模型使用 5 折交叉验证来评估其性能,确保模型在不同数据子集上的表现稳定;
  3. 模型评估与性能对比:对每个模型进行评估,计算常见的分类指标,包括准确率、敏感性(召回率)、特异性、精确度(阳性预测值)、负性预测值、F1 分数以及 Kappa 系数等,绘制多个图表进行模型性能对比,包括 ROC 曲线、Precision-Recall 曲线等,以展示不同模型在训练集和测试集上的表现;
  4. 模型结果的可视化:对最优模型进行了深入的可视化分析。通过 SHAP 值解释模型的决策过程,展示特征的重要性和对每个样本预测的影响,使用 PCA 进行降维,将数据投影到一维空间,并绘制散点图展示每个数据点在降维空间中的分布,同时根据预测的概率值对数据点进行颜色编码,区分不同类别,绘制了多个 SHAP 图(如条形图、散点图、瀑布图等),这些图表帮助解释模型的决策依据和各特征对模型预测的影响。

模拟案例

加载模块

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import shap
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb
import seaborn as sns
import scipy.stats as stats
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import PrecisionRecallDisplay
from xgboost import XGBClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import roc_curve, auc
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, cohen_kappa_score, confusion_matrix

plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.unicode_minus'] = False
import warnings
warnings.filterwarnings("ignore")

加载数据

1
2
3
4
5
6
import os
wkdir = 'C:/Users/Administrator/Desktop'
os.chdir(wkdir)

path = 'Z:/TData/big-data/sad41d8cd/251027_Medical_Classification_Model_Comparison.xlsx'
df = pd.read_excel(path)
1
2
3
4
5
6
7
8
9
10
df.head()
Out[91]:
X_1 X_2 X_3 X_4 X_5 X_6 X_7 ... X_41 X_42 X_43 X_44 X_45 X_46 y
0 1 0 0 0 1 0 0 ... 152 1.35 16.9 6.75 6.0 47 1
1 1 0 0 0 1 0 0 ... 132 1.49 12.8 4.71 5.8 63 0
2 1 0 0 0 1 0 0 ... 141 4.44 15.3 4.26 6.4 48 1
3 0 0 0 0 0 0 0 ... 105 1.64 14.4 4.12 5.5 67 0
4 1 0 0 0 1 1 0 ... 142 1.21 13.7 3.91 7.0 55 1

[5 rows x 47 columns]
1
2
3
4
5
6
7
8
9
df.columns
Out[92]:
Index(['X_1', 'X_2', 'X_3', 'X_4', 'X_5', 'X_6', 'X_7', 'X_8', 'X_9', 'X_10',
'X_11', 'X_12', 'X_13', 'X_14', 'X_15', 'X_16', 'X_17', 'X_18', 'X_19',
'X_20', 'X_21', 'X_22', 'X_23', 'X_24', 'X_25', 'X_26', 'X_27', 'X_28',
'X_29', 'X_30', 'X_31', 'X_32', 'X_33', 'X_34', 'X_35', 'X_36', 'X_37',
'X_38', 'X_39', 'X_40', 'X_41', 'X_42', 'X_43', 'X_44', 'X_45', 'X_46',
'y'],
dtype='object')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 256 entries, 0 to 255
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 X_1 256 non-null int64
1 X_2 256 non-null int64
2 X_3 256 non-null int64
3 X_4 256 non-null int64
4 X_5 256 non-null int64
5 X_6 256 non-null int64
6 X_7 256 non-null int64
7 X_8 256 non-null int64
8 X_9 256 non-null int64
9 X_10 256 non-null int64
10 X_11 256 non-null int64
11 X_12 256 non-null int64
12 X_13 256 non-null int64
13 X_14 256 non-null int64
14 X_15 256 non-null int64
15 X_16 256 non-null int64
16 X_17 256 non-null int64
17 X_18 256 non-null int64
18 X_19 256 non-null int64
19 X_20 256 non-null int64
20 X_21 256 non-null int64
21 X_22 256 non-null int64
22 X_23 256 non-null int64
23 X_24 256 non-null int64
24 X_25 256 non-null int64
25 X_26 256 non-null int64
26 X_27 256 non-null int64
27 X_28 256 non-null float64
28 X_29 256 non-null int64
29 X_30 256 non-null int64
30 X_31 256 non-null float64
31 X_32 256 non-null float64
32 X_33 256 non-null float64
33 X_34 256 non-null int64
34 X_35 256 non-null int64
35 X_36 256 non-null int64
36 X_37 256 non-null int64
37 X_38 256 non-null int64
38 X_39 256 non-null int64
39 X_40 256 non-null int64
40 X_41 256 non-null int64
41 X_42 256 non-null float64
42 X_43 256 non-null float64
43 X_44 256 non-null float64
44 X_45 256 non-null float64
45 X_46 256 non-null int64
46 y 256 non-null int64
dtypes: float64(8), int64(39)
memory usage: 94.1 KB

划分数据

1
2
3
4
5
6
7
8
9
10
11
12
# 划分特征和目标变量
X = df.drop(['y'], axis = 1) # 从数据集中去掉目标变量 'y',得到特征变量 X
y = df['y'] # 提取目标变量 y

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X, # 特征变量
y, # 目标变量
test_size = 0.3, # 测试集所占比例,这里为 30%
random_state = 42, # 随机种子,确保结果可重复
stratify = df['y'] # 按目标变量 'y' 的分布进行分层采样,确保训练集和测试集中目标变量的分布一致
)

特征筛选

  1. 特征重要性排序,使用初步训练的轻梯度提升机 (LGBM) 分类器对变量的重要性进行排序,并选择排名前 30 的变量;
  2. 前向特征选择,使用多个 LGBM 分类器,通过逐步添加变量的方式,依次评估每个变量对模型性能的贡献;
  3. 模型优化与最终选择,根据受试者特征曲线 AUC 评分分类器的性能。例如,当变量添加到第 7 个时,模型性能不再显著提升,因此最终选择前 7 个变量用于模型开发;
  4. 同时注意多重共线性的问题。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# 创建 LGBM 分类器
lgbm_clf = lgb.LGBMClassifier(random_state = 42, verbose = -1)

# 训练模型
lgbm_clf.fit(X_train, y_train)

# 获取特征重要性
feature_importances = lgbm_clf.feature_importances_

# 构建特征重要性排名
feature_importance_df = pd.DataFrame({
'Feature': X.columns,
'Importance': feature_importances
}).sort_values(by = 'Importance', ascending = False)


# 只取前30个重要特征
top_n = 30
top_features = feature_importance_df.head(top_n)

# 调整字体大小
plt.figure(figsize = (12, 8), dpi = 1200)
plt.barh(top_features['Feature'], top_features['Importance'], color = 'skyblue')
plt.xlabel('Importance', fontsize = 14)
plt.ylabel('Feature', fontsize = 14)
plt.title(f'Top {top_n} Feature Importance', fontsize = 16)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.gca().invert_yaxis()
plt.savefig(f'{wkdir}/特征筛选.png', bbox_inches = 'tight')
plt.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# 初始化存储结果的 DataFrame
selection_results = pd.DataFrame(columns = ['Feature', 'Importance', 'Mean_ROC'])

# 初始化用于训练的特征列表
selected_features = []

# K 折交叉验证
kf = KFold(n_splits = 5, shuffle = True, random_state = 42)

# 获取折数
n_splits = kf.get_n_splits()

# 动态创建列名
fold_columns = [f'Fold_{i+1}_ROC' for i in range(n_splits)]

# 依次添加特征
for i in range(len(top_features)):
# 当前特征
current_feature = top_features.iloc[i]['Feature']
selected_features.append(current_feature)

fold_roc_scores = [] # 存储每折的 ROC AUC 分数

# K 折交叉验证
for train_idx, val_idx in kf.split(X_train):
# 划分训练集和验证集
X_train_fold, X_val_fold = X_train.iloc[train_idx][selected_features], X_train.iloc[val_idx][selected_features]
y_train_fold, y_val_fold = y_train.iloc[train_idx], y_train.iloc[val_idx]

# 创建并训练 LGBM 模型
lgbm_clf = lgb.LGBMClassifier(random_state = 42, verbose = -1)
lgbm_clf.fit(X_train_fold, y_train_fold)

# 预测并计算 ROC AUC 分数
y_val_proba = lgbm_clf.predict_proba(X_val_fold)[:, 1] # 概率分数
fold_roc_score = roc_auc_score(y_val_fold, y_val_proba)
fold_roc_scores.append(fold_roc_score)

# 计算平均 ROC AUC 分数
mean_roc_score = np.mean(fold_roc_scores)

# 保存结果
row_data = {
'Feature': current_feature,
'Importance': top_features.iloc[i]['Importance'],
'Mean_ROC': mean_roc_score,
}

# 添加每折的ROC AUC分数
for j, score in enumerate(fold_roc_scores):
row_data[fold_columns[j]] = score
row_df = pd.DataFrame([row_data])
selection_results = pd.concat([selection_results, row_df], ignore_index = True)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
selection_results
Out[108]:
Feature Importance Mean_ROC ... Fold_3_ROC Fold_4_ROC Fold_5_ROC
0 X_30 62 0.632506 ... 0.696429 0.667488 0.364
1 X_39 55 0.721269 ... 0.754870 0.650246 0.656
2 X_46 46 0.747930 ... 0.795455 0.726601 0.678
3 X_32 36 0.749943 ... 0.792208 0.768473 0.648
4 X_34 35 0.759666 ... 0.844156 0.778325 0.600
5 X_9 32 0.822709 ... 0.853896 0.852217 0.740
6 X_33 32 0.834548 ... 0.840909 0.871921 0.744
7 X_28 31 0.837696 ... 0.847403 0.842365 0.764
8 X_27 27 0.842103 ... 0.860390 0.802956 0.784
9 X_44 22 0.825152 ... 0.889610 0.788177 0.808
10 X_45 22 0.847453 ... 0.899351 0.827586 0.824
11 X_35 19 0.827285 ... 0.896104 0.773399 0.804
12 X_40 19 0.812183 ... 0.879870 0.729064 0.796
13 X_36 18 0.820189 ... 0.873377 0.733990 0.776
14 X_43 15 0.814614 ... 0.879870 0.714286 0.832
15 X_37 13 0.822236 ... 0.912338 0.763547 0.768
16 X_38 13 0.790346 ... 0.857143 0.724138 0.756
17 X_24 11 0.796770 ... 0.873377 0.733990 0.772
18 X_3 11 0.792737 ... 0.860390 0.733990 0.788
19 X_31 10 0.784170 ... 0.844156 0.748768 0.796
20 X_42 10 0.789202 ... 0.827922 0.773399 0.800
21 X_26 10 0.789443 ... 0.834416 0.773399 0.804
22 X_21 9 0.787502 ... 0.837662 0.743842 0.800
23 X_5 7 0.788152 ... 0.837662 0.743842 0.800
24 X_16 6 0.794447 ... 0.837662 0.743842 0.800
25 X_17 4 0.785223 ... 0.844156 0.743842 0.796
26 X_41 4 0.797268 ... 0.860390 0.778325 0.812
27 X_29 3 0.796142 ... 0.853896 0.768473 0.816
28 X_2 3 0.796142 ... 0.853896 0.768473 0.816
29 X_20 2 0.796070 ... 0.834416 0.773399 0.820

[30 rows x 8 columns]
1
2
3
4
5
# 将 Importance 列百分比化并归一化
selection_results['Importance'] = (
selection_results['Importance'] / selection_results['Importance'].sum()
)
selection_results
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Out[109]: 
Feature Importance Mean_ROC ... Fold_3_ROC Fold_4_ROC Fold_5_ROC
0 X_30 0.105622 0.632506 ... 0.696429 0.667488 0.364
1 X_39 0.093697 0.721269 ... 0.754870 0.650246 0.656
2 X_46 0.078365 0.747930 ... 0.795455 0.726601 0.678
3 X_32 0.061329 0.749943 ... 0.792208 0.768473 0.648
4 X_34 0.059625 0.759666 ... 0.844156 0.778325 0.600
5 X_9 0.054514 0.822709 ... 0.853896 0.852217 0.740
6 X_33 0.054514 0.834548 ... 0.840909 0.871921 0.744
7 X_28 0.052811 0.837696 ... 0.847403 0.842365 0.764
8 X_27 0.045997 0.842103 ... 0.860390 0.802956 0.784
9 X_44 0.037479 0.825152 ... 0.889610 0.788177 0.808
10 X_45 0.037479 0.847453 ... 0.899351 0.827586 0.824
11 X_35 0.032368 0.827285 ... 0.896104 0.773399 0.804
12 X_40 0.032368 0.812183 ... 0.879870 0.729064 0.796
13 X_36 0.030664 0.820189 ... 0.873377 0.733990 0.776
14 X_43 0.025554 0.814614 ... 0.879870 0.714286 0.832
15 X_37 0.022147 0.822236 ... 0.912338 0.763547 0.768
16 X_38 0.022147 0.790346 ... 0.857143 0.724138 0.756
17 X_24 0.018739 0.796770 ... 0.873377 0.733990 0.772
18 X_3 0.018739 0.792737 ... 0.860390 0.733990 0.788
19 X_31 0.017036 0.784170 ... 0.844156 0.748768 0.796
20 X_42 0.017036 0.789202 ... 0.827922 0.773399 0.800
21 X_26 0.017036 0.789443 ... 0.834416 0.773399 0.804
22 X_21 0.015332 0.787502 ... 0.837662 0.743842 0.800
23 X_5 0.011925 0.788152 ... 0.837662 0.743842 0.800
24 X_16 0.010221 0.794447 ... 0.837662 0.743842 0.800
25 X_17 0.006814 0.785223 ... 0.844156 0.743842 0.796
26 X_41 0.006814 0.797268 ... 0.860390 0.778325 0.812
27 X_29 0.005111 0.796142 ... 0.853896 0.768473 0.816
28 X_2 0.005111 0.796142 ... 0.853896 0.768473 0.816
29 X_20 0.003407 0.796070 ... 0.834416 0.773399 0.820

[30 rows x 8 columns]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# 动态获取折列名称
fold_columns = [col for col in selection_results.columns if 'Fold_' in col]

# 添加置信区间列到 selection_results
selection_results['CI_Lower'] = None
selection_results['CI_Upper'] = None

# 遍历每行计算置信区间
for index, row in selection_results.iterrows():

# 提取每折的ROC成绩
fold_roc_scores = [row[fold] for fold in fold_columns]

# 样本大小
n_folds = len(fold_roc_scores)

# 平均值和标准误差
mean_roc = row['Mean_ROC']
std_err = stats.sem(fold_roc_scores) # 标准误差

# t值(95%置信区间,n_folds - 1 自由度)
t_value = stats.t.ppf(0.975, df=n_folds - 1)

# 置信区间
ci_lower = mean_roc - t_value * std_err
ci_upper = mean_roc + t_value * std_err

# 确保置信区间上限不超过1
ci_upper = min(ci_upper, 1)

# 保存到 DataFrame
selection_results.at[index, 'CI_Lower'] = ci_lower
selection_results.at[index, 'CI_Upper'] = ci_upper

# 输出结果
selection_results
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Out[111]: 
Feature Importance Mean_ROC ... Fold_5_ROC CI_Lower CI_Upper
0 X_30 0.105622 0.632506 ... 0.364 0.402515 0.862497
1 X_39 0.093697 0.721269 ... 0.656 0.604669 0.837869
2 X_46 0.078365 0.747930 ... 0.678 0.612311 0.883548
3 X_32 0.061329 0.749943 ... 0.648 0.628602 0.871285
4 X_34 0.059625 0.759666 ... 0.600 0.624474 0.894858
5 X_9 0.054514 0.822709 ... 0.740 0.752107 0.893311
6 X_33 0.054514 0.834548 ... 0.744 0.768515 0.900581
7 X_28 0.052811 0.837696 ... 0.764 0.784572 0.89082
8 X_27 0.045997 0.842103 ... 0.784 0.782365 0.90184
9 X_44 0.037479 0.825152 ... 0.808 0.775929 0.874374
10 X_45 0.037479 0.847453 ... 0.824 0.80444 0.890466
11 X_35 0.032368 0.827285 ... 0.804 0.767662 0.886909
12 X_40 0.032368 0.812183 ... 0.796 0.739797 0.884568
13 X_36 0.030664 0.820189 ... 0.776 0.743339 0.897039
14 X_43 0.025554 0.814614 ... 0.832 0.73852 0.890709
15 X_37 0.022147 0.822236 ... 0.768 0.74653 0.897943
16 X_38 0.022147 0.790346 ... 0.756 0.726275 0.854416
17 X_24 0.018739 0.796770 ... 0.772 0.732925 0.860614
18 X_3 0.018739 0.792737 ... 0.788 0.736106 0.849367
19 X_31 0.017036 0.784170 ... 0.796 0.732317 0.836023
20 X_42 0.017036 0.789202 ... 0.800 0.75556 0.822844
21 X_26 0.017036 0.789443 ... 0.804 0.745793 0.833092
22 X_21 0.015332 0.787502 ... 0.800 0.735047 0.839958
23 X_5 0.011925 0.788152 ... 0.800 0.736725 0.839578
24 X_16 0.010221 0.794447 ... 0.800 0.732864 0.856029
25 X_17 0.006814 0.785223 ... 0.796 0.732467 0.83798
26 X_41 0.006814 0.797268 ... 0.812 0.741074 0.853461
27 X_29 0.005111 0.796142 ... 0.816 0.741829 0.850456
28 X_2 0.005111 0.796142 ... 0.816 0.741829 0.850456
29 X_20 0.003407 0.796070 ... 0.820 0.750079 0.842061

[30 rows x 10 columns]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# 确保 Importance 列为数值类型
selection_results['Importance'] = pd.to_numeric(selection_results['Importance'], errors = 'coerce')

# 确保置信区间的列为数值类型
selection_results['CI_Lower'] = pd.to_numeric(selection_results['CI_Lower'], errors = 'coerce')
selection_results['CI_Upper'] = pd.to_numeric(selection_results['CI_Upper'], errors = 'coerce')

# 参数:选择前 n 个特征
n_features = 7

fig, ax1 = plt.subplots(figsize = (16, 6), dpi = 1200)

# 渐变柱状图:特征贡献度
norm = plt.Normalize(selection_results['Importance'].min(), selection_results['Importance'].max())
colors = plt.cm.Blues(norm(selection_results['Importance']))

ax1.bar(selection_results['Feature'], selection_results['Importance'], color = colors, label = 'Feature Importance')
ax1.set_xlabel("Features", fontsize = 18, fontweight = 'bold')
ax1.set_ylabel("Feature Importance", fontsize = 18, fontweight = 'bold')
ax1.tick_params(axis = 'y', labelsize = 12, width = 1.5)

# 修改 x 轴特征颜色,前 n_features 用红色,其他用黑色
x_labels = selection_results['Feature']
x_colors = ['red' if i < n_features else 'black' for i in range(len(x_labels))]
for tick_label, color in zip(ax1.get_xticklabels(), x_colors):
tick_label.set_color(color)

ax1.tick_params(axis = 'x', rotation = 45, labelsize = 12, width = 1.5)

# 折线图:AUC成绩
ax2 = ax1.twinx()

# 红点和红线:前 n_features 个特征
ax2.plot(
selection_results['Feature'][:n_features + 1], # 连接红点到黑点的过渡
selection_results['Mean_ROC'][:n_features + 1],
color = "red", marker = 'o', linestyle = '-', label = "Mean AUC (Top Features)"
)

# 黑点和黑线:其余特征
ax2.plot(
selection_results['Feature'][n_features:],
selection_results['Mean_ROC'][n_features:],
color = "black", marker = 'o', linestyle = '-', label = "Mean AUC (Other Features)"
)

# 添加置信区间阴影
ax2.fill_between(
selection_results['Feature'],
selection_results['CI_Lower'], # 置信区间下限
selection_results['CI_Upper'], # 置信区间上限
color = 'red',
alpha = 0.2, # 设置透明度
)

ax2.set_ylabel("Mean AUC", fontsize = 18, fontweight = 'bold')
ax2.tick_params(axis = 'y', labelsize = 12, width = 1.5)
ax2.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:.3f}')) # 保留三位小数

# 添加标题
plt.title(f"Feature Contribution and AUC Performance (Top {n_features} Features Highlighted)", fontsize = 18, fontweight = 'bold')
fig.tight_layout()
plt.savefig("f{wkdir}/特征曲线.png", bbox_inches = 'tight')
plt.close()
1
2
list(selection_results['Feature'][0:8])
# Out[116]: ['X_30', 'X_39', 'X_46', 'X_32', 'X_34', 'X_9', 'X_33', 'X_28']

构建模型

  • 基于特征筛选确定的特征 [‘X_30’, ‘X_39’, ‘X_46’, ‘X_32’, ‘X_34’, ‘X_33’, ‘X_9’, ‘X_28’] 进行多模型比较
  • 模型包括 ANN、DT、ET、GBM、KNN、LightGBM、RF、SVM、XGBoost
  • 利用 k 折交叉验证+网格搜索+手动微调确定各模型最优参数

混淆矩阵

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# 混淆矩阵
def plot_confusion_matrix_plus(y_true, y_pred, title = "Model", labels = None, save_path = None):
unique_labels = set(np.unique(np.concatenate([np.array(y_true), np.array(y_pred)])))
labels_used = list(unique_labels)
# 若未提供 labels,或者提供的 labels 与 y_true 不一致,则自动推断
if labels is None:
labels_with_totals = labels_used + ["Total"]
else:
labels_with_totals = [labels[lab] for lab in labels if lab in labels_used] + ["Total"]

# 若最终只有单一标签(极端情况),也能正常绘制
labels_used = list(labels_used)
cm = confusion_matrix(y_true, y_pred, labels = labels_used)

# 行标准化(避免除零)
row_sums = cm.sum(axis = 1, keepdims = True)
with np.errstate(divide = 'ignore', invalid = 'ignore'):
cm_normalized = np.divide(cm, row_sums, where = (row_sums != 0))
cm_normalized[np.isnan(cm_normalized)] = 0.0

# 添加 Total 行和列(计数)
cm_with_totals = np.vstack([cm, cm.sum(axis = 0, keepdims = True)])
cm_with_totals = np.column_stack([cm_with_totals, cm_with_totals.sum(axis = 1, keepdims = True)])

# 用于着色的矩阵(最后一行/列置 0,使其灰色覆盖)
heatmap_data = cm_normalized.copy()
heatmap_data = np.vstack([heatmap_data, np.zeros((1, heatmap_data.shape[1]))])
heatmap_data = np.column_stack([heatmap_data, np.zeros((heatmap_data.shape[0], 1))])

fig, ax = plt.subplots(figsize = (10, 8))
colors = sns.color_palette("Greens", as_cmap = True)
grey_color = "#f0f0f0"

sns.heatmap(
heatmap_data,
annot = False,
fmt = "",
cmap = colors,
xticklabels = labels_with_totals,
yticklabels = labels_with_totals,
cbar = False,
square = True,
linewidths = 1.5,
linecolor = "white",
ax = ax,
vmin = 0.0, vmax = 1.0
)

# 覆盖 Total 行/列为灰色
n_rows, n_cols = cm.shape
for i in range(n_rows):
ax.add_patch(plt.Rectangle((n_cols, i), 1, 1, fill = True, color = grey_color, edgecolor = "white", lw = 1.5))
for j in range(n_cols):
ax.add_patch(plt.Rectangle((j, n_rows), 1, 1, fill = True, color = grey_color, edgecolor = "white", lw = 1.5))
ax.add_patch(plt.Rectangle((n_cols, n_rows), 1, 1, fill = True, color = grey_color, edgecolor = "white", lw = 1.5))

# 填充数值与百分比
total_all = cm_with_totals[-1, -1]
for i in range(cm_with_totals.shape[0]):
for j in range(cm_with_totals.shape[1]):
if i < n_rows and j < n_cols:
value = cm[i, j]
percentage = cm_normalized[i, j] * 100 if row_sums[i, 0] > 0 else 0.0
ax.text(j + 0.5, i + 0.45, f"{percentage:.1f}%", ha = "center", va = "center", fontsize = 14, color = "black")
ax.text(j + 0.5, i + 0.72, f"{int(value)}", ha = "center", va = "center", fontsize = 13, color = "black")
elif i == n_rows and j == n_cols:
ax.text(j + 0.5, i + 0.5, f"{int(total_all)}", ha = "center", va = "center", fontsize = 15, color = "black")
else:
total_value = cm_with_totals[i, j]
total_percentage = (total_value / total_all * 100) if total_all > 0 else 0.0
ax.text(j + 0.5, i + 0.45, f"{total_percentage:.1f}%", ha = "center", va = "center", fontsize = 14, color = "black")
ax.text(j + 0.5, i + 0.72, f"{int(total_value)}", ha = "center", va = "center", fontsize = 13, color = "black")

plt.title(title, fontsize = 20)
plt.xlabel("Prediction (model output)", fontsize = 16)
plt.ylabel("Truth (observation)", fontsize = 16)
plt.tight_layout()
if save_path:
plt.savefig(save_path, bbox_inches = 'tight')
plt.close()

重新划分数据集

1
2
3
4
5
6
7
8
9
10
11
12
# 划分特征和目标变量
X = df[['X_30', 'X_39', 'X_46', 'X_32', 'X_34', 'X_33', 'X_9', 'X_28']] # 根据特征筛选变量重新划分特征
y = df['y'] # 提取目标变量 y

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X, # 特征变量
y, # 目标变量
test_size = 0.3, # 测试集所占比例,这里为 30%
random_state = 42, # 随机种子,确保结果可重复
stratify = df['y'] # 按目标变量 'y' 的分布进行分层采样,确保训练集和测试集中目标变量的分布一致
)

Artificial Neural Network

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# 定义 ANN 的基础参数
params_ann = {
'random_state': 42, # 随机种子
'max_iter': 200 # 最大迭代次数
}

# 初始化 ANN 分类模型
model_ann = MLPClassifier(**params_ann)

# 定义参数网格
param_grid_ann = {
'hidden_layer_sizes': [(50,), (100,), (100, 50)], # 隐藏层大小
'learning_rate_init': [0.001, 0.01, 0.1], # 学习率
'alpha': [0.0001, 0.001, 0.01] # 正则化参数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_ann = GridSearchCV(
estimator = model_ann,
param_grid = param_grid_ann,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_ann.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_ann = grid_search_ann.best_estimator_

# 预测测试集
y_pred = best_model_ann.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
1
2
3
4
5
6
7
8
9
Fitting 5 folds for each of 27 candidates, totalling 135 fits
precision recall f1-score support

0 0.88 0.83 0.85 52
1 0.68 0.76 0.72 25

accuracy 0.81 77
macro avg 0.78 0.79 0.78 77
weighted avg 0.81 0.81 0.81 77

Decision Tree, DT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# 定义 DT 的基础参数
params_dt = {'random_state': 42}

# 初始化决策树分类模型
model_dt = DecisionTreeClassifier(**params_dt)

# 定义参数网格
param_grid_dt = {
'criterion': ['gini', 'entropy'], # 划分标准
'max_depth': [None, 10, 20, 30], # 最大深度
'min_samples_split': [2, 5, 10], # 节点最小分裂样本数
'min_samples_leaf': [1, 2, 5] # 叶节点最小样本数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_dt = GridSearchCV(
estimator = model_dt,
param_grid = param_grid_dt,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_dt.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_dt = grid_search_dt.best_estimator_

# 预测测试集
y_pred = best_model_dt.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
1
2
3
4
5
6
7
8
9
Fitting 5 folds for each of 72 candidates, totalling 360 fits
precision recall f1-score support

0 0.78 0.69 0.73 52
1 0.48 0.60 0.54 25

accuracy 0.66 77
macro avg 0.63 0.65 0.64 77
weighted avg 0.69 0.66 0.67 77

Extra Trees, ET

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# 定义 ET 的基础参数
params_et = {
'random_state': 42 # 随机种子
}

# 初始化 Extra Trees 分类模型
model_et = ExtraTreesClassifier(**params_et)

# 定义参数网格
param_grid_et = {
'n_estimators': [50, 100, 200], # 树的数量
'criterion': ['gini', 'entropy'], # 划分标准
'max_depth': [None, 10, 20, 30], # 最大深度
'min_samples_split': [2, 5, 10], # 节点最小分裂样本数
'min_samples_leaf': [1, 2, 5] # 叶节点最小样本数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_et = GridSearchCV(
estimator = model_et,
param_grid = param_grid_et,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_et.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_et = grid_search_et.best_estimator_

# 预测测试集
y_pred = best_model_et.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
1
2
3
4
5
6
7
8
9
Fitting 5 folds for each of 216 candidates, totalling 1080 fits
precision recall f1-score support

0 0.94 0.85 0.89 52
1 0.73 0.88 0.80 25

accuracy 0.86 77
macro avg 0.83 0.86 0.84 77
weighted avg 0.87 0.86 0.86 77

GBM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# 定义 GBM 的基础参数
params_gbm = {'random_state': 42}

# 初始化梯度增强机分类模型
model_gbm = GradientBoostingClassifier(**params_gbm)

# 定义参数网格
param_grid_gbm = {
'n_estimators': [50, 100, 200], # 树的数量
'learning_rate': [0.01, 0.1, 0.2], # 学习率
'max_depth': [3, 5, 10], # 最大深度
'min_samples_split': [2, 5, 10], # 节点最小分裂样本数
'min_samples_leaf': [1, 2, 5] # 叶节点最小样本数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_gbm = GridSearchCV(
estimator = model_gbm,
param_grid = param_grid_gbm,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_gbm.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_gbm = grid_search_gbm.best_estimator_

# 预测测试集
y_pred = best_model_gbm.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
1
2
3
4
5
6
7
8
9
Fitting 5 folds for each of 243 candidates, totalling 1215 fits
precision recall f1-score support

0 0.82 0.87 0.84 52
1 0.68 0.60 0.64 25

accuracy 0.78 77
macro avg 0.75 0.73 0.74 77
weighted avg 0.77 0.78 0.78 77

KNN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 初始化 KNN 分类模型
model_knn = KNeighborsClassifier()

# 定义参数网格
param_grid_knn = {
'n_neighbors': [3, 5, 10], # 邻居数量
'weights': ['uniform', 'distance'], # 权重方式
'metric': ['euclidean', 'manhattan', 'minkowski'] # 距离度量方式
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_knn = GridSearchCV(
estimator = model_knn,
param_grid = param_grid_knn,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_knn.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_knn = grid_search_knn.best_estimator_

# 预测测试集
y_pred = best_model_knn.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
1
2
3
4
5
6
7
8
9
Fitting 5 folds for each of 18 candidates, totalling 90 fits
precision recall f1-score support

0 0.81 0.85 0.83 52
1 0.65 0.60 0.62 25

accuracy 0.77 77
macro avg 0.73 0.72 0.73 77
weighted avg 0.76 0.77 0.76 77

LightGBM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# 初始化 LightGBM 分类模型
model_lgbm = LGBMClassifier(random_state = 42, verbose = -1)

# 定义参数网格
param_grid_lgbm = {
'n_estimators': [50, 100, 200], # 树的数量
'learning_rate': [0.01, 0.1, 0.2], # 学习率
'max_depth': [-1, 10, 20], # 最大深度
'num_leaves': [31, 50, 100], # 叶节点数
'min_child_samples': [10, 20, 30] # 最小叶节点样本数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_lgbm = GridSearchCV(
estimator = model_lgbm,
param_grid = param_grid_lgbm,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_lgbm.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_lgbm = grid_search_lgbm.best_estimator_

# 预测测试集
y_pred = best_model_lgbm.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
1
2
3
4
5
6
7
8
9
Fitting 5 folds for each of 243 candidates, totalling 1215 fits
precision recall f1-score support

0 0.83 0.85 0.84 52
1 0.67 0.64 0.65 25

accuracy 0.78 77
macro avg 0.75 0.74 0.75 77
weighted avg 0.78 0.78 0.78 77

Random Forest, RF

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# 初始化随机森林分类模型
model_rf = RandomForestClassifier(random_state = 42)

# 定义参数网格
param_grid_rf = {
'n_estimators': [50, 100, 200], # 树的数量
'max_depth': [None, 10, 20, 30], # 最大深度
'min_samples_split': [2, 5, 10], # 节点最小分裂样本数
'min_samples_leaf': [1, 2, 5], # 叶节点最小样本数
'criterion': ['gini', 'entropy'] # 划分标准
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_rf = GridSearchCV(
estimator = model_rf,
param_grid = param_grid_rf,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_rf.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_rf = grid_search_rf.best_estimator_

# 预测测试集
y_pred = best_model_rf.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
1
2
3
4
5
6
7
8
9
Fitting 5 folds for each of 216 candidates, totalling 1080 fits
precision recall f1-score support

0 0.87 0.87 0.87 52
1 0.72 0.72 0.72 25

accuracy 0.82 77
macro avg 0.79 0.79 0.79 77
weighted avg 0.82 0.82 0.82 77

Support Vector Machine, SVM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# 初始化支持向量机分类模型
model_svm = SVC(probability = True, random_state = 42)

# 定义参数网格
param_grid_svm = {
'C': [0.1, 1, 10, 100], # 惩罚参数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_svm = GridSearchCV(
estimator = model_svm,
param_grid = param_grid_svm,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_svm.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_svm = grid_search_svm.best_estimator_

# 预测测试集
y_pred = best_model_svm.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
1
2
3
4
5
6
7
8
9
Fitting 5 folds for each of 4 candidates, totalling 20 fits
precision recall f1-score support

0 0.84 0.90 0.87 52
1 0.76 0.64 0.70 25

accuracy 0.82 77
macro avg 0.80 0.77 0.78 77
weighted avg 0.81 0.82 0.81 77

eXtreme Gradient Boosting, XGBoost

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# 初始化 XGBoost 分类模型
model_xgb = XGBClassifier(use_label_encoder = False, eval_metric = 'logloss', random_state = 42)

# 定义参数网格
param_grid_xgb = {
'n_estimators': [50, 100, 200], # 树的数量
'learning_rate': [0.01, 0.1, 0.2], # 学习率
'max_depth': [3, 5, 10], # 最大深度
'subsample': [0.8, 1.0], # 子采样比率
'colsample_bytree': [0.8, 1.0] # 树的列采样比率
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_xgb = GridSearchCV(
estimator = model_xgb,
param_grid = param_grid_xgb,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_xgb.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_xgb = grid_search_xgb.best_estimator_

# 预测测试集
y_pred = best_model_xgb.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)
1
2
3
4
5
6
7
8
9
Fitting 5 folds for each of 108 candidates, totalling 540 fits
precision recall f1-score support

0 0.86 0.83 0.84 52
1 0.67 0.72 0.69 25

accuracy 0.79 77
macro avg 0.76 0.77 0.77 77
weighted avg 0.80 0.79 0.79 77

Test AUC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# 获取所有模型及其颜色的映射
models = {
"Random Forest": {"model": best_model_rf, "color": "#FDD379"},
"LightGBM": {"model": best_model_lgbm, "color": "#F7C2CD"},
"Gradient Boosting": {"model": best_model_gbm, "color": "#A6DAEF"},
"XGBoost": {"model": best_model_xgb, "color": "#B0D9A5"},
"SVM": {"model": best_model_svm, "color": "#F06292"},
"KNN": {"model": best_model_knn, "color": "#64B5F6"},
"Decision Tree": {"model": best_model_dt, "color": "#FFB74D"},
"Extra Trees": {"model": best_model_et, "color": "#9575CD"},
"ANN": {"model": best_model_ann, "color": "#4DB6AC"}
}

# 绘制 ROC 曲线
plt.figure(figsize = (12, 10))

# 遍历每个模型计算 ROC 和 AUC
for name, info in models.items():
model = info["model"]
color = info["color"]
try:
y_pred = model.predict_proba(X_test)[:, 1] # 获取概率分布
except AttributeError:
y_pred = model.decision_function(X_test) # 对于 SVM,使用 decision_function
y_pred = (y_pred - y_pred.min()) / (y_pred.max() - y_pred.min()) # 归一化

# 计算 ROC 曲线和 AUC 值
fpr, tpr, _ = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)

# 绘制曲线
plt.plot(fpr, tpr, label = f"{name}: AUC = {roc_auc:.3f}", color = color, linewidth = 2)

# 绘制随机分类的参考线
plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)

# 图形细节设置
plt.title("ROC Curve - Test Set Model Comparison", fontsize = 20, fontweight = "bold")
plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.legend(loc = "lower right", fontsize = 12)

# 去除顶部和右侧边框
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_linewidth(1.5)
plt.gca().spines['bottom'].set_linewidth(1.5)

# 关闭网格线
plt.grid(False)
plt.tight_layout()
plt.savefig(f"{wkdir}/Test-AUC.png", bbox_inches = 'tight')
plt.close()

Train AUC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# 绘制训练集 ROC 曲线
plt.figure(figsize = (12, 10))

# 遍历每个模型计算 ROC 和 AUC(训练集)
for name, info in models.items():
model = info["model"]
color = info["color"]
try:
y_pred = model.predict_proba(X_train)[:, 1] # 获取概率分布
except AttributeError:
y_pred = model.decision_function(X_train) # 对于 SVM,使用 decision_function
y_pred = (y_pred - y_pred.min()) / (y_pred.max() - y_pred.min()) # 归一化

# 计算 ROC 曲线和 AUC 值
fpr, tpr, _ = roc_curve(y_train, y_pred)
roc_auc = auc(fpr, tpr)

# 绘制曲线
plt.plot(fpr, tpr, label=f"{name} (Train AUC = {roc_auc:.3f})", color = color, linewidth = 2)

# 绘制随机分类的参考线
plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)

# 图形细节设置
plt.title("ROC Curve - Training Set Model Comparison", fontsize = 20, fontweight = "bold")
plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.legend(loc="lower right", fontsize = 12)

# 去除顶部和右侧边框
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_linewidth(1.5)
plt.gca().spines['bottom'].set_linewidth(1.5)

# 关闭网格线
plt.grid(False)
plt.tight_layout()
plt.savefig(f"{wkdir}/Train-AUC.png", bbox_inches = 'tight')
plt.close()

训练集测试集评价指标系统输出比较

>为什么多指标评价模型

多指标评价模型能够全面反映模型在各个方面的综合性能,满足不同任务对模型表现的多样需求,尤其在处理数据不平衡时,能够更好地揭示模型在不同类别上的表现差异,从而为模型优化提供更加精准的指导,如在数据不平衡的情况下,F1 分数、敏感性 Sensitivity 和特异性 Specificity 比准确率 Accuracy 更具参考性,因为它们能更好地反映模型在不同类别上的表现,尤其是在处理少数类时。

为什么要对训练集、测试集评价指标计算

计算训练集上的评价指标可以评估模型的学习效果,而训练集和测试集之间的表现差异帮助我们检测是否存在过拟合或欠拟合,从而确保模型能够在未知数据上保持良好的预测能力,一个好的模型应该在各数据集上趋势保持一致。

最终模型选择

最终确定的模型为 ET 其在测试集上的 ROC-AUC 曲线下面积为 0.873,在训练集上为 0.987,且其准确率、敏感性、特异性、阳性预测值、阴性预测值、F1 分数和 Kappa 得分等指标在训练集和测试集上相对较好,且在几个模型里面识别少数类样本最多的模型。

需要强调的是,读者在运用模型时应根据自身的数据特点和评价指标偏好来做选择,因为在这里,多个模型的评价指标差异并不大。此外,模型的性能也与选择的参数范围密切相关,在当前网格搜索的范围内,得到的是最优结果,但如果更换参数,模型可能不再是最优或最差的,因此在实际应用中仍具有很大的可操作性和调整空间。

测试集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# 定义函数计算评价指标
def calculate_metrics(y_true, y_pred):
accuracy = accuracy_score(y_true, y_pred)
sensitivity = recall_score(y_true, y_pred) # 敏感性(召回率)
specificity = confusion_matrix(y_true, y_pred)[0, 0] / (confusion_matrix(y_true, y_pred)[0, 0] + confusion_matrix(y_true, y_pred)[0, 1])
positive_predictive_value = precision_score(y_true, y_pred)
negative_predictive_value = confusion_matrix(y_true, y_pred)[1, 1] / (confusion_matrix(y_true, y_pred)[1, 1] + confusion_matrix(y_true, y_pred)[1, 0])
f1 = f1_score(y_true, y_pred)
kappa = cohen_kappa_score(y_true, y_pred)
return [accuracy, sensitivity, specificity, positive_predictive_value, negative_predictive_value, f1, kappa]

# 模型字典
models = {
'ANN': best_model_ann,
'Decision Tree': best_model_dt,
'Extra Trees': best_model_et,
'Gradient Boosting': best_model_gbm,
'KNN': best_model_knn,
'LightGBM': best_model_lgbm,
'Random Forest': best_model_rf,
'SVM': best_model_svm,
'XGBoost': best_model_xgb
}

# 初始化存储模型评价指标的列表
metrics_dict = {'Metrics': ['accuracy', 'sensitivity', 'specificity', 'Positive predictive value', 'Negative predictive value', 'F1 score', 'Kappa score']}

# 计算每个模型的指标并存储
for model_name, model in models.items():
# 预测测试集
y_pred = model.predict(X_test)
# 计算评价指标
metrics_dict[model_name] = calculate_metrics(y_test, y_pred)

# 创建DataFrame存储所有模型的评价指标
metrics_df_test = pd.DataFrame(metrics_dict)
metrics_df_test
1
2
3
4
5
6
7
8
9
10
11
Out[156]: 
Metrics ANN ... SVM XGBoost
0 accuracy 0.805195 ... 0.818182 0.792208
1 sensitivity 0.760000 ... 0.640000 0.720000
2 specificity 0.826923 ... 0.903846 0.826923
3 Positive predictive value 0.678571 ... 0.761905 0.666667
4 Negative predictive value 0.760000 ... 0.640000 0.720000
5 F1 score 0.716981 ... 0.695652 0.692308
6 Kappa score 0.569191 ... 0.567416 0.535795

[7 rows x 10 columns]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# 可视化展示
def plot_metrics(metrics_df, title = "Metrics Plot", save_path = None):

# 获取一个颜色映射(确保颜色的多样性)
colors = plt.cm.get_cmap('tab20', len(metrics_df.columns) - 1) # 根据模型的数量选择足够的颜色
plt.figure(figsize = (10, 8), dpi = 1200)

# 遍历每一列模型,忽略 'Metrics' 列
for idx, model in enumerate(metrics_df.columns[1:]):
plt.plot(metrics_df['Metrics'], metrics_df[model], marker = 'o', label = model, color = colors(idx))

ax = plt.gca()

# 优化坐标轴边框设置
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.yaxis.set_major_locator(ticker.MultipleLocator(0.25))
ax.yaxis.set_minor_locator(ticker.MultipleLocator(0.125))
ax.yaxis.set_minor_formatter(ticker.NullFormatter())

# 设置图表标题和标签
plt.title(title, fontsize = 18, fontweight = "bold")
plt.ylim(0.1, 1.0)
plt.yticks([0.25, 0.50, 0.75, 1.00]) # 设置主要刻度
plt.xticks(rotation = 45, ha = 'right', fontsize = 14)
plt.yticks(fontsize = 14)

# 显示图例
plt.legend(title = "Models", loc = 'center left', bbox_to_anchor = (1, 0.5), frameon = False, fontsize = 12)

# 显示网格线,增强可视化效果
plt.grid(which = 'both', linestyle = '-', linewidth = 0.5, color = 'gray')

# 保存图形到文件,如果提供了保存路径
if save_path:
plt.savefig(save_path, bbox_inches = 'tight') # 确保内容不会被裁剪

# 布局优化,避免标签重叠
plt.tight_layout()
plt.close()

# 调用该函数来绘制测试集的指标
plot_metrics(metrics_df_test, title = "Test Set Metrics", save_path = f"{wkdir}/综合评价.png")

训练集

1
2
3
4
5
6
7
8
9
10
11
12
# 计算每个模型在训练集上的评价指标
metrics_dict_train = {'Metrics': ['accuracy', 'sensitivity', 'specificity', 'Positive predictive value', 'Negative predictive value', 'F1 score', 'Kappa score']}

# 遍历模型,计算训练集上的评价指标
for model_name, model in models.items():
# 预测训练集
y_pred_train = model.predict(X_train)
# 计算评价指标
metrics_dict_train[model_name] = calculate_metrics(y_train, y_pred_train)

# 创建DataFrame存储所有模型的训练集评价指标
metrics_df_train = pd.DataFrame(metrics_dict_train)
1
2
3
4
5
6
7
8
9
10
11
12
metrics_df_train
Out[162]:
Metrics ANN ... SVM XGBoost
0 accuracy 0.849162 ... 0.720670 0.983240
1 sensitivity 0.771930 ... 0.315789 0.964912
2 specificity 0.885246 ... 0.909836 0.991803
3 Positive predictive value 0.758621 ... 0.620690 0.982143
4 Negative predictive value 0.771930 ... 0.315789 0.964912
5 F1 score 0.765217 ... 0.418605 0.973451
6 Kappa score 0.654119 ... 0.259596 0.961208

[7 rows x 10 columns]
1
2
# 绘制训练集的评价指标
plot_metrics(metrics_df_train, title = "Training Set Metrics", save_path = f"{wkdir}/综合评价-Train.png")

针对最优模型进行可视化

1
2
3
4
5
6
7
8
9
10
11
# 预测类别的概率
probabilities = best_model_et.predict_proba(X_test)

# 创建一个 DataFrame,列名为类别
probability_df = pd.DataFrame(probabilities, columns = [f'Prob_Class_{i}' for i in range(probabilities.shape[1])])

# 如果需要,可以添加 X_test 的索引或其他标识列
probability_df.index = X_test.index

# 重置索引,并选择是否保留原索引作为一列
probability_df.reset_index(drop = True, inplace = True)
1
2
3
4
5
6
7
8
probability_df.head()
Out[168]:
Prob_Class_0 Prob_Class_1
0 0.870952 0.129048
1 0.448952 0.551048
2 0.857429 0.142571
3 0.318198 0.681802
4 0.381310 0.618690
1
2
3
4
5
6
7
8
9
10
df_selected = X_test

# 创建PCA对象,设置提取1个主成分
pca = PCA(n_components = 1)
# 进行PCA降维
pca_result = pca.fit_transform(df_selected)
# 获取解释方差比率(贡献度)
explained_variance = pca.explained_variance_ratio_
# 将PCA结果保存为DataFrame,并用贡献度标记列名
pca_df = pd.DataFrame(pca_result, columns = [f"PC1 ({explained_variance[0]*100:.2f}%)"])
1
2
3
4
5
6
7
8
pca_df.head()
Out[171]:
PC1 (97.93%)
0 -121.580604
1 -96.202631
2 898.441250
3 -437.343118
4 -519.324984
1
2
3
# 将 PCA 结果 (pca_df) 和其他数据合并
combined_df = pd.concat([probability_df, pca_df, y_test.reset_index(drop = True)], axis = 1)
combined_df.rename(columns = {y_test.name: 'True'}, inplace = True)
1
2
3
4
5
6
7
8
combined_df.head()
Out[174]:
Prob_Class_0 Prob_Class_1 PC1 (97.93%) True
0 0.870952 0.129048 -121.580604 0
1 0.448952 0.551048 -96.202631 1
2 0.857429 0.142571 898.441250 0
3 0.318198 0.681802 -437.343118 0
4 0.381310 0.618690 -519.324984 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
colors = combined_df['True'].map({0: '#A0D6B4', 1: '#C3A6D8'})

# 绘制散点图,设置更大的点和黑色边框
plt.figure(figsize = (10, 6), dpi = 1200)
plt.scatter(combined_df['Prob_Class_1'], combined_df['PC1 (97.93%)'],
c=colors, edgecolor = 'black', s = 100) # s = 100 增大点的大小,edgecolor = 'black' 添加黑色边框

# 添加分界线和轴标签
plt.axvline(x = 0.5, color = 'gray', linestyle = '--')
plt.xlabel('Predicted value for benign')
plt.ylabel('PC1 (97.93%)')
plt.title('Test set')
# 添加类别图例并移到图外
for true_value, color, label in [(0, '#A0D6B4', 'malignant'), (1, '#C3A6D8', 'benign')]:
plt.scatter([], [], color = color, edgecolor = 'black', s = 100, label = label)
plt.legend(title = "Class", loc = "center left", bbox_to_anchor = (1, 0.5))
plt.tight_layout(rect = [0, 0, 0.85, 1])
plt.savefig(f'{wkdir}/分类效果-Test.png', bbox_inches = 'tight')
plt.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# 预测类别的概率
probabilities = best_model_et.predict_proba(X_train)
# 创建一个 DataFrame,列名为类别
probability_df = pd.DataFrame(probabilities, columns = [f'Prob_Class_{i}' for i in range(probabilities.shape[1])])
# 如果需要,可以添加 X_train 的索引或其他标识列
probability_df.index = X_train.index
# 重置索引,并选择是否保留原索引作为一列
probability_df.reset_index(drop = True, inplace = True)
df_selected = X_train # 使用训练集数据
# 创建PCA对象,设置提取1个主成分
pca = PCA(n_components = 1)
# 进行PCA降维
pca_result = pca.fit_transform(df_selected)
# 获取解释方差比率(贡献度)
explained_variance = pca.explained_variance_ratio_
# 将PCA结果保存为DataFrame,并用贡献度标记列名
pca_df = pd.DataFrame(pca_result, columns = [f"PC1 ({explained_variance[0]*100:.2f}%)"])
# 将 PCA 结果 (pca_df) 和其他数据合并
combined_df = pd.concat([probability_df, pca_df, y_train.reset_index(drop = True)], axis = 1)
combined_df.rename(columns = {y_train.name: 'True'}, inplace = True)

colors = combined_df['True'].map({0: '#A0D6B4', 1: '#C3A6D8'})
# 绘制散点图,设置更大的点和黑色边框
plt.figure(figsize = (10, 6),dpi = 1200)
plt.scatter(combined_df['Prob_Class_1'], combined_df['PC1 (99.74%)'],
c = colors, edgecolor = 'black', s = 100) # s = 100 增大点的大小,edgecolor = 'black' 添加黑色边框

# 添加分界线和轴标签
plt.axvline(x = 0.5, color = 'gray', linestyle = '--')
plt.xlabel('Predicted value for benign')
plt.ylabel('PC1 (99.74%)')
plt.title('Train set')
# 添加类别图例并移到图外
for true_value, color, label in [(0, '#A0D6B4', 'malignant'), (1, '#C3A6D8', 'benign')]:
plt.scatter([], [], color = color, edgecolor = 'black', s = 100, label = label)
plt.legend(title = "Class", loc = "center left", bbox_to_anchor = (1, 0.5))
plt.tight_layout(rect = [0, 0, 0.85, 1])
plt.savefig(f'{wkdir}/分类效果-Train.png', bbox_inches = 'tight')
plt.close()

绘制 Precision-Recall 曲线

1
2
3
4
5
6
7
8
9
10
11
12
# 绘制 Precision-Recall 曲线,添加随机分类器基线
pr_display = PrecisionRecallDisplay.from_estimator(
best_model_et,
X_test,
y_test,
plot_chance_level = True, # 添加基线
name = "AdaBoost"
)

_ = pr_display.ax_.set_title("Test set Precision-Recall curve")
plt.savefig(f'{wkdir}/Precision-Recall-Test.png', bbox_inches = 'tight', dpi = 1200)
plt.close()
1
2
3
4
5
6
7
8
9
10
pr_display = PrecisionRecallDisplay.from_estimator(
best_model_et,
X_train,
y_train,
plot_chance_level = True, # 添加基线
name = "AdaBoost"
)
_ = pr_display.ax_.set_title("Train set Precision-Recall curve")
plt.savefig(f'{wkdir}/Precision-Recall-Train.png', bbox_inches = 'tight', dpi = 1200)
plt.close()

AdaBoost

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# 计算训练集和测试集的预测概率
y_train_pred = best_model_et.predict_proba(X_train)[:, 1]
y_test_pred = best_model_et.predict_proba(X_test)[:, 1]

# 计算 ROC 曲线和 AUC 值
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred)
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred)

roc_auc_train = auc(fpr_train, tpr_train)
roc_auc_test = auc(fpr_test, tpr_test)

# 绘制 ROC 曲线
plt.figure(figsize=(10, 8))

# 绘制训练集 ROC 曲线
plt.plot(fpr_train, tpr_train, label = f"Train: AUC={roc_auc_train:.3f} (n={len(y_train)})", color = "#CECCE5", linewidth = 2)

# 绘制测试集 ROC 曲线
plt.plot(fpr_test, tpr_test, label = f"Test: AUC={roc_auc_test:.3f} (n={len(y_test)})", color = "#B0D9A5", linewidth = 2)

# 绘制随机分类的参考线
plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)

# 图形细节设置
plt.title("ROC Curve - AdaBoost (Train & Test)", fontsize = 20, fontweight = "bold")
plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.legend(loc="lower right", fontsize = 12)

# 去除顶部和右侧边框
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_linewidth(1.5)
plt.gca().spines['bottom'].set_linewidth(1.5)

# 关闭网格线
plt.grid(False)
plt.tight_layout()

# 保存图像为 PDF
plt.savefig(f"{wkdir}/AdaBoost.png", bbox_inches = 'tight')
plt.close()

针对最优模型进行解释

1
2
3
4
5
6
7
8
explainer = shap.TreeExplainer(best_model_et)
shap_values = explainer.shap_values(X_test)

# 提取类别 0 的 SHAP 值
shap_values_class_0 = shap_values[:, :, 0]

# 提取类别 1 的 SHAP 值
shap_values_class_1 = shap_values[:, :, 1]
1
2
3
4
5
6
plt.figure(figsize = (10, 5), dpi = 1200)
shap.summary_plot(shap_values_class_1, X_test, plot_type = "bar", show = False)
plt.title('SHAP_numpy Sorted Feature Importance')
plt.tight_layout()
plt.savefig(f"{wkdir}/SHAP_numpy Sorted Feature Importance.png", bbox_inches = 'tight')
plt.close()
1
2
3
4
plt.figure()
shap.summary_plot(shap_values_class_0, X_test, feature_names = X_test.columns, plot_type = "dot", show = False)
plt.savefig(f"{wkdir}/SHAP_numpy Sorted Feature Importance.dot.png", bbox_inches = 'tight')
plt.close()
1
shap_values_df_1 = pd.DataFrame(shap_values_class_0, columns = X_test.columns)
1
2
3
4
5
6
7
8
9
10
shap_values_df_1.head()
Out[199]:
X_30 X_39 X_46 ... X_33 X_9 X_28
0 0.028375 -0.005348 0.019653 ... 0.008910 0.163576 -0.054551
1 0.015743 -0.017699 -0.024495 ... -0.010931 -0.193108 0.022756
2 0.014259 -0.070168 -0.012669 ... -0.022715 0.143295 0.068182
3 0.004616 0.021329 -0.015542 ... 0.004962 -0.317984 -0.008150
4 0.033777 0.071995 -0.030324 ... 0.002056 -0.328638 0.004778

[5 rows x 8 columns]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
features = shap_values_df_1.columns.tolist()  # 获取所有的特征名称

# 设置画布和子图结构(3行3列)
fig, axes = plt.subplots(3, 3, figsize = (10, 10), dpi = 1200)
axes = axes.flatten()

# 循环绘制每个特征的散点图
for i in range(len(axes)):
if i < len(features): # 如果还有特征未绘制
feature = features[i]
if feature in X_test.columns and feature in shap_values_df_1.columns:
ax = axes[i]
ax.scatter(X_test[feature], shap_values_df_1[feature], s = 10, color = "#6A9ACE")
ax.axhline(y = 0, color = 'red', linestyle = '-.', linewidth = 1) # 添加横线
ax.set_xlabel(feature, fontsize = 10)
ax.set_ylabel(f'SHAP value for\n{feature}', fontsize = 10)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
else:
# 如果特征不存在,隐藏对应的子图
axes[i].axis('off')
else:
# 如果超过了特征数量,关闭剩余的子图
axes[i].axis('off')

# 调整布局
plt.tight_layout()
plt.savefig(f"{wkdir}/SHAP value for features.png", bbox_inches = 'tight')
plt.close()

以下针对预测正确的类别 0 和类别 1 分别绘制瀑布图[这里就绘制测试集第一个样本 (0) 和第二个样本都预测正确 (1):

1
2
3
4
5
6
7
8
9
10
11
y_pred_et = best_model_et.predict(X_test)
explainer = shap.TreeExplainer(best_model_et)

# 计算shap值为Explanation格式
shap_values_Explanation = explainer(X_test)

# 提取类别 0 的 SHAP 值
shap_values_class_0 = shap_values_Explanation[:, :, 0]

# 提取类别 1 的 SHAP 值
shap_values_class_1 = shap_values_Explanation[:, :, 1]
1
2
3
4
5
6
y_pred_et
Out[203]:
array([0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0,
0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1])

预测类别 0 样本绘制

1
2
3
4
5
6
7
plt.figure(figsize = (10, 5), dpi = 1200)

# 绘制第 1 个样本 0 类别的 SHAP 瀑布图
shap.plots.waterfall(shap_values_class_0[0], show = False, max_display = 8)
plt.savefig(f"{wkdir}/绘制第 1 个样本 0 类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
plt.tight_layout()
plt.close()
1
2
3
4
5
6
7
plt.figure(figsize = (10, 5), dpi = 1200)

# 绘制第 1 个样本 1 类别的 SHAP 瀑布图
shap.plots.waterfall(shap_values_class_1[0], show = False, max_display = 8)
plt.savefig(f"{wkdir}/绘制第 1 个样本 1 类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
plt.tight_layout()
plt.show()

预测类别 1 样本绘制

1
2
3
4
5
6
7
plt.figure(figsize = (10, 5), dpi = 1200)

# 绘制第 2 个样本0类别的 SHAP 瀑布图
shap.plots.waterfall(shap_values_class_0[1], show = False, max_display = 8)
plt.savefig(f"{wkdir}/绘制第 2 个样本0类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
plt.tight_layout()
plt.show()
1
2
3
4
5
6
7
plt.figure(figsize = (10, 5), dpi = 1200)

# 绘制第 2 个样本1类别的 SHAP 瀑布图
shap.plots.waterfall(shap_values_class_1[1], show = False, max_display = 8)
plt.savefig(f"{wkdir}/绘制第 2 个样本1类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
plt.tight_layout()
plt.show()

完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
# -*- coding: utf-8 -*-

import shap
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb
import seaborn as sns
import scipy.stats as stats
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import PrecisionRecallDisplay
from xgboost import XGBClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import roc_curve, auc
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, cohen_kappa_score, confusion_matrix

plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.unicode_minus'] = False
import warnings
warnings.filterwarnings("ignore")


# 混淆矩阵
def plot_confusion_matrix_plus(y_true, y_pred, title = "Model", labels = None, save_path = None):
unique_labels = set(np.unique(np.concatenate([np.array(y_true), np.array(y_pred)])))
labels_used = list(unique_labels)
# 若未提供 labels,或者提供的 labels 与 y_true 不一致,则自动推断
if labels is None:
labels_with_totals = labels_used + ["Total"]
else:
labels_with_totals = [labels[lab] for lab in labels if lab in labels_used] + ["Total"]

# 若最终只有单一标签(极端情况),也能正常绘制
labels_used = list(labels_used)
cm = confusion_matrix(y_true, y_pred, labels = labels_used)

# 行标准化(避免除零)
row_sums = cm.sum(axis = 1, keepdims = True)
with np.errstate(divide = 'ignore', invalid = 'ignore'):
cm_normalized = np.divide(cm, row_sums, where = (row_sums != 0))
cm_normalized[np.isnan(cm_normalized)] = 0.0

# 添加 Total 行和列(计数)
cm_with_totals = np.vstack([cm, cm.sum(axis = 0, keepdims = True)])
cm_with_totals = np.column_stack([cm_with_totals, cm_with_totals.sum(axis = 1, keepdims = True)])

# 用于着色的矩阵(最后一行/列置 0,使其灰色覆盖)
heatmap_data = cm_normalized.copy()
heatmap_data = np.vstack([heatmap_data, np.zeros((1, heatmap_data.shape[1]))])
heatmap_data = np.column_stack([heatmap_data, np.zeros((heatmap_data.shape[0], 1))])

fig, ax = plt.subplots(figsize = (10, 8))
colors = sns.color_palette("Greens", as_cmap = True)
grey_color = "#f0f0f0"

sns.heatmap(
heatmap_data,
annot = False,
fmt = "",
cmap = colors,
xticklabels = labels_with_totals,
yticklabels = labels_with_totals,
cbar = False,
square = True,
linewidths = 1.5,
linecolor = "white",
ax = ax,
vmin = 0.0, vmax = 1.0
)

# 覆盖 Total 行/列为灰色
n_rows, n_cols = cm.shape
for i in range(n_rows):
ax.add_patch(plt.Rectangle((n_cols, i), 1, 1, fill = True, color = grey_color, edgecolor = "white", lw = 1.5))
for j in range(n_cols):
ax.add_patch(plt.Rectangle((j, n_rows), 1, 1, fill = True, color = grey_color, edgecolor = "white", lw = 1.5))
ax.add_patch(plt.Rectangle((n_cols, n_rows), 1, 1, fill = True, color = grey_color, edgecolor = "white", lw = 1.5))

# 填充数值与百分比
total_all = cm_with_totals[-1, -1]
for i in range(cm_with_totals.shape[0]):
for j in range(cm_with_totals.shape[1]):
if i < n_rows and j < n_cols:
value = cm[i, j]
percentage = cm_normalized[i, j] * 100 if row_sums[i, 0] > 0 else 0.0
ax.text(j + 0.5, i + 0.45, f"{percentage:.1f}%", ha = "center", va = "center", fontsize = 14, color = "black")
ax.text(j + 0.5, i + 0.72, f"{int(value)}", ha = "center", va = "center", fontsize = 13, color = "black")
elif i == n_rows and j == n_cols:
ax.text(j + 0.5, i + 0.5, f"{int(total_all)}", ha = "center", va = "center", fontsize = 15, color = "black")
else:
total_value = cm_with_totals[i, j]
total_percentage = (total_value / total_all * 100) if total_all > 0 else 0.0
ax.text(j + 0.5, i + 0.45, f"{total_percentage:.1f}%", ha = "center", va = "center", fontsize = 14, color = "black")
ax.text(j + 0.5, i + 0.72, f"{int(total_value)}", ha = "center", va = "center", fontsize = 13, color = "black")

plt.title(title, fontsize = 20)
plt.xlabel("Prediction (model output)", fontsize = 16)
plt.ylabel("Truth (observation)", fontsize = 16)
plt.tight_layout()
if save_path:
plt.savefig(save_path, bbox_inches = 'tight')
plt.close()


if __name__ == '__main__':

import os
wkdir = 'C:/Users/Administrator/Desktop'
os.chdir(wkdir)

path = 'Z:/TData/big-data/sad41d8cd/251027_Medical_Classification_Model_Comparison.xlsx'
df = pd.read_excel(path)

if True:
df.head()
df.columns
df.info()

if True:

# 划分特征和目标变量
X = df.drop(['y'], axis = 1) # 从数据集中去掉目标变量 'y',得到特征变量 X
y = df['y'] # 提取目标变量 y

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X, # 特征变量
y, # 目标变量
test_size = 0.3, # 测试集所占比例,这里为 30%
random_state = 42, # 随机种子,确保结果可重复
stratify = df['y'] # 按目标变量 'y' 的分布进行分层采样,确保训练集和测试集中目标变量的分布一致
)

# 特征筛选
if True:

# 创建 LGBM 分类器
lgbm_clf = lgb.LGBMClassifier(random_state = 42, verbose = -1)

# 训练模型
lgbm_clf.fit(X_train, y_train)

# 获取特征重要性
feature_importances = lgbm_clf.feature_importances_

# 构建特征重要性排名
feature_importance_df = pd.DataFrame({
'Feature': X.columns,
'Importance': feature_importances
}).sort_values(by = 'Importance', ascending = False)


# 只取前30个重要特征
top_n = 30
top_features = feature_importance_df.head(top_n)

# 调整字体大小
plt.figure(figsize = (12, 8), dpi = 1200)
plt.barh(top_features['Feature'], top_features['Importance'], color = 'skyblue')
plt.xlabel('Importance', fontsize = 14)
plt.ylabel('Feature', fontsize = 14)
plt.title(f'Top {top_n} Feature Importance', fontsize = 16)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.gca().invert_yaxis()
plt.savefig(f'{wkdir}/特征筛选.png', bbox_inches = 'tight')
plt.close()

if True:

# 初始化存储结果的 DataFrame
selection_results = pd.DataFrame(columns = ['Feature', 'Importance', 'Mean_ROC'])

# 初始化用于训练的特征列表
selected_features = []

# K 折交叉验证
kf = KFold(n_splits = 5, shuffle = True, random_state = 42)

# 获取折数
n_splits = kf.get_n_splits()

# 动态创建列名
fold_columns = [f'Fold_{i+1}_ROC' for i in range(n_splits)]

# 依次添加特征
for i in range(len(top_features)):
# 当前特征
current_feature = top_features.iloc[i]['Feature']
selected_features.append(current_feature)

fold_roc_scores = [] # 存储每折的 ROC AUC 分数

# K 折交叉验证
for train_idx, val_idx in kf.split(X_train):
# 划分训练集和验证集
X_train_fold, X_val_fold = X_train.iloc[train_idx][selected_features], X_train.iloc[val_idx][selected_features]
y_train_fold, y_val_fold = y_train.iloc[train_idx], y_train.iloc[val_idx]

# 创建并训练 LGBM 模型
lgbm_clf = lgb.LGBMClassifier(random_state = 42, verbose = -1)
lgbm_clf.fit(X_train_fold, y_train_fold)

# 预测并计算 ROC AUC 分数
y_val_proba = lgbm_clf.predict_proba(X_val_fold)[:, 1] # 概率分数
fold_roc_score = roc_auc_score(y_val_fold, y_val_proba)
fold_roc_scores.append(fold_roc_score)

# 计算平均 ROC AUC 分数
mean_roc_score = np.mean(fold_roc_scores)

# 保存结果
row_data = {
'Feature': current_feature,
'Importance': top_features.iloc[i]['Importance'],
'Mean_ROC': mean_roc_score,
}

# 添加每折的ROC AUC分数
for j, score in enumerate(fold_roc_scores):
row_data[fold_columns[j]] = score
row_df = pd.DataFrame([row_data])
selection_results = pd.concat([selection_results, row_df], ignore_index = True)

selection_results

if True:

# 将 Importance 列百分比化并归一化
selection_results['Importance'] = (
selection_results['Importance'] / selection_results['Importance'].sum()
)
selection_results

if True:

# 动态获取折列名称
fold_columns = [col for col in selection_results.columns if 'Fold_' in col]

# 添加置信区间列到 selection_results
selection_results['CI_Lower'] = None
selection_results['CI_Upper'] = None

# 遍历每行计算置信区间
for index, row in selection_results.iterrows():

# 提取每折的ROC成绩
fold_roc_scores = [row[fold] for fold in fold_columns]

# 样本大小
n_folds = len(fold_roc_scores)

# 平均值和标准误差
mean_roc = row['Mean_ROC']
std_err = stats.sem(fold_roc_scores) # 标准误差

# t值(95%置信区间,n_folds - 1 自由度)
t_value = stats.t.ppf(0.975, df=n_folds - 1)

# 置信区间
ci_lower = mean_roc - t_value * std_err
ci_upper = mean_roc + t_value * std_err

# 确保置信区间上限不超过1
ci_upper = min(ci_upper, 1)

# 保存到 DataFrame
selection_results.at[index, 'CI_Lower'] = ci_lower
selection_results.at[index, 'CI_Upper'] = ci_upper

# 输出结果
selection_results

if True:

# 确保 Importance 列为数值类型
selection_results['Importance'] = pd.to_numeric(selection_results['Importance'], errors = 'coerce')

# 确保置信区间的列为数值类型
selection_results['CI_Lower'] = pd.to_numeric(selection_results['CI_Lower'], errors = 'coerce')
selection_results['CI_Upper'] = pd.to_numeric(selection_results['CI_Upper'], errors = 'coerce')

# 参数:选择前 n 个特征
n_features = 7

fig, ax1 = plt.subplots(figsize = (16, 6), dpi = 1200)

# 渐变柱状图:特征贡献度
norm = plt.Normalize(selection_results['Importance'].min(), selection_results['Importance'].max())
colors = plt.cm.Blues(norm(selection_results['Importance']))

ax1.bar(selection_results['Feature'], selection_results['Importance'], color = colors, label = 'Feature Importance')
ax1.set_xlabel("Features", fontsize = 18, fontweight = 'bold')
ax1.set_ylabel("Feature Importance", fontsize = 18, fontweight = 'bold')
ax1.tick_params(axis = 'y', labelsize = 12, width = 1.5)

# 修改 x 轴特征颜色,前 n_features 用红色,其他用黑色
x_labels = selection_results['Feature']
x_colors = ['red' if i < n_features else 'black' for i in range(len(x_labels))]
for tick_label, color in zip(ax1.get_xticklabels(), x_colors):
tick_label.set_color(color)

ax1.tick_params(axis = 'x', rotation = 45, labelsize = 12, width = 1.5)

# 折线图:AUC成绩
ax2 = ax1.twinx()

# 红点和红线:前 n_features 个特征
ax2.plot(
selection_results['Feature'][:n_features + 1], # 连接红点到黑点的过渡
selection_results['Mean_ROC'][:n_features + 1],
color = "red", marker = 'o', linestyle = '-', label = "Mean AUC (Top Features)"
)

# 黑点和黑线:其余特征
ax2.plot(
selection_results['Feature'][n_features:],
selection_results['Mean_ROC'][n_features:],
color = "black", marker = 'o', linestyle = '-', label = "Mean AUC (Other Features)"
)

# 添加置信区间阴影
ax2.fill_between(
selection_results['Feature'],
selection_results['CI_Lower'], # 置信区间下限
selection_results['CI_Upper'], # 置信区间上限
color = 'red',
alpha = 0.2, # 设置透明度
)

ax2.set_ylabel("Mean AUC", fontsize = 18, fontweight = 'bold')
ax2.tick_params(axis = 'y', labelsize = 12, width = 1.5)
ax2.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:.3f}')) # 保留三位小数

# 添加标题
plt.title(f"Feature Contribution and AUC Performance (Top {n_features} Features Highlighted)", fontsize = 18, fontweight = 'bold')
fig.tight_layout()
plt.savefig(f"{wkdir}/特征曲线.png", bbox_inches = 'tight')
plt.close()

list(selection_results['Feature'][0:8])

# 重新划分数据集
if True:

# 划分特征和目标变量
X = df[['X_30', 'X_39', 'X_46', 'X_32', 'X_34', 'X_33', 'X_9', 'X_28']] # 根据特征筛选变量重新划分特征
y = df['y'] # 提取目标变量 y

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X, # 特征变量
y, # 目标变量
test_size = 0.3, # 测试集所占比例,这里为 30%
random_state = 42, # 随机种子,确保结果可重复
stratify = df['y'] # 按目标变量 'y' 的分布进行分层采样,确保训练集和测试集中目标变量的分布一致
)

# Artificial Neural Network
if True:

# 定义 ANN 的基础参数
params_ann = {
'random_state': 42, # 随机种子
'max_iter': 200 # 最大迭代次数
}

# 初始化 ANN 分类模型
model_ann = MLPClassifier(**params_ann)

# 定义参数网格
param_grid_ann = {
'hidden_layer_sizes': [(50,), (100,), (100, 50)], # 隐藏层大小
'learning_rate_init': [0.001, 0.01, 0.1], # 学习率
'alpha': [0.0001, 0.001, 0.01] # 正则化参数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_ann = GridSearchCV(
estimator = model_ann,
param_grid = param_grid_ann,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_ann.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_ann = grid_search_ann.best_estimator_

# 预测测试集
y_pred = best_model_ann.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

# DT
if True:

# 定义 DT 的基础参数
params_dt = {'random_state': 42}

# 初始化决策树分类模型
model_dt = DecisionTreeClassifier(**params_dt)

# 定义参数网格
param_grid_dt = {
'criterion': ['gini', 'entropy'], # 划分标准
'max_depth': [None, 10, 20, 30], # 最大深度
'min_samples_split': [2, 5, 10], # 节点最小分裂样本数
'min_samples_leaf': [1, 2, 5] # 叶节点最小样本数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_dt = GridSearchCV(
estimator = model_dt,
param_grid = param_grid_dt,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_dt.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_dt = grid_search_dt.best_estimator_

# 预测测试集
y_pred = best_model_dt.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

# ET
if True:

# 定义 ET 的基础参数
params_et = {
'random_state': 42 # 随机种子
}

# 初始化 Extra Trees 分类模型
model_et = ExtraTreesClassifier(**params_et)

# 定义参数网格
param_grid_et = {
'n_estimators': [50, 100, 200], # 树的数量
'criterion': ['gini', 'entropy'], # 划分标准
'max_depth': [None, 10, 20, 30], # 最大深度
'min_samples_split': [2, 5, 10], # 节点最小分裂样本数
'min_samples_leaf': [1, 2, 5] # 叶节点最小样本数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_et = GridSearchCV(
estimator = model_et,
param_grid = param_grid_et,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_et.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_et = grid_search_et.best_estimator_

# 预测测试集
y_pred = best_model_et.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

# GBM
if True:

# 定义 GBM 的基础参数
params_gbm = {'random_state': 42}

# 初始化梯度增强机分类模型
model_gbm = GradientBoostingClassifier(**params_gbm)

# 定义参数网格
param_grid_gbm = {
'n_estimators': [50, 100, 200], # 树的数量
'learning_rate': [0.01, 0.1, 0.2], # 学习率
'max_depth': [3, 5, 10], # 最大深度
'min_samples_split': [2, 5, 10], # 节点最小分裂样本数
'min_samples_leaf': [1, 2, 5] # 叶节点最小样本数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_gbm = GridSearchCV(
estimator = model_gbm,
param_grid = param_grid_gbm,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_gbm.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_gbm = grid_search_gbm.best_estimator_

# 预测测试集
y_pred = best_model_gbm.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

# KNN
if True:

# 初始化 KNN 分类模型
model_knn = KNeighborsClassifier()

# 定义参数网格
param_grid_knn = {
'n_neighbors': [3, 5, 10], # 邻居数量
'weights': ['uniform', 'distance'], # 权重方式
'metric': ['euclidean', 'manhattan', 'minkowski'] # 距离度量方式
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_knn = GridSearchCV(
estimator = model_knn,
param_grid = param_grid_knn,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_knn.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_knn = grid_search_knn.best_estimator_

# 预测测试集
y_pred = best_model_knn.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

# LightGBM
if True:

# 初始化 LightGBM 分类模型
model_lgbm = LGBMClassifier(random_state = 42, verbose = -1)

# 定义参数网格
param_grid_lgbm = {
'n_estimators': [50, 100, 200], # 树的数量
'learning_rate': [0.01, 0.1, 0.2], # 学习率
'max_depth': [-1, 10, 20], # 最大深度
'num_leaves': [31, 50, 100], # 叶节点数
'min_child_samples': [10, 20, 30] # 最小叶节点样本数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_lgbm = GridSearchCV(
estimator = model_lgbm,
param_grid = param_grid_lgbm,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_lgbm.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_lgbm = grid_search_lgbm.best_estimator_

# 预测测试集
y_pred = best_model_lgbm.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

# Random Forest
if True:

# 初始化随机森林分类模型
model_rf = RandomForestClassifier(random_state = 42)

# 定义参数网格
param_grid_rf = {
'n_estimators': [50, 100, 200], # 树的数量
'max_depth': [None, 10, 20, 30], # 最大深度
'min_samples_split': [2, 5, 10], # 节点最小分裂样本数
'min_samples_leaf': [1, 2, 5], # 叶节点最小样本数
'criterion': ['gini', 'entropy'] # 划分标准
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_rf = GridSearchCV(
estimator = model_rf,
param_grid = param_grid_rf,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_rf.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_rf = grid_search_rf.best_estimator_

# 预测测试集
y_pred = best_model_rf.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

# Support Vector Machine
if True:

# 初始化支持向量机分类模型
model_svm = SVC(probability = True, random_state = 42)

# 定义参数网格
param_grid_svm = {
'C': [0.1, 1, 10, 100], # 惩罚参数
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_svm = GridSearchCV(
estimator = model_svm,
param_grid = param_grid_svm,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_svm.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_svm = grid_search_svm.best_estimator_

# 预测测试集
y_pred = best_model_svm.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

# eXtreme Gradient Boosting, XGBoost
if True:

# 初始化 XGBoost 分类模型
model_xgb = XGBClassifier(use_label_encoder = False, eval_metric = 'logloss', random_state = 42)

# 定义参数网格
param_grid_xgb = {
'n_estimators': [50, 100, 200], # 树的数量
'learning_rate': [0.01, 0.1, 0.2], # 学习率
'max_depth': [3, 5, 10], # 最大深度
'subsample': [0.8, 1.0], # 子采样比率
'colsample_bytree': [0.8, 1.0] # 树的列采样比率
}

# 使用 GridSearchCV 进行网格搜索和 k 折交叉验证
grid_search_xgb = GridSearchCV(
estimator = model_xgb,
param_grid = param_grid_xgb,
scoring = 'neg_log_loss', # 评价指标为负对数损失
cv = 5, # 5 折交叉验证
n_jobs = -1, # 并行计算
verbose = 1 # 输出详细进度信息
)

# 训练模型
grid_search_xgb.fit(X_train, y_train)

# 使用最优参数训练模型
best_model_xgb = grid_search_xgb.best_estimator_

# 预测测试集
y_pred = best_model_xgb.predict(X_test)

# 输出模型报告,查看评价指标
print(classification_report(y_test, y_pred))

# 绘制 AUC 曲线
plot_confusion_matrix_plus(y_test, y_pred, title = '', labels = {1:"Hyperplasia", 0:"Normal"}, save_path = None)

# Test AUC
if True:

# 获取所有模型及其颜色的映射
models = {
"Random Forest": {"model": best_model_rf, "color": "#FDD379"},
"LightGBM": {"model": best_model_lgbm, "color": "#F7C2CD"},
"Gradient Boosting": {"model": best_model_gbm, "color": "#A6DAEF"},
"XGBoost": {"model": best_model_xgb, "color": "#B0D9A5"},
"SVM": {"model": best_model_svm, "color": "#F06292"},
"KNN": {"model": best_model_knn, "color": "#64B5F6"},
"Decision Tree": {"model": best_model_dt, "color": "#FFB74D"},
"Extra Trees": {"model": best_model_et, "color": "#9575CD"},
"ANN": {"model": best_model_ann, "color": "#4DB6AC"}
}

# 绘制 ROC 曲线
plt.figure(figsize = (12, 10))

# 遍历每个模型计算 ROC 和 AUC
for name, info in models.items():
model = info["model"]
color = info["color"]
try:
y_pred = model.predict_proba(X_test)[:, 1] # 获取概率分布
except AttributeError:
y_pred = model.decision_function(X_test) # 对于 SVM,使用 decision_function
y_pred = (y_pred - y_pred.min()) / (y_pred.max() - y_pred.min()) # 归一化

# 计算 ROC 曲线和 AUC 值
fpr, tpr, _ = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)

# 绘制曲线
plt.plot(fpr, tpr, label = f"{name}: AUC = {roc_auc:.3f}", color = color, linewidth = 2)

# 绘制随机分类的参考线
plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)

# 图形细节设置
plt.title("ROC Curve - Test Set Model Comparison", fontsize = 20, fontweight = "bold")
plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.legend(loc = "lower right", fontsize = 12)

# 去除顶部和右侧边框
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_linewidth(1.5)
plt.gca().spines['bottom'].set_linewidth(1.5)

# 关闭网格线
plt.grid(False)
plt.tight_layout()
plt.savefig(f"{wkdir}/Test-AUC.png", bbox_inches = 'tight')
plt.close()

# Train AUC
if True:

# 绘制训练集 ROC 曲线
plt.figure(figsize = (12, 10))

# 遍历每个模型计算 ROC 和 AUC(训练集)
for name, info in models.items():
model = info["model"]
color = info["color"]
try:
y_pred = model.predict_proba(X_train)[:, 1] # 获取概率分布
except AttributeError:
y_pred = model.decision_function(X_train) # 对于 SVM,使用 decision_function
y_pred = (y_pred - y_pred.min()) / (y_pred.max() - y_pred.min()) # 归一化

# 计算 ROC 曲线和 AUC 值
fpr, tpr, _ = roc_curve(y_train, y_pred)
roc_auc = auc(fpr, tpr)

# 绘制曲线
plt.plot(fpr, tpr, label=f"{name} (Train AUC = {roc_auc:.3f})", color = color, linewidth = 2)

# 绘制随机分类的参考线
plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)

# 图形细节设置
plt.title("ROC Curve - Training Set Model Comparison", fontsize = 20, fontweight = "bold")
plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.legend(loc="lower right", fontsize = 12)

# 去除顶部和右侧边框
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_linewidth(1.5)
plt.gca().spines['bottom'].set_linewidth(1.5)

# 关闭网格线
plt.grid(False)
plt.tight_layout()
plt.savefig(f"{wkdir}/Train-AUC.png", bbox_inches = 'tight')
plt.close()

# 训练集测试集评价指标系统输出比较
if True:

# 定义函数计算评价指标
def calculate_metrics(y_true, y_pred):
accuracy = accuracy_score(y_true, y_pred)
sensitivity = recall_score(y_true, y_pred) # 敏感性(召回率)
specificity = confusion_matrix(y_true, y_pred)[0, 0] / (confusion_matrix(y_true, y_pred)[0, 0] + confusion_matrix(y_true, y_pred)[0, 1])
positive_predictive_value = precision_score(y_true, y_pred)
negative_predictive_value = confusion_matrix(y_true, y_pred)[1, 1] / (confusion_matrix(y_true, y_pred)[1, 1] + confusion_matrix(y_true, y_pred)[1, 0])
f1 = f1_score(y_true, y_pred)
kappa = cohen_kappa_score(y_true, y_pred)
return [accuracy, sensitivity, specificity, positive_predictive_value, negative_predictive_value, f1, kappa]

# 模型字典
models = {
'ANN': best_model_ann,
'Decision Tree': best_model_dt,
'Extra Trees': best_model_et,
'Gradient Boosting': best_model_gbm,
'KNN': best_model_knn,
'LightGBM': best_model_lgbm,
'Random Forest': best_model_rf,
'SVM': best_model_svm,
'XGBoost': best_model_xgb
}

# 初始化存储模型评价指标的列表
metrics_dict = {'Metrics': ['accuracy', 'sensitivity', 'specificity', 'Positive predictive value', 'Negative predictive value', 'F1 score', 'Kappa score']}

# 计算每个模型的指标并存储
for model_name, model in models.items():
# 预测测试集
y_pred = model.predict(X_test)
# 计算评价指标
metrics_dict[model_name] = calculate_metrics(y_test, y_pred)

# 创建DataFrame存储所有模型的评价指标
metrics_df_test = pd.DataFrame(metrics_dict)
metrics_df_test

# 可视化展示
def plot_metrics(metrics_df, title = "Metrics Plot", save_path = None):

# 获取一个颜色映射(确保颜色的多样性)
colors = plt.cm.get_cmap('tab20', len(metrics_df.columns) - 1) # 根据模型的数量选择足够的颜色
plt.figure(figsize = (10, 8), dpi = 1200)

# 遍历每一列模型,忽略 'Metrics' 列
for idx, model in enumerate(metrics_df.columns[1:]):
plt.plot(metrics_df['Metrics'], metrics_df[model], marker = 'o', label = model, color = colors(idx))

ax = plt.gca()

# 优化坐标轴边框设置
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.yaxis.set_major_locator(ticker.MultipleLocator(0.25))
ax.yaxis.set_minor_locator(ticker.MultipleLocator(0.125))
ax.yaxis.set_minor_formatter(ticker.NullFormatter())

# 设置图表标题和标签
plt.title(title, fontsize = 18, fontweight = "bold")
plt.ylim(0.1, 1.0)
plt.yticks([0.25, 0.50, 0.75, 1.00]) # 设置主要刻度
plt.xticks(rotation = 45, ha = 'right', fontsize = 14)
plt.yticks(fontsize = 14)

# 显示图例
plt.legend(title = "Models", loc = 'center left', bbox_to_anchor = (1, 0.5), frameon = False, fontsize = 12)

# 显示网格线,增强可视化效果
plt.grid(which = 'both', linestyle = '-', linewidth = 0.5, color = 'gray')

# 保存图形到文件,如果提供了保存路径
if save_path:
plt.savefig(save_path, bbox_inches = 'tight') # 确保内容不会被裁剪

# 布局优化,避免标签重叠
plt.tight_layout()
plt.close()

# 调用该函数来绘制测试集的指标
plot_metrics(metrics_df_test, title = "Test Set Metrics", save_path = f"{wkdir}/综合评价-Test.png")

# 训练集测试集评价指标系统输出比较
if True:

# 计算每个模型在训练集上的评价指标
metrics_dict_train = {'Metrics': ['accuracy', 'sensitivity', 'specificity', 'Positive predictive value', 'Negative predictive value', 'F1 score', 'Kappa score']}

# 遍历模型,计算训练集上的评价指标
for model_name, model in models.items():
# 预测训练集
y_pred_train = model.predict(X_train)
# 计算评价指标
metrics_dict_train[model_name] = calculate_metrics(y_train, y_pred_train)

# 创建DataFrame存储所有模型的训练集评价指标
metrics_df_train = pd.DataFrame(metrics_dict_train)

# 绘制训练集的评价指标
plot_metrics(metrics_df_train, title = "Training Set Metrics", save_path = f"{wkdir}/综合评价-Train.png")

# 针对最优模型进行可视化
if True:

# 预测类别的概率
probabilities = best_model_et.predict_proba(X_test)

# 创建一个 DataFrame,列名为类别
probability_df = pd.DataFrame(probabilities, columns = [f'Prob_Class_{i}' for i in range(probabilities.shape[1])])

# 如果需要,可以添加 X_test 的索引或其他标识列
probability_df.index = X_test.index

# 重置索引,并选择是否保留原索引作为一列
probability_df.reset_index(drop = True, inplace = True)

if True:

df_selected = X_test

# 创建PCA对象,设置提取1个主成分
pca = PCA(n_components = 1)
# 进行PCA降维
pca_result = pca.fit_transform(df_selected)
# 获取解释方差比率(贡献度)
explained_variance = pca.explained_variance_ratio_
# 将PCA结果保存为DataFrame,并用贡献度标记列名
pca_df = pd.DataFrame(pca_result, columns = [f"PC1 ({explained_variance[0]*100:.2f}%)"])

pca_df.head()

if True:

# 将 PCA 结果 (pca_df) 和其他数据合并
combined_df = pd.concat([probability_df, pca_df, y_test.reset_index(drop = True)], axis = 1)
combined_df.rename(columns = {y_test.name: 'True'}, inplace = True)

if True:

colors = combined_df['True'].map({0: '#A0D6B4', 1: '#C3A6D8'})

# 绘制散点图,设置更大的点和黑色边框
plt.figure(figsize = (10, 6), dpi = 1200)
plt.scatter(combined_df['Prob_Class_1'], combined_df['PC1 (97.93%)'],
c=colors, edgecolor = 'black', s = 100) # s = 100 增大点的大小,edgecolor = 'black' 添加黑色边框

# 添加分界线和轴标签
plt.axvline(x = 0.5, color = 'gray', linestyle = '--')
plt.xlabel('Predicted value for benign')
plt.ylabel('PC1 (97.93%)')
plt.title('Test set')
# 添加类别图例并移到图外
for true_value, color, label in [(0, '#A0D6B4', 'malignant'), (1, '#C3A6D8', 'benign')]:
plt.scatter([], [], color = color, edgecolor = 'black', s = 100, label = label)
plt.legend(title = "Class", loc = "center left", bbox_to_anchor = (1, 0.5))
plt.tight_layout(rect = [0, 0, 0.85, 1])
plt.savefig(f'{wkdir}/分类效果-Test.png', bbox_inches = 'tight')
plt.close()

if True:

# 预测类别的概率
probabilities = best_model_et.predict_proba(X_train)
# 创建一个 DataFrame,列名为类别
probability_df = pd.DataFrame(probabilities, columns = [f'Prob_Class_{i}' for i in range(probabilities.shape[1])])
# 如果需要,可以添加 X_train 的索引或其他标识列
probability_df.index = X_train.index
# 重置索引,并选择是否保留原索引作为一列
probability_df.reset_index(drop = True, inplace = True)
df_selected = X_train # 使用训练集数据
# 创建PCA对象,设置提取1个主成分
pca = PCA(n_components = 1)
# 进行PCA降维
pca_result = pca.fit_transform(df_selected)
# 获取解释方差比率(贡献度)
explained_variance = pca.explained_variance_ratio_
# 将PCA结果保存为DataFrame,并用贡献度标记列名
pca_df = pd.DataFrame(pca_result, columns = [f"PC1 ({explained_variance[0]*100:.2f}%)"])
# 将 PCA 结果 (pca_df) 和其他数据合并
combined_df = pd.concat([probability_df, pca_df, y_train.reset_index(drop = True)], axis = 1)
combined_df.rename(columns = {y_train.name: 'True'}, inplace = True)

colors = combined_df['True'].map({0: '#A0D6B4', 1: '#C3A6D8'})
# 绘制散点图,设置更大的点和黑色边框
plt.figure(figsize = (10, 6),dpi = 1200)
plt.scatter(combined_df['Prob_Class_1'], combined_df['PC1 (99.74%)'],
c = colors, edgecolor = 'black', s = 100) # s = 100 增大点的大小,edgecolor = 'black' 添加黑色边框

# 添加分界线和轴标签
plt.axvline(x = 0.5, color = 'gray', linestyle = '--')
plt.xlabel('Predicted value for benign')
plt.ylabel('PC1 (99.74%)')
plt.title('Train set')
# 添加类别图例并移到图外
for true_value, color, label in [(0, '#A0D6B4', 'malignant'), (1, '#C3A6D8', 'benign')]:
plt.scatter([], [], color = color, edgecolor = 'black', s = 100, label = label)
plt.legend(title = "Class", loc = "center left", bbox_to_anchor = (1, 0.5))
plt.tight_layout(rect = [0, 0, 0.85, 1])
plt.savefig(f'{wkdir}/分类效果-Train.png', bbox_inches = 'tight')
plt.close()


# 绘制 Precision-Recall 曲线,添加随机分类器基线
if True:

pr_display = PrecisionRecallDisplay.from_estimator(
best_model_et,
X_test,
y_test,
plot_chance_level = True, # 添加基线
name = "AdaBoost"
)

_ = pr_display.ax_.set_title("Test set Precision-Recall curve")
plt.savefig(f'{wkdir}/Precision-Recall-Test.png', bbox_inches = 'tight', dpi = 1200)
plt.close()

pr_display = PrecisionRecallDisplay.from_estimator(
best_model_et,
X_train,
y_train,
plot_chance_level = True, # 添加基线
name = "AdaBoost"
)
_ = pr_display.ax_.set_title("Train set Precision-Recall curve")
plt.savefig(f'{wkdir}/Precision-Recall-Train.png', bbox_inches = 'tight', dpi = 1200)
plt.close()

# AdaBoost
if True:

# 计算训练集和测试集的预测概率
y_train_pred = best_model_et.predict_proba(X_train)[:, 1]
y_test_pred = best_model_et.predict_proba(X_test)[:, 1]

# 计算 ROC 曲线和 AUC 值
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred)
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred)

roc_auc_train = auc(fpr_train, tpr_train)
roc_auc_test = auc(fpr_test, tpr_test)

# 绘制 ROC 曲线
plt.figure(figsize=(10, 8))

# 绘制训练集 ROC 曲线
plt.plot(fpr_train, tpr_train, label = f"Train: AUC={roc_auc_train:.3f} (n={len(y_train)})", color = "#CECCE5", linewidth = 2)

# 绘制测试集 ROC 曲线
plt.plot(fpr_test, tpr_test, label = f"Test: AUC={roc_auc_test:.3f} (n={len(y_test)})", color = "#B0D9A5", linewidth = 2)

# 绘制随机分类的参考线
plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)

# 图形细节设置
plt.title("ROC Curve - AdaBoost (Train & Test)", fontsize = 20, fontweight = "bold")
plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.legend(loc="lower right", fontsize = 12)

# 去除顶部和右侧边框
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_linewidth(1.5)
plt.gca().spines['bottom'].set_linewidth(1.5)

# 关闭网格线
plt.grid(False)
plt.tight_layout()

# 保存图像为 PDF
plt.savefig(f"{wkdir}/AdaBoost.png", bbox_inches = 'tight')
plt.close()

# 针对最优模型进行解释
if True:

explainer = shap.TreeExplainer(best_model_et)
shap_values = explainer.shap_values(X_test)

# 提取类别 0 的 SHAP 值
shap_values_class_0 = shap_values[:, :, 0]

# 提取类别 1 的 SHAP 值
shap_values_class_1 = shap_values[:, :, 1]


plt.figure(figsize = (10, 5), dpi = 1200)
shap.summary_plot(shap_values_class_1, X_test, plot_type = "bar", show = False)
plt.title('SHAP_numpy Sorted Feature Importance')
plt.tight_layout()
plt.savefig(f"{wkdir}/SHAP_numpy Sorted Feature Importance.png", bbox_inches = 'tight')
plt.close()

plt.figure()
shap.summary_plot(shap_values_class_0, X_test, feature_names = X_test.columns, plot_type = "dot", show = False)
plt.savefig(f"{wkdir}/SHAP_numpy Sorted Feature Importance.dot.png", bbox_inches = 'tight')
plt.close()

shap_values_df_1 = pd.DataFrame(shap_values_class_0, columns = X_test.columns)
shap_values_df_1.head()

if True:

features = shap_values_df_1.columns.tolist() # 获取所有的特征名称

# 设置画布和子图结构(3行3列)
fig, axes = plt.subplots(3, 3, figsize = (10, 10), dpi = 1200)
axes = axes.flatten()

# 循环绘制每个特征的散点图
for i in range(len(axes)):
if i < len(features): # 如果还有特征未绘制
feature = features[i]
if feature in X_test.columns and feature in shap_values_df_1.columns:
ax = axes[i]
ax.scatter(X_test[feature], shap_values_df_1[feature], s = 10, color = "#6A9ACE")
ax.axhline(y = 0, color = 'red', linestyle = '-.', linewidth = 1) # 添加横线
ax.set_xlabel(feature, fontsize = 10)
ax.set_ylabel(f'SHAP value for\n{feature}', fontsize = 10)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
else:
# 如果特征不存在,隐藏对应的子图
axes[i].axis('off')
else:
# 如果超过了特征数量,关闭剩余的子图
axes[i].axis('off')

# 调整布局
plt.tight_layout()
plt.savefig(f"{wkdir}/SHAP value for features.png", bbox_inches = 'tight')
plt.close()

# 以下针对预测正确的类别 0 和类别 1 分别绘制瀑布图[这里就绘制测试集第一个样本 (0) 和第二个样本都预测正确 (1):
if True:

y_pred_et = best_model_et.predict(X_test)
explainer = shap.TreeExplainer(best_model_et)

# 计算shap值为Explanation格式
shap_values_Explanation = explainer(X_test)

# 提取类别 0 的 SHAP 值
shap_values_class_0 = shap_values_Explanation[:, :, 0]

# 提取类别 1 的 SHAP 值
shap_values_class_1 = shap_values_Explanation[:, :, 1]

# 预测类别 0 样本绘制
if True:

plt.figure(figsize = (10, 5), dpi = 1200)

# 绘制第 1 个样本 0 类别的 SHAP 瀑布图
shap.plots.waterfall(shap_values_class_0[0], show = False, max_display = 8)
plt.savefig(f"{wkdir}/绘制第 1 个样本 0 类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
plt.tight_layout()
plt.close()

plt.figure(figsize = (10, 5), dpi = 1200)

# 绘制第 1 个样本 1 类别的 SHAP 瀑布图
shap.plots.waterfall(shap_values_class_1[0], show = False, max_display = 8)
plt.savefig(f"{wkdir}/绘制第 1 个样本 1 类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
plt.tight_layout()
plt.show()

# 预测类别1样本绘制
if True:
plt.figure(figsize = (10, 5), dpi = 1200)

# 绘制第 2 个样本0类别的 SHAP 瀑布图
shap.plots.waterfall(shap_values_class_0[1], show = False, max_display = 8)
plt.savefig(f"{wkdir}/绘制第 2 个样本0类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
plt.tight_layout()
plt.show()

plt.figure(figsize = (10, 5), dpi = 1200)

# 绘制第 2 个样本1类别的 SHAP 瀑布图
shap.plots.waterfall(shap_values_class_1[1], show = False, max_display = 8)
plt.savefig(f"{wkdir}/绘制第 2 个样本1类别的 SHAP 瀑布图.png", bbox_inches = 'tight')
plt.tight_layout()
plt.show()
  • Titre: 从特征筛选到多模型机器学习对比与优化 -- 基于医学数据的分类分析完整案例/多模型机器学习对比与优化
  • Auteur: Xing Abao
  • Créé à : 2025-10-27 12:30:38
  • Mis à jour à : 2025-10-28 22:40:55
  • Lien: https://bioinformatics.vip/2025/10/27/sad41d8cd/251027_Medical_Classification_Model_Comparison/
  • Licence: Cette œuvre est sous licence CC BY-NC-SA 4.0.
Commentaires