基于 TOPSIS 赋权的多模型融合方法及其优化应用提升模型预测性能

Xing Abao Lv3

本示例其核心目标是构建一个高性能的分类预测模型。整个流程始于数据准备,历经多个独立模型的训练与优化,最终通过一种基于多标准决策分析的先进方法,将这些独立模型融合成一个性能更优的集成模型。

最后,实现了一个自定义的 EnsembleModel 类,用于执行加权模型融合。该类接收所有基模型以及通过TOPSIS计算出的权重。其核心的 predict_proba 方法会获取每个基模型对新数据的预测概率,然后根据TOPSIS权重对这些概率进行加权平均,从而得到一个综合的预测概率。最终的分类结果则基于这个加权概率(通常以 0.5 为阈值)来决定。这个集成模型同样经过了严格的性能评估,并与所有基模型进行了直观的比较。脚本通过绘制所有模型在测试集上的 ROC 曲线,将集成模型的曲线用醒目的颜色突出显示,清晰地展示了通过TOPSIS加权融合后的集成模型相较于任何单一基模型在性能上的显著提升。

整个流程的最终产物,包括所有优化后的基模型和最终的集成模型,都可以通过 joblib 库进行持久化保存,以便未来直接调用。

模拟案例

加载模块

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, auc as auc_func, f1_score, accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.base import BaseEstimator, ClassifierMixin

plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.unicode_minus'] = False
import warnings
warnings.filterwarnings("ignore")

加载数据

1
2
3
4
5
6
import os
wkdir = 'C:/Users/Administrator/Desktop'
os.chdir(wkdir)

path = 'Z:/TData/big-data/sad41d8cd/251026_TOPSIS_Weighted_Model_Fusion.xlsx'
df = pd.read_excel(path)
1
2
3
4
5
6
7
8
9
10
df.head()
Out[40]:
SystolicBP Hb ... AtrialFibrillationType Electrical_cardioversion
0 132.0 152.0 ... 1 1
1 97.0 132.0 ... 1 0
2 126.0 141.0 ... 1 1
3 112.0 105.0 ... 0 0
4 113.0 142.0 ... 1 1

[5 rows x 23 columns]
1
2
3
4
5
6
7
8
df.columns
Out[41]:
Index(['SystolicBP', 'Hb', 'LeftAtrialDiam', 'DiastolicBP', 'Age', 'BMI',
'SurgeryTime', 'Cr', 'CHA2DS2VASC', 'FT4', 'AFCourse', 'TSH',
'NtproBNP', 'ACEI_ARB_ARNI', 'Statin', 'SmokingHistory', 'HTN', 'B',
'Rivaroxaban', 'Sex', 'Dabigatran', 'AtrialFibrillationType',
'Electrical_cardioversion'],
dtype='object')

划分数据

1
2
3
4
5
6
7
8
9
10
11
12
# 划分特征和目标变量
X = df.drop(['Electrical_cardioversion'], axis = 1)
y = df['Electrical_cardioversion']

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size = 0.3,
random_state = 42,
stratify = df['Electrical_cardioversion']
)

数据过采样

1
2
3
4
5
6
7
# 导入`SMOTE`算法用于数据过采样
# `sampling_strategy = 1` 表示将少数类样本的数量增加到与多数类相同, 即使样本平衡
# `k_neighbors = 20` 表示用于生成合成样本时使用 20 个最近邻点, SMOTE 算法将基于这些邻居生成新的样本
smote = SMOTE(sampling_strategy = 1, k_neighbors = 20, random_state = 42)

# 使用`SMOTE`对训练集数据进行过采样, 生成新的平衡数据集
X_SMOTE_train, y_SMOTE_train = smote.fit_resample(X_train, y_train)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
X_SMOTE_train
Out[46]:
SystolicBP Hb ... Dabigatran AtrialFibrillationType
0 116.000000 149.000000 ... 1 0
1 124.000000 140.000000 ... 1 0
2 117.000000 137.000000 ... 1 1
3 140.000000 100.000000 ... 0 0
4 110.000000 129.000000 ... 0 0
.. ... ... ... ... ...
239 117.717327 161.630396 ... 1 1
240 97.276607 123.184404 ... 0 0
241 121.418391 151.831139 ... 0 1
242 134.716083 117.602548 ... 1 1
243 137.162555 125.486900 ... 0 1

[244 rows x 22 columns]

定义评估函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
def evaluate_model_performance(model, X_resampled, y_resampled, n_iterations = 1000, random_state = 42):
"""
使用 Bootstrap 采样评估分类模型的性能,并计算多个指标的中位数和 95% 置信区间。

参数:
model : 已训练的分类模型(例如 best_dt_model)
X_resampled : 输入特征数据(已重采样的数据)
y_resampled : 目标标签数据(已重采样的数据)
n_iterations : Bootstrap 采样的次数,默认为 1000
random_state : 随机种子,用于保证每次采样结果可复现

返回:
包含各项性能指标中位数和95%置信区间的字典
"""
# 使用模型进行预测
y_pred = model.predict(X_resampled)
y_probas = model.predict_proba(X_resampled)

# 初始化存储指标的列表
auc_scores = []
f1_scores = []
acc_scores = []
pre_scores = []
sen_scores = []
spe_scores = []
fprs = []
tprs = []

# Bootstrap采样
np.random.seed(random_state) # 设置随机种子,确保结果复现
for i in range(n_iterations):
sample_indices = np.random.choice(len(y_resampled), size = len(y_resampled), replace = True)
y_true_sample = y_resampled[sample_indices]
y_pred_sample = y_pred[sample_indices]
y_probas_sample = y_probas[sample_indices]

# 计算AUC
fpr, tpr, _ = roc_curve(y_true_sample, y_probas_sample[:, 1])
auc_score = auc_func(fpr, tpr)
auc_scores.append(auc_score)
fprs.append(fpr)
tprs.append(tpr)

# 计算其他指标
f1_scores.append(f1_score(y_true_sample, y_pred_sample))
acc_scores.append(accuracy_score(y_true_sample, y_pred_sample))
pre_scores.append(precision_score(y_true_sample, y_pred_sample))
sen_scores.append(recall_score(y_true_sample, y_pred_sample))
tn, fp, fn, tp = confusion_matrix(y_true_sample, y_pred_sample).ravel()
spe_scores.append(tn / (tn + fp))

# 计算各个指标的中位数和 95% 置信区间
def compute_median_and_ci(scores):
median = np.median(scores)
lower = np.percentile(scores, 2.5)
upper = np.percentile(scores, 97.5)
return median, lower, upper

# 计算各项指标
auc_median, auc_lower, auc_upper = compute_median_and_ci(auc_scores)
f1_median, f1_lower, f1_upper = compute_median_and_ci(f1_scores)
acc_median, acc_lower, acc_upper = compute_median_and_ci(acc_scores)
pre_median, pre_lower, pre_upper = compute_median_and_ci(pre_scores)
sen_median, sen_lower, sen_upper = compute_median_and_ci(sen_scores)
spe_median, spe_lower, spe_upper = compute_median_and_ci(spe_scores)

# 输出并返回各项指标的结果
result = {
'AUC': (auc_median, auc_lower, auc_upper),
'F1': (f1_median, f1_lower, f1_upper),
'ACC': (acc_median, acc_lower, acc_upper),
'PRE': (pre_median, pre_lower, pre_upper),
'SEN': (sen_median, sen_lower, sen_upper),
'SPE': (spe_median, spe_lower, spe_upper)
}

# 打印结果
print(f"AUC: Median = {auc_median}, 95% CI = [{auc_lower}, {auc_upper}]")
print(f"F1: Median = {f1_median}, 95% CI = [{f1_lower}, {f1_upper}]")
print(f"ACC: Median = {acc_median}, 95% CI = [{acc_lower}, {acc_upper}]")
print(f"PRE: Median = {pre_median}, 95% CI = [{pre_lower}, {pre_upper}]")
print(f"SEN: Median = {sen_median}, 95% CI = [{sen_lower}, {sen_upper}]")
print(f"SPE: Median = {spe_median}, 95% CI = [{spe_lower}, {spe_upper}]")
print()

return result

# 原始数据正向化
# 当然我们这里分类模型的评价指标都是越大越好, 所以不需要
# 但是如果存在不是极大型的评价指标就需要正向化,
# 同时如果不是一个量纲还需要进行数据标准化避免量纲影响
def func_1(x):
# 极小型
return(x.max()-x)

def func_2(x,x_best):
M = (abs(x-x_best)).max()
# 中间性
return(1-abs(x-x_best)/M)

def func_3(x,a,b):
M = max(a-min(x), max(x)-b)
y = []
for i in x:
if i < a:
y.append(1-(a-i)/M)
elif i > b:
y.append(1-(i-b)/M)
else:
y.append(1)
# 区间型
return(y)

def entropyWeight(data):
"""
熵权法确定权重

:param data: 行为评价对象,列为一个个的指标的 DataFrame
"""
data = np.array(data)
# 归一化
P = data / data.sum(axis = 0)

# 计算熵值
E = np.nansum(-P * np.log(P) / np.log(len(data)), axis = 0)

# 计算权系数
return (1 - E) / (1 - E).sum()

def topsis(data, weight = None):

# 计算正理想解和负理想解
Z = pd.DataFrame([data.max(), data.min()], index = ['正理想解', '负理想解'])

# 计算权重,使用熵权法计算,如果没有提供权重
weight = entropyWeight(data) if weight is None else np.array(weight)
Result = data.copy()

Result['正理想解'] = np.sqrt(((data - Z.loc['正理想解']) ** 2 * weight).sum(axis = 1))
Result['负理想解'] = np.sqrt(((data - Z.loc['负理想解']) ** 2 * weight).sum(axis = 1))

Result['综合得分指数'] = Result['负理想解'] / (Result['负理想解'] + Result['正理想解'])

# 添加百分比占比列 (将综合得分指数转换为百分比) 和为1
Result['百分比占比'] = (Result['综合得分指数'] / Result['综合得分指数'].sum())
Result['排序'] = Result.rank(ascending = False)['综合得分指数']

return Result, Z, weight

Decision Tree, DT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 初始化决策树模型
dt_model = DecisionTreeClassifier(random_state = 42)

# 设置网格搜索的参数范围
param_grid = {
'criterion': ['gini', 'entropy'], # 划分标准
'max_depth': [None, 10, 20, 30, 40], # 树的最大深度
'min_samples_split': [2, 5, 10], # 内部节点再划分所需的最小样本数
'min_samples_leaf': [1, 2, 4], # 叶子节点的最小样本数
'max_features': [None, 'sqrt', 'log2'] # 每次划分时考虑的最大特征数
}

# 创建`GridSearchCV`对象,进行 K 折交叉验证,选择最优参数
grid_search = GridSearchCV(estimator = dt_model, param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终决策树模型
best_dt_model = grid_search.best_estimator_

# est_dt_model 是通过网格搜索训练后的决策树模型
dt_train = evaluate_model_performance(best_dt_model, X_SMOTE_train, y_SMOTE_train)
dt_test = evaluate_model_performance(best_dt_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Fitting 5 folds for each of 270 candidates, totalling 1350 fits
最优参数:{'criterion': 'gini', 'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 5}

AUC: Median = 0.999327007201023, 95% CI = [0.9976797394028131, 1.0]
F1: Median = 0.9834022038567494, 95% CI = [0.9657729120230759, 1.0]
ACC: Median = 0.9836065573770492, 95% CI = [0.9631147540983607, 1.0]
PRE: Median = 0.9836731973877115, 95% CI = [0.956140350877193, 1.0]
SEN: Median = 0.9838709677419355, 95% CI = [0.9558591679915209, 1.0]
SPE: Median = 0.9838709677419355, 95% CI = [0.9536983204134367, 1.0]

AUC: Median = 0.7476869678689086, 95% CI = [0.6476586740611862, 0.8457761299435028]
F1: Median = 0.5777777777777777, 95% CI = [0.4186046511627907, 0.7301587301587301]
ACC: Median = 0.7012987012987013, 95% CI = [0.5974025974025974, 0.7922077922077922]
PRE: Median = 0.5333333333333333, 95% CI = [0.35290143084260733, 0.7200555555555554]
SEN: Median = 0.6428571428571429, 95% CI = [0.4482758620689655, 0.8182629870129869]
SPE: Median = 0.7317073170731707, 95% CI = [0.6086956521739131, 0.8478260869565217]

Random Forest, RF

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 初始化随机森林模型
rf_model = RandomForestClassifier(random_state = 42)

# 设置网格搜索的参数范围
param_grid = {
'n_estimators': [50, 100, 200], # 树的数量
'criterion': ['gini', 'entropy'], # 划分标准
'max_depth': [None, 10, 20, 30, 40], # 树的最大深度
'min_samples_split': [2, 5, 10], # 内部节点再划分所需的最小样本数
'min_samples_leaf': [1, 2, 4], # 叶子节点的最小样本数
'max_features': [None, 'sqrt', 'log2'] # 每次划分时考虑的最大特征数
}

# 创建`GridSearchCV`对象, 进行 K 折交叉验证, 选择最优参数
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终随机森林模型
best_rf_model = grid_search.best_estimator_

# 使用`evaluate_model_performance`评估模型性能
rf_train = evaluate_model_performance(best_rf_model, X_SMOTE_train, y_SMOTE_train)
rf_test = evaluate_model_performance(best_rf_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Fitting 5 folds for each of 810 candidates, totalling 4050 fits
最优参数:{'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}

AUC: Median = 1.0, 95% CI = [0.9999999999999999, 1.0]
F1: Median = 1.0, 95% CI = [1.0, 1.0]
ACC: Median = 1.0, 95% CI = [1.0, 1.0]
PRE: Median = 1.0, 95% CI = [1.0, 1.0]
SEN: Median = 1.0, 95% CI = [1.0, 1.0]
SPE: Median = 1.0, 95% CI = [1.0, 1.0]

AUC: Median = 0.856711915535445, 95% CI = [0.749617812170461, 0.9343750444662625]
F1: Median = 0.7058823529411765, 95% CI = [0.5365635888501743, 0.8275862068965517]
ACC: Median = 0.7922077922077922, 95% CI = [0.7012987012987013, 0.8701298701298701]
PRE: Median = 0.65625, 95% CI = [0.4642857142857143, 0.8214285714285714]
SEN: Median = 0.7647058823529411, 95% CI = [0.5789473684210527, 0.9166666666666666]
SPE: Median = 0.8113207547169812, 95% CI = [0.6956078083407276, 0.9090909090909091]

XGBoost

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# 初始化 XGBoost 模型
xgb_model = XGBClassifier(random_state = 42, use_label_encoder = False, eval_metric = 'mlogloss')

# 设置网格搜索的参数范围
param_grid = {
'n_estimators': [50, 100], # 树的数量
'learning_rate': [0.01, 0.2], # 学习率
'max_depth': [3, 5], # 树的最大深度
'min_child_weight': [1, 5], # 最小子节点权重
'subsample': [0.8, 1.0], # 用于训练模型的样本比例
'colsample_bytree': [0.8, 1.0], # 每棵树使用的特征比例
'gamma': [0, 0.2] # 最小化损失函数的增益
}

# 创建`GridSearchCV`对象, 进行 K 折交叉验证, 选择最优参数
grid_search = GridSearchCV(estimator = xgb_model, param_grid=param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终 XGBoost 模型
best_xgb_model = grid_search.best_estimator_

# 使用evaluate_model_performance评估模型性能
xgb_train = evaluate_model_performance(best_xgb_model, X_SMOTE_train, y_SMOTE_train)
xgb_test = evaluate_model_performance(best_xgb_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Fitting 5 folds for each of 128 candidates, totalling 640 fits
最优参数:{'colsample_bytree': 0.8, 'gamma': 0, 'learning_rate': 0.2, 'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 100, 'subsample': 1.0}

AUC: Median = 1.0, 95% CI = [0.9999999999999999, 1.0]
F1: Median = 1.0, 95% CI = [1.0, 1.0]
ACC: Median = 1.0, 95% CI = [1.0, 1.0]
PRE: Median = 1.0, 95% CI = [1.0, 1.0]
SEN: Median = 1.0, 95% CI = [1.0, 1.0]
SPE: Median = 1.0, 95% CI = [1.0, 1.0]

AUC: Median = 0.7941685652134739, 95% CI = [0.6747996480643539, 0.8836640417901137]
F1: Median = 0.5777777777777777, 95% CI = [0.4, 0.7272727272727273]
ACC: Median = 0.7142857142857143, 95% CI = [0.6233766233766234, 0.8181818181818182]
PRE: Median = 0.56, 95% CI = [0.36, 0.75]
SEN: Median = 0.6, 95% CI = [0.4230769230769231, 0.8]
SPE: Median = 0.7755102040816326, 95% CI = [0.654527972027972, 0.8868448637316562]

LGBM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# 初始化`LightGBM`模型
lgbm_model = LGBMClassifier(random_state = 42, verbose = -1)

# 设置网格搜索的参数范围
param_grid = {
'n_estimators': [50, 200], # 树的数量
'learning_rate': [0.01, 0.2], # 学习率
'max_depth': [3, 7], # 树的最大深度
'num_leaves': [31, 100] # 树的最大叶子数
}

# 创建`GridSearchCV`对象, 进行`K`折交叉验证, 选择最优参数
grid_search = GridSearchCV(estimator = lgbm_model, param_grid=param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终LightGBM模型
best_lgbm_model = grid_search.best_estimator_

# 使用`evaluate_model_performance`评估模型性能
lgbm_train = evaluate_model_performance(best_lgbm_model, X_SMOTE_train, y_SMOTE_train)
lgbm_test = evaluate_model_performance(best_lgbm_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Fitting 5 folds for each of 16 candidates, totalling 80 fits
最优参数:{'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 200, 'num_leaves': 31}

AUC: Median = 1.0, 95% CI = [0.9999999999999999, 1.0]
F1: Median = 1.0, 95% CI = [1.0, 1.0]
ACC: Median = 1.0, 95% CI = [1.0, 1.0]
PRE: Median = 1.0, 95% CI = [1.0, 1.0]
SEN: Median = 1.0, 95% CI = [1.0, 1.0]
SPE: Median = 1.0, 95% CI = [1.0, 1.0]

AUC: Median = 0.7959476894205958, 95% CI = [0.6753731778425656, 0.8892340425531914]
F1: Median = 0.5581395348837209, 95% CI = [0.35294117647058826, 0.7307926829268292]
ACC: Median = 0.7337662337662338, 95% CI = [0.6233766233766234, 0.8311688311688312]
PRE: Median = 0.5909090909090909, 95% CI = [0.3684210526315789, 0.8]
SEN: Median = 0.525062656641604, 95% CI = [0.3157894736842105, 0.7368993135011441]
SPE: Median = 0.8301886792452831, 95% CI = [0.7221666666666666, 0.925959595959596]

SVM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 初始化`SVM`模型
svm_model = SVC(random_state = 42, probability = True)

# 设置网格搜索的参数范围
param_grid = {
'C': [0.1, 1, 10], # 正则化参数,控制分类器的复杂度
'kernel': ['poly', 'rbf'], # 核函数类型
'gamma': ['scale', 0.1, 1] # 核函数的系数
}

# 创建`GridSearchCV`对象, 进行 K 折交叉验证, 选择最优参数
grid_search = GridSearchCV(estimator = svm_model, param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终SVM模型
best_svm_model = grid_search.best_estimator_

# 使用`evaluate_model_performance`评估`SVM`模型
svm_train = evaluate_model_performance(best_svm_model, X_SMOTE_train, y_SMOTE_train)
svm_test = evaluate_model_performance(best_svm_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Fitting 5 folds for each of 18 candidates, totalling 90 fits
最优参数:{'C': 0.1, 'gamma': 0.1, 'kernel': 'poly'}

AUC: Median = 0.5554619578021035, 95% CI = [0.5296992258804412, 0.5864952611204405]
F1: Median = 0.9727626459143969, 95% CI = [0.949573055028463, 0.988768191052697]
ACC: Median = 0.9713114754098361, 95% CI = [0.9508196721311475, 0.9877049180327869]
PRE: Median = 0.9609375, 95% CI = [0.9256136672011842, 0.9916683884297521]
SEN: Median = 0.984, 95% CI = [0.9586690771349863, 1.0]
SPE: Median = 0.9603174603174603, 95% CI = [0.9217391304347826, 0.9913811523725318]

AUC: Median = 0.5022222222222222, 95% CI = [0.4488346994535519, 0.54]
F1: Median = 0.5909090909090909, 95% CI = [0.4186046511627907, 0.7391706924315619]
ACC: Median = 0.7142857142857143, 95% CI = [0.6233766233766234, 0.8181818181818182]
PRE: Median = 0.5555555555555556, 95% CI = [0.3748842592592593, 0.7335185185185183]
SEN: Median = 0.64, 95% CI = [0.4443333333333333, 0.8333333333333334]
SPE: Median = 0.7586206896551724, 95% CI = [0.630414653784219, 0.8644312255541069]

NB

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 初始化`BernoulliNB`模型
bnb_model = BernoulliNB()

# 设置网格搜索的参数范围
param_grid = {
'alpha': [0.1, 1, 10], # 平滑参数,控制对零
'binarize': [0.0, 0.1, 0.5, 1.0] # 特征二值化的阈值
}

# 创建`GridSearchCV`对象, 进行 K 折交叉验证, 选择最优参数
grid_search = GridSearchCV(estimator=bnb_model, param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终`BernoulliNB`模型
best_bnb_model = grid_search.best_estimator_

# 使用`evaluate_model_performance`评估模型性能
bnb_train = evaluate_model_performance(best_bnb_model, X_SMOTE_train, y_SMOTE_train)
bnb_test = evaluate_model_performance(best_bnb_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Fitting 5 folds for each of 12 candidates, totalling 60 fits
最优参数:{'alpha': 0.1, 'binarize': 0.0}

AUC: Median = 0.8213263268070969, 95% CI = [0.7695113278420205, 0.8745045026664511]
F1: Median = 0.7467811158798283, 95% CI = [0.6841906902935732, 0.8016877637130801]
ACC: Median = 0.7540983606557377, 95% CI = [0.7008196721311475, 0.8073770491803278]
PRE: Median = 0.7727272727272727, 95% CI = [0.6964198179271709, 0.8545735707591376]
SEN: Median = 0.7215253029223094, 95% CI = [0.6440375302663438, 0.7966167902542373]
SPE: Median = 0.788135593220339, 95% CI = [0.7153918579626973, 0.8661696049055815]

AUC: Median = 0.8365052708225549, 95% CI = [0.730728536360636, 0.9281056005398111]
F1: Median = 0.7368421052631579, 95% CI = [0.5881127450980392, 0.8533333333333334]
ACC: Median = 0.8051948051948052, 95% CI = [0.7142857142857143, 0.8961038961038961]
PRE: Median = 0.6551724137931034, 95% CI = [0.4996621621621622, 0.8181818181818182]
SEN: Median = 0.8509259259259259, 95% CI = [0.6799642857142858, 0.9629629629629629]
SPE: Median = 0.7884615384615384, 95% CI = [0.673469387755102, 0.8965955581531267]

GBDT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 初始化`GradientBoostingClassifier`模型
gbdt_model = GradientBoostingClassifier(random_state = 42)

# 设置网格搜索的参数范围
param_grid = {
'n_estimators': [50, 100, 200], # 树的数量
'learning_rate': [0.01, 0.1, 0.2], # 学习率
'max_depth': [3, 5, 7], # 每棵树的最大深度
'subsample': [0.8, 1.0], # 用于训练每棵树的样本比例
'min_samples_split': [2, 5, 10], # 内部节点再划分所需的最小样本数
'min_samples_leaf': [1, 2, 4] # 叶子节点的最小样本数
}

# 创建`GridSearchCV`对象,进行 K 折交叉验证, 选择最优参数
grid_search = GridSearchCV(estimator=gbdt_model, param_grid=param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终`GradientBoostingClassifier`模型
best_gbdt_model = grid_search.best_estimator_

# 使用`evaluate_model_performance`评估模型性能
gbdt_train = evaluate_model_performance(best_gbdt_model, X_SMOTE_train, y_SMOTE_train)
gbdt_test = evaluate_model_performance(best_gbdt_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Fitting 5 folds for each of 486 candidates, totalling 2430 fits
最优参数:{'learning_rate': 0.2, 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200, 'subsample': 0.8}

AUC: Median = 1.0, 95% CI = [0.9999999999999999, 1.0]
F1: Median = 1.0, 95% CI = [1.0, 1.0]
ACC: Median = 1.0, 95% CI = [1.0, 1.0]
PRE: Median = 1.0, 95% CI = [1.0, 1.0]
SEN: Median = 1.0, 95% CI = [1.0, 1.0]
SPE: Median = 1.0, 95% CI = [1.0, 1.0]

AUC: Median = 0.8091896736398014, 95% CI = [0.6930041621911922, 0.9001683170481125]
F1: Median = 0.5614035087719298, 95% CI = [0.37492732558139535, 0.7170566037735848]
ACC: Median = 0.7142857142857143, 95% CI = [0.6233766233766234, 0.8181818181818182]
PRE: Median = 0.5625, 95% CI = [0.35714285714285715, 0.76]
SEN: Median = 0.569047619047619, 95% CI = [0.3684210526315789, 0.76]
SPE: Median = 0.7924528301886793, 95% CI = [0.673469387755102, 0.8958333333333334]

TOPSIS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# 模型集合
models = {
'DecisionTree': best_dt_model,
'RandomForest': best_rf_model,
'XGBoost': best_xgb_model,
'LightGBM': best_lgbm_model,
'SVM': best_svm_model,
'BernoulliNB': best_bnb_model,
'GradientBoosting': best_gbdt_model
}

# 存储结果
metrics = ['AUC', 'F1', 'ACC', 'PRE', 'SEN', 'SPE']
results = {metric: [] for metric in metrics}

# 计算每个模型的评价指标
for model_name, model in models.items():
y_pred = model.predict(X_test)
# 获取预测的概率值用于 AUC 计算
y_prob = model.predict_proba(X_test)[:, 1]

# 计算各个指标
auc = roc_auc_score(y_test, y_prob)
f1 = f1_score(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

# 计算敏感性(Recall)和特异性(Specificity)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
sensitivity = recall # 同为召回率
specificity = tn / (tn + fp)

# 将结果添加到对应的指标列表
results['AUC'].append(auc)
results['F1'].append(f1)
results['ACC'].append(acc)
results['PRE'].append(precision)
results['SEN'].append(sensitivity)
results['SPE'].append(specificity)

results_df = pd.DataFrame(results, index = models.keys())

# 固定权重,当然可以通过其它方法确定权重
weight = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]
Result, Z, weight = topsis(results_df, weight)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
results_df
Out[70]:
AUC F1 ACC PRE SEN SPE
DecisionTree 0.745385 0.581818 0.701299 0.533333 0.64 0.730769
RandomForest 0.849615 0.703704 0.792208 0.655172 0.76 0.807692
XGBoost 0.786154 0.576923 0.714286 0.555556 0.60 0.769231
LightGBM 0.790769 0.553191 0.727273 0.590909 0.52 0.826923
SVM 0.500000 0.592593 0.714286 0.551724 0.64 0.750000
BernoulliNB 0.833846 0.736842 0.805195 0.656250 0.84 0.788462
GradientBoosting 0.802308 0.560000 0.714286 0.560000 0.56 0.788462

Result
Out[77]:
AUC F1 ACC ... 综合得分指数 百分比占比 排序
DecisionTree 0.745385 0.581818 0.701299 ... 0.452934 0.118222 5.0
RandomForest 0.849615 0.703704 0.792208 ... 0.842936 0.220017 2.0
XGBoost 0.786154 0.576923 0.714286 ... 0.477515 0.124638 3.0
LightGBM 0.790769 0.553191 0.727273 ... 0.446772 0.116613 6.0
SVM 0.500000 0.592593 0.714286 ... 0.221394 0.057787 7.0
BernoulliNB 0.833846 0.736842 0.805195 ... 0.926777 0.241901 1.0
GradientBoosting 0.802308 0.560000 0.714286 ... 0.462900 0.120823 4.0

[7 rows x 11 columns]

定义集成模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
class EnsembleModel(BaseEstimator, ClassifierMixin):

def __init__(self, models, weights):
"""
初始化集成模型。

:param models: 字典类型,包含每个基学习器模型
:param weights: 模型权重数组(通过TOPSIS的百分比占比得到)
"""
self.models = models
self.weights = np.array(weights) # 权重 (百分比占比)

def fit(self, X, y):
"""
集成模型的fit方法,基学习器会进行训练。
"""
for model in self.models.values():
model.fit(X, y) # 为每个模型调用 fit 方法
return self

def predict_proba(self, X):
"""
返回加权后的预测概率。对于二分类问题,返回两个类别的加权概率。

:param X: 测试数据集
:return: 返回两个类别的加权预测概率,格式为 (n_samples, 2)
"""
probas = np.zeros((X.shape[0], 2)) # 创建一个二维数组,每行有两个值,分别表示类别0和公众号Python机器学习AI类别1的概率

# 获取每个模型的预测概率
for i, model in enumerate(self.models.values()):
model_proba = model.predict_proba(X) # 获取模型的预测概率
probas[:, 0] += model_proba[:, 0] * self.weights[i] # 加权类别0的概率
probas[:, 1] += model_proba[:, 1] * self.weights[i] # 加权类别1的概率

return probas # 返回加权后的两个类别的概率

def predict(self, X):
"""
返回预测结果,基于加权后的预测概率判断类别。

:param X: 测试数据集
:return: 返回最终预测类别(0 或 1)
"""
weighted_proba = self.predict_proba(X)
return (weighted_proba[:, 1] >= 0.5).astype(int) # 如果加权概率 >= 0.5,则预测为类别1,否则为类别0

# 初始化集成模型
ensemble_model = EnsembleModel(models = models, weights = Result['百分比占比']) # 权重

# 训练集成模型
ensemble_model.fit(X_SMOTE_train, y_SMOTE_train)

# 评估模型
ensemble_train = evaluate_model_performance(ensemble_model, X_SMOTE_train, y_SMOTE_train)
ensemble_test = evaluate_model_performance(ensemble_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))

# 使用集成模型预测测试集的概率(类别 1 的加权概率)
probs = ensemble_model.predict_proba(X_test)

# 使用集成模型预测测试集的类别(类别 0 或类别 1)
predictions = ensemble_model.predict(X_test)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Out[79]: 
EnsembleModel(models={'BernoulliNB': BernoulliNB(alpha=0.1),
'DecisionTree': DecisionTreeClassifier(min_samples_split=5,
random_state=42),
'GradientBoosting': GradientBoostingClassifier(learning_rate=0.2,
max_depth=5,
min_samples_split=5,
n_estimators=200,
random_state=42,
subsample=0.8),
'LightGBM': LGBMClassifier(learning_rate=0.2, max_depth=3,
n_estimators=200...
learning_rate=0.2, max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None, max_depth=5,
max_leaves=None,
min_child_weight=1, missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
random_state=42, ...)},
weights=array([0.11822175, 0.22001717, 0.12463761, 0.11661339, 0.05778658,
0.24190072, 0.12082277]))

AUC: Median = 1.0, 95% CI = [0.9999999999999999, 1.0]
F1: Median = 1.0, 95% CI = [1.0, 1.0]
ACC: Median = 1.0, 95% CI = [1.0, 1.0]
PRE: Median = 1.0, 95% CI = [1.0, 1.0]
SEN: Median = 1.0, 95% CI = [1.0, 1.0]
SPE: Median = 1.0, 95% CI = [1.0, 1.0]

AUC: Median = 0.845703487744304, 95% CI = [0.7403440715129027, 0.9246229260935144]
F1: Median = 0.6262254901960784, 95% CI = [0.42857142857142855, 0.7693548387096774]
ACC: Median = 0.7662337662337663, 95% CI = [0.6753246753246753, 0.8571428571428571]
PRE: Median = 0.6538461538461539, 95% CI = [0.45, 0.85]
SEN: Median = 0.6, 95% CI = [0.3997826086956522, 0.7916666666666666]
SPE: Median = 0.8495283018867925, 95% CI = [0.7407407407407407, 0.9411764705882353]

绘制 AUC 曲线

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# 对于 DecisionTree 模型
fpr_dt, tpr_dt, _ = roc_curve(y_test, best_dt_model.predict_proba(X_test)[:, 1])
auc_dt = roc_auc_score(y_test, best_dt_model.predict_proba(X_test)[:, 1])

# 对于 RandomForest 模型
fpr_rf, tpr_rf, _ = roc_curve(y_test, best_rf_model.predict_proba(X_test)[:, 1])
auc_rf = roc_auc_score(y_test, best_rf_model.predict_proba(X_test)[:, 1])

# 对于 XGBoost 模型
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, best_xgb_model.predict_proba(X_test)[:, 1])
auc_xgb = roc_auc_score(y_test, best_xgb_model.predict_proba(X_test)[:, 1])

# 对于 LightGBM 模型
fpr_lgbm, tpr_lgbm, _ = roc_curve(y_test, best_lgbm_model.predict_proba(X_test)[:, 1])
auc_lgbm = roc_auc_score(y_test, best_lgbm_model.predict_proba(X_test)[:, 1])

# 对于 SVM 模型
fpr_svm, tpr_svm, _ = roc_curve(y_test, best_svm_model.predict_proba(X_test)[:, 1])
auc_svm = roc_auc_score(y_test, best_svm_model.predict_proba(X_test)[:, 1])

# 对于 BernoulliNB 模型
fpr_bnb, tpr_bnb, _ = roc_curve(y_test, best_bnb_model.predict_proba(X_test)[:, 1])
auc_bnb = roc_auc_score(y_test, best_bnb_model.predict_proba(X_test)[:, 1])

# 对于 GradientBoosting 模型
fpr_gbdt, tpr_gbdt, _ = roc_curve(y_test, best_gbdt_model.predict_proba(X_test)[:, 1])
auc_gbdt = roc_auc_score(y_test, best_gbdt_model.predict_proba(X_test)[:, 1])

# 对于集成模型
fpr_ensemble, tpr_ensemble, _ = roc_curve(y_test, ensemble_model.predict_proba(X_test)[:, 1])
auc_ensemble = roc_auc_score(y_test, ensemble_model.predict_proba(X_test)[:, 1])

# 绘制 ROC 曲线
plt.figure(figsize = (8, 7))

# 绘制每个基础模型的 ROC 曲线(灰色,虚线)
plt.plot(fpr_dt, tpr_dt, color = 'gray', linestyle = '--', alpha = 0.6, label = f'DT (AUC = {auc_dt:.4f})')
plt.plot(fpr_rf, tpr_rf, color = 'gray', linestyle = '--', alpha = 0.6, label = f'RF (AUC = {auc_rf:.4f})')
plt.plot(fpr_xgb, tpr_xgb, color = 'gray', linestyle = '--', alpha = 0.6, label = f'XGB (AUC = {auc_xgb:.4f})')
plt.plot(fpr_lgbm, tpr_lgbm, color = 'gray', linestyle = '--', alpha = 0.6, label = f'LGBM (AUC = {auc_lgbm:.4f})')
plt.plot(fpr_svm, tpr_svm, color = 'gray', linestyle = '--', alpha = 0.6, label = f'SVM (AUC = {auc_svm:.4f})')
plt.plot(fpr_bnb, tpr_bnb, color = 'gray', linestyle = '--', alpha = 0.6, label = f'NB (AUC = {auc_bnb:.4f})')
plt.plot(fpr_gbdt, tpr_gbdt, color = 'gray', linestyle = '--', alpha = 0.6, label = f'GBDT (AUC = {auc_gbdt:.4f})')

# 绘制集成模型的 ROC 曲线(橙色)
plt.plot(fpr_ensemble, tpr_ensemble, color = 'orange', label = f'Ensemble (AUC = {auc_ensemble:.4f})')

# 绘制对角线(随机猜测的模型)
plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)

# 设置图像标签
plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.legend(loc = "lower right", fontsize = 12, frameon = False)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_linewidth(1.5)
plt.gca().spines['bottom'].set_linewidth(1.5)
plt.grid(False)

# 调整布局并保存图像
plt.tight_layout()
plt.savefig("ROC-Ensemble.pdf", format = 'pdf', bbox_inches = 'tight', dpi = 1200)
plt.show()

保存每个模型

1
2
3
4
5
6
7
8
joblib.dump(best_dt_model, 'DecisionTree.pkl')
joblib.dump(best_rf_model, 'RandomForest.pkl')
joblib.dump(best_xgb_model, 'XGBoost.pkl')
joblib.dump(best_lgbm_model, 'LightGBM.pkl')
joblib.dump(best_svm_model, 'SVM.pkl')
joblib.dump(best_bnb_model, 'BernoulliNB.pkl')
joblib.dump(best_gbdt_model, 'GradientBoosting.pkl')
joblib.dump(ensemble_model, 'ensemble_model.pkl')

完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
# -*- coding: utf-8 -*-

import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, auc as auc_func, f1_score, accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.base import BaseEstimator, ClassifierMixin

plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.unicode_minus'] = False
import warnings
warnings.filterwarnings("ignore")

def evaluate_model_performance(model, X_resampled, y_resampled, n_iterations = 1000, random_state = 42):
"""
使用 Bootstrap 采样评估分类模型的性能,并计算多个指标的中位数和 95% 置信区间。

参数:
model : 已训练的分类模型(例如 best_dt_model)
X_resampled : 输入特征数据(已重采样的数据)
y_resampled : 目标标签数据(已重采样的数据)
n_iterations : Bootstrap 采样的次数,默认为 1000
random_state : 随机种子,用于保证每次采样结果可复现

返回:
包含各项性能指标中位数和95%置信区间的字典
"""
# 使用模型进行预测
y_pred = model.predict(X_resampled)
y_probas = model.predict_proba(X_resampled)

# 初始化存储指标的列表
auc_scores = []
f1_scores = []
acc_scores = []
pre_scores = []
sen_scores = []
spe_scores = []
fprs = []
tprs = []

# Bootstrap采样
np.random.seed(random_state) # 设置随机种子,确保结果复现
for i in range(n_iterations):
sample_indices = np.random.choice(len(y_resampled), size = len(y_resampled), replace = True)
y_true_sample = y_resampled[sample_indices]
y_pred_sample = y_pred[sample_indices]
y_probas_sample = y_probas[sample_indices]

# 计算AUC
fpr, tpr, _ = roc_curve(y_true_sample, y_probas_sample[:, 1])
auc_score = auc_func(fpr, tpr)
auc_scores.append(auc_score)
fprs.append(fpr)
tprs.append(tpr)

# 计算其他指标
f1_scores.append(f1_score(y_true_sample, y_pred_sample))
acc_scores.append(accuracy_score(y_true_sample, y_pred_sample))
pre_scores.append(precision_score(y_true_sample, y_pred_sample))
sen_scores.append(recall_score(y_true_sample, y_pred_sample))
tn, fp, fn, tp = confusion_matrix(y_true_sample, y_pred_sample).ravel()
spe_scores.append(tn / (tn + fp))

# 计算各个指标的中位数和 95% 置信区间
def compute_median_and_ci(scores):
median = np.median(scores)
lower = np.percentile(scores, 2.5)
upper = np.percentile(scores, 97.5)
return median, lower, upper

# 计算各项指标
auc_median, auc_lower, auc_upper = compute_median_and_ci(auc_scores)
f1_median, f1_lower, f1_upper = compute_median_and_ci(f1_scores)
acc_median, acc_lower, acc_upper = compute_median_and_ci(acc_scores)
pre_median, pre_lower, pre_upper = compute_median_and_ci(pre_scores)
sen_median, sen_lower, sen_upper = compute_median_and_ci(sen_scores)
spe_median, spe_lower, spe_upper = compute_median_and_ci(spe_scores)

# 输出并返回各项指标的结果
result = {
'AUC': (auc_median, auc_lower, auc_upper),
'F1': (f1_median, f1_lower, f1_upper),
'ACC': (acc_median, acc_lower, acc_upper),
'PRE': (pre_median, pre_lower, pre_upper),
'SEN': (sen_median, sen_lower, sen_upper),
'SPE': (spe_median, spe_lower, spe_upper)
}

# 打印结果
print(f"AUC: Median = {auc_median}, 95% CI = [{auc_lower}, {auc_upper}]")
print(f"F1: Median = {f1_median}, 95% CI = [{f1_lower}, {f1_upper}]")
print(f"ACC: Median = {acc_median}, 95% CI = [{acc_lower}, {acc_upper}]")
print(f"PRE: Median = {pre_median}, 95% CI = [{pre_lower}, {pre_upper}]")
print(f"SEN: Median = {sen_median}, 95% CI = [{sen_lower}, {sen_upper}]")
print(f"SPE: Median = {spe_median}, 95% CI = [{spe_lower}, {spe_upper}]")
print()

return result

# 原始数据正向化
# 当然我们这里分类模型的评价指标都是越大越好, 所以不需要
# 但是如果存在不是极大型的评价指标就需要正向化,
# 同时如果不是一个量纲还需要进行数据标准化避免量纲影响
def func_1(x):
# 极小型
return(x.max()-x)

def func_2(x,x_best):
M = (abs(x-x_best)).max()
# 中间性
return(1-abs(x-x_best)/M)

def func_3(x,a,b):
M = max(a-min(x), max(x)-b)
y = []
for i in x:
if i < a:
y.append(1-(a-i)/M)
elif i > b:
y.append(1-(i-b)/M)
else:
y.append(1)
# 区间型
return(y)

def entropyWeight(data):
"""
熵权法确定权重

:param data: 行为评价对象,列为一个个的指标的 DataFrame
"""
data = np.array(data)
# 归一化
P = data / data.sum(axis = 0)

# 计算熵值
E = np.nansum(-P * np.log(P) / np.log(len(data)), axis = 0)

# 计算权系数
return (1 - E) / (1 - E).sum()

def topsis(data, weight = None):

# 计算正理想解和负理想解
Z = pd.DataFrame([data.max(), data.min()], index = ['正理想解', '负理想解'])

# 计算权重,使用熵权法计算,如果没有提供权重
weight = entropyWeight(data) if weight is None else np.array(weight)
Result = data.copy()

Result['正理想解'] = np.sqrt(((data - Z.loc['正理想解']) ** 2 * weight).sum(axis = 1))
Result['负理想解'] = np.sqrt(((data - Z.loc['负理想解']) ** 2 * weight).sum(axis = 1))

Result['综合得分指数'] = Result['负理想解'] / (Result['负理想解'] + Result['正理想解'])

# 添加百分比占比列 (将综合得分指数转换为百分比) 和为1
Result['百分比占比'] = (Result['综合得分指数'] / Result['综合得分指数'].sum())
Result['排序'] = Result.rank(ascending = False)['综合得分指数']

return Result, Z, weight


if __name__ == '__main__':

import os
wkdir = 'C:/Users/Administrator/Desktop'
os.chdir(wkdir)

path = 'Z:/TData/big-data/sad41d8cd/251026_TOPSIS_Weighted_Model_Fusion.xlsx'
df = pd.read_excel(path)

if False:

df.head()
df.columns

if True:

# 划分特征和目标变量
X = df.drop(['Electrical_cardioversion'], axis = 1)
y = df['Electrical_cardioversion']

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size = 0.3,
random_state = 42,
stratify = df['Electrical_cardioversion']
)

# 原始文献中经过采样, 所以这里加上采样,
# 但是实际上该模拟数据集不采样性能会更好, 直接使用 X_train, y_test
# 进行后面的模型训练即可, 这里是和文献一致
if True:

# 导入`SMOTE`算法用于数据过采样
# `sampling_strategy = 1` 表示将少数类样本的数量增加到与多数类相同, 即使样本平衡
# `k_neighbors = 20` 表示用于生成合成样本时使用 20 个最近邻点, SMOTE 算法将基于这些邻居生成新的样本
smote = SMOTE(sampling_strategy = 1, k_neighbors = 20, random_state = 42)

# 使用`SMOTE`对训练集数据进行过采样, 生成新的平衡数据集
X_SMOTE_train, y_SMOTE_train = smote.fit_resample(X_train, y_train)

# DecisionTree DT
if True:

# 初始化决策树模型
dt_model = DecisionTreeClassifier(random_state = 42)

# 设置网格搜索的参数范围
param_grid = {
'criterion': ['gini', 'entropy'], # 划分标准
'max_depth': [None, 10, 20, 30, 40], # 树的最大深度
'min_samples_split': [2, 5, 10], # 内部节点再划分所需的最小样本数
'min_samples_leaf': [1, 2, 4], # 叶子节点的最小样本数
'max_features': [None, 'sqrt', 'log2'] # 每次划分时考虑的最大特征数
}

# 创建`GridSearchCV`对象,进行 K 折交叉验证,选择最优参数
grid_search = GridSearchCV(estimator = dt_model, param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终决策树模型
best_dt_model = grid_search.best_estimator_

# est_dt_model 是通过网格搜索训练后的决策树模型
dt_train = evaluate_model_performance(best_dt_model, X_SMOTE_train, y_SMOTE_train)
dt_test = evaluate_model_performance(best_dt_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))

# Random Forest RF
if True:

# 初始化随机森林模型
rf_model = RandomForestClassifier(random_state = 42)

# 设置网格搜索的参数范围
param_grid = {
'n_estimators': [50, 100, 200], # 树的数量
'criterion': ['gini', 'entropy'], # 划分标准
'max_depth': [None, 10, 20, 30, 40], # 树的最大深度
'min_samples_split': [2, 5, 10], # 内部节点再划分所需的最小样本数
'min_samples_leaf': [1, 2, 4], # 叶子节点的最小样本数
'max_features': [None, 'sqrt', 'log2'] # 每次划分时考虑的最大特征数
}

# 创建`GridSearchCV`对象, 进行 K 折交叉验证, 选择最优参数
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终随机森林模型
best_rf_model = grid_search.best_estimator_

# 使用`evaluate_model_performance`评估模型性能
rf_train = evaluate_model_performance(best_rf_model, X_SMOTE_train, y_SMOTE_train)
rf_test = evaluate_model_performance(best_rf_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))

# XGBoost
if True:

# 初始化 XGBoost 模型
xgb_model = XGBClassifier(random_state = 42, use_label_encoder = False, eval_metric = 'mlogloss')

# 设置网格搜索的参数范围
param_grid = {
'n_estimators': [50, 100], # 树的数量
'learning_rate': [0.01, 0.2], # 学习率
'max_depth': [3, 5], # 树的最大深度
'min_child_weight': [1, 5], # 最小子节点权重
'subsample': [0.8, 1.0], # 用于训练模型的样本比例
'colsample_bytree': [0.8, 1.0], # 每棵树使用的特征比例
'gamma': [0, 0.2] # 最小化损失函数的增益
}

# 创建`GridSearchCV`对象, 进行 K 折交叉验证, 选择最优参数
grid_search = GridSearchCV(estimator = xgb_model, param_grid=param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终 XGBoost 模型
best_xgb_model = grid_search.best_estimator_

# 使用evaluate_model_performance评估模型性能
xgb_train = evaluate_model_performance(best_xgb_model, X_SMOTE_train, y_SMOTE_train)
xgb_test = evaluate_model_performance(best_xgb_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))

# LGBM
if True:

# 初始化`LightGBM`模型
lgbm_model = LGBMClassifier(random_state = 42, verbose = -1)

# 设置网格搜索的参数范围
param_grid = {
'n_estimators': [50, 200], # 树的数量
'learning_rate': [0.01, 0.2], # 学习率
'max_depth': [3, 7], # 树的最大深度
'num_leaves': [31, 100] # 树的最大叶子数
}

# 创建`GridSearchCV`对象, 进行`K`折交叉验证, 选择最优参数
grid_search = GridSearchCV(estimator = lgbm_model, param_grid=param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终LightGBM模型
best_lgbm_model = grid_search.best_estimator_

# 使用`evaluate_model_performance`评估模型性能
lgbm_train = evaluate_model_performance(best_lgbm_model, X_SMOTE_train, y_SMOTE_train)
lgbm_test = evaluate_model_performance(best_lgbm_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))

# SVM
if True:

# 初始化`SVM`模型
svm_model = SVC(random_state = 42, probability = True)

# 设置网格搜索的参数范围
param_grid = {
'C': [0.1, 1, 10], # 正则化参数,控制分类器的复杂度
'kernel': ['poly', 'rbf'], # 核函数类型
'gamma': ['scale', 0.1, 1] # 核函数的系数
}

# 创建`GridSearchCV`对象, 进行 K 折交叉验证, 选择最优参数
grid_search = GridSearchCV(estimator = svm_model, param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终SVM模型
best_svm_model = grid_search.best_estimator_

# 使用`evaluate_model_performance`评估`SVM`模型
svm_train = evaluate_model_performance(best_svm_model, X_SMOTE_train, y_SMOTE_train)
svm_test = evaluate_model_performance(best_svm_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))

# NB
if True:

# 初始化`BernoulliNB`模型
bnb_model = BernoulliNB()

# 设置网格搜索的参数范围
param_grid = {
'alpha': [0.1, 1, 10], # 平滑参数,控制对零
'binarize': [0.0, 0.1, 0.5, 1.0] # 特征二值化的阈值
}

# 创建`GridSearchCV`对象, 进行 K 折交叉验证, 选择最优参数
grid_search = GridSearchCV(estimator=bnb_model, param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终`BernoulliNB`模型
best_bnb_model = grid_search.best_estimator_

# 使用`evaluate_model_performance`评估模型性能
bnb_train = evaluate_model_performance(best_bnb_model, X_SMOTE_train, y_SMOTE_train)
bnb_test = evaluate_model_performance(best_bnb_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))

# GBDT
if True:

# 初始化`GradientBoostingClassifier`模型
gbdt_model = GradientBoostingClassifier(random_state = 42)

# 设置网格搜索的参数范围
param_grid = {
'n_estimators': [50, 100, 200], # 树的数量
'learning_rate': [0.01, 0.1, 0.2], # 学习率
'max_depth': [3, 5, 7], # 每棵树的最大深度
'subsample': [0.8, 1.0], # 用于训练每棵树的样本比例
'min_samples_split': [2, 5, 10], # 内部节点再划分所需的最小样本数
'min_samples_leaf': [1, 2, 4] # 叶子节点的最小样本数
}

# 创建`GridSearchCV`对象,进行 K 折交叉验证, 选择最优参数
grid_search = GridSearchCV(estimator=gbdt_model, param_grid=param_grid, cv = 5, n_jobs = -1, verbose = 1)

# 训练网格搜索模型
grid_search.fit(X_SMOTE_train, y_SMOTE_train)

# 输出最优参数
print(f"最优参数:{grid_search.best_params_}\n")

# 使用最优参数建立最终`GradientBoostingClassifier`模型
best_gbdt_model = grid_search.best_estimator_

# 使用`evaluate_model_performance`评估模型性能
gbdt_train = evaluate_model_performance(best_gbdt_model, X_SMOTE_train, y_SMOTE_train)
gbdt_test = evaluate_model_performance(best_gbdt_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))

# TOPSIS
if True:

# 模型集合
models = {
'DecisionTree': best_dt_model,
'RandomForest': best_rf_model,
'XGBoost': best_xgb_model,
'LightGBM': best_lgbm_model,
'SVM': best_svm_model,
'BernoulliNB': best_bnb_model,
'GradientBoosting': best_gbdt_model
}

# 存储结果
metrics = ['AUC', 'F1', 'ACC', 'PRE', 'SEN', 'SPE']
results = {metric: [] for metric in metrics}

# 计算每个模型的评价指标
for model_name, model in models.items():
y_pred = model.predict(X_test)
# 获取预测的概率值用于 AUC 计算
y_prob = model.predict_proba(X_test)[:, 1]

# 计算各个指标
auc = roc_auc_score(y_test, y_prob)
f1 = f1_score(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

# 计算敏感性(Recall)和特异性(Specificity)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
sensitivity = recall # 同为召回率
specificity = tn / (tn + fp)

# 将结果添加到对应的指标列表
results['AUC'].append(auc)
results['F1'].append(f1)
results['ACC'].append(acc)
results['PRE'].append(precision)
results['SEN'].append(sensitivity)
results['SPE'].append(specificity)

results_df = pd.DataFrame(results, index = models.keys())

# 固定权重,当然可以通过其它方法确定权重
weight = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]
Result, Z, weight = topsis(results_df, weight)

# 定义集成模型
if True:

class EnsembleModel(BaseEstimator, ClassifierMixin):

def __init__(self, models, weights):
"""
初始化集成模型。

:param models: 字典类型,包含每个基学习器模型
:param weights: 模型权重数组(通过TOPSIS的百分比占比得到)
"""
self.models = models
self.weights = np.array(weights) # 权重 (百分比占比)

def fit(self, X, y):
"""
集成模型的fit方法,基学习器会进行训练。
"""
for model in self.models.values():
model.fit(X, y) # 为每个模型调用 fit 方法
return self

def predict_proba(self, X):
"""
返回加权后的预测概率。对于二分类问题,返回两个类别的加权概率。

:param X: 测试数据集
:return: 返回两个类别的加权预测概率,格式为 (n_samples, 2)
"""
probas = np.zeros((X.shape[0], 2)) # 创建一个二维数组,每行有两个值,分别表示类别0和公众号Python机器学习AI类别1的概率

# 获取每个模型的预测概率
for i, model in enumerate(self.models.values()):
model_proba = model.predict_proba(X) # 获取模型的预测概率
probas[:, 0] += model_proba[:, 0] * self.weights[i] # 加权类别0的概率
probas[:, 1] += model_proba[:, 1] * self.weights[i] # 加权类别1的概率

return probas # 返回加权后的两个类别的概率

def predict(self, X):
"""
返回预测结果,基于加权后的预测概率判断类别。

:param X: 测试数据集
:return: 返回最终预测类别(0 或 1)
"""
weighted_proba = self.predict_proba(X)
return (weighted_proba[:, 1] >= 0.5).astype(int) # 如果加权概率 >= 0.5,则预测为类别1,否则为类别0

# 初始化集成模型
ensemble_model = EnsembleModel(models = models, weights = Result['百分比占比']) # 权重

# 训练集成模型
ensemble_model.fit(X_SMOTE_train, y_SMOTE_train)

# 评估模型
ensemble_train = evaluate_model_performance(ensemble_model, X_SMOTE_train, y_SMOTE_train)
ensemble_test = evaluate_model_performance(ensemble_model, X_test.reset_index(drop = True), y_test.reset_index(drop = True))

# 使用集成模型预测测试集的概率(类别 1 的加权概率)
probs = ensemble_model.predict_proba(X_test)

# 使用集成模型预测测试集的类别(类别0或类别1)
predictions = ensemble_model.predict(X_test)

# 绘制最终 AUC 曲线
if True:

# 对于 DecisionTree 模型
fpr_dt, tpr_dt, _ = roc_curve(y_test, best_dt_model.predict_proba(X_test)[:, 1])
auc_dt = roc_auc_score(y_test, best_dt_model.predict_proba(X_test)[:, 1])

# 对于 RandomForest 模型
fpr_rf, tpr_rf, _ = roc_curve(y_test, best_rf_model.predict_proba(X_test)[:, 1])
auc_rf = roc_auc_score(y_test, best_rf_model.predict_proba(X_test)[:, 1])

# 对于 XGBoost 模型
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, best_xgb_model.predict_proba(X_test)[:, 1])
auc_xgb = roc_auc_score(y_test, best_xgb_model.predict_proba(X_test)[:, 1])

# 对于 LightGBM 模型
fpr_lgbm, tpr_lgbm, _ = roc_curve(y_test, best_lgbm_model.predict_proba(X_test)[:, 1])
auc_lgbm = roc_auc_score(y_test, best_lgbm_model.predict_proba(X_test)[:, 1])

# 对于 SVM 模型
fpr_svm, tpr_svm, _ = roc_curve(y_test, best_svm_model.predict_proba(X_test)[:, 1])
auc_svm = roc_auc_score(y_test, best_svm_model.predict_proba(X_test)[:, 1])

# 对于 BernoulliNB 模型
fpr_bnb, tpr_bnb, _ = roc_curve(y_test, best_bnb_model.predict_proba(X_test)[:, 1])
auc_bnb = roc_auc_score(y_test, best_bnb_model.predict_proba(X_test)[:, 1])

# 对于 GradientBoosting 模型
fpr_gbdt, tpr_gbdt, _ = roc_curve(y_test, best_gbdt_model.predict_proba(X_test)[:, 1])
auc_gbdt = roc_auc_score(y_test, best_gbdt_model.predict_proba(X_test)[:, 1])

# 对于集成模型
fpr_ensemble, tpr_ensemble, _ = roc_curve(y_test, ensemble_model.predict_proba(X_test)[:, 1])
auc_ensemble = roc_auc_score(y_test, ensemble_model.predict_proba(X_test)[:, 1])

# 绘制 ROC 曲线
plt.figure(figsize = (8, 7))

# 绘制每个基础模型的 ROC 曲线(灰色,虚线)
plt.plot(fpr_dt, tpr_dt, color = 'gray', linestyle = '--', alpha = 0.6, label = f'DT (AUC = {auc_dt:.4f})')
plt.plot(fpr_rf, tpr_rf, color = 'gray', linestyle = '--', alpha = 0.6, label = f'RF (AUC = {auc_rf:.4f})')
plt.plot(fpr_xgb, tpr_xgb, color = 'gray', linestyle = '--', alpha = 0.6, label = f'XGB (AUC = {auc_xgb:.4f})')
plt.plot(fpr_lgbm, tpr_lgbm, color = 'gray', linestyle = '--', alpha = 0.6, label = f'LGBM (AUC = {auc_lgbm:.4f})')
plt.plot(fpr_svm, tpr_svm, color = 'gray', linestyle = '--', alpha = 0.6, label = f'SVM (AUC = {auc_svm:.4f})')
plt.plot(fpr_bnb, tpr_bnb, color = 'gray', linestyle = '--', alpha = 0.6, label = f'NB (AUC = {auc_bnb:.4f})')
plt.plot(fpr_gbdt, tpr_gbdt, color = 'gray', linestyle = '--', alpha = 0.6, label = f'GBDT (AUC = {auc_gbdt:.4f})')

# 绘制集成模型的 ROC 曲线(橙色)
plt.plot(fpr_ensemble, tpr_ensemble, color = 'orange', label = f'Ensemble (AUC = {auc_ensemble:.4f})')

# 绘制对角线(随机猜测的模型)
plt.plot([0, 1], [0, 1], 'r--', linewidth = 1.5, alpha = 0.8)

# 设置图像标签
plt.xlabel("False Positive Rate (1-Specificity)", fontsize = 18)
plt.ylabel("True Positive Rate (Sensitivity)", fontsize = 18)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16)
plt.legend(loc = "lower right", fontsize = 12, frameon = False)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_linewidth(1.5)
plt.gca().spines['bottom'].set_linewidth(1.5)
plt.grid(False)

# 调整布局并保存图像
plt.tight_layout()
plt.savefig("ROC-Ensemble.pdf", format = 'pdf', bbox_inches = 'tight', dpi = 1200)
plt.show()

# 保存每个模型
if True:

joblib.dump(best_dt_model, 'DecisionTree.pkl')
joblib.dump(best_rf_model, 'RandomForest.pkl')
joblib.dump(best_xgb_model, 'XGBoost.pkl')
joblib.dump(best_lgbm_model, 'LightGBM.pkl')
joblib.dump(best_svm_model, 'SVM.pkl')
joblib.dump(best_bnb_model, 'BernoulliNB.pkl')
joblib.dump(best_gbdt_model, 'GradientBoosting.pkl')
joblib.dump(ensemble_model, 'ensemble_model.pkl')
  • Titre: 基于 TOPSIS 赋权的多模型融合方法及其优化应用提升模型预测性能
  • Auteur: Xing Abao
  • Créé à : 2025-10-26 22:51:21
  • Mis à jour à : 2025-10-28 22:40:55
  • Lien: https://bioinformatics.vip/2025/10/26/sad41d8cd/251026_TOPSIS_Weighted_Model_Fusion/
  • Licence: Cette œuvre est sous licence CC BY-NC-SA 4.0.
Commentaires