融合机器学习算法 - 用 VotingClassifier 实现分类多模型的投票集成

Xing Abao Lv3

基于集成学习的投票分类器:从个体智慧到集体决策。

在数据科学和机器学习领域,我们常常追求构建一个尽可能精准和鲁棒的预测模型。然而,就像在现实世界中,依赖单一专家的意见可能存在偏见或局限性,单个机器学习模型也可能在某些数据上表现不佳。为了克服这一挑战,集成学习(Ensemble Learning)应运而生。它借鉴了“三个臭皮匠,顶个诸葛亮”的集体智慧思想,将多个模型的预测结果进行组合,从而获得比任何单一模型都更优越的性能。本次示例,正是对这一思想中一种经典方法 —— 投票法。不仅构建了模型,还通过一系列专业的可视化和评估指标,系统地展示了从数据准备到模型验证的全过程。

模型选择与集成:组建一个“专家委员会”。

首先初始化了六个当前业界非常流行且性能强大的分类模型:随机森林、XGBoost、LightGBM、梯度提升机、AdaBoost 和 CatBoost。可以将这六个模型想象成来自不同领域的六位顶尖专家,他们各自擅长从不同角度分析问题。例如,在生物信息学中,有的模型可能更擅长处理高维度的基因表达数据,而有的则在识别图像中的模式上更具优势。仅仅依赖其中任何一位专家都可能存在盲点,因此,将他们的“意见”综合起来,形成一个“专家委员会”的集体决策,往往能得出更可靠的结论。这正是VotingClassifier投票分类器所做的事情。

模拟案例

加载模块

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore", message = ".*does not have valid feature names.*")
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score

plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.unicode_minus'] = False

加载数据

1
2
path = 'Z:/TData/big-data/sad41d8cd/251030_voting_classifier_ensemble_learning.xlsx'
df = pd.read_excel(path)
1
2
3
4
5
6
7
8
df.head()
Out[10]:
X_1 X_2 X_3 X_4 X_5 X_6 X_7 X_8 X_9 X_10 X_11 X_12 X_13 class
0 63 1 1 145 233 1 2 150 0 2.3 3 0 1 0
1 67 1 4 160 286 0 2 108 1 1.5 2 3 0 1
2 67 1 4 120 229 0 2 129 1 2.6 2 2 2 1
3 37 1 3 130 250 0 0 187 0 3.5 3 0 0 0
4 41 0 2 130 204 0 2 172 0 1.4 1 0 0 0
1
2
3
4
5
df.columns
Out[11]:
Index(['X_1', 'X_2', 'X_3', 'X_4', 'X_5', 'X_6', 'X_7', 'X_8', 'X_9', 'X_10',
'X_11', 'X_12', 'X_13', 'class'],
dtype='object')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297 entries, 0 to 296
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 X_1 297 non-null int64
1 X_2 297 non-null int64
2 X_3 297 non-null int64
3 X_4 297 non-null int64
4 X_5 297 non-null int64
5 X_6 297 non-null int64
6 X_7 297 non-null int64
7 X_8 297 non-null int64
8 X_9 297 non-null int64
9 X_10 297 non-null float64
10 X_11 297 non-null int64
11 X_12 297 non-null int64
12 X_13 297 non-null int64
13 class 297 non-null int64
dtypes: float64(1), int64(13)
memory usage: 32.6 KB

练集和测试集

1
2
3
4
5
6
# 划分特征和目标变量
X = df.drop(['class'], axis = 1)
y = df['class']

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = df['class'])

定义模型

1
2
3
4
5
6
rf_clf = RandomForestClassifier(random_state = 42)
xgb_clf = XGBClassifier(use_label_encoder = False, eval_metric = 'logloss', random_state = 42)
lgbm_clf = LGBMClassifier(random_state = 42, verbose = -1)
gbm_clf = GradientBoostingClassifier(random_state = 42)
adaboost_clf = AdaBoostClassifier(random_state = 42, algorithm = 'SAMME')
catboost_clf = CatBoostClassifier(verbose = 0, random_state = 42)

硬投票

硬投票机制非常直观,就像一场民主选举。对于一个新的样本,六个模型分别给出自己的预测。硬投票分类器会统计所有模型的预测结果,并以“少数服从多数”的原则,将得票最多的类别作为最终的预测结果。这种方法简单直接,但忽略了每位“专家”对其判断的信心程度。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 创建硬投票分类器
voting_hard = VotingClassifier(
estimators = [
('RandomForest', rf_clf),
('XGBoost', xgb_clf),
('LightGBM', lgbm_clf),
('GradientBoosting', gbm_clf),
('AdaBoost', adaboost_clf),
('CatBoost', catboost_clf)
],
voting = 'hard'
)

# 训练硬投票分类器
voting_hard.fit(X_train, y_train)

软投票

相比之下,软投票则是一种更为精细和强大的策略。它不仅考虑每个模型的预测类别,更重要的是,它会考虑每个模型预测为某一类别的“概率”或“置信度”。例如,对于一位病人,模型 A 可能以 95% 的置信度认为是“患病”,而模型 B 和 C 仅以 60% 的置信度认为是“健康”。软投票会综合所有模型的预测概率进行加权平均,最终选择平均概率最高的类别作为最终结果。由于充分利用了每个模型提供的丰富信息,软投票通常能获得比硬投票更高的准确率。可以利用weights参数,允许为表现更优异的模型赋予更高的权重,让“权威专家”的意见在集体决策中占有更重要的分量。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 创建软投票分类器
voting_soft = VotingClassifier(
estimators = [
('RandomForest', rf_clf),
('XGBoost', xgb_clf),
('LightGBM', lgbm_clf),
('GradientBoosting', gbm_clf),
('AdaBoost', adaboost_clf),
('CatBoost', catboost_clf)
],
voting = 'soft',
weights = [1, 1, 1, 1, 1, 1]
)

# 训练软投票分类器
voting_soft.fit(X_train, y_train)

软/硬投票预测

1
2
3
4
5
6
# 硬投票预测测试集
y_pred_hard = voting_hard.predict(X_test)

# 输出硬投票模型的评价指标
print("Classification Report for Hard Voting:")
print(classification_report(y_test, y_pred_hard))
1
2
3
4
5
6
7
8
9
Classification Report for Hard Voting:
precision recall f1-score support

0 0.78 0.91 0.84 32
1 0.87 0.71 0.78 28

accuracy 0.82 60
macro avg 0.83 0.81 0.81 60
weighted avg 0.82 0.82 0.81 60
1
2
3
4
5
6
# 软投票预测测试集
y_pred_soft = voting_soft.predict(X_test)

# 输出软投票模型的评价指标
print("Classification Report for Soft Voting:")
print(classification_report(y_test, y_pred_soft))
1
2
3
4
5
6
7
8
9
Classification Report for Soft Voting:
precision recall f1-score support

0 0.81 0.91 0.85 32
1 0.88 0.75 0.81 28

accuracy 0.83 60
macro avg 0.84 0.83 0.83 60
weighted avg 0.84 0.83 0.83 60

混淆矩阵

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 硬投票的混淆矩阵
conf_matrix_hard = confusion_matrix(y_test, y_pred_hard)

# 软投票的混淆矩阵
conf_matrix_soft = confusion_matrix(y_test, y_pred_soft)
fig, axes = plt.subplots(1, 2, figsize = (16, 6), dpi=1200)

# 绘制硬投票混淆矩阵热力图
sns.heatmap(conf_matrix_hard, annot = True, annot_kws = {'size': 15}, fmt = 'd', cmap = 'YlGnBu', cbar_kws = {'shrink': 0.75}, ax = axes[0])
axes[0].set_title('Confusion Matrix (Hard Voting)', fontsize = 15)
axes[0].set_xlabel('Predicted Label', fontsize = 15)
axes[0].set_ylabel('True Label', fontsize = 15)

# 绘制软投票混淆矩阵热力图
sns.heatmap(conf_matrix_soft, annot = True, annot_kws = {'size': 15}, fmt = 'd', cmap = 'YlGnBu', cbar_kws = {'shrink': 0.75}, ax = axes[1])
axes[1].set_title('Confusion Matrix (Soft Voting)', fontsize = 15)
axes[1].set_xlabel('Predicted Label', fontsize = 15)
axes[1].set_ylabel('True Label', fontsize = 15)
plt.tight_layout()
plt.savefig("混淆矩阵_硬投票_软投票.png", bbox_inches = 'tight')
plt.close()

所有 AUC 曲线

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# 初始化字典存储每个模型的预测结果和`ROC`信息
models = {
'RandomForest': rf_clf,
'XGBoost': xgb_clf,
'LightGBM': lgbm_clf,
'GradientBoosting': gbm_clf,
'AdaBoost': adaboost_clf,
'CatBoost': catboost_clf
}

# 绘制ROC曲线
plt.figure(figsize = (10, 8))
for name, model in models.items():
# 获取预测概率
y_proba = model.fit(X_train, y_train).predict_proba(X_test)[:, 1]
# 计算ROC曲线和AUC
fpr, tpr, _ = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
# 绘制ROC曲线
plt.plot(fpr, tpr, label=f"{name} (AUC = {auc_score:.2f})")

# 添加对硬投票分类器的ROC曲线
voting_hard.fit(X_train, y_train)
y_pred_hard = voting_hard.predict(X_test)

# 使用投票分类器计算硬投票下的AUC和假阳率、真阳率
y_proba_hard = voting_hard.transform(X_test)[:, 1]
fpr_hard, tpr_hard, _ = roc_curve(y_test, y_proba_hard)
auc_score_hard = roc_auc_score(y_test, y_proba_hard)

plt.plot(fpr_hard, tpr_hard, label = f"Voting (AUC = {auc_score_hard:.2f})", linestyle = '--')
plt.plot([0, 1], [0, 1], 'k--', label = "Random Guessing")
plt.xlabel('False Positive Rate (FPR)', fontsize = 18)
plt.ylabel('True Positive Rate (TPR)', fontsize = 18)
plt.title('ROC Curve of Base Models and Voting Classifier', fontsize = 18)
plt.legend(loc = 'lower right')
plt.grid()
plt.savefig("ROC Curve of Base Models and Voting Classifier.png", bbox_inches = 'tight', dpi = 1200)
plt.close()

软投票 AUC 曲线

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 获取软投票分类器的预测概率
# 选择正类的概率
y_proba_soft = voting_soft.predict_proba(X_test)[:, 1]

# 计算软投票分类器的ROC曲线和AUC值
fpr_soft, tpr_soft, _ = roc_curve(y_test, y_proba_soft)
auc_score_soft = roc_auc_score(y_test, y_proba_soft)

# 绘制ROC曲线
plt.figure(figsize = (8, 6))
plt.plot(fpr_soft, tpr_soft, label = f"Soft Voting (AUC = {auc_score_soft:.2f})")

# 添加随机猜测的基线
plt.plot([0, 1], [0, 1], 'k--', label = "Random Guessing")

# 图形修饰
plt.xlabel('False Positive Rate (FPR)', fontsize = 18)
plt.ylabel('True Positive Rate (TPR)', fontsize = 18)
plt.title('ROC Curve of Soft Voting Classifier', fontsize = 18)
plt.legend(loc='lower right')
plt.grid()
plt.savefig("ROC Curve of Soft Voting Classifier.png", bbox_inches = 'tight', dpi = 1200)
plt.show()

完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
# -*- coding: utf-8 -*-

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore", message = ".*does not have valid feature names.*")
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score

plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.unicode_minus'] = False

if __name__ == '__main__':

import os
wkdir = 'C:/Users/Administrator/Desktop'
os.chdir(wkdir)

# 读取数据
path = 'Z:/TData/big-data/sad41d8cd/251030_voting_classifier_ensemble_learning.xlsx'
df = pd.read_excel(path)

if True:
df.head()
df.columns
df.info()


# 划分特征和目标变量
X = df.drop(['class'], axis = 1)
y = df['class']

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = df['class'])

# 定义模型
if True:

rf_clf = RandomForestClassifier(random_state = 42)
xgb_clf = XGBClassifier(use_label_encoder = False, eval_metric = 'logloss', random_state = 42)
lgbm_clf = LGBMClassifier(random_state = 42, verbose = -1)
gbm_clf = GradientBoostingClassifier(random_state = 42)
adaboost_clf = AdaBoostClassifier(random_state = 42, algorithm = 'SAMME')
catboost_clf = CatBoostClassifier(verbose = 0, random_state = 42)


# 硬投票
if True:

# 创建硬投票分类器
voting_hard = VotingClassifier(
estimators = [
('RandomForest', rf_clf),
('XGBoost', xgb_clf),
('LightGBM', lgbm_clf),
('GradientBoosting', gbm_clf),
('AdaBoost', adaboost_clf),
('CatBoost', catboost_clf)
],
voting = 'hard'
)

# 训练硬投票分类器
voting_hard.fit(X_train, y_train)

# 软投票
if True:

# 创建软投票分类器
voting_soft = VotingClassifier(
estimators = [
('RandomForest', rf_clf),
('XGBoost', xgb_clf),
('LightGBM', lgbm_clf),
('GradientBoosting', gbm_clf),
('AdaBoost', adaboost_clf),
('CatBoost', catboost_clf)
],
voting = 'soft',
weights = [1, 1, 1, 1, 1, 1]
)

# 训练软投票分类器
voting_soft.fit(X_train, y_train)

# 软/硬投票预测
if True:

# 硬投票预测测试集
y_pred_hard = voting_hard.predict(X_test)

# 输出硬投票模型的评价指标
print("Classification Report for Hard Voting:")
print(classification_report(y_test, y_pred_hard))

# 软投票预测测试集
y_pred_soft = voting_soft.predict(X_test)

# 输出软投票模型的评价指标
print("Classification Report for Soft Voting:")
print(classification_report(y_test, y_pred_soft))

# 混淆矩阵
if True:

# 硬投票的混淆矩阵
conf_matrix_hard = confusion_matrix(y_test, y_pred_hard)

# 软投票的混淆矩阵
conf_matrix_soft = confusion_matrix(y_test, y_pred_soft)
fig, axes = plt.subplots(1, 2, figsize = (16, 6), dpi=1200)

# 绘制硬投票混淆矩阵热力图
sns.heatmap(conf_matrix_hard, annot = True, annot_kws = {'size': 15}, fmt = 'd', cmap = 'YlGnBu', cbar_kws = {'shrink': 0.75}, ax = axes[0])
axes[0].set_title('Confusion Matrix (Hard Voting)', fontsize = 15)
axes[0].set_xlabel('Predicted Label', fontsize = 15)
axes[0].set_ylabel('True Label', fontsize = 15)

# 绘制软投票混淆矩阵热力图
sns.heatmap(conf_matrix_soft, annot = True, annot_kws = {'size': 15}, fmt = 'd', cmap = 'YlGnBu', cbar_kws = {'shrink': 0.75}, ax = axes[1])
axes[1].set_title('Confusion Matrix (Soft Voting)', fontsize = 15)
axes[1].set_xlabel('Predicted Label', fontsize = 15)
axes[1].set_ylabel('True Label', fontsize = 15)
plt.tight_layout()
plt.savefig("混淆矩阵_硬投票_软投票.png", bbox_inches = 'tight')
plt.close()

# 所有 AUC 曲线
if True:

# 初始化字典存储每个模型的预测结果和`ROC`信息
models = {
'RandomForest': rf_clf,
'XGBoost': xgb_clf,
'LightGBM': lgbm_clf,
'GradientBoosting': gbm_clf,
'AdaBoost': adaboost_clf,
'CatBoost': catboost_clf
}

# 绘制ROC曲线
plt.figure(figsize = (10, 8))
for name, model in models.items():
# 获取预测概率
y_proba = model.fit(X_train, y_train).predict_proba(X_test)[:, 1]
# 计算ROC曲线和AUC
fpr, tpr, _ = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
# 绘制ROC曲线
plt.plot(fpr, tpr, label=f"{name} (AUC = {auc_score:.2f})")

# 添加对硬投票分类器的ROC曲线
voting_hard.fit(X_train, y_train)
y_pred_hard = voting_hard.predict(X_test)

# 使用投票分类器计算硬投票下的AUC和假阳率、真阳率
y_proba_hard = voting_hard.transform(X_test)[:, 1]
fpr_hard, tpr_hard, _ = roc_curve(y_test, y_proba_hard)
auc_score_hard = roc_auc_score(y_test, y_proba_hard)

plt.plot(fpr_hard, tpr_hard, label = f"Voting (AUC = {auc_score_hard:.2f})", linestyle = '--')
plt.plot([0, 1], [0, 1], 'k--', label = "Random Guessing")
plt.xlabel('False Positive Rate (FPR)', fontsize = 18)
plt.ylabel('True Positive Rate (TPR)', fontsize = 18)
plt.title('ROC Curve of Base Models and Voting Classifier', fontsize = 18)
plt.legend(loc = 'lower right')
plt.grid()
plt.savefig("ROC Curve of Base Models and Voting Classifier.png", bbox_inches = 'tight', dpi = 1200)
plt.close()

# 软投票 AUC 曲线
if True:

# 获取软投票分类器的预测概率
# 选择正类的概率
y_proba_soft = voting_soft.predict_proba(X_test)[:, 1]

# 计算软投票分类器的ROC曲线和AUC值
fpr_soft, tpr_soft, _ = roc_curve(y_test, y_proba_soft)
auc_score_soft = roc_auc_score(y_test, y_proba_soft)

# 绘制ROC曲线
plt.figure(figsize = (8, 6))
plt.plot(fpr_soft, tpr_soft, label = f"Soft Voting (AUC = {auc_score_soft:.2f})")

# 添加随机猜测的基线
plt.plot([0, 1], [0, 1], 'k--', label = "Random Guessing")

# 图形修饰
plt.xlabel('False Positive Rate (FPR)', fontsize = 18)
plt.ylabel('True Positive Rate (TPR)', fontsize = 18)
plt.title('ROC Curve of Soft Voting Classifier', fontsize = 18)
plt.legend(loc='lower right')
plt.grid()
plt.savefig("ROC Curve of Soft Voting Classifier.png", bbox_inches = 'tight', dpi = 1200)
plt.show()
  • Title: 融合机器学习算法 - 用 VotingClassifier 实现分类多模型的投票集成
  • Author: Xing Abao
  • Created at : 2025-10-30 08:33:27
  • Updated at : 2025-10-30 09:20:49
  • Link: https://bioinformatics.vip/2025/10/30/sad41d8cd/251030_voting_classifier_ensemble_learning/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments