融合机器学习算法 - 用 VotingClassifier 实现分类多模型的投票集成

Xing Abao Lv3

2025-10-30 08:33:27 2025-10-30 08:33:27 Created 2025-10-30 09:20:49 2025-10-30 09:20:49 Updated

机器学习

机器学习

基于集成学习的投票分类器：从个体智慧到集体决策。

在数据科学和机器学习领域，我们常常追求构建一个尽可能精准和鲁棒的预测模型。然而，就像在现实世界中，依赖单一专家的意见可能存在偏见或局限性，单个机器学习模型也可能在某些数据上表现不佳。为了克服这一挑战，集成学习（Ensemble Learning）应运而生。它借鉴了“三个臭皮匠，顶个诸葛亮”的集体智慧思想，将多个模型的预测结果进行组合，从而获得比任何单一模型都更优越的性能。本次示例，正是对这一思想中一种经典方法 —— 投票法。不仅构建了模型，还通过一系列专业的可视化和评估指标，系统地展示了从数据准备到模型验证的全过程。

模型选择与集成：组建一个“专家委员会”。

首先初始化了六个当前业界非常流行且性能强大的分类模型：随机森林、XGBoost、LightGBM、梯度提升机、AdaBoost 和 CatBoost。可以将这六个模型想象成来自不同领域的六位顶尖专家，他们各自擅长从不同角度分析问题。例如，在生物信息学中，有的模型可能更擅长处理高维度的基因表达数据，而有的则在识别图像中的模式上更具优势。仅仅依赖其中任何一位专家都可能存在盲点，因此，将他们的“意见”综合起来，形成一个“专家委员会”的集体决策，往往能得出更可靠的结论。这正是VotingClassifier投票分类器所做的事情。

模拟案例

加载模块

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore", message = ".*does not have valid feature names.*")
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
             
plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.unicode_minus'] = False

加载数据

1 2	path = 'Z:/TData/big-data/sad41d8cd/251030_voting_classifier_ensemble_learning.xlsx' df = pd.read_excel(path)

df.head()
Out[10]: 
   X_1  X_2  X_3  X_4  X_5  X_6  X_7  X_8  X_9  X_10  X_11  X_12  X_13  class
0   63    1    1  145  233    1    2  150    0   2.3     3     0     1      0
1   67    1    4  160  286    0    2  108    1   1.5     2     3     0      1
2   67    1    4  120  229    0    2  129    1   2.6     2     2     2      1
3   37    1    3  130  250    0    0  187    0   3.5     3     0     0      0
4   41    0    2  130  204    0    2  172    0   1.4     1     0     0      0

df.columns
Out[11]: 
Index(['X_1', 'X_2', 'X_3', 'X_4', 'X_5', 'X_6', 'X_7', 'X_8', 'X_9', 'X_10',
       'X_11', 'X_12', 'X_13', 'class'],
      dtype='object')

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297 entries, 0 to 296
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X_1     297 non-null    int64  
 1   X_2     297 non-null    int64  
 2   X_3     297 non-null    int64  
 3   X_4     297 non-null    int64  
 4   X_5     297 non-null    int64  
 5   X_6     297 non-null    int64  
 6   X_7     297 non-null    int64  
 7   X_8     297 non-null    int64  
 8   X_9     297 non-null    int64  
 9   X_10    297 non-null    float64
 10  X_11    297 non-null    int64  
 11  X_12    297 non-null    int64  
 12  X_13    297 non-null    int64  
 13  class   297 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 32.6 KB

练集和测试集

# 划分特征和目标变量
X = df.drop(['class'], axis = 1)
y = df['class']

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = df['class'])

定义模型

rf_clf = RandomForestClassifier(random_state = 42)
xgb_clf = XGBClassifier(use_label_encoder = False, eval_metric = 'logloss', random_state = 42)
lgbm_clf = LGBMClassifier(random_state = 42, verbose = -1)
gbm_clf = GradientBoostingClassifier(random_state = 42)
adaboost_clf = AdaBoostClassifier(random_state = 42, algorithm = 'SAMME')
catboost_clf = CatBoostClassifier(verbose = 0, random_state = 42)

硬投票

硬投票机制非常直观，就像一场民主选举。对于一个新的样本，六个模型分别给出自己的预测。硬投票分类器会统计所有模型的预测结果，并以“少数服从多数”的原则，将得票最多的类别作为最终的预测结果。这种方法简单直接，但忽略了每位“专家”对其判断的信心程度。

# 创建硬投票分类器
voting_hard = VotingClassifier(
    estimators = [
        ('RandomForest', rf_clf),
        ('XGBoost', xgb_clf),
        ('LightGBM', lgbm_clf),
        ('GradientBoosting', gbm_clf),
        ('AdaBoost', adaboost_clf),
        ('CatBoost', catboost_clf)
    ],
    voting = 'hard'
)

# 训练硬投票分类器
voting_hard.fit(X_train, y_train)

软投票

相比之下，软投票则是一种更为精细和强大的策略。它不仅考虑每个模型的预测类别，更重要的是，它会考虑每个模型预测为某一类别的“概率”或“置信度”。例如，对于一位病人，模型 A 可能以 95% 的置信度认为是“患病”，而模型 B 和 C 仅以 60% 的置信度认为是“健康”。软投票会综合所有模型的预测概率进行加权平均，最终选择平均概率最高的类别作为最终结果。由于充分利用了每个模型提供的丰富信息，软投票通常能获得比硬投票更高的准确率。可以利用weights参数，允许为表现更优异的模型赋予更高的权重，让“权威专家”的意见在集体决策中占有更重要的分量。

# 创建软投票分类器
voting_soft = VotingClassifier(
    estimators = [
        ('RandomForest', rf_clf),
        ('XGBoost', xgb_clf),
        ('LightGBM', lgbm_clf),
        ('GradientBoosting', gbm_clf),
        ('AdaBoost', adaboost_clf),
        ('CatBoost', catboost_clf)
    ],
    voting = 'soft',
    weights = [1, 1, 1, 1, 1, 1]
)

# 训练软投票分类器
voting_soft.fit(X_train, y_train)

软/硬投票预测

# 硬投票预测测试集
y_pred_hard = voting_hard.predict(X_test)

# 输出硬投票模型的评价指标
print("Classification Report for Hard Voting:")
print(classification_report(y_test, y_pred_hard))

Classification Report for Hard Voting:
              precision    recall  f1-score   support

           0       0.78      0.91      0.84        32
           1       0.87      0.71      0.78        28

    accuracy                           0.82        60
   macro avg       0.83      0.81      0.81        60
weighted avg       0.82      0.82      0.81        60

# 软投票预测测试集
y_pred_soft = voting_soft.predict(X_test)

# 输出软投票模型的评价指标
print("Classification Report for Soft Voting:")
print(classification_report(y_test, y_pred_soft))

Classification Report for Soft Voting:
              precision    recall  f1-score   support

           0       0.81      0.91      0.85        32
           1       0.88      0.75      0.81        28

    accuracy                           0.83        60
   macro avg       0.84      0.83      0.83        60
weighted avg       0.84      0.83      0.83        60

混淆矩阵

# 硬投票的混淆矩阵
conf_matrix_hard = confusion_matrix(y_test, y_pred_hard)

# 软投票的混淆矩阵
conf_matrix_soft = confusion_matrix(y_test, y_pred_soft)
fig, axes = plt.subplots(1, 2, figsize = (16, 6), dpi=1200)

# 绘制硬投票混淆矩阵热力图
sns.heatmap(conf_matrix_hard, annot = True, annot_kws = {'size': 15}, fmt = 'd', cmap = 'YlGnBu', cbar_kws = {'shrink': 0.75}, ax = axes[0])
axes[0].set_title('Confusion Matrix (Hard Voting)', fontsize = 15)
axes[0].set_xlabel('Predicted Label', fontsize = 15)
axes[0].set_ylabel('True Label', fontsize = 15)

# 绘制软投票混淆矩阵热力图
sns.heatmap(conf_matrix_soft, annot = True, annot_kws = {'size': 15}, fmt = 'd', cmap = 'YlGnBu', cbar_kws = {'shrink': 0.75}, ax = axes[1])
axes[1].set_title('Confusion Matrix (Soft Voting)', fontsize = 15)
axes[1].set_xlabel('Predicted Label', fontsize = 15)
axes[1].set_ylabel('True Label', fontsize = 15)
plt.tight_layout()
plt.savefig("混淆矩阵_硬投票_软投票.png", bbox_inches = 'tight')
plt.close()

所有 AUC 曲线

# 初始化字典存储每个模型的预测结果和`ROC`信息
models = {
    'RandomForest': rf_clf,
    'XGBoost': xgb_clf,
    'LightGBM': lgbm_clf,
    'GradientBoosting': gbm_clf,
    'AdaBoost': adaboost_clf,
    'CatBoost': catboost_clf
}

# 绘制ROC曲线
plt.figure(figsize = (10, 8))
for name, model in models.items():
    # 获取预测概率
    y_proba = model.fit(X_train, y_train).predict_proba(X_test)[:, 1]
    # 计算ROC曲线和AUC
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc_score = roc_auc_score(y_test, y_proba)
    # 绘制ROC曲线
    plt.plot(fpr, tpr, label=f"{name} (AUC = {auc_score:.2f})")
    
# 添加对硬投票分类器的ROC曲线
voting_hard.fit(X_train, y_train)
y_pred_hard = voting_hard.predict(X_test)

# 使用投票分类器计算硬投票下的AUC和假阳率、真阳率
y_proba_hard = voting_hard.transform(X_test)[:, 1]
fpr_hard, tpr_hard, _ = roc_curve(y_test, y_proba_hard)
auc_score_hard = roc_auc_score(y_test, y_proba_hard)

plt.plot(fpr_hard, tpr_hard, label = f"Voting (AUC = {auc_score_hard:.2f})", linestyle = '--')
plt.plot([0, 1], [0, 1], 'k--', label = "Random Guessing")
plt.xlabel('False Positive Rate (FPR)', fontsize = 18)
plt.ylabel('True Positive Rate (TPR)', fontsize = 18)
plt.title('ROC Curve of Base Models and Voting Classifier', fontsize = 18)
plt.legend(loc = 'lower right')
plt.grid()
plt.savefig("ROC Curve of Base Models and Voting Classifier.png", bbox_inches = 'tight', dpi = 1200)
plt.close()

软投票 AUC 曲线

# 获取软投票分类器的预测概率
# 选择正类的概率
y_proba_soft = voting_soft.predict_proba(X_test)[:, 1] 

# 计算软投票分类器的ROC曲线和AUC值
fpr_soft, tpr_soft, _ = roc_curve(y_test, y_proba_soft)
auc_score_soft = roc_auc_score(y_test, y_proba_soft)

# 绘制ROC曲线
plt.figure(figsize = (8, 6))
plt.plot(fpr_soft, tpr_soft, label = f"Soft Voting (AUC = {auc_score_soft:.2f})")

# 添加随机猜测的基线
plt.plot([0, 1], [0, 1], 'k--', label = "Random Guessing")

# 图形修饰
plt.xlabel('False Positive Rate (FPR)', fontsize = 18)
plt.ylabel('True Positive Rate (TPR)', fontsize = 18)
plt.title('ROC Curve of Soft Voting Classifier', fontsize = 18)
plt.legend(loc='lower right')
plt.grid()
plt.savefig("ROC Curve of Soft Voting Classifier.png", bbox_inches = 'tight', dpi = 1200)
plt.show()

完整代码

# -*- coding: utf-8 -*-

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore", message = ".*does not have valid feature names.*")
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
             
plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.unicode_minus'] = False

if __name__ == '__main__':
    
    import os
    wkdir = 'C:/Users/Administrator/Desktop'
    os.chdir(wkdir)
    
    # 读取数据
    path = 'Z:/TData/big-data/sad41d8cd/251030_voting_classifier_ensemble_learning.xlsx'
    df = pd.read_excel(path)
    
    if True:
        df.head()
        df.columns
        df.info()
        
        
    # 划分特征和目标变量
    X = df.drop(['class'], axis = 1)
    y = df['class']
    
    # 划分训练集和测试集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = df['class'])
    
    # 定义模型
    if True:
        
        rf_clf = RandomForestClassifier(random_state = 42)
        xgb_clf = XGBClassifier(use_label_encoder = False, eval_metric = 'logloss', random_state = 42)
        lgbm_clf = LGBMClassifier(random_state = 42, verbose = -1)
        gbm_clf = GradientBoostingClassifier(random_state = 42)
        adaboost_clf = AdaBoostClassifier(random_state = 42, algorithm = 'SAMME')
        catboost_clf = CatBoostClassifier(verbose = 0, random_state = 42)
    
    
    # 硬投票
    if True:
        
        # 创建硬投票分类器
        voting_hard = VotingClassifier(
            estimators = [
                ('RandomForest', rf_clf),
                ('XGBoost', xgb_clf),
                ('LightGBM', lgbm_clf),
                ('GradientBoosting', gbm_clf),
                ('AdaBoost', adaboost_clf),
                ('CatBoost', catboost_clf)
            ],
            voting = 'hard'
        )
        
        # 训练硬投票分类器
        voting_hard.fit(X_train, y_train)
    
    # 软投票
    if True:
        
        # 创建软投票分类器
        voting_soft = VotingClassifier(
            estimators = [
                ('RandomForest', rf_clf),
                ('XGBoost', xgb_clf),
                ('LightGBM', lgbm_clf),
                ('GradientBoosting', gbm_clf),
                ('AdaBoost', adaboost_clf),
                ('CatBoost', catboost_clf)
            ],
            voting = 'soft',
            weights = [1, 1, 1, 1, 1, 1]
        )
        
        # 训练软投票分类器
        voting_soft.fit(X_train, y_train)
        
    # 软/硬投票预测
    if True:

        # 硬投票预测测试集
        y_pred_hard = voting_hard.predict(X_test)
        
        # 输出硬投票模型的评价指标
        print("Classification Report for Hard Voting:")
        print(classification_report(y_test, y_pred_hard))
        
        # 软投票预测测试集
        y_pred_soft = voting_soft.predict(X_test)
        
        # 输出软投票模型的评价指标
        print("Classification Report for Soft Voting:")
        print(classification_report(y_test, y_pred_soft))
        
    # 混淆矩阵
    if True:
        
        # 硬投票的混淆矩阵
        conf_matrix_hard = confusion_matrix(y_test, y_pred_hard)
        
        # 软投票的混淆矩阵
        conf_matrix_soft = confusion_matrix(y_test, y_pred_soft)
        fig, axes = plt.subplots(1, 2, figsize = (16, 6), dpi=1200)
        
        # 绘制硬投票混淆矩阵热力图
        sns.heatmap(conf_matrix_hard, annot = True, annot_kws = {'size': 15}, fmt = 'd', cmap = 'YlGnBu', cbar_kws = {'shrink': 0.75}, ax = axes[0])
        axes[0].set_title('Confusion Matrix (Hard Voting)', fontsize = 15)
        axes[0].set_xlabel('Predicted Label', fontsize = 15)
        axes[0].set_ylabel('True Label', fontsize = 15)
           
        # 绘制软投票混淆矩阵热力图
        sns.heatmap(conf_matrix_soft, annot = True, annot_kws = {'size': 15}, fmt = 'd', cmap = 'YlGnBu', cbar_kws = {'shrink': 0.75}, ax = axes[1])
        axes[1].set_title('Confusion Matrix (Soft Voting)', fontsize = 15)
        axes[1].set_xlabel('Predicted Label', fontsize = 15)
        axes[1].set_ylabel('True Label', fontsize = 15)
        plt.tight_layout()
        plt.savefig("混淆矩阵_硬投票_软投票.png", bbox_inches = 'tight')
        plt.close()
        
    # 所有 AUC 曲线
    if True:
        
        # 初始化字典存储每个模型的预测结果和`ROC`信息
        models = {
            'RandomForest': rf_clf,
            'XGBoost': xgb_clf,
            'LightGBM': lgbm_clf,
            'GradientBoosting': gbm_clf,
            'AdaBoost': adaboost_clf,
            'CatBoost': catboost_clf
        }
        
        # 绘制ROC曲线
        plt.figure(figsize = (10, 8))
        for name, model in models.items():
            # 获取预测概率
            y_proba = model.fit(X_train, y_train).predict_proba(X_test)[:, 1]
            # 计算ROC曲线和AUC
            fpr, tpr, _ = roc_curve(y_test, y_proba)
            auc_score = roc_auc_score(y_test, y_proba)
            # 绘制ROC曲线
            plt.plot(fpr, tpr, label=f"{name} (AUC = {auc_score:.2f})")
            
        # 添加对硬投票分类器的ROC曲线
        voting_hard.fit(X_train, y_train)
        y_pred_hard = voting_hard.predict(X_test)
        
        # 使用投票分类器计算硬投票下的AUC和假阳率、真阳率
        y_proba_hard = voting_hard.transform(X_test)[:, 1]
        fpr_hard, tpr_hard, _ = roc_curve(y_test, y_proba_hard)
        auc_score_hard = roc_auc_score(y_test, y_proba_hard)
        
        plt.plot(fpr_hard, tpr_hard, label = f"Voting (AUC = {auc_score_hard:.2f})", linestyle = '--')
        plt.plot([0, 1], [0, 1], 'k--', label = "Random Guessing")
        plt.xlabel('False Positive Rate (FPR)', fontsize = 18)
        plt.ylabel('True Positive Rate (TPR)', fontsize = 18)
        plt.title('ROC Curve of Base Models and Voting Classifier', fontsize = 18)
        plt.legend(loc = 'lower right')
        plt.grid()
        plt.savefig("ROC Curve of Base Models and Voting Classifier.png", bbox_inches = 'tight', dpi = 1200)
        plt.close()
          
    # 软投票 AUC 曲线
    if True:
        
        # 获取软投票分类器的预测概率
        # 选择正类的概率
        y_proba_soft = voting_soft.predict_proba(X_test)[:, 1] 
        
        # 计算软投票分类器的ROC曲线和AUC值
        fpr_soft, tpr_soft, _ = roc_curve(y_test, y_proba_soft)
        auc_score_soft = roc_auc_score(y_test, y_proba_soft)
        
        # 绘制ROC曲线
        plt.figure(figsize = (8, 6))
        plt.plot(fpr_soft, tpr_soft, label = f"Soft Voting (AUC = {auc_score_soft:.2f})")
        
        # 添加随机猜测的基线
        plt.plot([0, 1], [0, 1], 'k--', label = "Random Guessing")
        
        # 图形修饰
        plt.xlabel('False Positive Rate (FPR)', fontsize = 18)
        plt.ylabel('True Positive Rate (TPR)', fontsize = 18)
        plt.title('ROC Curve of Soft Voting Classifier', fontsize = 18)
        plt.legend(loc='lower right')
        plt.grid()
        plt.savefig("ROC Curve of Soft Voting Classifier.png", bbox_inches = 'tight', dpi = 1200)
        plt.show()

Title: 融合机器学习算法 - 用 VotingClassifier 实现分类多模型的投票集成
Author: Xing Abao
Created at : 2025-10-30 08:33:27
Updated at : 2025-10-30 09:20:49
Link: https://bioinformatics.vip/2025/10/30/sad41d8cd/251030_voting_classifier_ensemble_learning/
License: This work is licensed under CC BY-NC-SA 4.0.

#机器学习

Comments

On this page

融合机器学习算法 - 用 VotingClassifier 实现分类多模型的投票集成

1. 模拟案例
2. 完整代码