目录
  1. 1. 一、统一 API 哲学
    1. 1.1. 1.1 三大对象类型
    2. 1.2. 1.2 超级一致的 API
  2. 2. 二、数据预处理
    1. 2.1. 2.1 标准化与缩放
    2. 2.2. 2.2 编码分类特征
    3. 2.3. 2.3 离散化与特征生成
  3. 3. 三、缺失值填补
    1. 3.1. 3.1 基础填补
    2. 3.2. 3.2 高级填补
  4. 4. 四、特征选择
    1. 4.1. 4.1 过滤式(Filter)
    2. 4.2. 4.2 包裹式(Wrapper)
    3. 4.3. 4.3 嵌入式(Embedded)
  5. 5. 五、模型选择与超参数调优
    1. 5.1. 5.1 数据划分与交叉验证
    2. 5.2. 5.2 超参数搜索
    3. 5.3. 5.3 学习曲线与验证曲线
  6. 6. 六、Pipeline 工作流
    1. 6.1. 6.1 基础 Pipeline
    2. 6.2. 6.2 ColumnTransformer: 异构数据处理
    3. 6.3. 6.3 FeatureUnion: 并行特征提取
  7. 7. 七、评估指标
    1. 7.1. 7.1 分类指标
    2. 7.2. 7.2 回归指标
    3. 7.3. 7.3 聚类指标
  8. 8. 八、完整工作流示例
机器学习框架篇-Scikit-learn

Scikit-learn 是 Python 生态中最核心的传统机器学习库,构建于 NumPy、SciPy 和 matplotlib 之上。它最大的设计哲学是统一的 API 接口——所有估计器(estimator)都遵循 fit/predict/transform 模式,使学习曲线极其平滑。本文从统一接口哲学出发,系统梳理预处理、特征工程、模型选择、Pipeline 和评估指标的全套工作流。

一、统一 API 哲学

Scikit-learn 的核心设计原则:所有对象共享一致的接口

1.1 三大对象类型

# 1. Estimator(估计器)— 实现 fit() 的任何对象
estimator.fit(X, y=None) # 从数据学习参数

# 2. Transformer(转换器)— 实现 transform() 的 Estimator
transformer.fit(X, y=None) # 学习转换参数
X_transformed = transformer.transform(X) # 应用转换
# fit_transform(): 两步骤结合,可能更高效
X_transformed = transformer.fit_transform(X)

# 3. Predictor(预测器)— 实现 predict() 的 Estimator
predictor.fit(X, y)
y_pred = predictor.predict(X_new)
# 还包括 predict_proba, predict_log_proba, decision_function

# 超参数:均在 __init__() 中接收,作为公共属性存储
# - 通过 set_params() / get_params() 统一管理
# - GridSearchCV / Pipeline 依赖这一机制

# 检查一个对象是否有某能力:
from sklearn.utils.estimator_checks import check_estimator
# 各模块的标签属性(_estimator_type):
# classifier → 'classifier'
# regressor → 'regressor'
# cluster → 'clusterer'

1.2 超级一致的 API

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# 三个完全不同的算法,完全相同的使用方式
for Model in [LogisticRegression, RandomForestClassifier, SVC]:
model = Model() # 实例化(超参数在 __init__ 中设置)
model.fit(X_train, y_train) # 训练
y_pred = model.predict(X_test) # 预测
score = model.score(X_test, y_test) # 评估(分类返回准确率)

二、数据预处理

2.1 标准化与缩放

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import RobustScaler, QuantileTransformer
from sklearn.preprocessing import PowerTransformer, MaxAbsScaler

# StandardScaler: z-score 标准化 (x - μ) / σ
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 保存: scaler.mean_, scaler.scale_ (即 σ)
# 适用: 数据接近正态分布,如 PCA、SVM、线性回归、逻辑回归

# MinMaxScaler: 缩放到 [0, 1] 区间
scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled = scaler.fit_transform(X)
# 公式: X_scaled = (X - X.min) / (X.max - X.min) * (max - min) + min
# 适用: 神经网络、基于距离的算法(对异常值敏感)
# feature_range 可设为 (-1, 1) 等其他区间

# RobustScaler: 使用分位数,对异常值鲁棒
scaler = RobustScaler(quantile_range=(25.0, 75.0))
X_scaled = scaler.fit_transform(X)
# 公式: X_scaled = (X - median) / IQR
# 适用: 有离群值的数据

# MaxAbsScaler: 缩放到 [-1, 1],保持 0 的位置(适合稀疏数据)
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)

# QuantileTransformer: 映射到均匀分布或正态分布
qt = QuantileTransformer(n_quantiles=1000, output_distribution='normal')
# output_distribution='uniform' 映射到 [0, 1] 均匀分布
# output_distribution='normal' 映射到标准正态分布 N(0, 1)
# n_quantiles: 比样本数少即可,控制分位数精度
X_trans = qt.fit_transform(X)

# PowerTransformer: 使数据更像高斯分布
# method='yeo-johnson' (默认,支持负值)
pt_yeo = PowerTransformer(method='yeo-johnson')
# method='box-cox' (仅正值)
pt_box = PowerTransformer(method='box-cox')
X_trans = pt_yeo.fit_transform(X)

2.2 编码分类特征

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder

# OneHotEncoder: 独热编码
ohe = OneHotEncoder(
sparse_output=True, # 默认返回稀疏矩阵
handle_unknown='error', # 'ignore' 忽略未见类别
drop=None, # 'first': 舍弃第一个类别(避免多重共线性)
min_frequency=0.01, # 或者 int: 出现次数小于此频率的归为 infrequent
max_categories=None, # 限制最大类别数
)
X_encoded = ohe.fit_transform(X)
# 结果: 每列对应一个类别 → (n_samples, sum(n_categories))
# 检查类别: ohe.categories_

# OrdinalEncoder: 有序整数编码(类别 → 整数)
oe = OrdinalEncoder(
handle_unknown='use_encoded_value', unknown_value=-1,
encoded_missing_value=None # 是否编码缺失值
)
X_ord = oe.fit_transform(X)
# 适用: 树模型(可处理离散值)

# LabelEncoder: 单列标签编码(仅用于 target y)
le = LabelEncoder()
y_encoded = le.fit_transform(y)
# 注意: 不要对特征 X 使用 LabelEncoder(应使用 OrdinalEncoder)
# 查看映射: le.classes_

2.3 离散化与特征生成

from sklearn.preprocessing import KBinsDiscretizer, PolynomialFeatures
from sklearn.preprocessing import FunctionTransformer, Binarizer

# KBinsDiscretizer: 将连续特征分为离散区间(分箱)
kbd = KBinsDiscretizer(
n_bins=5,
encode='onehot', # 'onehot', 'onehot-dense', 'ordinal'
strategy='quantile', # 'uniform'(等宽), 'quantile'(等频), 'kmeans'
subsample=200000, # kmeans 的子样本数
random_state=42,
)
X_binned = kbd.fit_transform(X)
# bin_edges_ 属性保存分箱边界

# PolynomialFeatures: 生成多项式特征和交互特征
poly = PolynomialFeatures(
degree=2,
interaction_only=False, # True: 只生成交互项 (无 x1^2, x2^2 等纯幂项)
include_bias=True, # 是否包含偏置列 (全 1)
)
X_poly = poly.fit_transform(X)
# [a, b] → [1, a, b, a^2, ab, b^2]

# FunctionTransformer: 包装自定义函数
def log_transform(X):
return np.log1p(X) # log(1 + X),处理零值

ft = FunctionTransformer(
func=log_transform,
inverse_func=np.expm1, # 支持 inverse_transform
validate=False, # True 时会检查输入
)
X_trans = ft.fit_transform(X)

# Binarizer: 根据阈值二值化
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.5)
X_binary = binarizer.fit_transform(X)

三、缺失值填补

3.1 基础填补

from sklearn.impute import SimpleImputer

# 均值/中位数/众数填补
imp_mean = SimpleImputer(strategy='mean')
imp_median = SimpleImputer(strategy='median')
imp_mode = SimpleImputer(strategy='most_frequent') # 众数
imp_constant = SimpleImputer(strategy='constant', fill_value=-999)

X_imputed = imp_mean.fit_transform(X)
# 属性: imp_mean.statistics_ (每列的填补值)

# 添加缺失指示器
from sklearn.impute import MissingIndicator
indicator = MissingIndicator(features='missing-only', sparse=False)
missing_mask = indicator.fit_transform(X)
# 可以并联到 Pipeline 中帮助模型感知缺失模式

3.2 高级填补

from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# KNNImputer: 使用 k 近邻的均值填补
knn_imp = KNNImputer(
n_neighbors=5,
weights='uniform', # 'uniform' 或 'distance'(加权平均)
metric='nan_euclidean', # 忽略 NaN 的欧氏距离
add_indicator=False, # 是否添加缺失指示器
)
# 公式: x_missing = (1/k) * sum(x_neighbors)
X_knn = knn_imp.fit_transform(X)

# IterativeImputer: 多变量迭代填补 (MICE)
# 将每个含缺失的特征建模为其他特征的函数
iter_imp = IterativeImputer(
estimator=None, # 默认 BayesianRidge
missing_values=np.nan,
max_iter=10, # 最大迭代轮数
n_nearest_features=None, # 用于预测的特征数
initial_strategy='mean', # 初始填补策略
imputation_order='ascending', # 或 'roman', 'random'
random_state=42,
add_indicator=False,
)
# 过程:
# 1. 初始填补所有缺失值
# 2. 对每个含缺失的特征:
# a. 临时移除填补值
# b. 用其他特征预测该特征
# c. 更新填补值
# 3. 重复 max_iter 轮
X_iter = iter_imp.fit_transform(X)

四、特征选择

4.1 过滤式(Filter)

from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.feature_selection import chi2, f_regression, mutual_info_regression

# VarianceThreshold: 删除低方差特征
vt = VarianceThreshold(threshold=0.01) # 方差 < 0.01 的特征被删除
X_selected = vt.fit_transform(X)
# 对于伯努利分布: threshold = p*(1-p),如 threshold=0.8*0.2=0.16

# SelectKBest: 按评分选择 k 个最佳特征
selector = SelectKBest(
score_func=f_classif, # 分类: f_classif(ANOVA F-值), mutual_info_classif(互信息), chi2(卡方)
k=10
)
X_selected = selector.fit_transform(X, y)
# 获取得分和 p 值
scores = selector.scores_ # 每个特征的得分
pvalues = selector.pvalues_ # 每个特征的 p 值

# 回归任务的评分函数
from sklearn.feature_selection import f_regression, mutual_info_regression

# SelectPercentile: 按百分比保留特征
from sklearn.feature_selection import SelectPercentile
sp = SelectPercentile(score_func=f_classif, percentile=80) # 保留前 80%

4.2 包裹式(Wrapper)

from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# RFE (Recursive Feature Elimination): 递归特征消除
model = LogisticRegression(max_iter=1000)
rfe = RFE(
estimator=model,
n_features_to_select=10, # 保留的特征数
step=1, # 每次删除的特征数
verbose=0,
)
X_selected = rfe.fit_transform(X, y)
# 属性:
rfe.support_ # 布尔数组,哪些特征被保留
rfe.ranking_ # 排名(1 最好)
rfe.n_features_ # 实际保留的特征数

# RFECV: 带交叉验证的 RFE(自动选择最优特征数)
rfecv = RFECV(
estimator=RandomForestClassifier(),
step=1,
min_features_to_select=5,
cv=5,
scoring='accuracy',
n_jobs=-1,
)
X_selected = rfecv.fit_transform(X, y)
print(f'最优特征数: {rfecv.n_features_}')

4.3 嵌入式(Embedded)

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV, LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

# SelectFromModel: 利用模型固有的特征重要性
# L1 正则化方式
sfm_l1 = SelectFromModel(
LogisticRegression(penalty='l1', solver='liblinear', C=0.1)
)
X_selected = sfm_l1.fit_transform(X, y)
# 通过 sfm_l1.threshold_ 查看选择的阈值
# 查看系数: sfm_l1.estimator_.coef_

# 基于特征重要性(树模型)
sfm_tree = SelectFromModel(
GradientBoostingClassifier(n_estimators=100),
threshold='median', # 'mean', 'median', 或数值 * 1.25 等
max_features=20,
)
X_selected = sfm_tree.fit_transform(X, y)

# LassoCV 自动选择正则化强度的特征选择
lasso = LassoCV(cv=5, random_state=42).fit(X, y)
selected_features = np.where(lasso.coef_ != 0)[0]
X_selected = X[:, selected_features]

五、模型选择与超参数调优

5.1 数据划分与交叉验证

from sklearn.model_selection import train_test_split
from sklearn.model_selection import (StratifiedKFold, GroupKFold,
TimeSeriesSplit, RepeatedStratifiedKFold)
from sklearn.model_selection import cross_validate, cross_val_score

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
stratify=y, # 保持类别比例(分类任务重要)
shuffle=True,
random_state=42
)

# StratifiedKFold: K 折交叉验证(保持每折类别比例一致)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
X_fold_train, X_fold_val = X[train_idx], X[val_idx]
y_fold_train, y_fold_val = y[train_idx], y[val_idx]
# 训练和评估

# GroupKFold: 同一组不会跨训练/验证分割
gkf = GroupKFold(n_splits=5)
for train_idx, val_idx in gkf.split(X, y, groups=patient_id):
# groups 如患者 ID:确保同一患者的数据不跨分

# TimeSeriesSplit: 时间序列交叉验证
tss = TimeSeriesSplit(n_splits=5, max_train_size=None)
# 第 1 折: train=[0], test=[1]
# 第 2 折: train=[0,1], test=[2]
# ...
# test_size 总在 train 数据之后(避免数据泄露)

# cross_validate: 多指标交叉验证
scores = cross_validate(
model, X, y,
cv=5,
scoring=['accuracy', 'f1_macro', 'roc_auc_ovr'],
return_train_score=True,
n_jobs=-1,
verbose=0,
)
# 返回包含 fit_time, score_time, test_accuracy, train_accuracy 等的字典

# RepeatedStratifiedKFold: 重复分层 K 折
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)
# 执行 3 次 5 折交叉验证(共 15 次评估)

5.2 超参数搜索

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import HalvingGridSearchCV, HalvingRandomSearchCV
from sklearn.ensemble import RandomForestClassifier

# GridSearchCV: 穷举搜索
param_grid = {
'n_estimators': [100, 200, 500],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
}
gs = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy', # 'roc_auc', 'f1_macro', 或列表
n_jobs=-1, # 并行
verbose=1,
refit=True, # 用最优参数在整个训练集上重新训练
return_train_score=True,
)
gs.fit(X_train, y_train)
print(gs.best_params_)
print(gs.best_score_)
print(gs.best_estimator_) # refit=True 时可用
print(gs.cv_results_.keys()) # 所有详细结果

# RandomizedSearchCV: 随机搜索(更高效)
param_distributions = {
'n_estimators': [100, 200, 300, 500, 1000],
'max_depth': [None, 5, 10, 15, 20, 30],
'min_samples_split': np.arange(2, 21),
'max_features': ['sqrt', 'log2', None] + list(np.arange(0.1, 1.1, 0.1)),
}
rs = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=100, # 采样 100 组参数(vs grid search 的全部组合)
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42,
verbose=1,
)
rs.fit(X_train, y_train)

# HalvingGridSearchCV: 逐半搜索(更快)
# 思路:先用少量资源评估,逐步筛选好的参数组合用更多资源
hgs = HalvingGridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
factor=3, # 每轮保留 1/3 的候选
min_resources='exhaust', # 或 'smallest' 或数值
scoring='accuracy',
n_jobs=-1,
)

5.3 学习曲线与验证曲线

from sklearn.model_selection import learning_curve, validation_curve

# 学习曲线:评估样本量对性能的影响
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring='accuracy',
n_jobs=-1,
shuffle=True,
random_state=42,
)

# 验证曲线:评估单一超参数对性能的影响
train_scores, val_scores = validation_curve(
model, X, y,
param_name='max_depth',
param_range=[1, 2, 5, 10, 20, 50],
cv=5,
scoring='accuracy',
n_jobs=-1,
)

六、Pipeline 工作流

6.1 基础 Pipeline

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Pipeline: 串联多个步骤
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(max_iter=1000)),
])

# make_pipeline: 自动命名(名称为小写的类名)
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))

# 使用
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
y_proba = pipe.predict_proba(X_test)
score = pipe.score(X_test, y_test)

# 访问 Pipeline 中的步骤
pipe.named_steps['scaler'] # StandardScaler
pipe.named_steps['classifier'] # LogisticRegression
pipe.set_params(classifier__C=0.1) # __ 分隔访问嵌套参数

# Pipeline 的优点:
# 1. 防止数据泄露(transform 仅在训练集上 fit)
# 2. 简化代码
# 3. 方便超参数搜索
# 4. 便于模型持久化

# GridSearchCV 与 Pipeline 组合
param_grid = {
'scaler__with_mean': [True, False],
'classifier__C': [0.01, 0.1, 1.0, 10.0],
'classifier__penalty': ['l1', 'l2'],
}
gs = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
gs.fit(X_train, y_train)

6.2 ColumnTransformer: 异构数据处理

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# 对不同列应用不同预处理
numeric_features = ['age', 'fare', 'sibsp']
categorical_features = ['sex', 'embarked', 'pclass']
passthrough_features = ['survived_flag'] # 不处理的列

preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False),
categorical_features),
('pass', 'passthrough', passthrough_features),
# 也可以使用 'drop' 丢弃某些列
], remainder='drop') # 未指定的列丢弃 (或 'passthrough')

# 与 Pipeline 组合
full_pipe = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegression(max_iter=1000))
])
full_pipe.fit(X_train, y_train)

# make_column_selector: 按 dtype 选择列
from sklearn.compose import make_column_selector

preprocessor = ColumnTransformer([
('num', StandardScaler(), make_column_selector(dtype_include=np.number)),
('cat', OneHotEncoder(), make_column_selector(dtype_include=object)),
])

6.3 FeatureUnion: 并行特征提取

from sklearn.pipeline import FeatureUnion

# 并行处理多个特征管道,结果拼接
union = FeatureUnion([
('pca_features', Pipeline([
('select', SelectKBest(k=50)),
('pca', PCA(n_components=10)),
])),
('poly_features', PolynomialFeatures(degree=2, include_bias=False)),
])

# 混合进主 Pipeline
full_pipe = Pipeline([
('features', union),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier()),
])

七、评估指标

7.1 分类指标

from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, precision_recall_fscore_support,
classification_report, confusion_matrix,
roc_auc_score, roc_curve, log_loss)

y_true = [0, 1, 1, 0, 1]
y_pred = [0, 0, 1, 0, 1]

# 基础指标
accuracy_score(y_true, y_pred) # 准确率
# 多类时: precision_score(y_true, y_pred, average='macro')

# precision_recall_fscore_support 一次性计算多个指标
precision, recall, f1, support = precision_recall_fscore_support(
y_true, y_pred,
average='macro', # None 返回每类值
# 'micro': 全局 TP/FP/FN 汇总后计算
# 'macro': 每类独立计算后平均(不受类不平衡影响)
# 'weighted': 每类独立计算后按支持度加权(考虑不平衡)
# 'binary': 仅二分类
zero_division=0 # 除零时返回的值
)

# 分类报告(文本格式)
print(classification_report(y_true, y_pred, target_names=['negative', 'positive']))
# precision recall f1-score support
# negative 0.43 0.75 0.55 12
# positive 0.79 0.52 0.63 21
# accuracy 0.61 33
# macro avg 0.61 0.64 0.59 33
# weighted avg 0.66 0.61 0.60 33

# 混淆矩阵
cm = confusion_matrix(y_true, y_pred)
# array([[TN, FP],
# [FN, TP]])

# ROC-AUC(需要概率预测)
y_proba = model.predict_proba(X_test)
# 二分类
auc = roc_auc_score(y_true, y_proba[:, 1])
# 多分类
auc = roc_auc_score(y_true, y_proba, multi_class='ovr')
# multi_class='ovo': 一对一;'ovr': 一对多

# ROC 曲线
fpr, tpr, thresholds = roc_curve(y_true, y_proba[:, 1])

# Log Loss(负对数似然)
ll = log_loss(y_true, y_proba)

7.2 回归指标

from sklearn.metrics import (r2_score, mean_squared_error, mean_absolute_error,
mean_absolute_percentage_error, explained_variance_score,
max_error)

# R² (决定系数)
r2 = r2_score(y_true, y_pred)
# R² = 1 - SS_res / SS_tot
# 可能为负(模型比均值还差)

# MSE / RMSE
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
# 也可直接用: rmse = mean_squared_error(y_true, y_pred, squared=False)

# MAE
mae = mean_absolute_error(y_true, y_pred)

# MAPE (Mean Absolute Percentage Error)
mape = mean_absolute_percentage_error(y_true, y_pred) * 100 # 转为百分比
# 注意: y_true 接近 0 时 MAPE 会爆炸

# Max Error
max_err = max_error(y_true, y_pred)

# 解释方差
evs = explained_variance_score(y_true, y_pred)

7.3 聚类指标

from sklearn.metrics import (adjusted_rand_score, adjusted_mutual_info_score,
silhouette_score, homogeneity_score,
completeness_score, v_measure_score)

# 有真实标签时
ari = adjusted_rand_score(y_true, y_pred) # [-1, 1]
ami = adjusted_mutual_info_score(y_true, y_pred) # [0, 1]

# 无真实标签时
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

silhouette = silhouette_score(X, labels) # [-1, 1],越大越好
db = davies_bouldin_score(X, labels) # 越小越好
ch = calinski_harabasz_score(X, labels) # 越大越好

八、完整工作流示例

从原始 CSV 到部署模型的端到端流程:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PowerTransformer
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import joblib

# 1. 加载数据
df = pd.read_csv('data.csv')

# 2. 识别特征类型
numeric_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()

# 移除目标列和 ID 列
if 'target' in numeric_features:
numeric_features.remove('target')
if 'id' in numeric_features:
numeric_features.remove('id')

# 3. 定义预处理管道
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('power', PowerTransformer(method='yeo-johnson')),
('scaler', StandardScaler()),
])

categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False,
min_frequency=10)),
])

preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
], remainder='drop')

# 4. 定义模型管道
pipeline = Pipeline([
('preprocessor', preprocessor),
('feature_selection', SelectFromModel(
RandomForestClassifier(n_estimators=100, random_state=42),
threshold='median'
)),
('classifier', LogisticRegression(max_iter=1000, class_weight='balanced')),
])

# 5. 准备数据
X = df.drop(columns=['target'])
y = LabelEncoder().fit_transform(df['target'])

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)

# 6. 超参数搜索
param_grid = {
'classifier__C': np.logspace(-3, 2, 10),
'classifier__penalty': ['l2'],
'feature_selection__threshold': ['median', 'mean'],
}

gs = GridSearchCV(
pipeline,
param_grid,
cv=StratifiedKFold(5, shuffle=True, random_state=42),
scoring='roc_auc_ovr',
n_jobs=-1,
verbose=1,
refit=True,
)

gs.fit(X_train, y_train)

# 7. 评估
print(f'Best params: {gs.best_params_}')
print(f'Best CV score: {gs.best_score_:.4f}')

y_pred = gs.predict(X_test)
y_proba = gs.predict_proba(X_test)

print(classification_report(y_test, y_pred))
print(f'ROC-AUC: {roc_auc_score(y_test, y_proba, multi_class="ovr"):.4f}')

# 8. 保存模型
joblib.dump(gs.best_estimator_, 'model_pipeline.pkl')

# 9. 加载使用
loaded_pipe = joblib.load('model_pipeline.pkl')
predictions = loaded_pipe.predict(new_data)

Scikit-learn 的价值在于其超级一致性的 API 设计和经过充分测试的实现。通过 Pipeline 和 ColumnTransformer 的组合,数据科学家可以在一个对象中确保预处理的一致性(训练时与预测时使用完全相同的 transform),消除了数据泄露的风险。对于表格数据的传统机器学习任务,Scikit-learn 是无可争议的首选工具。配合 XGBoost、LightGBM、CatBoost 等第三方兼容库,Scikit-learn 的生态覆盖了从探索性分析到生产部署的完整流程。

文章作者: Leo·Cheung
文章链接: http://tufusi.com/2022/04/15/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E6%A1%86%E6%9E%B6%E7%AF%87-Scikit-learn/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 ONE·PIECE
打赏
  • 微信
  • 支付宝

评论