【竞赛专题】比赛技巧.md

一、比赛需掌握技巧：全局概览

一场成功的机器学习比赛遵循一条清晰的流水线：

数据探索（EDA）→ 数据清洗 → 特征工程 → 模型选择 → 交叉验证 → 超参数优化 → 集成 → 后处理 → 提交

本文按上述顺序，系统介绍每个阶段的核心技巧、常见陷阱和业界最佳实践。

二、数据处理

2.1 EDA（Exploratory Data Analysis）检查清单

EDA 是最关键的起步阶段，一个彻底的 EDA 能节省后续数倍的时间。标准检查清单：

数据概览：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. 基本形状和类型
print(df.shape)
print(df.info())
print(df.dtypes.value_counts())

# 2. 缺失值分析
missing = df.isna().sum()
missing_pct = (missing / len(df)) * 100
print(missing_pct[missing_pct > 0].sort_values(ascending=False))

# 3. 目标变量分布
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
df['target'].hist(bins=50)  # 回归任务：检查偏度、离群值
plt.subplot(1, 3, 2)
df['target'].value_counts().plot.bar()  # 分类任务：检查类别平衡

# 4. 重复行
print(f"Duplicate rows: {df.duplicated().sum()}")

# 5. 数值列统计
print(df.describe().T)

# 6. 高基数列
for col in df.select_dtypes('object'):
    print(f"{col}: {df[col].nunique()} unique values")

每个特征的深度探索：

分布形状（histplot + boxplot）
与目标的关系（回归：散点图/相关性矩阵；分类：箱线图/小提琴图）
异常值（IQR / Z-score / 业务常识）
时间和周期性模式（如果有时序数据）

2.2 关键 Python 库

库	用途	核心 API
Pandas	数据读取、清洗、转换	`read_csv`, `merge`, `groupby`, `pivot_table`
NumPy	数值计算、矩阵运算	`np.where`, `np.percentile`, `np.log`, `np.clip`
Matplotlib	基础绘图	`plt.plot`, `plt.hist`, `plt.subplots`
Seaborn	高级统计图	`sns.heatmap`, `sns.distplot`, `sns.boxplot`
Plotly	交互式图表	`px.scatter`, `px.line`

三、绘图技巧

3.1 常用可视化场景速查

分析目标	推荐图表	代码示例
数值分布	直方图 + 密度曲线	`sns.histplot(df['col'], kde=True)`
类别分布	计数柱状图	`df['cat'].value_counts().plot.bar()`
双变量关系	散点图 / 六边形图	`sns.scatterplot(x='a', y='b', hue='cat', data=df)`
相关性矩阵	热力图	`sns.heatmap(df.corr(), annot=True, cmap='coolwarm')`
多类别对比	分组箱线图	`sns.boxplot(x='cat', y='num', data=df)`
时间序列	折线图	`df.plot(x='date', y='value')`
缺失值分布	missingno 矩阵/柱状图	`import missingno as msno; msno.matrix(df)`
地理数据	散点地图	`df.plot.scatter(x='lon', y='lat')`

3.2 全局样式设置

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('muted')
plt.rcParams['figure.dpi'] = 120
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['font.size'] = 12

四、特征工程

特征工程是竞赛中区分金牌方案和铜牌方案的关键环节。

4.1 特征工程系统化工作流

Step 1: 理解每个特征的业务/物理含义
Step 2: 修复数据问题（缺失值、异常值、不一致编码）
Step 3: 创建基础转换（log, sqrt, binning, one-hot）
Step 4: 创建交互特征（加减乘除、A/B ratio, A*B, concat）
Step 5: 创建聚合特征（groupby → agg: mean, std, min, max, count, nunique）
Step 6: 创建时序特征（lag, diff, rolling stats, expanding stats）
Step 7: 特征选择（移除低方差、高相关、低重要性的特征）
Step 8: 验证新特征在交叉验证中的增益

4.2 数值特征处理

# log 变换（处理右偏分布）
df['log_col'] = np.log1p(df['col'])

# 分箱
df['bin'] = pd.cut(df['col'], bins=10, labels=False)

# 标准化
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['col']])

# 多项式特征
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
poly_features = poly.fit_transform(df[['a', 'b']])

4.3 类别特征编码

方法	适用场景	示例
One-Hot Encoding	低基数（< 10 unique values）	`pd.get_dummies(df['color'])`
Label Encoding	有序类别、高基数预处理	`df['cat'].astype('category').cat.codes`
Target Encoding	高基数，与目标相关	见下方代码
Frequency Encoding	高基数，值频率有信息量	`df['cat'].map(df['cat'].value_counts())`
Count Encoding	类似频率编码但用计数	同上
CatBoost 原生处理	极高基数，不想手动处理	直接传入 category 列名到 `cat_features`

Target Encoding（均值编码）：

# 重要：必须使用交叉验证来避免数据泄漏
from sklearn.model_selection import KFold

def target_encode(train, test, col, target, folds=5, smoothing=10):
    """
    col: categorical column name
    target: target column name
    smoothing: Laplace smoothing strength
    """
    train['encoded'] = 0
    kf = KFold(n_splits=folds, shuffle=True, random_state=42)

    for trn_idx, val_idx in kf.split(train):
        trn, val = train.iloc[trn_idx], train.iloc[val_idx]

        global_mean = trn[target].mean()
        agg = trn.groupby(col)[target].agg(['count', 'mean'])

        # 平滑编码
        smoothed = (agg['count'] * agg['mean'] + smoothing * global_mean) \
                   / (agg['count'] + smoothing)

        train.loc[val_idx, 'encoded'] = val[col].map(smoothed)

    # 对 test 集使用全量训练的编码
    full_agg = train.groupby(col)[target].agg(['count', 'mean'])
    global_mean = train[target].mean()
    full_smoothed = (full_agg['count'] * full_agg['mean'] + smoothing * global_mean) \
                    / (full_agg['count'] + smoothing)
    test['encoded'] = test[col].map(full_smoothed)

    return train[['encoded']], test[['encoded']]

4.4 聚合特征（对表格数据尤其强大）

# 对 'user_id' 分组，统计其交易行为特征
agg_features = df.groupby('user_id').agg(
    transaction_count=('amount', 'count'),
    amount_mean=('amount', 'mean'),
    amount_std=('amount', 'std'),
    amount_sum=('amount', 'sum'),
    unique_merchants=('merchant_id', 'nunique'),
    last_amount=('amount', 'last')
).reset_index()

# 合并回主表
df = df.merge(agg_features, on='user_id', how='left')

4.5 时序特征

# 滞后特征
for lag in [1, 3, 7, 14]:
    df[f'sales_lag_{lag}'] = df.groupby('store')['sales'].shift(lag)

# 滚动统计
for window in [7, 14, 30]:
    df[f'sales_rolling_mean_{window}'] = df.groupby('store')['sales'].rolling(window).mean().values
    df[f'sales_rolling_std_{window}'] = df.groupby('store')['sales'].rolling(window).std().values

# 日期特征分解
df['dayofweek'] = df['date'].dt.dayofweek
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
df['month'] = df['date'].dt.month
df['day_of_month'] = df['date'].dt.day

4.6 特征选择方法

# 1. 移除低方差特征
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.01)
X_selected = sel.fit_transform(X)

# 2. 移除高相关特征（> 0.95 的保留一个）
corr_matrix = df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]

# 3. 基于模型的特征重要性
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)

# 4. Null Importance（更可靠，来自 Kaggle）
# 比较真实标签训练的 importance 和随机标签训练的 importance

五、常用模型与模型选择

5.1 表格数据模型选型指南

方案	模型	适用场景	优点	缺点
Baseline	Logistic Regression / Ridge	任何表格问题	极快、可解释	精度上限低
主力 1	XGBoost	通用表格数据	强大、社区大、GPU 加速	超参数多
主力 2	LightGBM	高维特征、大数据	训练快、内存省、原生类别支持	对噪声敏感
主力 3	CatBoost	高基数类别特征	最好的类别特征处理、默认参数好	训练慢
高级	TabNet / FT-Transformer	大规模表格数据	可表征学习	调参困难、不一致
线性基准	ElasticNet, RidgeCV	子任务、校准层	稳定、可解释	需要更多特征工程

5.2 计算机视觉模型选型

方案	模型	说明
分类/特征提取	EfficientNet V2	目前准确度/速度综合最佳的选择之一
分类	ConvNeXt	现代化 CNN，精度接近 ViT
分类/检测	ViT / Swin Transformer	纯 Transformer 架构，需要更多数据
检测	YOLOv8 / YOLOv10	One-stage，工业部署首选
检测（高精度）	DETR / Co-DETR	Transformer-based detector
分割	Mask2Former / SAM	全景/实例分割
预训练权重	Timm library	`import timm; model = timm.create_model('convnext_base', pretrained=True)`

5.3 NLP / 文本模型选型

方案	模型	说明
轻量级	DeBERTa-V3-base/xsmall	比 BERT-base 精度高，速度快
中等	DeBERTa-V3-large	Kaggle NLP 竞赛热门选择
大模型	Llama 3 / Mistral / Qwen	开源 LLM，适合复杂推理
嵌入	BGE / E5 / Instructor	文本相似度/检索
传统基线	TF-IDF + Logistic Regression	快速 benchmark

5.4 模型选择的决策树

问题类型？
  ├─ 表格数据 → 特征有高基数类别吗？
  │   ├─ 有 → CatBoost
  │   ├─ 否 → 特征维度 > 1000？
  │   │   ├─ 是 → LightGBM
  │   │   └─ 否 → XGBoost + 二者都试试
  │   └─ 最终方案 = XGB+LightGBM+CatBoost 的 ensemble
  │
  ├─ 图像 → 数据量 > 10万？
  │   ├─ 是 → ViT / Swin
  │   └─ 否 → EfficientNet / ConvNeXt + 迁移学习
  │
  ├─ 文本 → 需要生成/推理？
  │   ├─ 是 → LLM (Llama-3 / GPT-4)
  │   └─ 否 → DeBERTa-V3 + fine-tune
  │
  └─ 多模态 → CLIP / BLIP-2 → 各自的 encoder + 融合层

六、交叉验证策略

6.1 常用 CV 策略速查

策略	适用场景	注意事项
K-Fold (k=5)	通用场景	最常用，需 shuffle
Stratified K-Fold	分类问题（尤其类别不平衡）	保持每折的类别分布一致
Group K-Fold	同一 group 不应跨越 train/val	用户级预测、医疗数据
TimeSeriesSplit	时间序列	训练集始终在验证集之前
StratifiedGroupKFold	有 group 且需要分层	较新的 sklearn 功能
Repeated K-Fold	小数据集（< 1000 行）	多次随机分折取平均

6.2 CV 实现模板

from sklearn.model_selection import StratifiedKFold, GroupKFold, TimeSeriesSplit

# 标准 K-Fold
def run_cv(X, y, model_factory, n_splits=5):
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    oof_preds = np.zeros(len(y))
    scores = []

    for fold, (trn_idx, val_idx) in enumerate(cv.split(X, y)):
        X_trn, X_val = X.iloc[trn_idx], X.iloc[val_idx]
        y_trn, y_val = y.iloc[trn_idx], y.iloc[val_idx]

        model = model_factory()
        model.fit(X_trn, y_trn, eval_set=[(X_val, y_val)],
                  early_stopping_rounds=50, verbose=False)

        oof_preds[val_idx] = model.predict(X_val)
        scores.append(some_metric(y_val, oof_preds[val_idx]))
        print(f"Fold {fold+1}: {scores[-1]:.4f}")

    print(f"CV Mean: {np.mean(scores):.4f} ± {np.std(scores):.4f}")
    return oof_preds, scores

6.3 CV vs Public LB vs Private LB

常见陷阱：Shake-up（排行榜抖动）——Public LB 排名很高，但 Private LB 排名大幅下降。原因：

Public LB 数据占比较小（如 30%），且可能与 Private LB 的数据分布不同
过度拟合 Public LB（通过大量提交试错）
CV 和 LB 不一致（CV 稳定但 LB 波动 → 可能 CV 设置有问题）

黄金法则：永远信任 CV 而非 Public LB。CV 和 LB 一致时才能信任 LB 趋势。

七、超参数优化

7.1 优化方法对比

方法	说明	库	适用场景
Grid Search	穷举所有组合	sklearn	参数少时的验证
Random Search	随机采样搜索空间	sklearn	中规模参数搜索
Bayesian Optimization	用 GP 建模参数-性能关系	Optuna, Hyperopt	大规模搜索，最优方案
TPE（Tree-structured Parzen Estimator）	基于密度估计的贝叶斯方法	Optuna	集成在 Optuna 中
CMA-ES	进化策略	Optuna sampler	对某些连续参数敏感
Successive Halving / Hyperband	快速淘汰差参数	Optuna pruner	节省时间

7.2 Optuna 实战

import optuna

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.3, 1.0),
        'min_child_weight': trial.suggest_float('min_child_weight', 0.5, 10.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 500, 5000),
    }

    model = XGBRegressor(**params, tree_method='hist', random_state=42)
    score = cross_val_score(model, X_train, y_train,
                            cv=5, scoring='neg_mean_squared_error').mean()
    return -score  # Optuna 默认最大化

study = optuna.create_study(direction='minimize')  # 最小化 MSE
study.optimize(objective, n_trials=100, show_progress_bar=True)

print(f"Best params: {study.best_params}")
print(f"Best score: {study.best_value:.4f}")

# 查看参数重要性
optuna.visualization.plot_param_importances(study)

7.3 Optuna 高级功能

# 剪枝（pruning）— 训练中期淘汰明显差的三方
study = optuna.create_study(direction='minimize',
                            pruner=optuna.pruners.MedianPruner())

# 在 objective 中：
for step in range(n_estimators):
    model.fit_step(...)
    intermediate_value = evaluate(...)
    trial.report(intermediate_value, step)
    if trial.should_prune():
        raise optuna.TrialPruned()

# 从上次搜索结果恢复（持续搜参）
study = optuna.load_study(study_name='xgboost_opt', storage='sqlite:///study.db')
study.optimize(objective, n_trials=50)

八、集成方法

8.1 核心集成策略

方法	描述	所需模型数	效果
Averaging	直接对预测求算术平均	2-5	好
Weighted Averaging	按 CV 分数加权平均	2-5	更好
Rank Averaging	对排名取平均后转回值	2-5	对异常预测鲁棒
Stacking	用元模型学习各基模型的权重	5+	最好（但有数据泄漏风险）
Blending	单折 Hold-out 的简化 Stacking	3-5	快但可能过拟合

8.2 Stacking 实现（OOF 方式）

def get_stacking_features(models, X, y, X_test, n_folds=5):
    """
    使用 OOF（Out-of-Fold）预测作为 stacking 的特征，避免数据泄漏。
    """
    oof = np.zeros((len(X), len(models)))
    test_preds = np.zeros((len(X_test), len(models)))

    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

    for i, model_factory in enumerate(models):
        model = model_factory()
        test_fold_preds = np.zeros(len(X_test))

        for fold, (trn_idx, val_idx) in enumerate(kf.split(X)):
            X_trn, X_val = X.iloc[trn_idx], X.iloc[val_idx]
            y_trn, y_val = y.iloc[trn_idx], y.iloc[val_idx]

            model = model_factory()
            model.fit(X_trn, y_trn)
            oof[val_idx, i] = model.predict(X_val)
            test_fold_preds += model.predict(X_test) / n_folds

        test_preds[:, i] = test_fold_preds

    return oof, test_preds

# 使用 OOF 特征训练第二层元模型
oof_train, test_feat = get_stacking_features(base_models, X, y, X_test)
meta_model = Ridge(alpha=1.0)
meta_model.fit(oof_train, y)
final_preds = meta_model.predict(test_feat)

8.3 集成多样性

有效的集成要求基模型之间有足够的多样性。增加多样性的方法：

不同算法：XGBoost + LightGBM + CatBoost + RandomForest
不同参数：同一个算法使用不同的超参数（深度不同、采样不同）
不同特征子集：每折用不同的特征子集训练
不同种子：相同的参数但不同的 random seed
不同数据版本：不同的数据清洗/特征工程版本
不同标签处理：回归 vs 分箱+分类

8.4 Ensemble 的 ROI

一个常见经验法则：

策略	额外提升（典型）	额外时间成本
2 个不同模型加权平均	+0.1% ~ 0.3%	2x
5 个模型加权平均	+0.2% ~ 0.5%	5x
3 模型 + Stacking	+0.2% ~ 0.4%	4x ~ 10x
单模型调参更久	+0.1% ~ 0.2%	1.2x
更深入的特征工程	+0.5% ~ 2%	视复杂程度

结论：特征工程 > 模型集成 > 超参数微调。

九、后处理

9.1 常见后处理技术

# 1. 预测校准（分类问题）
from sklearn.calibration import CalibratedClassifierCV
calibrated = CalibratedClassifierCV(model, method='isotonic', cv=5)
calibrated.fit(X, y)

# 2. 概率裁剪（防止 logloss 的极端预测）
preds = np.clip(preds, 1e-15, 1 - 1e-15)

# 3. 阈值调整（最大化特定指标）
from sklearn.metrics import f1_score
thresholds = np.arange(0.3, 0.7, 0.01)
best_threshold = max(thresholds, key=lambda t: f1_score(y_true, preds > t))

# 4. 异常值回退（回归问题）
# 如果预测值超出合理的物理范围，回退到合理的边界
preds = np.clip(preds, lower_bound, upper_bound)

# 5. 数据分布匹配（quantile normalization）
# 使预测分布匹配训练标签的分布
from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(output_distribution='normal')
preds = qt.fit_transform(preds.reshape(-1, 1)).ravel()

十、管道管理与实验追踪

10.1 实验管理

# 使用 MLflow 追踪实验
import mlflow

mlflow.set_experiment("kaggle-competition-name")

with mlflow.start_run(run_name="xgb_baseline_v1"):
    mlflow.log_params(params)
    mlflow.log_metric("cv_score", cv_score)
    mlflow.xgboost.log_model(model, "model")
    mlflow.log_artifact("feature_importance.png")

10.2 管道管理

# 使用 sklearn Pipeline 确保训练/测试的一致性
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), numerical_cols),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ]), categorical_cols)
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', XGBClassifier())
])

pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)  # 保证 train 和 test 经过相同的变换

十一、常见陷阱

11.1 数据泄漏（Data Leakage）

泄漏类型	示例	后果
时间泄漏	用未来信息预测过去	模型失效（实盘中完全不可用）
目标泄漏	特征中含目标变量的某种变换	超高的 CV 和 LB，但在真实场景中为 0
Group 泄漏	同一 entity 的训练/测试数据分到了 train 和 val	CV 虚高
预处理泄漏	在分割前对全量数据做标准化/编码/填充	CV 虚高（应仅在训练集上 fit，然后 transform 验证集）

11.2 其他常见错误

使用 train_test_split 而不是合理的 CV 策略
仅看 Public LB 而不关注 CV-LB 一致性
随机种子不一致导致无法复现
在 CV 期间使用了 random_state=42，但从未检查不同种子的 CV 稳定性
提交文件格式错误（排序、列名、数据类型）

十二、面试高频问答

Q1: 在一个 Kaggle 竞赛中，你的 CV 很稳定（5折 0.85±0.005），但 Public LB 只有 0.80。你会优先信任哪个？

优先信任 CV。这种差异（CV >> LB）可能的原因是：(1) Public LB 的样本太小（如只有 30%），采样偏差导致分数偏低；(2) Public LB 的数据分布与训练集不同（如时间维度上 LB 是更晚的数据）。在信任 CV 的前提下，应该去分析 LB 和 CV 之间差异的模式——如果 CV 的提升能反映在 LB 上（CV +0.01 → LB +0.005），即使绝对值不同，趋势可靠就可以继续优化。

Q2: 如何处理类别极度不平衡（正负比 1:1000）？

多管齐下：(1) 使用适合不平衡数据的评估指标（Precision-Recall AUC 而非 Accuracy）；(2) 模型层面：scale_pos_weight（XGBoost）、class_weight（sklearn）、Focal Loss；(3) 重采样：对负类欠采样（通常比过采样更有效，因为不引入合成数据）；(4) 阈值调优：在 CV 上寻找最优分类阈值；(5) 异常检测视角：将正类视为异常点，使用 One-Class SVM / Isolation Forest。

Q3: 特征工程和模型集成，哪个投入产出比更高？

特征工程。在大多数竞赛中，前 20% 的排名提升来自优质的特征工程，而集成通常只在 top 1% 的牌子争夺中起决定性作用。一个精心设计的特征是”免费”的（不增加推理时间），而集成通常线性地增加计算成本。高效做法：先全力做特征工程到饱和，最后用集成收割最后的 0.1%-0.3%。

Q4: Stacking 和 Blending 有什么区别？各自的优缺点？

Stacking：用 K-Fold 的 OOF（出折）预测作为元特征，每个训练样本有来自基模型的 OOF 预测。优点：使用全量数据训练元模型，更稳健。缺点：需要 K 次训练每个基模型，训练时间长。
Blending：用一个简单的 Hold-out（如 80% train / 20% blend）训练基模型，用 hold-out 上的预测作为元特征。优点：简单快速。缺点：元模型仅使用了 20% 的数据训练，可能不够稳定；基模型也未在 20% 的数据上训练，浪费了数据。

在数据量足够（>10万行）时 blending 通常够用；在数据量较少时 stacking 更可靠。

Q5: 如何发现和处理数据泄漏？

检测数据泄漏的方法：(1) 检查每个特征的预测能力——如果一个特征单独使用就能达到接近完美的 AUC (> 0.99)，极可能是泄漏特征；(2) 检查特征与目标的时间先后——特征的数据应该在实际发生后才能得到；(3) 对怀疑的特征做 shuffle 分析——如果随机打乱某个特征导致模型性能大幅下降，说明模型严重依赖该特征（如果该特征应该是独立的，那就是泄漏）。

处理方法：(1) 如果在比赛中且规则允许，可以在讨论区提出讨论；(2) 如果不确定是否泄漏，准备一个”含泄漏特征”和”不含泄漏特征”两个版本的方案；(3) 在 CV 中加入”时间回测”来模拟真实场景——按时间分割训练/验证集，看依赖泄漏特征的模型是否在时间外推时崩掉。