机器学习框架篇-XGBoost

本文是对 XGBoost C++ 核心源码的深度分析，基于 XGBoost 官方仓库 (dmlc/xgboost) 和官方文档。我们将从顶层架构开始，逐步深入各个核心模块。

一、总体架构概览

1.1 源码目录结构

xgboost/
├── include/xgboost/       # 公共头文件 (C API, 数据结构声明)
├── src/
│   ├── tree/              # 树构建核心（最关键的目录）
│   │   ├── updater/       # Tree Updater 插件（colmaker, hist, approx, pruner...）
│   │   ├── param.h/cc     # 树参数
│   │   └── split_evaluator.h/cc  # 分裂评估器
│   ├── data/              # DMatrix, SimpleCSRSource, SparsePage
│   ├── gbm/               # 梯度提升机（GBTree, GBLinear, DART）
│   ├── objective/         # 目标函数（reg, binary, multi, rank...）
│   ├── metric/            # 评估指标（rmse, logloss, auc...）
│   ├── linear/            # 线性模型提升器
│   ├── common/            # 通用工具（线程池、随机数、量化草图...）
│   └── c_api/             # C API 实现
├── python-package/        # Python 绑定
├── jvm-packages/          # JVM (Scala/Java) 绑定
├── R-package/             # R 绑定
└── plugin/                # 插件示例（自定义 updater/objective/metric）

1.2 核心类关系

Learner (顶层训练器)
  ├── GBTree / GBLinear / DART   ← 梯度提升机（gbm 层）
  │   └── TrainParam              ← 训练超参数
  ├── ObjFunction                ← 目标函数（计算 g, h）
  ├── Metric                     ← 评估指标
  └── DMatrix                    ← 输入数据容器
       └── SparsePage             ← 稀疏页（数据存储单元）
            └── CSC/CSR format    ← 列/行压缩格式

TreeUpdater (树更新器，插件体系)
  ├── ColMaker                   ← 精确贪心（列块）树构建器
  ├── HistMaker                  ← 直方图近似树构建器
  ├── ApproxMaker (分布式)        ← 分布式近似构建器
  ├── Pruner                     ← 后剪枝
  └── SketchMaker                ← 加权分位点草图

二、DMatrix：数据容器深度解析

2.1 数据流向

原始数据 (CSV, LibSVM, NumPy array)
    ↓ XGDMatrixCreateFromFile / XGDMatrixCreateFromMat
SimpleCSRSource (中间层：存储原始数据，CSR 格式)
    ↓ 预处理（排序、分桶、压缩）
SparsePage (核心存储单元：CSC 格式的列块)
    ↓ 训练时读取
ColMaker / HistMaker (树构建器)

2.2 SparsePage 内部结构

SparsePage 是 XGBoost 的底层数据存储单元，以 CSC（Compressed Sparse Column）格式存储：

// src/data/sparse_page.h (简化)
class SparsePage {
 public:
  // 数据成员
  bst_row_t size_{0};                    // 行数
  common::Span<Entry> data_;             // 非零元素数组（展平）
  common::Span<bst_row_t> offset_;       // 每行在 data_ 中的起始偏移（CSR）
  std::vector<bst_feature_t> col_idx_;   // 对应 data_ 每项的列索引

  // CSC 格式成员（用于列块访问）
  common::Span<size_t> column_ptr_;      // 每列在 data_ 中的起始偏移
  common::Span<bst_row_t> row_idx_;      // 对应 data_ 每项的行索引
};

CSR 和 CSC 的切换：在训练前，XGBoost 将数据从 CSR（行优先，方便读取样本）转换为 CSC（列优先，方便按特征列访问），这个过程称为”转置”。转置后的数据以 Column Block 的形式存储。

2.3 数据预处理流水线

// 简化：DMatrix 构建时的主要处理步骤
DMatrix* DMatrix::Create(const char* fname, bool silent) {
  // Step 1: 加载原始数据到 SimpleCSRSource
  auto source = std::make_unique<SimpleCSRSource>();
  source->LoadFromFile(fname);  // 或 LoadFromMat, LoadFromCSR

  // Step 2: 如果是 Hist 模式，进行分桶（binning）
  if (tree_method == TreeMethod::kHist) {
    source = QuantileBinning(source, max_bin);
  }

  // Step 3: 转置为 CSC（列块）并稀疏化
  auto dmat = std::make_unique<DMatrix>();
  dmat->ProcessSparsePages(source.get());

  return dmat.release();
}

2.4 Quantile Binning（Hist 模式专用）

对于 tree_method='hist'，DMatrix 构建时预先将连续特征值分到若干个桶（bins）中：

// 简化的分桶逻辑
void QuantileBinning(SparsePage* page, int max_bin) {
  for (int fid = 0; fid < num_features; ++fid) {
    // 1. 收集该特征的所有非缺失值
    std::vector<float> values = CollectFeatureValues(page, fid);

    // 2. 计算等频分位点（或加权分位点）
    std::vector<float> cuts = FindQuantileCuts(values, max_bin);

    // 3. 将每个特征值映射到对应的 bin index
    for (auto& entry : page[fid]) {
      entry.fvalue = FindBin(entry.fvalue, cuts);
    }
  }
}

Hist 模式的好处：

分桶后每个特征只有 max_bin (默认 256) 个候选值，不需要排序
内存消耗大幅降低：每个特征值仅需 1 字节（uint8 bin index）
GPU 友好：直方图算法本身与 GPU 的并行模型高度契合

三、树构建器详解：ColMaker

3.1 精确贪心树构建算法

ColMaker 是 XGBoost 的精确树构建器（对应 tree_method='exact'），它基于 Column Block 实现按列并行的分裂查找。

// src/tree/updater/colmaker.cc (极度简化)
class ColMaker : public TreeUpdater {
 public:
  void Update(HostDeviceVector<GradientPair>* gpair,
              DMatrix* dmat,
              const std::vector<RegTree*>& trees) override {
    // 对每棵树
    for (auto& tree : trees) {
      // 初始化：创建根节点
      Builder builder(dmat, gpair, param);
      builder.InitRoot(tree);

      // 逐层构建
      for (int depth = 0; depth < param.max_depth; ++depth) {
        // 收集本层需要分裂的节点
        auto nodes = builder.CollectNodesToSplit(depth);

        // 并行处理：每个线程处理一部分特征列
        builder.FindSplit(nodes);
      }
    }
  }
};

3.2 分裂查找的并行化

ColMaker 的核心并行化策略是特征级并行：将特征列分配到多个线程，每个线程独立在自己负责的特征列上寻找最优分裂点。

// 简化的并行分裂查找
void ColMaker::Builder::FindSplit(
    const std::vector<NodeEntry>& nodes) {

  // 每个线程维护自己的 split 候选（线程局部存储）
  std::vector<std::vector<SplitEntry>> thread_splits(
      nthread, std::vector<SplitEntry>(nodes.size()));

  // 特征列并行
  #pragma omp parallel for schedule(dynamic)
  for (int fid = 0; fid < num_features; ++fid) {
    int tid = omp_get_thread_num();

    // 获取该特征的列块（已按特征值排序）
    auto col = dmat_->GetSortedColumn(fid);

    // 对每个待分裂节点
    for (size_t nid = 0; nid < nodes.size(); ++nid) {
      // 累积 G 和 H
      double sum_g = 0, sum_h = 0;
      for (auto& entry : col) {
        sum_g += gradient[entry.index].grad;
        sum_h += gradient[entry.index].hess;

        // 计算分裂增益
        double gain = CalcGain(nodes[nid].sum_g - sum_g,
                               nodes[nid].sum_h - sum_h,
                               sum_g, sum_h);

        if (gain > thread_splits[tid][nid].gain) {
          thread_splits[tid][nid].Update(gain, fid, entry.fvalue);
        }
      }
    }
  }

  // 合并各线程的最佳分裂（每个节点取所有线程中的最佳）
  for (size_t nid = 0; nid < nodes.size(); ++nid) {
    for (int tid = 0; tid < nthread; ++tid) {
      if (thread_splits[tid][nid].gain > best_split[nid].gain) {
        best_split[nid] = thread_splits[tid][nid];
      }
    }
  }
}

3.3 缓存感知的梯度访问

ColMaker 中最关键的性能优化是缓存感知的梯度预取：

// 简化的缓存感知累加
template <int BufferSize = 256>
void AccumulateGradients(const SparsePage::Column& col,
                          const GradientPair* gpair,
                          NodeEntry* node) {
  // 小型缓冲区，用于批量处理
  bst_gpair_t buf_g[BufferSize];
  bst_gpair_t buf_h[BufferSize];
  size_t buf_pos = 0;

  for (auto& entry : col) {
    // 收集到缓冲区
    buf_g[buf_pos] = gpair[entry.index].GetGrad();
    buf_h[buf_pos] = gpair[entry.index].GetHess();
    buf_pos++;

    if (buf_pos == BufferSize) {
      // 缓冲区满，批量累加（利用 CPU SIMD 向量化）
      for (size_t i = 0; i < BufferSize; ++i) {
        node->sum_g += buf_g[i];
        node->sum_h += buf_h[i];
      }
      buf_pos = 0;
    }
  }

  // 处理剩余
  for (size_t i = 0; i < buf_pos; ++i) {
    node->sum_g += buf_g[i];
    node->sum_h += buf_h[i];
  }
}

缓冲区的存在将”随机访问”转换为”批量顺序访问”：先顺序读取梯度值到缓冲区（利用 CPU 预取机制），再在缓冲区内完成累加（利用 SIMD 向量化指令）。

四、树构建器详解：HistMaker

4.1 直方图近似算法

HistMaker（对应 tree_method='hist'）使用直方图近似进行分裂查找。相比 ColMaker 的精确排序，Hist 算法将连续特征值离散化到预设的桶中，然后基于这些桶的聚合统计量来搜索最佳分裂。

// HistMaker 的核心流程
void HistMaker::BuildHistogram(
    const std::vector<GradientPair>& gpair,
    const SparsePage& page,
    const RegTree& tree,
    std::vector<Histogram>* hist) {

  // 对每个特征
  for (int fid = 0; fid < page.num_features; ++fid) {
    auto col = page[fid];

    // 对每个待分裂节点
    for (auto& node : nodes_to_split) {
      // 构造直方图：bin_idx -> (sum_g, sum_h)
      // 将每个样本的 (g, h) 累加到对应的 bin 中
      for (auto& entry : col) {
        if (tree[entry.row].IsInNode(node.nid)) {
          int bin_idx = entry.bin;  // 特征值已被预分为桶
          hist[fid][node.nid][bin_idx].Add(gpair[entry.row]);
        }
      }

      // 在直方图上扫描找最佳分裂
      double sum_g_left = 0, sum_h_left = 0;
      for (int bin = 0; bin < param.max_bin; ++bin) {
        sum_g_left += hist[fid][node.nid][bin].sum_g;
        sum_h_left += hist[fid][node.nid][bin].sum_h;

        double gain = CalcGain(sum_g_left, sum_h_left,
                                node.sum_g - sum_g_left,
                                node.sum_h - sum_h_left);
        // 更新最佳分裂...
      }
    }
  }
}

4.2 直方图减法技巧（Subtraction Trick）

Hist 算法的一个重要优化是直方图减法：

Histogram(parent) = Histogram(left_child) + Histogram(right_child)
→ Histogram(right_child) = Histogram(parent) - Histogram(left_child)

这意味着对于每个分裂，只需要构建较小子节点的直方图，另一个通过减法得到。这几乎将直方图构建的计算量减半。

void HistMaker::FindSplitBySubtraction(
    const NodeEntry& parent,
    const NodeEntry& smaller_child,
    NodeEntry* larger_child) {
  // 从 parent 直方图减去 smaller_child 直方图
  for (int bin = 0; bin < max_bin; ++bin) {
    larger_child->hist[bin].sum_g =
        parent.hist[bin].sum_g - smaller_child.hist[bin].sum_g;
    larger_child->hist[bin].sum_h =
        parent.hist[bin].sum_h - smaller_child.hist[bin].sum_h;
  }
}

五、GPU 加速

5.1 GPU Histogram 算法

XGBoost 的 GPU 实现基于直方图算法（tree_method='gpu_hist'）。GPU 的并行模型与直方图操作高度契合：

// GPU Hist 算法（CUDA kernel 重点部分）
// 每个 thread block 处理一个特征列的直方图

__global__ void BuildHistKernel(
    const Entry* data,          // 特征值（bin index）
    const GradientPair* gpair,  // 梯度和 Hessian
    const int* row_to_node,     // 每行属于哪个节点
    float* hist,                // 输出：直方图 [n_features][n_nodes][n_bins]
    int n_features, int n_nodes, int n_bins) {

  int fid = blockIdx.x;          // 每个 block 处理一个特征
  int nid = blockIdx.y;          // 每个 block 处理一个节点
  int tid = threadIdx.x;

  // 使用共享内存维护局部直方图
  __shared__ float s_hist[N_BINS * 2];  // G + H

  // 并行扫描特征列
  for (int i = tid; i < num_entries; i += blockDim.x) {
    int row = data[i].row;
    if (row_to_node[row] == nid) {
      int bin = data[i].bin;
      atomicAdd(&s_hist[2 * bin], gpair[row].grad);
      atomicAdd(&s_hist[2 * bin + 1], gpair[row].hess);
    }
  }

  __syncthreads();

  // 写回全局内存
  for (int b = tid; b < n_bins; b += blockDim.x) {
    hist[fid * n_nodes * n_bins + nid * n_bins + b] = s_hist[2 * b];
  }
}

5.2 CPU vs GPU 性能对比

场景	tree_method=’hist’ (CPU)	tree_method=’gpu_hist’ (GPU)	加速比
小数据集（<1 万行）	~5 秒	~2 秒	2.5x
中数据集（~100 万行）	~120 秒	~15 秒	8x
大数据集（~1000 万行）	~1800 秒	~120 秒	15x
稀疏极高（>90%）	~100 秒	~20 秒	5x

六、Python 绑定与 C API

6.1 C API 层次

XGBoost 提供了两层 API：

C API（c_api/c_api.h）：基于句柄（handle）的 C 函数
Python 封装（core.py）：在 C API 之上封装为 Python 类

// C API 训练流程
DMatrixHandle train_dmat;
XGDMatrixCreateFromFile("train.libsvm", 0, &train_dmat);

BoosterHandle booster;
XGBoosterCreate(&train_dmat, 1, &booster);

// 多轮训练
for (int iter = 0; iter < num_round; ++iter) {
  XGBoosterUpdateOneIter(booster, iter, train_dmat);
}

// 预测
bst_ulong out_len;
const float* preds;
XGBoosterPredict(booster, test_dmat, 0, 0, 0, &out_len, &preds);

6.2 Python 训练流程追踪

# Python API 内部调用的追踪
import xgboost as xgb

dtrain = xgb.DMatrix('train.libsvm')
# → 内部调用: XGDMatrixCreateFromFile()

params = {'max_depth': 3, 'eta': 0.1}
num_round = 100
bst = xgb.train(params, dtrain, num_round)
# → 内部循环调用:
#   for i in range(num_round):
#       XGBoosterUpdateOneIter(handle, iter, dtrain.handle)

6.3 Booster 的 save/load

XGBoost 模型保存支持两种格式：

JSON（推荐，人类可读、可调试）：bst.save_model('model.json')
UBJSON (binary JSON)：字节级存储，紧凑
deprecated binary format：旧版二进制格式

七、自定义目标函数与评估指标

7.1 Python 端自定义目标函数

XGBoost 支持在 Python 端定义 Objective 和 Metric，灵活度极高：

import xgboost as xgb
import numpy as np

# 自定义目标函数：输入预测值 predt 和 DMatrix dtrain
# 返回梯度 grad 和二阶梯度 hess
def weighted_logistic_loss(predt, dtrain):
    y = dtrain.get_label()
    weights = dtrain.get_weight()  # 样本权重

    predt = 1.0 / (1.0 + np.exp(-predt))  # sigmoid
    grad = (predt - y) * weights
    hess = predt * (1.0 - predt) * weights
    return grad, hess

# 自定义评估指标
def f1_score_eval(predt, dtrain):
    y_true = dtrain.get_label()
    preds = 1.0 / (1.0 + np.exp(-predt))
    y_pred_binary = (preds > 0.5).astype(int)

    tp = np.sum((y_pred_binary == 1) & (y_true == 1))
    fp = np.sum((y_pred_binary == 1) & (y_true == 0))
    fn = np.sum((y_pred_binary == 0) & (y_true == 1))

    precision = tp / (tp + fp + 1e-8)
    recall = tp / (tp + fn + 1e-8)
    f1 = 2 * precision * recall / (precision + recall + 1e-8)

    return 'custom_f1', f1

# 训练
bst = xgb.train(
    params, dtrain,
    obj=weighted_logistic_loss,       # 自定义目标
    feval=f1_score_eval,              # 自定义指标
    evals=[(dtrain, 'train')],
    num_boost_round=100
)

7.2 C++ 端自定义目标函数（插件方式）

// src/objective/my_custom_obj.cc
#include <xgboost/objective.h>

class MyCustomObjective : public ObjFunction {
 public:
  void Configure(const std::vector<std::pair<std::string, std::string>>& args) override {
    // 解析自定义参数
  }

  void GetGradient(const HostDeviceVector<bst_float>& preds,
                   const MetaInfo& info,
                   int iter,
                   HostDeviceVector<GradientPair>* out_gpair) override {
    auto& gpair = out_gpair->HostVector();
    const auto& labels = info.labels_.HostVector();

    for (size_t i = 0; i < labels.size(); ++i) {
      // 自定义梯度计算
      float p = 1.0f / (1.0f + std::exp(-preds.HostVector()[i]));
      gpair[i] = GradientPair(p - labels[i],          // grad
                               p * (1.0f - p));        // hess
    }
  }

  // 注册插件
  XGBOOST_REGISTER_OBJECTIVE(MyCustomObjective, "my_custom_obj")
      .describe("My custom logistic objective")
      .set_body([]() { return new MyCustomObjective(); });
};

八、加权分位点草图的数据结构实现

8.1 合并与修剪操作

// common/quantile.h (简化)
class WQSummary {
 public:
  struct Entry {
    float rmin;   // 累计 rank 下界
    float rmax;   // 累计 rank 上界
    float wmin;   // 累计 weight 下界
    float value;  // 该分位点对应的特征值
  };

  // 合并两个草图（分布式聚合的核心操作）
  static WQSummary Merge(const WQSummary& a, const WQSummary& b) {
    WQSummary result;
    // 合并两个有序的 Entry 列表
    // 合并后的 rmin = a.rmin + b.rmin, rmax = a.rmax + b.rmax
    // ...
    return result;
  }

  // 修剪：在 eps 精度约束下减少 Entry 数量
  void Prune(float eps) {
    std::vector<Entry> pruned;
    for (size_t i = 0; i < entries_.size(); ) {
      size_t j = i + 1;
      // 合并所有满足 rmax[j] - rmin[i] < eps 的连续条目
      while (j < entries_.size() &&
             entries_[j].rmax - entries_[i].rmin < eps) {
        ++j;
      }
      // 将 [i, j) 范围内的条目合并为一个
      pruned.push_back(MergeRange(i, j));
      i = j;
    }
    entries_ = std::move(pruned);
  }
};

九、面试高频问答

Q1: XGBoost 的 ColMaker 和 HistMaker 在源码层面的核心区别是什么？

ColMaker 在源码中对应 src/tree/updater/colmaker.cc，它基于预排序的 Column Block 进行精确分裂查找：每个特征列被提前排序，分裂查找时按排序顺序遍历列中所有值，计算每个可能的切分点的 Gain。HistMaker 对应 src/tree/updater/histmaker.cc（近似算法）和新版 gpu_hist，它先将特征值分桶到 max_bin 个桶中，然后基于直方图（bin 的聚合 G/H 统计量）搜索分裂。HistMaker 利用直方图减法（从父节点直方图减左子直方图得右子直方图）来将计算量减半。

Q2: DMatrix 内部是如何组织数据的？为什么 ColMaker 需要预先转置？

DMatrix 内部使用 SparsePage 存储数据，数据以 CSC（Compressed Sparse Column）格式存储，即按列组织。ColMaker 在进行分裂查找时需要按特征列遍历数据——需要知道”特征 k 的所有非零值分别在哪些样本上”。CSC 格式天然支持这种访问模式：对于特征 k，通过 column_ptr_[k] 和 column_ptr_[k+1] 可以直接获取该列的所有 entry。如果没有预转置（数据按行存储），每次分裂查找都需要扫描全部样本聚合，效率极低。

Q3: GPU 加速为什么能显著提升 XGBoost 的训练速度？

GPU 的加速来自其大规模并行架构与直方图算法的契合：(1) GPU 拥有数千个计算核心，可以同时为多个特征 × 多个节点构建直方图；(2) 直方图构建过程中的原子累加操作（atomicAdd）开销在 GPU 上远低于 CPU，因为 GPU 的共享内存和 Global Memory 原子操作都很高效；(3) 分裂查找阶段的直方图扫描可以使用 prefix sum（scan）等 GPU 原语高效实现。当然，GPU 加速的前提是使用 Hist 方法（特征值离散化），因为原始精确排序在 GPU 上反而不高效。

Q4: 如何在 XGBoost 中实现自定义的损失函数？需要注意什么？

通过 Python API，传入 obj 和 feval 参数即可。自定义目标函数 obj(predt, dtrain) 需要返回 (grad, hess) 元组。关键注意事项：(1) predt 是 raw margin（未经过 sigmoid/softmax 的原始预测值），需要自行做映射；(2) 返回的 grad 和 hess 必须与 predt 形状相同（都是 n_samples 长度的一维数组）；(3) 对于二分类，LogLoss 的 Hessian 是 $p(1-p)$，需要在梯度计算中正确处理；(4) 自定义目标函数会影响 early stopping 的行为（early stopping 使用的是 eval_metric，不会自动调整）。

Q5: XGBoost 的加权分位点草图（Weighted Quantile Sketch）为什么用 Hessian 作为权重？

通过重写 XGBoost 的目标函数，可以将其转化为以 Hessian 为权重的加权平方损失形式：$\sum_i \frac{1}{2}h_i (f_t(x_i) + g_i/h_i)^2 + \Omega(f_t) + C$。这意味着每个样本对损失的贡献与其 Hessian 成正比。当我们在近似分裂查找中提出候选分裂点时，应该确保”Hessian 大的样本附近有更多候选点”——因为那些样本对损失函数的 curvature 贡献大，分裂点的位置对其更敏感。如果使用均匀分位点（不加权），可能在对损失贡献大的区域候选点不够密集，导致近似精度差。