【Python系列】Numpy使用详解

NumPy（Numerical Python）是 Python 生态中科学计算与数据分析的基石库。它提供了高性能的多维数组对象 ndarray，以及丰富的数学函数、线性代数、随机数生成等工具。NumPy 底层以 C 和 Fortran 实现，对数组的操作在编译层面展开循环，因此比纯 Python 循环快 1-2 个数量级。本文从数组创建、数据类型、广播机制、索引切片、通用函数、线性代数、随机数、形状变换、einsum 和张量收缩、内存布局等角度全面讲解 NumPy，并给出 PCA 从零实现、图像卷积、蒙特卡洛求圆周率等完整示例。

一、ndarray —— NumPy 的核心数据结构

1.1 什么是 ndarray

ndarray（N-dimensional array）是一个固定类型的多维容器，所有元素必须是同构的（homogeneous），即具有相同的 dtype。这一约束是 NumPy 高性能的根本原因：同构元素在内存中连续排布，无需指针间接寻址，CPU 缓存命中率极高。

与 Python list 的对比：

特性	Python list	ndarray
元素类型	异构（存指针）	同构（存值）
内存布局	指针数组 + 分散的对象	连续内存块
运算方式	解释循环	编译层向量化
内存占用	大（每个元素多一个 PyObject 头部）	小（纯数据，无对象开销）

创建一个 ndarray 最简单的方式是从 list 或 tuple 转换：

import numpy as np

# 一维数组（向量）
a = np.array([1, 2, 3, 4, 5])
print(a.shape)   # (5,)
print(a.ndim)    # 1
print(a.dtype)   # int64 (取决于平台)

# 二维数组（矩阵）
b = np.array([[1, 2, 3], [4, 5, 6]])
print(b.shape)   # (2, 3)
print(b.ndim)    # 2

# 三维数组
c = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(c.shape)   # (2, 2, 2)

1.2 创建数组的常用函数

NumPy 提供了丰富的工厂函数来创建数组，覆盖从零初始化到特定分布的各类场景：

# ---- 全零/全一/全常量 ----
np.zeros(10)                    # 长度为 10 的全零向量，默认 float64
np.zeros((3, 4), dtype=np.int32)  # 3x4 的 int32 零矩阵
np.ones((2, 5))                 # 2x5 全一矩阵
np.full((3, 5), 520)            # 3x5 矩阵，填充常数 520
np.full_like(a, 7)              # 形状同 a，填充 7

# ---- 序列生成 ----
np.arange(0, 20, 3)             # [0, 3, 6, 9, 12, 15, 18]，类似 range 但返回数组
np.arange(0.0, 1.0, 0.1)        # 浮点步长：0.0, 0.1, ..., 0.9
# 注意：浮点 arange 有精度问题，推荐使用 linspace
np.linspace(0, 10, 5)           # [0., 2.5, 5., 7.5, 10.]，等间距 5 个点
np.linspace(0, 1, 11)           # [0., 0.1, 0.2, ..., 1.0]
np.logspace(0, 4, 5)            # [1, 10, 100, 1000, 10000]，对数等间距
np.geomspace(1, 16, 5)          # [1, 2, 4, 8, 16]，等比数列

# ---- 单位矩阵与对角矩阵 ----
np.eye(4)                       # 4x4 单位矩阵（float64）
np.eye(4, dtype=np.int32)       # int32 单位矩阵
np.identity(4)                  # 同 eye，但必须是方阵
np.diag([1, 2, 3])              # 对角线为 [1,2,3] 的 3x3 矩阵
np.diag(np.arange(1, 5))        # 从已有数组提取对角线或创建对角矩阵

# ---- 随机数组（旧 API） ----
np.random.seed(42)              # 设置全局随机种子
np.random.random((2, 4))        # [0, 1) 均匀分布，形状 (2,4)
np.random.randint(4, 9, size=(3, 5))  # [4,9) 整数，形状 (3,5)
np.random.randn(3, 4)           # 标准正态 N(0,1)，形状 (3,4)
np.random.normal(loc=5, scale=2, size=(100,))  # N(5, 2²) 正态分布

# ---- 从已有数据创建 ----
np.empty((3, 4))                # 分配内存但不初始化（内容随机）
np.empty_like(a)                # 形状同 a，未初始化
np.zeros_like(a)                # 形状同 a，填充 0
np.ones_like(a)                 # 形状同 a，填充 1

二、dtype —— 数据类型系统

2.1 dtype 的本质

dtype（data type）不仅描述数组中存的「是什么类型」，还精确规定了元素在内存中的字节数、字节序（endianness）、以及对于结构化数组而言的字段结构。NumPy 支持的类型远多于 Python 内置类型：

类型	描述	字节数	简写
`np.bool_`	布尔值（True/False）	1	`'?'`
`np.int8`	有符号 8 位整数	1	`'i1'`
`np.int16`	有符号 16 位整数	2	`'i2'`
`np.int32`	有符号 32 位整数	4	`'i4'`
`np.int64`	有符号 64 位整数	8	`'i8'`
`np.uint8`	无符号 8 位（常用于图像）	1	`'u1'`
`np.uint16`	无符号 16 位	2	`'u2'`
`np.float16`	半精度浮点	2	`'f2'`
`np.float32`	单精度浮点	4	`'f4'`
`np.float64`	双精度浮点（默认）	8	`'f8'`
`np.complex64`	复数（32 位实部 + 32 位虚部）	8	`'c8'`
`np.complex128`	复数（64 位实部 + 64 位虚部）	16	`'c16'`
`np.str_`	定长 Unicode 字符串	可变	`'U<n>'`
`np.bytes_`	定长字节串	可变	`'S<n>'`
`np.object_`	Python 对象指针	8/平台	`'O'`

2.2 dtype 的指定与转换

# 在创建时指定 dtype
a = np.array([1, 2, 3], dtype=np.float32)      # 显式 float32
a = np.array([1.5, 2.3, 3.1], dtype=np.int32)  # 截断为 [1, 2, 3]
a = np.zeros((3, 4), dtype='i4')                # 字符串缩写
a = np.array([1, 2, 3], dtype='complex128')     # 复数

# 类型转换 —— astype 返回新数组（拷贝）
x = np.array([1, 2, 3], dtype=np.float64)
y = x.astype(np.int32)         # 创建 int32 副本
z = x.astype(np.float32, copy=False)  # copy=False 表示如果类型已匹配则不复制

# 查看 dtype 属性
print(a.dtype)         # dtype('float64')
print(a.itemsize)      # 8（每个元素 8 字节）
print(a.nbytes)        # 总字节数 = itemsize * size
print(a.dtype.name)    # 'float64'
print(a.dtype.kind)    # 'f'（类型族：f=浮点, i=整数, u=无符号, b=布尔, c=复数, O=对象, U=字符串）

2.3 结构化数组

当需要存储异构数据（类似 C 的 struct 或数据库的行）时，可以使用结构化数组：

# 定义结构化 dtype
dtype = np.dtype([
    ('name', 'U20'),       # 最多 20 个 Unicode 字符的 name 字段
    ('age', 'i4'),         # int32 的 age 字段
    ('score', 'f8'),       # float64 的 score 字段
])

# 创建结构化数组
students = np.array([
    ('Alice', 22, 92.5),
    ('Bob', 23, 85.0),
    ('Charlie', 21, 88.5),
], dtype=dtype)

# 访问字段
print(students['name'])         # ['Alice' 'Bob' 'Charlie']
print(students['score'].mean()) # 88.666...
print(students[students['age'] > 21])  # 按条件筛选

# 多重索引
print(students[0]['name'])      # 'Alice'
print(students[0][0])           # 'Alice'（按位置）

结构化数组虽然灵活，但在性能敏感场景下更推荐使用 Pandas。它的主要优势在于可与 C/Fortran 代码无缝交互，以及零拷贝内存映射。

2.4 类型提升（Type Promotion）规则

当计算涉及不同 dtype 时，NumPy 会自动将结果提升为能安全容纳两者的类型：

a = np.array([1, 2, 3], dtype=np.int32)
b = np.array([1.5, 2.5, 3.5], dtype=np.float64)
c = a + b                       # dtype 提升为 float64

x = np.array([1], dtype=np.int8)
y = np.array([1000], dtype=np.int16)
z = x + y                       # dtype 提升为 int16

# 显式指定输出类型（ufunc 的 dtype 参数）
result = np.add(a, b, dtype=np.float32)

三、强大的索引系统

NumPy 的索引远比 Python list 丰富，支持基本索引（返回视图）、高级索引（返回副本）、布尔索引以及 np.where / np.take 等实用函数。

3.1 基本索引与切片

基本索引使用整数、切片或 ...（Ellipsis），返回原数组的视图（view），即修改视图也会影响原数组：

a = np.arange(24).reshape(2, 3, 4)   # shape (2, 3, 4)
print(a)

# 单元素索引
a[0, 1, 2]          # 第一个 batch、第二行、第三列 → 6

# 切片
a[:, 0, :]          # shape (2, 4)：所有 batch 的第 0 行
a[:, :, ::2]        # shape (2, 3, 2)：所有 batch，所有行，每隔一列
a[::-1, ...]        # shape (2, 3, 4)：batch 轴反转

# Ellipsis 表示「尽可能多的冒号」
a[..., 2]           # shape (2, 3)：最后一个轴取索引 2
# 等价于 a[:, :, 2]

# 基本索引产生视图
view = a[0:2, 1:3, :]  # shape (2, 2, 4)
view[0, 0, 0] = 999    
print(a[0, 1, 0])       # 999 —— 原数组被修改！

3.2 花式索引（Fancy Indexing）

使用整数数组或列表进行索引，返回副本而不是视图：

a = np.arange(10) ** 2   # [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# 整数数组索引
indices = [2, 5, 8]
print(a[indices])         # [4, 25, 64]

# 重复选取
print(a[[1, 1, 3, 3, 7, 7]])  # [1, 1, 9, 9, 49, 49]

# 二维数组的整数索引
b = np.arange(24).reshape(6, 4)
row_idx = np.array([0, 2, 4])
col_idx = np.array([1, 2, 3])
print(b[row_idx, col_idx])     # [b[0,1], b[2,2], b[4,3]] = [1, 10, 19]

# 多维整数索引：取出矩形的（十字交叉）元素
rows = np.array([[0, 0], [3, 3]])
cols = np.array([[0, 3], [0, 3]])
print(b[rows, cols])
# [[ 0  3]
#  [12 15]]

# ix_ 函数：生成开网格索引
rows, cols = np.ix_([0, 2, 4], [1, 3])
print(b[rows, cols])
# [[ 1  3]
#  [ 9 11]
#  [17 19]]

3.3 布尔索引（Boolean Masking）

布尔索引是数据筛选的利器，返回副本：

a = np.random.randn(100)
mask = a > 0
print(a[mask])               # 所有正数的值
print(a[a > 0])              # 等价写法

# 复合条件
mask = (a > -0.5) & (a < 0.5)    # 必须用 & 而不是 and
print(a[mask])

# 注意：不能省略括号，因为 & 优先级高于比较符
# mask = a > -0.5 & a < 0.5   # 错误！等价于 a > (-0.5 & a) < 0.5

# 条件赋值
a[a < 0] = 0                  # 把所有负数置零
a[a > 2] = 2                  # 裁剪到 2

# where 的三参数形式（if-else 向量化）
result = np.where(a > 0, a, -a)   # 如果 a>0 则取 a，否则取 -a（即取绝对值）

3.4 np.where / np.nonzero / np.take / np.choose

# np.where 单参数形式：返回满足条件的索引元组
a = np.arange(12).reshape(3, 4)
indices = np.where(a > 5)        # (array([1, 1, 2, 2, 2, 2]), array([2, 3, 0, 1, 2, 3]))
# 第一个 array 是行索引，第二个是列索引

# 用 zip 遍历
for row, col in zip(*indices):
    print(f"a[{row}, {col}] = {a[row, col]}")

# np.take：沿指定轴取元素
a = np.arange(10)
print(np.take(a, [0, 2, 4, 6, 8]))   # [0, 2, 4, 6, 8]
print(a.take([0, 2, 4, 6, 8]))       # 等价方法

# np.put：按索引放置值
np.put(a, [0, 2], [-44, -55])
print(a)  # [-44, 1, -55, 3, ...]

四、广播机制（Broadcasting）深度剖析

广播是 NumPy 最强大的特性之一，它允许不同形状的数组在算术运算中自动对齐。理解广播规则是写出高效向量化代码的关键。

4.1 广播的三条规则

NumPy 从右向左比较两个数组的 shape，当满足以下条件时触发广播：

规则一：如果两个数组的维数不同，在维数较少的数组的形状左侧补 1。
规则二：如果两个数组的某个维度大小不一致，但其中一个为 1，则将该维度拉伸为与对方相同。
规则三：如果两个数组在某个维度上大小均不为 1 且不相等，则抛出 ValueError。

# 示例：a.shape = (3, 1)，b.shape = (1, 4)
a = np.array([[1], [2], [3]])        # shape (3, 1)
b = np.array([[10, 20, 30, 40]])     # shape (1, 4)

c = a + b
# 广播过程：
# a: (3, 1) → (3, 4)  # 列维度拉伸为 4
# b: (1, 4) → (3, 4)  # 行维度拉伸为 3
# 结果: (3, 4)
print(c)
# [[11 21 31 41]
#  [12 22 32 42]
#  [13 23 33 43]]

4.2 广播的经典应用场景

# 1. 数组与标量（最简情况）
a = np.arange(10)
b = a * 3          # 标量 3 广播为同形状数组

# 2. 均值中心化（每列减去该列均值）
data = np.random.randn(100, 5)        # shape (100, 5)
centered = data - data.mean(axis=0)   # mean(axis=0).shape = (5,) → 广播为 (100, 5)

# 3. 标准化（z-score）
zscored = (data - data.mean(axis=0)) / data.std(axis=0)

# 4. 外积（outer product）
x = np.array([1, 2, 3])      # shape (3,)
y = np.array([10, 20])       # shape (2,)
# 通过 reshape 利用广播计算外积
outer = x[:, np.newaxis] * y[np.newaxis, :]
# x[:, np.newaxis] → (3, 1)
# y[np.newaxis, :] → (1, 2)
# 广播后 → (3, 2)
print(outer)
# [[10 20]
#  [20 40]
#  [30 60]]

# 5. 将一个矩阵的每一行加一个不同的行向量
A = np.random.randn(4, 5)
row_bias = np.random.randn(4)
result = A + row_bias[:, np.newaxis]  # (4, 1) + (4, 5) → (4, 5)

4.3 np.newaxis 与维度扩展

np.newaxis（等价于 None）用于在指定位置插入一个长度为 1 的新维度，是手工控制广播的得力工具：

a = np.array([1, 2, 3])       # shape (3,)

# 插入新维度
print(a[:, np.newaxis].shape)  # (3, 1) —— 列向量
print(a[np.newaxis, :].shape)  # (1, 3) —— 行向量
print(a[:, np.newaxis, np.newaxis].shape)  # (3, 1, 1)

# 用 np.expand_dims 明确语义
print(np.expand_dims(a, axis=0).shape)   # (1, 3)
print(np.expand_dims(a, axis=1).shape)   # (3, 1)
print(np.expand_dims(a, axis=-1).shape)  # (3, 1)

五、通用函数（ufuncs）

5.1 什么是 ufunc

ufunc（universal function）是对 ndarray 逐元素操作的向量化包装器。它们由 C 实现，性能远超 Python 循环。NumPy 内置了 60+ 个 ufunc，覆盖算术、三角函数、比较、逻辑等。

# 一元 ufunc
a = np.arange(10)
np.sqrt(a)              # 逐元素开方
np.exp(a)               # e^a
np.log(a + 1)           # ln(a+1)，避免 log(0)
np.sin(a)               # sin(a)
np.abs(a)               # |a|
np.sign(a)              # -1, 0, 1 符号

# 二元 ufunc
b = np.arange(10, 20)
np.add(a, b)            # a + b
np.subtract(a, b)       # a - b
np.multiply(a, b)       # a * b
np.divide(a, b)         # a / b
np.power(a, 2)          # a²
np.maximum(a, b)        # max(a, b)  逐元素，不同于 np.max
np.minimum(a, b)        # min(a, b)
np.mod(a, b)            # a % b
np.greater(a, b)        # a > b（返回 bool 数组）

5.2 ufunc 的 out 参数

out 参数允许将结果写入预先分配的内存，避免额外分配：

a = np.random.randn(1000000)
b = np.random.randn(1000000)
result = np.empty_like(a)

# 不分配新内存，结果写入 result
np.add(a, b, out=result)

# 链式操作中复用缓冲区
temp = np.empty_like(a)
final = np.sqrt(np.add(a, b, out=temp), out=result)
# 如果不用 out，np.add 会分配一个新数组，然后 sqrt 又分配一个
# 使用 out 后只需要 temp 和 result 两个预分配的缓冲区

5.3 ufunc 的 reduce / accumulate / reduceat / outer 方法

每个 ufunc 都附带了四种额外方法，提供了灵活的操作维度：

a = np.arange(1, 6)  # [1, 2, 3, 4, 5]

# reduce：沿某个轴逐步归约
np.add.reduce(a)             # 1+2+3+4+5 = 15
np.multiply.reduce(a)        # 1*2*3*4*5 = 120

# accumulate：保留每一步的中间结果
np.add.accumulate(a)         # [1, 3, 6, 10, 15]
np.multiply.accumulate(a)    # [1, 2, 6, 24, 120]

# reduceat：在指定位置分片归约
idx = [0, 2, 4]
np.add.reduceat(a, idx)      # [a[0:2].sum(), a[2:4].sum(), a[4:].sum()] = [3, 7, 5]

# outer：计算两个数组的外积
np.multiply.outer(a, a)      # 5x5 乘法表
np.add.outer(a, a)           # 5x5 加法表

5.4 自定义 ufunc（frompyfunc / vectorize）

# 将普通 Python 函数「向量化」
def step_function(x, threshold):
    return 1 if x >= threshold else 0

vstep = np.vectorize(step_function, excluded=['threshold'])
a = np.random.randn(10)
print(vstep(a, threshold=0.5))

# 更底层：frompyfunc 返回一个 ufunc 对象
ufunc_step = np.frompyfunc(lambda x: 1 if x >= 0 else 0, 1, 1)
print(ufunc_step(np.arange(-3, 4)))

# 注意：vectorize 本质上仍是 Python 循环，只是语法上方便
# 真正的性能提升需要 numba 或 Cython

六、线性代数（numpy.linalg）

numpy.linalg 模块提供了标准的线性代数运算，底层一般链接到 BLAS/LAPACK 实现。

6.1 矩阵乘法

A = np.random.randn(3, 4)
B = np.random.randn(4, 5)

# 三种等价方式
C1 = np.dot(A, B)            # 传统写法
C2 = A.dot(B)                # 方法形式
C3 = A @ B                   # Python 3.5+ 的 @ 运算符（推荐）

# 注意一维数组的特殊行为
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
np.dot(x, y)     # 32（内积，结果是一个标量）
A.dot(x)         # (3,4) @ (4,) = (3,) 矩阵-向量乘

6.2 矩阵分解与求逆

A = np.array([[1, 2], [3, 4]])

# 矩阵求逆
A_inv = np.linalg.inv(A)
print(A_inv)
# [[-2.   1. ]
#  [ 1.5 -0.5]]
print(A @ A_inv)   # 接近单位矩阵（受限于浮点精度）

# 行列式
det = np.linalg.det(A)      # -2.0

# 矩阵的秩
rank = np.linalg.matrix_rank(A)

# 迹
trace = np.trace(A)          # 5
trace2 = np.linalg.trace(A)  # 别名

# 求解线性方程组 Ax = b
A = np.array([[3, 1], [1, 2]])
b = np.array([9, 8])
x = np.linalg.solve(A, b)     # x = [2, 3]
print(A @ x)                   # [9. 8.] ✓

# 最小二乘解（超定方程组）
A_over = np.array([[1, 1], [1, 2], [1, 3]])
b_over = np.array([2, 2.5, 3.5])
x_lsq, residuals, rank, sv = np.linalg.lstsq(A_over, b_over, rcond=None)

6.3 特征值与特征向量

A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

eigenvalues, eigenvectors = np.linalg.eig(A)
print("特征值:", eigenvalues)
print("特征向量（每列一个）:\n", eigenvectors)

# 验证：A @ v_i = λ_i * v_i
for i in range(3):
    v_i = eigenvectors[:, i]
    lambda_i = eigenvalues[i]
    left = A @ v_i
    right = lambda_i * v_i
    print(f"λ_{i}: 残差 = {np.linalg.norm(left - right):.2e}")

# 实对称矩阵使用 eigh（更快，保证实数特征值）
S = np.array([[4, 1], [1, 3]])
w, v = np.linalg.eigh(S)      # 对对称/Hermitian 矩阵
print("特征值:", w)            # [2.38..., 4.61...]
print("特征向量正交:", np.abs(v[:,0] @ v[:,1]) < 1e-10)  # True

6.4 SVD（奇异值分解）

SVD 是数据科学中最重要的矩阵分解，广泛应用于 PCA、推荐系统、图像压缩等：

A = np.random.randn(5, 3)
U, S, Vt = np.linalg.svd(A, full_matrices=False)

print("U.shape:", U.shape)        # (5, 3)
print("S:", S)                    # 3 个奇异值（已排序，从大到小）
print("Vt.shape:", Vt.shape)      # (3, 3)

# 重构验证
Sigma = np.diag(S)
A_reconstructed = U @ Sigma @ Vt
print("重构误差:", np.linalg.norm(A - A_reconstructed))

# 低秩近似：只用前 k 个奇异值
k = 2
A_approx = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
print(f"秩-{k} 近似，相对误差: {np.linalg.norm(A - A_approx) / np.linalg.norm(A):.4f}")

6.5 其他 linalg 函数

# 范数
np.linalg.norm(A, ord='fro')       # Frobenius 范数
np.linalg.norm(A, ord=2)           # 谱范数（最大奇异值）
np.linalg.norm(A, ord=np.inf)      # 无穷范数

# 条件数
np.linalg.cond(A)                  # 2-范数条件数

# Cholesky 分解（正定矩阵）
L = np.linalg.cholesky(S)          # S = L @ L.T

# QR 分解
Q, R = np.linalg.qr(A)
print("Q 正交:", np.allclose(Q.T @ Q, np.eye(Q.shape[1])))

# 矩阵幂
np.linalg.matrix_power(A, 3)       # A @ A @ A

七、随机数生成 —— Generator API

自 NumPy 1.17 起，推荐使用 np.random.default_rng() 获得的 Generator 对象，替代旧的全局 np.random.* 函数。Generator API 底层使用更先进的算法（如 PCG64，SFC64，Philox），提供更好的统计属性和性能。

7.1 创建 Generator

# 推荐方式：使用 default_rng
rng = np.random.default_rng(42)    # 设置种子

# 也可以显式指定 BitGenerator
from numpy.random import PCG64, Generator
rng = Generator(PCG64(42))

# 线程安全的独立 Generator
from numpy.random import SeedSequence
ss = SeedSequence(12345)
child_seeds = ss.spawn(4)          # 产生 4 个独立种子
generators = [np.random.default_rng(s) for s in child_seeds]

7.2 主要分布

rng = np.random.default_rng(42)

# 均匀分布
rng.random((3, 4))                  # [0, 1)
rng.uniform(low=-1, high=2, size=(10,))  # [-1, 2)

# 整数
rng.integers(low=0, high=10, size=5)    # [0, 10) 
rng.integers(low=5, high=20, size=(3, 4))

# 正态分布
rng.standard_normal((100,))              # N(0, 1)
rng.normal(loc=5, scale=2, size=(50, 2)) # N(5, 2²)

# 其他分布
rng.exponential(scale=1.0, size=100)     # 指数分布
rng.poisson(lam=3, size=100)             # 泊松分布
rng.binomial(n=10, p=0.5, size=100)      # 二项分布
rng.gamma(shape=2, scale=1, size=100)    # Gamma 分布
rng.beta(a=2, b=5, size=100)             # Beta 分布
rng.chisquare(df=3, size=100)            # 卡方分布

# 排列与采样
a = np.arange(10)
rng.shuffle(a)                           # 原地打乱
permuted = rng.permutation(a)            # 返回打乱后的副本
samples = rng.choice(a, size=5, replace=False)  # 不放回采样
samples = rng.choice(a, size=5, replace=True, p=weights)  # 带权采样

八、形状变换与数组拼接

8.1 reshape, flatten, ravel

a = np.arange(24)

# reshape
b = a.reshape(2, 3, 4)      # 返回视图（如果可能）
c = a.reshape(3, 8)         # 返回视图
# 使用 -1 自动推断维度
d = a.reshape(-1, 6)        # (4, 6)
e = a.reshape(2, -1, 2)     # (2, 6, 2)

# reshape 返回视图还是副本？
# 规则：如果 reshape 的结果在内存中连续，则返回视图，否则返回副本
orig = np.arange(6).reshape(2, 3)
print(orig.flags['C_CONTIGUOUS'])  # True
view = orig.reshape(3, 2)
print(view.flags['C_CONTIGUOUS'])  # True → 是视图
view[0, 0] = 999
print(orig[0, 0])                   # 999 → 确实是视图！

# flatten：永远返回副本
flat_copy = a.reshape(2, 3, 4).flatten()
flat_copy[0] = 999                  # 不影响 a

# ravel：尽可能返回视图
ravel_view = a.reshape(2, 3, 4).ravel()
ravel_view[0] = 999                  # 可能影响原数组（取决于内存布局）

# -1 作为新维度的自动大小
auto = a.reshape(2, -1)              # (2, 12)，自动推断第二维

8.2 transpose 与 swapaxes

a = np.arange(24).reshape(2, 3, 4)

# 转置：反转所有轴
b = a.transpose()          # shape (4, 3, 2)
b = a.T                    # 等价写法

# 指定轴顺序
c = a.transpose(1, 0, 2)   # shape (3, 2, 4)：将 axis0 和 axis1 交换
d = a.transpose(2, 0, 1)   # shape (4, 2, 3)

# swapaxes：交换两个轴
e = a.swapaxes(0, 2)        # shape (4, 3, 2)

# moveaxis：将指定轴移到目标位置
f = np.moveaxis(a, source=0, destination=-1)  # (3, 4, 2)

8.3 拼接与堆叠

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

# concatenate：沿已有轴拼接
np.concatenate([a, b], axis=0)    # (4, 2)
np.concatenate([a, b], axis=1)    # (2, 4)

# vstack / hstack：在垂直/水平方向堆叠
np.vstack([a, b])                  # (4, 2)，等价 concatenate axis=0
np.hstack([a, b])                  # (2, 4)，等价 concatenate axis=1

# stack：沿新轴堆叠
np.stack([a, b], axis=0)           # (2, 2, 2)
np.stack([a, b], axis=-1)          # (2, 2, 2)，新轴在最后

# dstack：沿深度（axis=2）堆叠
np.dstack([a, b])                  # (2, 2, 2)

# split / hsplit / vsplit：拆分
c = np.arange(12).reshape(3, 4)
np.split(c, 3, axis=0)             # 拆成 3 个 (1, 4) 数组
np.split(c, 2, axis=1)             # 拆成 2 个 (3, 2) 数组
first, second = np.split(c, [2], axis=1)  # 第0到1列、第2到3列
np.hsplit(c, 2)                    # 水平拆
np.vsplit(c, 3)                    # 垂直拆

# tile / repeat
np.tile([0, 1], 3)                 # [0, 1, 0, 1, 0, 1]
np.tile([0, 1], (2, 2))            # [[0,1,0,1], [0,1,0,1]]
np.repeat([0, 1], 3)               # [0, 0, 0, 1, 1, 1]
np.repeat([[1,2], [3,4]], 2, axis=0)  # [[1,2],[1,2],[3,4],[3,4]]
np.repeat([[1,2], [3,4]], 2, axis=1)  # [[1,1,2,2],[3,3,4,4]]

九、einsum —— 爱因斯坦求和约定

np.einsum 是 NumPy 中最强大也最优雅的函数之一。它使用爱因斯坦求和约定的记号，可以在一次调用中完成各种张量操作。虽然初看语法晦涩，但掌握后能大幅简化代码。

9.1 einsum 基本语法

基本形式：np.einsum(subscripts, *operands)

其中 subscripts 是一个字符串，用逗号分隔输入的下标，用 -> 分隔输出下标：

# 向量内积
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.einsum('i,i->', a, b)          # 32，标量输出

# 矩阵-向量乘法
A = np.random.randn(3, 4)
x = np.random.randn(4)
np.einsum('ij,j->i', A, x)        # 等价 A @ x

# 矩阵乘法
B = np.random.randn(4, 5)
np.einsum('ij,jk->ik', A, B)      # 等价 A @ B

# 外积
np.einsum('i,j->ij', a, b)        # 等价 a[:, np.newaxis] * b

9.2 常用 einsum 模式总结

# 转置
np.einsum('ij->ji', A)            # A.T

# 对角线
np.einsum('ii->i', A)             # np.diag(A)，只适用于方阵

# 迹（trace）
np.einsum('ii->', square_matrix)  # np.trace(square_matrix)

# 逐元素乘
np.einsum('ij,ij->ij', A, B)      # A * B

# 按行求和 / 按列求和
np.einsum('ij->i', A)             # A.sum(axis=1)
np.einsum('ij->j', A)             # A.sum(axis=0)

# 所有元素求和
np.einsum('ij->', A)              # A.sum()

# 批量矩阵乘法（batch matrix multiply）
# A: (b, i, k), B: (b, k, j)
np.einsum('bik,bkj->bij', A_batch, B_batch)

# 双线性形式 x^T A y
np.einsum('i,ij,j->', x, A, y)

9.3 einsum 高级：张量收缩

# 张量收缩：将两个张量的某些维度「收缩」掉
T1 = np.random.randn(3, 4, 5)
T2 = np.random.randn(5, 6, 4)

# 收缩维度 4 和维度 5（即 T1 的 (4,5) 与 T2 的 (5,6,4) 中的 5 和 4）
# 指定方式：共享下标表示收缩的维度
result = np.einsum('ijk,klj->il', T1, T2)
print(result.shape)  # (3, 6)

# 更直观的示例：将一个 rank-3 张量与矩阵收缩
tensor = np.random.randn(2, 3, 4)
matrix = np.random.randn(4, 5)
result = np.einsum('ijk,kl->ijl', tensor, matrix)
print(result.shape)  # (2, 3, 5)

9.4 einsum 的 optimize 参数

# 对于复杂表达式，设置 optimize=True 可以大幅加速
# NumPy 会自动寻找最优的收缩顺序
result = np.einsum('abcd,cdef,efgh->abgh', 
                    T1, T2, T3, 
                    optimize=True)

# 等价于 optimize='greedy'（追求更好性能时可以用 'optimal'）
result = np.einsum('abcd,cdef,efgh->abgh', 
                    T1, T2, T3, 
                    optimize='optimal')

十、内存布局：C order vs Fortran order

10.1 行优先与列优先

NumPy 支持两种内存布局：

C order（行优先，row-major）：最后一维的变化最快。默认布局。
Fortran order（列优先，column-major）：第一维的变化最快。

# 创建一个 2x3 矩阵，分别用 C 和 Fortran 布局
a_c = np.array([[1, 2, 3], [4, 5, 6]], order='C')
a_f = np.array([[1, 2, 3], [4, 5, 6]], order='F')

# 它们在 Python 层面的形状和值都相同
print(a_c)  # [[1 2 3] [4 5 6]]
print(a_f)  # [[1 2 3] [4 5 6]]

# 但在内存中排布不同：
# C: 1, 2, 3, 4, 5, 6  (行优先)
# F: 1, 4, 2, 5, 3, 6  (列优先)
print(a_c.flags['C_CONTIGUOUS'])  # True
print(a_f.flags['F_CONTIGUOUS'])  # True

10.2 布局对性能的影响

沿连续维度遍历通常快得多：

import time

size = 5000
a_c = np.random.randn(size, size)                # C 连续
a_f = np.asfortranarray(np.random.randn(size, size))  # F 连续

# 按行遍历（axis=1 连续）
start = time.perf_counter()
s = 0
for i in range(size):
    for j in range(size):
        s += a_c[i, j]          # C 连续 → 快
t_c = time.perf_counter() - start

start = time.perf_counter()
s = 0
for i in range(size):
    for j in range(size):
        s += a_f[i, j]          # F 连续，按行遍历 → 慢（跳跃）
t_f = time.perf_counter() - start

print(f"C 连续按行遍历: {t_c:.4f}s")
print(f"F 连续按行遍历: {t_f:.4f}s")
# 典型结果：C 连续按行遍历约为 F 连续按行遍历的 2-3 倍快

10.3 控制布局

# 创建时指定
a = np.ones((3, 4), order='F')

# 转换为另一布局
a_c = np.ascontiguousarray(a)       # 转为 C 连续
a_f = np.asfortranarray(a)          # 转为 F 连续

# reshape 中指定
b = a.reshape(2, 6, order='C')      # C 顺序 reshape
b = a.reshape(2, 6, order='F')      # Fortran 顺序 reshape
b = a.reshape(2, 6, order='A')      # 保留原布局

# 检查连续性
print(a.flags.c_contiguous)         # True/False
print(a.flags.f_contiguous)         # True/False
print(a.flags.owndata)              # 是否拥有数据（还是 view）

10.4 跨库互操作时的布局

当 NumPy 与 OpenCV、PyTorch、TensorFlow 等库交互时，内存布局至关重要：

# OpenCV 图像默认是 HWC 格式，uint8
# PyTorch 默认是 NCHW 格式
# 在它们之间转换时需要注意连续性和布局

# 从 OpenCV 到 PyTorch
import cv2
img = cv2.imread('photo.jpg')           # shape (H, W, 3), dtype uint8
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# 转为 float32 并确保 C 连续
img_float = np.ascontiguousarray(img_rgb.astype(np.float32))
# 归一化后转 PyTorch tensor
# tensor = torch.from_numpy(img_float).permute(2, 0, 1).unsqueeze(0)

十一、向量化 vs 循环 —— 性能实证

11.1 基准测试框架

import time
import numpy as np

def benchmark(func, *args, n_runs=10, **kwargs):
    """运行 n_runs 次，取中位数时间"""
    times = []
    for _ in range(n_runs):
        start = time.perf_counter()
        func(*args, **kwargs)
        times.append(time.perf_counter() - start)
    return np.median(times)  # 用中位数避免极值干扰

def python_loop_sum_of_squares(n):
    s = 0
    for i in range(n):
        s += i ** 2
    return s

def numpy_vectorized_sum_of_squares(n):
    return np.sum(np.arange(n) ** 2)

11.2 不同规模下的性能对比

sizes = [1000, 10000, 100000, 1000000]
results = []

for n in sizes:
    t_loop = benchmark(lambda: sum(i**2 for i in range(n)), n_runs=3)
    t_numpy = benchmark(lambda: np.sum(np.arange(n)**2), n_runs=10)
    results.append((n, t_loop, t_numpy))

print(f"{'Size':>12} {'Loop (ms)':>12} {'NumPy (ms)':>12} {'Speedup':>10}")
for n, tl, tn in results:
    print(f"{n:>12} {tl*1000:>12.3f} {tn*1000:>12.3f} {tl/tn:>10.1f}x")

# 典型输出：
#         Size    Loop (ms)   NumPy (ms)    Speedup
#         1000        0.089        0.005       17.8x
#        10000        0.918        0.008      114.8x
#       100000        9.325        0.037      252.0x
#      1000000       97.321        0.345      282.1x

11.3 避免 Python 循环的常见模式

# 模式 1：条件过滤
# 慢
result = []
for x in data:
    if x > 0:
        result.append(x)
# 快
result = data[data > 0]

# 模式 2：查找替换
# 慢
for i in range(len(data)):
    if data[i] < 0:
        data[i] = 0
# 快
data[data < 0] = 0
# 或
data = np.where(data < 0, 0, data)

# 模式 3：分段函数
# 慢
result = np.empty_like(data)
for i in range(len(data)):
    if data[i] < 0:
        result[i] = 0
    elif data[i] > 1:
        result[i] = 1
    else:
        result[i] = data[i]
# 快：np.clip
result = np.clip(data, 0, 1)
# 或 np.where 嵌套
result = np.where(data < 0, 0, np.where(data > 1, 1, data))

# 模式 4：累积运算
# 慢
cumulative = [0]
for x in data:
    cumulative.append(cumulative[-1] + x)
# 快
cumulative = np.cumsum(data)

# 模式 5：含前后依赖的运算时使用 numba
# NumPy 无法向量化前后依赖的循环时，就该用 numba
from numba import jit

@jit(nopython=True)
def compute_with_dependency(data):
    result = np.empty_like(data)
    result[0] = data[0]
    for i in range(1, len(data)):
        result[i] = result[i-1] * 0.9 + data[i]
    return result

十二、完整示例一：PCA 从零实现

主成分分析（PCA）是数据降维的核心方法。以下用 NumPy 完整实现，展示 linalg.eig、矩阵运算、广播等核心 API 的实际运用：

import numpy as np

class PCA:
    """从零实现的 PCA，使用特征分解"""
    
    def __init__(self, n_components):
        self.n_components = n_components
        self.components_ = None           # 主成分方向 (n_components, n_features)
        self.mean_ = None                 # 均值向量
        self.explained_variance_ = None   # 各主成分的方差（特征值降序）
        self.explained_variance_ratio_ = None  # 方差解释率
    
    def fit(self, X):
        """
        拟合 PCA
        X: shape (n_samples, n_features)
        """
        n_samples, n_features = X.shape
        
        # 1. 去中心化（每一维减去该维的均值）
        self.mean_ = X.mean(axis=0)
        X_centered = X - self.mean_
        
        # 2. 计算协方差矩阵：C = (1/n) * X^T X
        covariance = (X_centered.T @ X_centered) / (n_samples - 1)  # 无偏估计
        
        # 3. 对协方差矩阵做特征分解
        eigenvalues, eigenvectors = np.linalg.eigh(covariance)
        # eigh 返回升序排列，我们需要降序
        eigenvalues = eigenvalues[::-1]
        eigenvectors = eigenvectors[:, ::-1]
        
        # 4. 选取前 n_components 个主成分
        self.components_ = eigenvectors[:, :self.n_components].T  # (k, d)
        self.explained_variance_ = eigenvalues[:self.n_components]
        self.explained_variance_ratio_ = (
            self.explained_variance_ / eigenvalues.sum()
        )
        
        return self
    
    def transform(self, X):
        """将数据投影到低维空间"""
        X_centered = X - self.mean_
        return X_centered @ self.components_.T  # (n, d) @ (d, k) = (n, k)
    
    def fit_transform(self, X):
        """一步完成拟合 + 转换"""
        self.fit(X)
        return self.transform(X)
    
    def inverse_transform(self, Z):
        """从低维表示重构原始数据"""
        # Z: (n, k), components_: (k, d)
        return Z @ self.components_ + self.mean_


# ---- 使用示例：对鸢尾花数据集做 PCA ----
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data           # (150, 4)
y = iris.target

pca = PCA(n_components=2)
Z = pca.fit_transform(X)

print(f"原始形状: {X.shape}")                     # (150, 4)
print(f"降维后形状: {Z.shape}")                   # (150, 2)
print(f"方差解释率: {pca.explained_variance_ratio_}")  
# 典型结果: [0.9246, 0.0530]
print(f"累积方差解释率: {pca.explained_variance_ratio_.sum():.4f}")  
# ~0.9776，即前 2 个主成分保留了约 97.8% 的信息

# 与 sklearn 结果对比验证
from sklearn.decomposition import PCA as SklearnPCA
pca_sk = SklearnPCA(n_components=2)
Z_sk = pca_sk.fit_transform(X)
print(f"与 sklearn 投影结果差异（绝对值最大）: {np.abs(Z_sk - Z).max():.6e}")
# 注意：符号可能相反（特征向量的方向可以反转），但绝对值应很小

12.1 PCA 的 SVD 实现版本

上面的实现使用协方差矩阵的特征分解。更稳定的做法是直接对数据中心化后的矩阵做 SVD：

class PCA_SVD:
    """基于 SVD 的 PCA 实现"""
    
    def __init__(self, n_components):
        self.n_components = n_components
        self.components_ = None
        self.mean_ = None
        self.explained_variance_ = None
        self.explained_variance_ratio_ = None
    
    def fit(self, X):
        self.mean_ = X.mean(axis=0)
        X_centered = X - self.mean_
        
        # 对去中心化的矩阵做 SVD
        # X_centered = U @ diag(S) @ Vt
        # Vt 的行就是主成分方向
        U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
        
        n_samples = X.shape[0]
        # 特征值 = S^2 / (n-1)
        eigenvalues = S**2 / (n_samples - 1)
        
        self.components_ = Vt[:self.n_components]  # (k, d)
        self.explained_variance_ = eigenvalues[:self.n_components]
        self.explained_variance_ratio_ = (
            self.explained_variance_ / eigenvalues.sum()
        )
        
        return self
    
    def transform(self, X):
        X_centered = X - self.mean_
        return X_centered @ self.components_.T
    
    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)
    
    def inverse_transform(self, Z):
        return Z @ self.components_ + self.mean_

十三、完整示例二：图像卷积

使用 NumPy 实现二维卷积，展示 strides（步幅）和窗口滑动操作。这个例子也揭示了 NumPy 对局部运算的支持方式：

import numpy as np

def conv2d_vectorized(image, kernel):
    """
    向量化实现的二维卷积（valid 模式，无 padding）
    
    参数:
        image:  shape (H, W)
        kernel: shape (kH, kW)
    返回:
        conv:   shape (H - kH + 1, W - kW + 1)
    """
    H, W = image.shape
    kH, kW = kernel.shape
    out_H = H - kH + 1
    out_W = W - kW + 1
    
    # 使用 as_strided 创建滑动窗口视图（避免复制数据）
    # strides 表示在每个维度上移动一个单位需要跳过的字节数
    from numpy.lib.stride_tricks import as_strided
    
    # image 的 strides（字节）
    item_size = image.itemsize
    # 创建一个形状为 (out_H, out_W, kH, kW) 的视图
    # 每个 (i, j, :, :) 就是 image 中以 (i, j) 为起点的一个 kH x kW 窗口
    shape = (out_H, out_W, kH, kW)
    strides = (image.strides[0],      # 移动到下一行
               image.strides[1],      # 移动到下一列
               image.strides[0],      # 窗口内下一行 → 同原始行步幅
               image.strides[1])      # 窗口内下一列 → 同原始列步幅
    
    windows = as_strided(image, shape=shape, strides=strides)
    
    # 卷积 = 逐窗口与 kernel 的逐元素乘 + 求和
    # windows: (out_H, out_W, kH, kW)
    # kernel:  (kH, kW) → 广播为 (1, 1, kH, kW)
    conv = np.sum(windows * kernel, axis=(2, 3))
    
    return conv


# ---- 使用示例 ----
# 创建一个测试图像（有清晰边缘）
image = np.array([
    [10, 10, 10, 10, 10],
    [10, 50, 50, 50, 10],
    [10, 50, 100, 50, 10],
    [10, 50, 50, 50, 10],
    [10, 10, 10, 10, 10],
], dtype=np.float64)

# Sobel 水平边缘检测核
sobel_x = np.array([[-1, 0, 1],
                     [-2, 0, 2],
                     [-1, 0, 1]], dtype=np.float64)

# 高斯模糊核
gaussian = np.array([[1, 2, 1],
                      [2, 4, 2],
                      [1, 2, 1]], dtype=np.float64) / 16.0

result_sobel = conv2d_vectorized(image, sobel_x)
result_gauss = conv2d_vectorized(image, gaussian)

print("原始图像:\n", image)
print("\nSobel 水平边缘:\n", result_sobel)
print("\n高斯模糊:\n", result_gauss)

as_strided 创建的窗口视图仅是对原始数据的重新解释（零拷贝），因此即使图像很大，内存开销也极低。但使用 as_strided 时必须小心：如果 strides 设置不当，可能访问到越界内存。

13.1 使用 scipy 的卷积进行验证

from scipy.signal import convolve2d

# 验证我们的实现（valid 模式）
result_scipy = convolve2d(image, sobel_x, mode='valid')
print("与 scipy 的差异:", np.abs(result_sobel - result_scipy).max())
# 输出：0.0 —— 完全一致

十四、完整示例三：蒙特卡洛求圆周率

蒙特卡洛方法通过随机采样估计确定性量。求圆周率是经典的入门示例，但我们将展示如何用 NumPy 的批量运算将其效率推向极致：

import numpy as np
import time

def monte_carlo_pi(n_points, rng=None):
    """
    蒙特卡洛法估计 π
    
    思路：在 [-1, 1] x [-1, 1] 正方形内均匀撒点，
    落在单位圆内的点的比例应趋近于 π/4
    """
    if rng is None:
        rng = np.random.default_rng()
    
    # 批量生成所有点（一次性，避免 Python 循环）
    x = rng.uniform(-1, 1, n_points)
    y = rng.uniform(-1, 1, n_points)
    
    # 向量化判断：点是否在单位圆内
    inside = x**2 + y**2 <= 1.0
    n_inside = np.sum(inside)
    
    pi_estimate = 4.0 * n_inside / n_points
    
    return pi_estimate, n_inside, inside


def monte_carlo_pi_batched(total_points, batch_size=10_000_000, rng=None):
    """
    分批计算，避免一次分配过多内存
    """
    if rng is None:
        rng = np.random.default_rng()
    
    n_inside_total = 0
    remaining = total_points
    
    while remaining > 0:
        n_this_batch = min(remaining, batch_size)
        _, n_inside, _ = monte_carlo_pi(n_this_batch, rng)
        n_inside_total += n_inside
        remaining -= n_this_batch
    
    return 4.0 * n_inside_total / total_points


# ---- 实验 ----
for n in [1_000, 10_000, 100_000, 1_000_000, 10_000_000]:
    start = time.perf_counter()
    pi_est, n_inside, _ = monte_carlo_pi(n)
    elapsed = time.perf_counter() - start
    error = abs(pi_est - np.pi)
    print(f"n={n:>12,}  估计值={pi_est:.8f}  误差={error:.8f}  耗时={elapsed:.4f}s")

# 典型输出：
# n=       1,000  估计值=3.14400000  误差=0.00240735  耗时=0.0001s
# n=      10,000  估计值=3.14160000  误差=0.00000735  耗时=0.0001s
# n=     100,000  估计值=3.14148000  误差=0.00011265  耗时=0.0010s
# n=   1,000,000  估计值=3.14155600  误差=0.00003665  耗时=0.0072s
# n=  10,000,000  估计值=3.14159040  误差=0.00000225  耗时=0.0715s

这个例子完美展示了 NumPy 的优势：10,000,000 个点的运算在约 0.07 秒内完成（含随机数生成）。纯 Python 循环实现同样的逻辑将需要数十秒。

十五、实用函数与技巧

15.1 统计函数

a = np.random.randn(1000)

np.mean(a)               # 均值
np.median(a)             # 中位数
np.std(a)                # 标准差（默认 ddof=0，即总体标准差）
np.std(a, ddof=1)        # 样本标准差
np.var(a)                # 方差
np.percentile(a, 50)     # 第 50 百分位（中位数）
np.percentile(a, [25, 50, 75])  # 四分位数
np.quantile(a, [0.25, 0.5, 0.75])

# 沿指定轴的统计
b = np.random.randn(5, 10)
b.mean(axis=0)           # (10,) 每列均值
b.mean(axis=1)           # (5,)  每行均值
b.mean(axis=0, keepdims=True)  # (1, 10) 保持维度

# 累积统计
np.cumsum(a)             # 累积和
np.cumprod(a)            # 累积积

# 极值
np.min(a), np.max(a)
np.argmin(a), np.argmax(a)   # 极值的索引
np.ptp(a)                    # peak-to-peak = max - min

15.2 排序与搜索

a = np.array([3, 1, 2, 5, 4])

np.sort(a)                        # [1, 2, 3, 4, 5] 返回副本
a.sort()                          # 原地排序
sorted_indices = np.argsort(a)    # 排序后各元素的原始索引

# 按某一列排序（间接排序）
data = np.array([[2, 7], [1, 3], [4, 5]])
idx = np.argsort(data[:, 0])      # 按第一列排序的索引
sorted_data = data[idx]           # [[1,3], [2,7], [4,5]]

# 部分排序
k = 3
top_k_values = np.partition(a, -k)[-k:]        # 最大的 k 个（不一定排序）
top_k_indices = np.argpartition(a, -k)[-k:]    # 最大的 k 个的索引

# 搜索
np.searchsorted([1,2,3,4,5], 3.5)     # 3（插入位置，保持有序）
np.searchsorted([1,2,3,4,5], [0, 2.5, 6])  # [0, 2, 5]

15.3 集合运算

a = np.array([1, 2, 3, 4, 5])
b = np.array([3, 4, 5, 6, 7])

np.intersect1d(a, b)         # [3, 4, 5]
np.union1d(a, b)             # [1, 2, 3, 4, 5, 6, 7]
np.setdiff1d(a, b)           # [1, 2]（在 a 但不在 b）
np.setxor1d(a, b)            # [1, 2, 6, 7]（对称差）
np.in1d(a, b)                # [False, False, True, True, True]（a 的元素是否在 b 中）
np.isin(a, b)                # 同 in1d，推荐用法

np.unique(a)                 # 去重并排序
np.unique(a, return_counts=True)   # 返回 (去重值, 计数)
np.unique(a, return_index=True)    # 返回 (去重值, 首次出现索引)
np.unique(a, return_inverse=True)  # 返回 (去重值, 原始元素对应的唯一值索引)

15.4 保存与加载

# 二进制格式（.npy 单数组，.npz 多数组）
np.save('array.npy', a)
a_loaded = np.load('array.npy')

# 多数组打包
np.savez('arrays.npz', x=x_data, y=y_data, params=params)
data = np.load('arrays.npz')
print(data['x'], data['y'], data['params'])

# 压缩版（适合大数组）
np.savez_compressed('arrays_comp.npz', x=x_data, y=y_data)

# 文本格式
np.savetxt('data.csv', data, delimiter=',', header='col1,col2,col3')
loaded = np.loadtxt('data.csv', delimiter=',', skiprows=1)

15.5 内存映射（memmap）

对于无法完整加载到内存的超大数组，可以使用 np.memmap 将磁盘上的二进制文件映射到虚拟内存：

# 创建内存映射数组
mmap = np.memmap('large_array.dat', dtype='float64', mode='w+', shape=(100000, 1000))
mmap[0, :] = np.arange(1000)          # 操作如同普通 ndarray
mmap.flush()                           # 确保写入磁盘

# 只读加载
mmap_ro = np.memmap('large_array.dat', dtype='float64', mode='r', shape=(100000, 1000))
print(mmap_ro[0, :5])                  # 操作系统按需加载页面

del mmap  # 确保关闭（或使用上下文管理器）

十六、常见陷阱与最佳实践

16.1 视图 vs 副本

a = np.array([1, 2, 3, 4])
b = a[1:3]           # b 是 a 的视图
b[0] = 99
print(a)              # [1, 99, 3, 4] —— a 被修改了！

# 需要副本时显式调用 .copy()
b = a[1:3].copy()
b[0] = 99
print(a)              # [1, 2, 3, 4] —— 安全

# 花式索引返回副本
c = a[[1, 2]]         # 副本
c[0] = 99
print(a)              # [1, 2, 3, 4] —— 未影响

16.2 布尔索引中的括号

a = np.array([1, 2, 3, 4, 5])

# 正确：每个条件用括号括起来
mask = (a > 2) & (a < 5)       # [False, False, True, True, False]

# 错误：运算符优先级问题
# mask = a > 2 & a < 5         # 等价 a > (2 & a) < 5，语法错误

16.3 浮点比较

a = np.array([0.1 + 0.2])           # 0.30000000000000004
print(a == 0.3)                      # [False]

# 使用 np.isclose
print(np.isclose(a, 0.3))            # [True]
print(np.isclose(a, 0.3, atol=1e-8, rtol=1e-5))

# 或 np.allclose
print(np.allclose(a, 0.3))           # True

16.4 axis 参数的理解

axis 指定了操作沿着哪个轴（维度）进行。关键记忆法：「axis=i 意味着沿着第 i 个轴做归约，该维度被消除」。

a = np.arange(24).reshape(2, 3, 4)  # shape (2, 3, 4)

a.sum(axis=0)    # shape (3, 4)，将 axis0（大小为 2）归约掉
a.sum(axis=1)    # shape (2, 4)，将 axis1（大小为 3）归约掉
a.sum(axis=2)    # shape (2, 3)，将 axis2（大小为 4）归约掉
a.sum(axis=(0, 2))  # shape (3,)，将 axis0 和 axis2 同时归约掉

16.5 不要滥用 np.append

np.append 每次调用都会分配新的数组并复制数据，在循环中使用会导致 O(n²) 的时间复杂度：

# 错误做法
result = np.array([])
for x in data:
    result = np.append(result, x)    # 每次复制全部数据

# 正确做法：先用 list 收集，最后一次性转换
result = []
for x in data:
    result.append(x)
result = np.array(result)

十七、总结

NumPy 是 Python 科学计算生态的基石。其核心价值在于：

ndarray：同构、多维、内存连续的数据容器。
广播（broadcasting）：无需显式循环即可对不同形状数组做逐元素运算。
通用函数（ufuncs）：编译层实现的向量化运算，性能比 Python 循环高 1-2 个数量级。
线性代数（linalg）：基于 BLAS/LAPACK 的标准矩阵运算，支持特征分解、SVD、求解线性方程组等。
einsum：用爱因斯坦求和约定优雅地表达各种张量运算。
灵活的索引：基本索引（视图）、花式索引（副本）、布尔索引（筛选），满足各种数据访问需求。

掌握 NumPy 意味着你能够在 Python 中写出既简洁又高性能的数值计算代码。当 NumPy 的向量化不足以表达某些复杂算法（如前后依赖的循环）时，需要考虑 Numba / Cython 作为补充方案，但绝大多数数据处理任务都可以在 NumPy 内高效完成。

参考资料：

NumPy 官方文档：https://numpy.org/doc/stable/
NumPy 用户指南（broadcasting）：https://numpy.org/doc/stable/user/basics.broadcasting.html
NumPy einsum 详解：https://numpy.org/doc/stable/reference/generated/numpy.einsum.html
Harris, C.R., et al. “Array programming with NumPy.” Nature 585 (2020): 357-362.