# 矩阵微分

## 独立性假设

$\begin{array}{c} \frac{\partial x_i}{\partial x_j} = \delta_{ij} = \begin{cases} 1 & i = j \\ 0 & others \end{cases} \\ \frac{\partial x_{kl}}{\partial x_{ij}} = \delta_{ki} \delta_{lj} = \begin{cases} 1 & k = i \cap l = j \\ 0 & others \end{cases} \end{array}$

## 2. 定义

### 2.1 矩阵的向量化

• 列向量化：矩阵 $\boldsymbol{A} \in \mathbb{R}^{m \times n}$ 的向量化 $\mathrm{vec}(\boldsymbol{A})$ 是一个线性变换，它将矩阵 $\boldsymbol{A} = [a_{ij}]$ 的元素按列堆栈，排列成一个 $mn \times 1$ 的向量，即

$\begin{array}{c} \mathrm{vec}(\boldsymbol{A}) = [a_{11}, \cdots, a_{m1}, \cdots, a_{1n}, \cdots, a_{mn}]^T \end{array}$

• 行向量化：类似地，矩阵 $\boldsymbol{A}$ 的行向量化为

$\begin{array}{c} \mathrm{rvec}(\boldsymbol{A}) = [a_{11}, \cdots, a_{1n}, \cdots, a_{m1}, \cdots, a_{mn}] \end{array}$

### 2.2 向量的矩阵化

$\begin{array}{c} \boldsymbol{A}_{m \times n} = \mathrm{unvec}_{m, n}(\boldsymbol{a}) = \left[ \begin{matrix} a_1 & a_{m+1} & \cdots & a_{m(n-1)+1} \\ a_2 & a_{m+2} & \cdots & a_{m(n-1)+2} \\ \vdots & \vdots & \ddots & \vdots \\ a_m & a_{2m} & \cdots & a_{mn} \end{matrix} \right] \end{array}$

### 2.3 偏导算子

• $1 \times m$ 行向量偏导算子记为：

$\begin{array}{c} D_{\boldsymbol{x}} \overset{\mathrm{def}}{=} \frac{\partial}{\partial \boldsymbol{x}^T} = [ \frac{\partial}{\partial x_1}, \cdots, \frac{\partial}{\partial x_m} ] \end{array}$

• $n \times m$ 矩阵偏导算子存在两种可能的定义，分别记为：

$\begin{array}{c} D_{\boldsymbol{X}} = \left[ \begin{matrix} \frac{\partial}{\partial x_{11}} & \cdots & \frac{\partial}{\partial x_{m1}} \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_{1n}} & \cdots & \frac{\partial}{\partial x_{mn}} \end{matrix} \right] \end{array}$

$\begin{array}{c} D_{\mathrm{rvec} \boldsymbol{X}} = [\frac{\partial}{\partial x_{11}}, \cdots, \frac{\partial}{\partial x_{m1}}, \cdots, \frac{\partial}{\partial x_{1n}}, \cdots, \frac{\partial}{\partial x_{mn}}] \end{array}$

$\begin{array}{c} D_{\mathrm{rvec}\boldsymbol{X}} = \mathrm{rvec}(D_{\boldsymbol{X}}) = (\mathrm{vec}(D_{\boldsymbol{X}}^T))^T \end{array}$

• $m \times 1$ 列向量偏导算子（习惯称为梯度算子）记为：

$\begin{array}{c} \nabla_{\boldsymbol{x}} \overset{\mathrm{def}}{=} \frac{\partial}{\partial \boldsymbol{x}} = [\frac{\partial}{\partial x_1}, \cdots, \frac{\partial}{\partial x_m}]^T \end{array}$

• $n \times m$ 矩阵梯度算子存在两种可能的定义，分别记为：

$\begin{array}{c} \nabla_{\boldsymbol{X}} = \left[ \begin{matrix} \frac{\partial}{\partial x_{11}} & \cdots & \frac{\partial}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_{m1}} & \cdots & \frac{\partial}{\partial x_{mn}} \end{matrix} \right] \end{array}$

$\begin{array}{c} \nabla_{\mathrm{vec}\boldsymbol{X}} = \frac{\partial}{\partial_{\mathrm{vec}\boldsymbol{X}}} = [\frac{\partial}{\partial x_{11}}, \cdots, \frac{\partial}{\partial x_{m1}}, \cdots, \frac{\partial}{\partial x_{1n}}, \cdots, \frac{\partial}{\partial x_{mn}}]^T \end{array}$

$\begin{array}{c} \nabla_{\mathrm{vec} \boldsymbol{X}} = \mathrm{vec}(\nabla_{\boldsymbol{X}}) \end{array}$

$\begin{array}{c} \nabla_{\boldsymbol{x}} = D_{\boldsymbol{x}}^T \\ \nabla_{\boldsymbol{X}} = D_{\boldsymbol{X}}^T \\ \nabla_{\mathrm{vec}\boldsymbol{X}} = D_{\mathrm{vec}\boldsymbol{X}}^T \end{array}$

### 2.4 实值标量函数偏导

• 实值标量函数 $f(\boldsymbol{x})$ 对向量变元 $\boldsymbol{x}$ 的行向量偏导为：

$\begin{array}{c} D_{\boldsymbol{x}} f(\boldsymbol{x}) \overset{\mathrm{def}}{=} \frac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}^T} = [\frac{\partial f(\boldsymbol{x})}{\partial x_1}, \cdots, \frac{\partial f(\boldsymbol{x})}{\partial x_m}] \end{array}$

• 实值标量函数 $f(\boldsymbol{x})$ 对向量变元 $\boldsymbol{x}$ 的列向量偏导为：

$\begin{array}{c} \nabla_{\boldsymbol{x}} f(\boldsymbol{x}) \overset{\mathrm{def}}{=} \frac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} = [\frac{\partial f(\boldsymbol{x})}{\partial x_1}, \cdots, \frac{\partial f(\boldsymbol{x})}{\partial x_m}]^T \end{array}$

• 实值标量函数 $f(\boldsymbol{X})$ 对矩阵变元 $\boldsymbol{X}$ 的 Jacobian 矩阵为：

$\begin{array}{c} D_{\boldsymbol{X}} f(\boldsymbol{X}) = \frac{\partial f(\boldsymbol{X})}{\partial \boldsymbol{X}^T} = \left[ \begin{matrix} \frac{\partial f(\boldsymbol{X})}{\partial x_{11}} & \cdots & \frac{\partial f(\boldsymbol{X})}{x_{m1}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f(\boldsymbol{X})}{\partial x_{1n}} & \cdots & \frac{\partial f(\boldsymbol{X})}{\partial x_{mn}} \end{matrix} \right] \end{array}$

$\begin{array}{c} D_{\mathrm{rvec} \boldsymbol{X}} f(\boldsymbol{X}) = [\frac{\partial}{\partial x_{11}}, \cdots, \frac{\partial}{\partial x_{m1}}, \cdots, \frac{\partial}{\partial x_{1n}}, \cdots, \frac{\partial}{\partial x_{mn}}] \end{array}$

• 实值标量函数 $f(\boldsymbol{X})$ 对矩阵变元 $\boldsymbol{X}$ 的梯度矩阵为：

$\begin{array}{c} \nabla_{\boldsymbol{X}} f(\boldsymbol{X}) = \frac{\partial f(\boldsymbol{X})}{\partial \boldsymbol{X}} = \left[ \begin{matrix} \frac{\partial f(\boldsymbol{X})}{\partial x_{11}} & \cdots & \frac{\partial f(\boldsymbol{X})}{\partial x_{1n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f(\boldsymbol{X})}{\partial x_{m1}} & \cdots & \frac{\partial f(\boldsymbol{X})}{\partial x_{mn}} \end{matrix} \right] \end{array}$

$\begin{array}{c} \nabla_{\mathrm{vec}\boldsymbol{X}} f(\boldsymbol{X}) = \frac{\partial f(\boldsymbol{X})}{\partial_{\mathrm{vec}\boldsymbol{X}}} = [\frac{\partial f(\boldsymbol{X})}{\partial x_{11}}, \cdots, \frac{\partial f(\boldsymbol{X})}{\partial x_{m1}}, \cdots, \frac{\partial f(\boldsymbol{X})}{\partial x_{1n}}, \cdots, \frac{\partial f(\boldsymbol{X})}{\partial x_{mn}}]^T \end{array}$

### 2.5 实值矩阵函数偏导

• 对于实值矩阵函数 $\boldsymbol{F}(\boldsymbol{X}) \in \mathbb{R}^{p \times q}$，其中矩阵变元 $\boldsymbol{X} \in \mathbb{R}^{m \times n}$，Jacobian 矩阵定义为：

$\begin{array}{c} D_{\boldsymbol{X}} \boldsymbol{F}(\boldsymbol{X}) = \frac{\partial_{\mathrm{vec}} \boldsymbol{F}(\boldsymbol{X})}{\partial_{\mathrm{vec}^T}\boldsymbol{X}} \end{array}$

$\begin{array}{c} \nabla_{\boldsymbol{X}} \boldsymbol{F}(\boldsymbol{X}) = \frac{\partial_{\mathrm{vec}^T} \boldsymbol{F}(\boldsymbol{X})}{\partial_{\mathrm{vec}}\boldsymbol{X}} \end{array}$

【注】实值矩阵函数的 Jacobian 矩阵和梯度矩阵与对应偏导的 $D_{\boldsymbol{X}}, \nabla_{\boldsymbol{X}}$ 算子原始定义不相符，但为了保持符号的一致性，仍沿用与实值标量函数相同的算子符号。

## 3. 性质

### 3.1 基本法则

• $f(\boldsymbol{X}) = c$ 为常数，其中 $\boldsymbol{X}$$m \times n$ 矩阵，则梯度 $\frac{\partial c}{\partial \boldsymbol{X}} = \boldsymbol{0}_{m \times n}$

• 线性法则：若 $f(\boldsymbol{X})$$g(\boldsymbol{X})$ 分别是矩阵 $\boldsymbol{X}$ 的实值函数，$c_1$$c_2$ 为实常数，则

$\begin{array}{c} \frac{\partial [c_1 f(\boldsymbol{X}) + c_2 f(\boldsymbol{X})]}{\partial \boldsymbol{X}} = c_1 \frac{\partial f(\boldsymbol{X})}{\partial \boldsymbol{X}} + c_2 \frac{\partial g(\boldsymbol{X})}{\partial \boldsymbol{X}} \end{array}$

• 乘积法则：若 $f(\boldsymbol{X})$$g(\boldsymbol{X})$$h(\boldsymbol{X})$ 都是矩阵 $\boldsymbol{X}$ 的实值函数，则

$\begin{array}{c} \frac{\partial [f(\boldsymbol{X}) g(\boldsymbol{X})]}{\partial \boldsymbol{X}} = g(\boldsymbol{X}) \frac{\partial f(\boldsymbol{X})}{\partial \boldsymbol{X}} + f(\boldsymbol{X}) \frac{\partial g(\boldsymbol{X})}{\partial \boldsymbol{X}} \end{array}$

• 商法则：若 $g(\boldsymbol{X}) \ne 0$，则

$\begin{array}{c} \frac{\partial [f(\boldsymbol{X}) / g(\boldsymbol{X})]}{\partial \boldsymbol{X}} = \frac{1}{g^2(\boldsymbol{X})} [ g(\boldsymbol{X}) \frac{\partial f(\boldsymbol{X})}{\partial \boldsymbol{X}} - f(\boldsymbol{X}) \frac{\partial g(\boldsymbol{X})}{\partial \boldsymbol{X}} ] \end{array}$

• 链式法则：令 $\boldsymbol{X}$$m \times n$ 矩阵，且 $y = f(\boldsymbol{X})$$g(y)$ 分别是以矩阵 $\boldsymbol{X}$ 和标量 $y$ 为变元的实值函数，则

$\begin{array}{c} \frac{\partial g(f(\boldsymbol{X}))}{\partial \boldsymbol{X}} = \frac{\mathrm{d}g(y)}{\mathrm{d}y} \frac{\partial f(\boldsymbol{X})}{\partial \boldsymbol{X}} \end{array}$

$\begin{array}{c} [\frac{\partial g(\boldsymbol{F}(\boldsymbol{X}))}{\partial \boldsymbol{X}}]_{ij} = \frac{\partial g(\boldsymbol{F}(\boldsymbol{X}))}{\partial x_{ij}} = \sum_{k=1}^p \sum_{l=1}^q \frac{\partial g(\boldsymbol{F}(\boldsymbol{X}))}{\partial f_{kl}(\boldsymbol{X})} \frac{\partial f_{kl}(\boldsymbol{X})}{\partial x_{ij}} \end{array}$

### 3.2 常用性质

• 对于实值标量函数 $f(\boldsymbol{x}) = \boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x}$，其中 $\boldsymbol{A} \in \mathbb{R}^{n \times n}, \boldsymbol{x} \in \mathbb{R}^{n \times 1}$，其行偏导向量和矩阵向量为

$\begin{array}{c} D_\boldsymbol{x} f(\boldsymbol{x}) = \boldsymbol{x}^T(\boldsymbol{A} + \boldsymbol{A}^T) \\ \nabla_\boldsymbol{x} f(\boldsymbol{x}) = (\boldsymbol{A}^T + \boldsymbol{A})\boldsymbol{x} \end{array}$

• 对于实值标量函数 $f(\boldsymbol{X}) = \boldsymbol{a}^T \boldsymbol{X}^T \boldsymbol{X} \boldsymbol{b}$，其中 $\boldsymbol{X} \in \mathbb{R}^{m \times n}, \boldsymbol{a},\boldsymbol{b} \in \mathbb{R}^{n \times 1}$，其 Jacobian 矩阵和梯度矩阵为

$\begin{array}{c} D_{\boldsymbol{X}} f(\boldsymbol{X}) = \boldsymbol{X} (\boldsymbol{b} \boldsymbol{a}^T + \boldsymbol{a} \boldsymbol{b}^T) \\ \nabla_{\boldsymbol{X}} f(\boldsymbol{X}) = (\boldsymbol{b} \boldsymbol{a}^T + \boldsymbol{a} \boldsymbol{b}^T) \boldsymbol{X}^T \end{array}$

• 对于实值矩阵函数 $\boldsymbol{F}(\boldsymbol{X}) = \boldsymbol{X} \in \mathbb{R}^{m \times n}$，其 Jacobian 矩阵和梯度矩阵为

$\begin{array}{c} D_{\boldsymbol{X}} \boldsymbol{F} (\boldsymbol{X}) = \nabla_{\boldsymbol{X}} \boldsymbol{F}(\boldsymbol{X}) = \boldsymbol{I}_{mn} \in \mathbb{R}^{mn \times mn} \end{array}$

• 对于实值矩阵函数 $\boldsymbol{F}(\boldsymbol{X}) = \boldsymbol{A} \boldsymbol{X} \boldsymbol{B}$，其中 $\boldsymbol{A} \in \mathbb{R}^{p \times m}, \boldsymbol{X} \in \mathbb{R}^{m \times n}, \boldsymbol{B} \in \mathbb{R}^{n \times q}$，其 Jacobian 矩阵和梯度矩阵为

$\begin{array}{c} D_{\boldsymbol{X}}\boldsymbol{F}(\boldsymbol{X}) = \boldsymbol{B}^T \otimes \boldsymbol{A} \\ \nabla_{\boldsymbol{X}} \boldsymbol{F}(\boldsymbol{X}) = \boldsymbol{B} \otimes \boldsymbol{A}^T \end{array}$

• 在机器学习中，对于线性实值矩阵函数 $\boldsymbol{Y} = \boldsymbol{X} \cdot \boldsymbol{W} + \boldsymbol{b}$，其中 $\boldsymbol{X} \in \mathbb{R}^{n \times m}, \boldsymbol{W} \in \mathbb{R}^{m \times 1}, \boldsymbol{b} \in \mathbb{R}^{n \times 1}$，损失函数为 $\mathcal{L}$，则有梯度

$\begin{array}{c} [\nabla_{\boldsymbol{X}} \mathcal{L}]_{kl} = [\frac{\partial \mathcal{L}}{\partial \boldsymbol{X}}]_{kl} = \sum_{i=1}^n \frac{\partial \mathcal{L}}{\partial y_{i}} \frac{\partial y_{i}}{\partial x_{kl}} = w_{l} \frac{\partial \mathcal{L}}{\partial y_k} \\ \Downarrow \\ \nabla_{\boldsymbol{X}} \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \boldsymbol{X}} = \frac{\partial \mathcal{L}}{\partial \boldsymbol{Y}} \boldsymbol{W}^T \end{array}$

$\begin{array}{c} [\nabla_{\boldsymbol{W}} \mathcal{L}]_{k} = [\frac{\partial \mathcal{L}}{\partial \boldsymbol{W}}]_{k} = \sum_{i=1}^n \frac{\partial \mathcal{L}}{\partial y_{i}} \frac{\partial y_{i}}{\partial w_{k}} = \sum_{i=1}^n \frac{\partial \mathcal{L}}{\partial y_i} x_{ik} \\ \Downarrow \\ \nabla_{\boldsymbol{W}} \mathcal{L} = \frac{\partial \mathcal{L}}{\partial \boldsymbol{W}} = \boldsymbol{X}^T \frac{\partial \mathcal{L}}{\partial \boldsymbol{Y}} \end{array}$

## 4. 常用公式

### 4.1 向量函数及其偏导

$a \boldsymbol{x}$ $a \boldsymbol{I}$
$\boldsymbol{b}^T \boldsymbol{x}$ $\boldsymbol{b}$
$\boldsymbol{x}^T \boldsymbol{b}$ $\boldsymbol{b}$
$\boldsymbol{x}^T \boldsymbol{x}$ $2 \boldsymbol{x}$
$\boldsymbol{x} \boldsymbol{x}^T$ $\boldsymbol{I} \otimes \boldsymbol{x} + \boldsymbol{x} \otimes \boldsymbol{I}$
$\boldsymbol{b}^T \boldsymbol{A} \boldsymbol{x}$ $\boldsymbol{A}^T \boldsymbol{b}$
$\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{b}$ $\boldsymbol{A} \boldsymbol{b}$
$\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x}$ $(\boldsymbol{A} + \boldsymbol{A}^T) \boldsymbol{x}$
$\exp{(-\frac{1}{2} \boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x})}$ $-\frac{1}{2} (\boldsymbol{A} + \boldsymbol{A}^T) \boldsymbol{x} \cdot \exp{(-\frac{1}{2} \boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x})}$