【强化学习】9-策略梯度方法

Categories: RL 2026-03-22

一、策略梯度的基本思想

本章，我们主要会从基于价值的方法转向基于策略的方法，从价值函数近似转向策略函数近似。

到目前为止，我们介绍的策略是用表格表示的，所有状态的动作概率都存储在一个表格 $\pi ( a | s )$ 中。表格的每个条目都由一个状态和一个动作索引：

	$a_1$	$a_2$	$a_3$	$a_4$	$a_5$
$s_1$	$\pi(a_1 \mid s_1)$	$\pi(a_2 \mid s_1)$	$\pi(a_3 \mid s_1)$	$\pi(a_4 \mid s_1)$	$\pi(a_5 \mid s_1)$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
$s_9$	$\pi(a_1 \mid s_9)$	$\pi(a_2 \mid s_9)$	$\pi(a_3 \mid s_9)$	$\pi(a_4 \mid s_9)$	$\pi(a_5 \mid s_9)$

我们可以直接访问或修改表格中的值。

现在，我们用参数化的函数来表示策略： $\begin{equation} \pi (a | s, \theta) \end{equation}$ 其中 $\theta \in \mathbb { R } ^ { m }$ 是一个参数向量。

这个函数可以是一个神经网络，其输入是 $s$，输出是采取每个动作的概率，参数是 $\theta$。
优点：当状态空间很大时，表格表示在存储和泛化方面的效率会很低。
函数表示有时也写作 $\pi ( a , s , \theta )$、$\pi _ {\theta} (a | s)$ 或 $\pi _ { \theta } ( a , s )$。

表格表示与函数表示的区别：

第一，如何定义最优策略？

当表示为表格时，最优策略 $\pi^*$ 有 $v_{\pi^*}(s) \geq v_{\pi}(s), \quad \forall \pi$。
当用函数表示时，如果策略 $\pi$ 能最大化某个标量度量指标（scalar metrics），那么它就是最优的。

第二，如何访问一个动作的概率？

在表格情况下，在状态 $s$ 采取动作 $a$ 的概率可以通过查找表格策略直接获得。
在函数表示的情况下，我们需要根据函数结构和参数来计算 $\pi ( a | s , \theta )$ 的值。

第三，如何更新策略？

当用表格表示时，可以通过直接更改表格中的条目来更新策略 $\pi$。
当用参数化函数表示时，就不能再用这种方式更新策略 $\pi$ 了。相反，它只能通过改变参数 $\theta$ 来更新。

策略梯度的基本思想很简单：

首先，定义最优策略的度量指标（或目标函数）：$J ( \theta )$，它可以用来定义最优策略。其次，使用基于梯度的优化算法来搜索最优策略： $\begin{equation} \theta_{t + 1} = \theta_{t} + \alpha \nabla_{\theta} J (\theta_{t}) \end{equation}$ 我们接下来要解答：

应该使用什么合适的度量指标？
如何计算这些度量指标的梯度？

二、定义最优策略的度量指标

有两种度量指标。

1. Average Value

第一种度量指标是平均状态价值，或简称为平均价值。具体地，该指标定义为： $\begin{equation} \bar{v}_{\pi} = \sum_{s \in \mathcal{S} } d (s) v_{\pi} (s) \end{equation}$

$\bar { v } _ { \pi }$ 是状态价值的加权平均。
$d ( s ) \geq 0$ 是状态 $s$ 的权重。
由于 $\begin{array} { r } { \sum _ { s \in \mathcal { S } } d ( s ) = 1 } \end{array}$，我们可以将 $d ( s )$ 解释为一个概率分布。那么，这个指标可以写成：

\[\begin{equation} \bar{v}_{\pi} = \mathbb{E} [ v_{\pi} (S) ] \end{equation}\]

其中 $S \sim d$。

向量-乘积形式为： $\begin{equation} \bar{v}_{\pi} = \sum_{s \in \mathcal{S} } d (s) v_{\pi} (s) = d ^ {T} v_{\pi} \end{equation}$ 其中 $\begin{equation} \begin{aligned} v_{\pi} = \left[ \dots , v_{\pi} (s), \dots \right] ^ {T} \in \mathbb{R} ^ {| S |} \\ d = [ \dots , d (s), \dots ] ^ {T} \in \mathbb{R} ^ {| \mathcal{S} |} \end{aligned} \end{equation}$ 这种表达形式在我们分析其梯度时特别有用。

如何选择分布 $d$ ？有两种情况：

第一种情况是 $d$ 独立于策略 $\pi$。这种情况相对简单，因为度量指标的梯度更容易计算。在这种情况下，我们特地将 $d$ 记为 $d _ { 0 }$，将 $\bar { v } _ { \pi }$ 记为 $\bar { v } _ { \pi } ^ { 0 }$。选择 $d _ { 0 }$ 也有以下典型的情况：

一种简单的方法是将所有状态视为同等重要，因此选择 $d _ { 0 } ( s ) = 1 / | S |$。
另一个重要的情况是，我们只对特定的状态 $s _ { 0 }$ 感兴趣。例如，某些任务中的片段总是从同一个状态 $s _ { 0 }$ 开始。那么，我们只关心从 $s _ { 0 }$ 开始的长期回报。在这种情况下：

\[\begin{equation} \begin{aligned} d _ {0} (s _ {0}) &= 1, \quad d _ {0} (s \neq s _ {0}) = 0 \\ \bar{v}_{\pi} &= v_{\pi} (s_0) \end{aligned} \end{equation}\]

第二种情况是 $d$ 依赖于策略 $\pi$。一种常见的选择是将 $d$ 取为 $d _ { \pi } ( s )$，即 $\pi$ 下的平稳分布。$d _ { \pi }$ 的一个基本性质是它满足：

\[\begin{equation} d_{\pi} ^ {T} P_{\pi} = d_{\pi} ^ {T} \end{equation}\]

其中 $P _ { \pi }$ 是状态转移概率矩阵。选择 $d _ { \pi }$ 的解释如下：

如果一个状态在长期内被频繁访问，那么它就更重要，应该获得更大的权重。
如果一个状态很少被访问，那么我们就给它较小的权重。

2. Average Reward

第二种度量指标是平均单步奖励，或简称为平均奖励。具体地，该指标定义为：

\[\begin{equation} \bar{r}_{\pi} \doteq \sum_{s \in \mathcal{S} } d_{\pi} (s) r_{\pi} (s) = \mathbb{E} [ r_{\pi} (S) ] \end{equation}\]

其中 $S \sim d _ { \pi }$。这里

\[\begin{equation} r_{\pi} (s) \doteq \sum_{a \in \mathcal{A} } \pi (a | s) r (s, a) \end{equation}\]

是从状态 $s$ 出发所能获得的平均单步即时奖励，而

\[\begin{equation} r (s, a) = \mathbb{E} [ R | s, a ] = \sum_{r} r p (r | s, a) \end{equation}\]

其中，权重 $d _ { \pi }$ 是平稳分布。顾名思义，$\bar { r } _ { \pi }$ 就是单步即时奖励的加权平均。

这里还有另一种常见的等价形式。假设一个 agent 遵循一个给定的策略，并生成一个轨迹及其对应的奖励序列 $( R _ { t + 1 } , R _ { t + 2 } , \dots )$。沿着这个轨迹的平均单步奖励为： $\begin{equation} \begin{aligned} \lim _ {n \rightarrow \infty} & \frac {1}{n} \mathbb{E} \left[ R _ {t + 1} + R _ {t + 2} + \dots + R _ {t + n} | S _ {t} = s _ {0} \right] \\ & = \lim _ {n \rightarrow \infty} \frac {1}{n} \mathbb{E} \left[ \sum_ {k = 1} ^ {n} R _ {t + k} | S _ {t} = s _ {0} \right] \end{aligned} \end{equation}$

其中 $s _ { 0 }$ 是轨迹的起始状态。一个重要的性质是：

\[\begin{equation} \begin{aligned} \lim _ {n \rightarrow \infty} \frac {1}{n} \mathbb{E} \left[ \sum_ {k = 1} ^ {n} R _ {t + k} | S _ {t} = s _ {0} \right] &= \lim _ {n \rightarrow \infty} \frac {1}{n} \mathbb{E} \left[ \sum_ {k = 1} ^ {n} R _ {t + k} \right] \\ &= \sum_ {s} d_{\pi} (s) r_{\pi} (s) \\ &= \bar{r}_{\pi} \end{aligned} \end{equation}\]

这是因为随着步数的增多，起始状态 $s _ { 0 }$ 已无关紧要，最后会趋向于平稳分布。$\bar { r } _ { \pi }$ 的两种定义是等价的。

此外，我们也经常会见到以下度量指标： $\begin{equation} J (\theta) = \mathbb{E} \left[ \sum_{t = 0}^{\infty} \gamma^{t} R_{t + 1} \right] \end{equation}$ 其中，轨迹从 $S _ { 0 } \sim d$ 开始，然后是 $A _ { 0 } , R _ { 1 } , S _ { 1 } , A _ { 1 } , R _ { 2 } , S _ { 2 } , . . .$。此外 $A _ { t } \sim \pi ( S _ { t } )$，并且 $R _ { t + 1 } , S _ { t + 1 } \sim p ( R _ { t + 1 } | S _ { t } , A _ { t } ) , p ( S _ { t + 1 } | S _ { t } , A _ { t } )$。那么，我们知道这个度量指标与平均价值相同，因为： $\begin{equation} \begin{aligned} J (\theta) &= \mathbb{E} \left[ \sum_{t = 0}^{\infty} \gamma^{t} R_{t + 1} \right] = \sum_{s \in \mathcal{S} } d (s) \mathbb{E} \left[ \sum_{t = 0}^{\infty} \gamma^{t} R_{t + 1} | S_{0} = s \right] \\ &= \sum_{s \in \mathcal{S} } d (s) v_{\pi} (s) \\ &= \bar{v}_{\pi} \end{aligned} \end{equation}$

3. 几点说明

关于度量指标的说明 1：

所有这些度量指标都是 $\pi$ 的函数。由于 $\pi$ 由 $\theta$ 参数化，这些度量指标也是 $\theta$ 的函数。换句话说，不同的 $\theta$ 值可以产生不同的度量指标值。因此，我们可以搜索 $\theta$ 的最优值以最大化这些度量指标。这就是策略梯度方法的基本思想。

关于度量指标的说明 2：

一个复杂之处在于，度量指标可以定义在折扣因子 $\gamma \in ( 0 , 1 )$ 的情况下，也可以定义在无折扣 $\gamma = 1$ 的情况下。到目前为止，我们只考虑折扣情况。

关于度量指标的说明 3：

直观来看，$\bar { r } _ { \pi }$ 似乎更短视，因为它只考虑即时奖励， $\bar { v } _ { \pi }$ 则考虑的是所有步骤的总奖励。然而，这两个度量指标是等价的。在 $\gamma < 1$ 的折扣情况下，有：

\[\begin{equation} \bar{r}_{\pi} = (1 - \gamma) \bar{v}_{\pi} \end{equation}\]

证明过程如下：

注意到 $\bar{v}_\pi(\theta) = d_\pi^T v_\pi$ 且 $\bar{r}_\pi(\theta) = d_\pi^T r_\pi$，其中 $v_\pi$ 和 $r_\pi$ 满足贝尔曼方程 $v_\pi = r_\pi + \gamma P_\pi v_\pi$。在贝尔曼方程两边同时左乘 $d_\pi^T$，可得：

\[\begin{equation} \bar{v}_\pi = \bar{r}_\pi + \gamma d_\pi^T P_\pi v_\pi = \bar{r}_\pi + \gamma d_\pi^T v_\pi = \bar{r}_\pi + \gamma \bar{v}_\pi \end{equation}\]

三、度量指标的梯度

给定一个度量指标，我们接下来推导它的梯度，然后，应用基于梯度的方法来优化这个度量指标。梯度计算是策略梯度方法中最复杂的部分之一，这是因为：

首先，我们需要区分不同的度量指标 $\bar { v } _ { \pi }$、$\bar { r } _ { \pi }$、$\bar { v } _ { \pi } ^ { 0 }$。
其次，我们需要区分折扣情况和无折扣情况。

但对上述所有情况，可以用一个统一的基本形式给出梯度计算结果： $\begin{equation} \nabla_{\theta} J (\theta) = \sum_{s \in \mathcal{S} } \eta (s) \sum_{a \in \mathcal{A} } \nabla_{\theta} \pi (a | s, \theta) q_{\pi} (s, a) \end{equation}$ 其中：

$J(\theta)$ 可以是 $\bar { v } _ { \pi }$、$\bar { r } _ { \pi }$ 或 $\bar { v } _ { \pi } ^ { 0 }$。
“=” 可能表示严格相等、近似或成比例。
$\eta$ 是状态的一个分布或权重。

上式是不同情况的一个近似的统一形式，具体来说，每种情况的结果都有一定区别，这里直接给出： $\begin{align} \nabla_{\theta} \bar{r}_{\pi} &\simeq \sum_{s} d_{\pi} (s) \sum_{a} \nabla_{\theta} \pi (a | s, \theta) q_{\pi} (s, a) \\ \nabla_{\theta} \bar{v}_{\pi} &= \frac {1}{1 - \gamma} \nabla_{\theta} \bar{r}_{\pi} \\ \nabla_{\theta} \bar{v}_{\pi} ^ {0} &= \sum_{s \in \mathcal{S} } \rho_{\pi} (s) \sum_{a \in \mathcal{A} } \nabla_{\theta} \pi (a | s, \theta) q_{\pi} (s, a) \end{align}$ 上面的梯度有一个紧凑而有用的形式： $\begin{equation} \begin{aligned} \nabla_{\theta} J (\theta) &= \sum_{s \in \mathcal{S} } \eta (s) \sum_{a \in \mathcal{A} } \nabla_{\theta} \pi (a | s, \theta) q_{\pi} (s, a) \\ &= \mathbb{E} \left[ \nabla_{\theta} \ln \pi (A | S, \theta) q_{\pi} (S, A) \right] \end{aligned} \end{equation}$ 其中 $S \sim \eta$，$A \sim \pi ( A | S , \theta )$。为什么这个表达式有用？因为我们可以使用样本来近似梯度：

\[\begin{equation} \nabla_{\theta} J \approx \nabla_{\theta} \ln \pi (a | s, \theta) q_{\pi} (s, a) \end{equation}\]

上述等式证明过程如下：

考虑函数 $\ln \pi$，其中 $\ln$ 是自然对数。很容易看出： $\begin{equation} \nabla_{\theta} \ln \pi (a | s, \theta) = \frac {\nabla_{\theta} \pi (a | s , \theta)}{\pi (a | s , \theta)} \end{equation}$ 因此： $\begin{equation} \nabla_{\theta} \pi (a | s, \theta) = \pi (a | s, \theta) \nabla_{\theta} \ln \pi (a | s, \theta) \end{equation}$ 代入后得到： $\begin{equation} \begin{aligned} \nabla_{\theta} J &= \sum_{s} d (s) \sum_{a} \nabla_{\theta} \pi (a | s, \theta) q_{\pi} (s, a) \\ &= \sum_{s} d (s) \sum_{a} \pi (a | s, \theta) \nabla_{\theta} \ln \pi (a | s, \theta) q_{\pi} (s, a) \\ &= \mathbb{E} _ {S \sim d} \left[ \sum_{a} \pi (a | S, \theta) \nabla_{\theta} \ln \pi (a | S, \theta) q_{\pi} (S, a) \right] \\ &= \mathbb{E} _ {S \sim d, A \sim \pi} \left[ \nabla_{\theta} \ln \pi (A | S, \theta) q_{\pi} (S, A) \right] \\ &= \mathbb{E} \left[ \nabla_{\theta} \ln \pi (A | S, \theta) q_{\pi} (S, A) \right] \end{aligned} \end{equation}$ 一些说明： 因为我们需要计算 $\ln \pi ( a | s , \theta )$，我们必须确保对于所有 $s , a , \theta$ 有： $\begin{equation} \pi (a | s, \theta) > 0 \end{equation}$

这可以通过使用 softmax 函数来实现，该函数可以将向量中的条目从 $( - \infty , + \infty )$ 归一化到 $(0, 1)$。例如，对于任意向量 $ x = [ x _ { 1 } , \ldots , x _ { n } ] ^ { T }$： $\begin{equation} z _ {i} = \frac {e ^ {x _ {i} }}{\sum_ {j = 1} ^ {n} e ^ {x _ {j} }} \end{equation}$

其中 $z _ { i } \in ( 0 , 1 )$ 且 $\textstyle \sum _ { i = 1 } ^ { n } z _ { i } = 1$。那么，策略函数的形式为：

\[\begin{equation} \pi (a | s, \theta) = \frac {e ^ {h (s , a , \theta)} }{\sum_ {a ^ {\prime} \in \mathcal {A} } e ^ {h (s , a ^ {\prime} , \theta)} } \end{equation}\]

其中 $h ( s , a , \theta )$ 是另一个函数。这种基于 softmax 函数的形式可以通过神经网络实现，其输入是 $s$，参数是 $\theta$。该网络有 $| { \cal A } |$ 个输出，每个输出对应于一个动作 $a$ 的 $\pi ( a | s , \theta )$，输出层的激活函数为 softmax。由于对于所有 $a$，都有 $\pi ( a | s , \theta ) > 0$，因此参数化的策略是随机的，因此具有探索性。当然，也存在确定性策略梯度（DPG）方法，后续介绍。

四、梯度上升算法（REINFORCE）

1. 算法

最大化 $J ( \theta )$ 的梯度上升算法是： $\begin{equation} \begin{aligned} \theta_{t + 1} &= \theta_{t} + \alpha \nabla_{\theta} J (\theta) \\ &= \theta_{t} + \alpha \mathbb{E} \left[ \nabla_{\theta} \ln \pi (A | S, \theta_{t}) q_{\pi} (S, A) \right] \end{aligned} \end{equation}$

真实的梯度可以用一个随机梯度来代替：

\[\begin{equation} \theta_{t + 1} = \theta_{t} + \alpha \nabla_{\theta} \ln \pi \left(a_{t} \mid s_{t}, \theta_{t}\right) q_{\pi} \left(s_{t}, a_{t}\right) \end{equation}\]

此外，由于 $q _ { \pi }$ 是未知的，可以用近似来代替：

\[\begin{equation} \theta_{t + 1} = \theta_{t} + \alpha \nabla_{\theta} \ln \pi (a_{t} | s_{t}, \theta_{t}) q_{t} (s_{t}, a_{t}) \end{equation}\]

有不同的方法来近似 $q _ { \pi } ( s _ { t } , a _ { t } )$，本章主要是基于蒙特卡洛的方法，即 REINFORCE。此外还包括 TD 方法等，这会引出下一章 Actor-Critic 的内容。

2. 几点说明

如何进行采样？ $\begin{equation} \mathbb{E} _ {S \sim d, A \sim \pi} \left[ \nabla_{\theta} \ln \pi (A | S, \theta_{t}) q_{\pi} (S, A) \right] \longrightarrow \nabla_{\theta} \ln \pi (a | s, \theta_{t}) q_{\pi} (s, a) \end{equation}$

如何采样 $S$：$S \sim d$，其中分布 $d$ 是在 $\pi$ 下的长期行为。
如何采样 $A$：$A \sim \pi ( A | S , \theta )$，因此，$a _ { t }$ 应该根据在 $s _ { t }$ 处的 $\pi ( \theta _ { t } )$ 进行采样。
因此，策略梯度方法是on-policy的。

如何解释这个算法？

由于：

\[\begin{equation} \nabla_{\theta} \ln \pi (a_{t} | s_{t}, \theta_{t}) = \frac {\nabla_{\theta} \pi (a_{t} | s_{t} , \theta_{t})}{\pi (a_{t} | s_{t} , \theta_{t})} \end{equation}\]

该算法可以重写为：

\[\begin{equation} \begin{aligned} \theta_{t + 1} &= \theta_{t} + \alpha \nabla_{\theta} \ln \pi (a_{t} | s_{t}, \theta_{t}) q_{t} (s_{t}, a_{t}) \\ &= \theta_{t} + \alpha \underbrace {\left(\frac {q_{t} (s_{t} , a_{t})}{\pi (a_{t} | s_{t} , \theta_{t})}\right)} _ {\beta_{t} } \nabla_{\theta} \pi (a_{t} | s_{t}, \theta_{t}) \end{aligned} \end{equation}\]

因此，我们得到了算法的重要表达式：

\[\begin{equation} \theta_{t + 1} = \theta_{t} + \alpha \beta_{t} \nabla_{\theta} \pi (a_{t} | s_{t}, \theta_{t}) \end{equation}\]

这是一个用于最大化 $\pi ( a _ { t } | s _ { t } , \theta )$ 的梯度上升算法。直观理解，当 $\alpha \beta _ { t }$ 足够小时：

如果 $\beta _ { t } > 0$，选择 $( s _ { t } , a _ { t } )$ 的概率会增加：

\[\begin{equation} \pi \left(a_{t} \mid s_{t}, \theta_{t + 1}\right) > \pi \left(a_{t} \mid s_{t}, \theta_{t}\right) \end{equation}\]

$\beta _ { t }$ 越大，增强的力度越强。

如果 $\beta _ { t } < 0$，那么：

\[\begin{equation} \pi ( a _ { t } | s _ { t } , \theta _ { t + 1 } ) < \pi ( a _ { t } | s _ { t } , \theta _ { t } ) \end{equation}\]

数学推导：当 $\theta _ { t + 1 } - \theta _ { t }$ 足够小时，我们有： $\begin{equation} \begin{aligned} \pi \left(a_{t} \mid s_{t}, \theta_{t + 1}\right) &\approx \pi \left(a_{t} \mid s_{t}, \theta_{t}\right) + \left(\nabla_{\theta} \pi \left(a_{t} \mid s_{t}, \theta_{t}\right)\right) ^ {T} \left(\theta_{t + 1} - \theta_{t}\right) \\ &= \pi \left(a_{t} \mid s_{t}, \theta_{t}\right) + \alpha \beta_{t} \left(\nabla_{\theta} \pi \left(a_{t} \mid s_{t}, \theta_{t}\right)\right) ^ {T} \left(\nabla_{\theta} \pi \left(a_{t} \mid s_{t}, \theta_{t}\right)\right) \\ &= \pi \left(a_{t} \mid s_{t}, \theta_{t}\right) + \alpha \beta_{t} \| \nabla_{\theta} \pi \left(a_{t} \mid s_{t}, \theta_{t}\right) \| ^ {2} \end{aligned} \end{equation}$

系数 $\beta _ { t }$ 可以很好地平衡探索（exploration）和利用（exploitation）

首先，$\beta _ { t }$ 与 $q _ { t } ( s _ { t } , a _ { t } )$ 成正比。
- 如果 $q _ { t } ( s _ { t } , a _ { t } )$ 很大，那么 $\beta _ { t }$ 也很大。
- 因此，该算法旨在增强具有更大价值的动作（exploitation）。
其次，$\beta _ { t }$ 与 $\pi ( a _ { t } | s _ { t } , \theta _ { t } )$ 成反比。
- 如果 $\pi ( a _ { t } | s _ { t } , \theta _ { t } )$ 很小，那么 $\beta _ { t }$ 就很大。
- 因此，该算法旨在增大那些具有低概率的动作（exploration）。

3. 伪代码

基于蒙特卡洛的策略梯度（REINFORCE）伪代码如下：

\[\begin{equation*} \begin{aligned} & \text{Initialization: A parameterized function } \pi(a|s,\theta), \gamma \in (0,1), \text{ and } \alpha>0. \\ & \text{Aim: Search for an optimal policy maximizing } J(\theta). \\ & \text{For the } k \text{ th iteration, do} \\ & \quad \quad \text{Select } s_0 \text{ and generate an episode following } \pi(\theta_k). \\ & \quad \quad \text{Suppose the episode is } \{ s_0,a_0,r_1,\dots,s_{T-1},a_{T-1},r_T \}. \\ & \quad \quad \text{For } t=0,1,\dots,T-1, \text{ do} \\ & \quad \quad \quad \quad \text{Value update: } q_t(s_t,a_t)=\sum^T_{k=t+1} \gamma^{k-t-1}r_k \\ & \quad \quad \quad \quad \text{Policy update: } \theta_{t + 1} = \theta_{t} + \alpha \nabla_{\theta} \ln \pi (a_{t} | s_{t}, \theta_{t}) q_{t} (s_{t}, a_{t}) \\ & \quad \quad \theta_k = \theta_T \end{aligned} \end{equation*}\]

参考

强化学习的数学原理