Ⅶ. Policy Gradient Methods-有解無憂

Dictum:

?Life is just a series of trying to make up your mind. -- T. Fuller

不同于近似價值函式并以此計算確定性的策略的基于價值的RL方法，基于策略的RL方法將策略的學習從概率集合$P(a|s)$變換成策略函式$\pi(a|s)$，并通過求解策略目標函式的極大值，得到最優策略$\pi^*$，主要用的是策略梯度方法(Policy Gradient Methods)，

策略梯度方法直接對隨機策略$\pi$進行引數化為$\pi(a|s,\theta)=\Pr\{A_t=a|S_t=s,\theta_t=\theta\}$，其中$s\in\mathcal{S}$，$a\in\mathcal{A}(s)$，權重引數向量$\theta\in\mathbb{R}^{d^\prime}(d^\prime<<|\mathcal{S}|)$，引數化后的策略不再是一個概率集合，而是一個近似函式，

相比于基于價值的RL方法，策略梯度方法的優勢如下：

基于價值的RL方法依賴于價值函式的估計；而對策略梯度方法，價值函式可以用于學習策略的引數，但沒有直接的依賴關系
基于價值的RL方法為了尋找最優策略，使用了$\epsilon$-貪心策略以$\epsilon$的概率選擇隨機動作；而策略梯度方法可以直接近似一個確定的策略，因此，具有更好的收斂性
基于價值的RL方法更適用于離散的動作、狀態空間，即使一些改進也需要將連續的動作空間離散化，但這樣的映射呈指數形式，增加求最優解的難度；而策略梯度方法在高維或連續動作空間中有效
基于價值的RL方法通常只能學習最優策略是確定的，而策略梯度方法可以學習本身就具有隨機性的最優策略

缺點如下：

策略梯度方法使用梯度上升，容易收斂到區域最優解
同時，對策略難以進行有效評估，因為它具有很高的方差

策略梯度定義

策略函式

策略函式是在確定在時序$t$的狀態$s$下，采取任何科恩那個動作的具體概率，可以看作概率密度函式，假設策略被引數化為$\pi(a|s,\theta)$，要求$\pi(a|s,\theta)$對引數$\theta$的偏導存在，則策略梯度

\[\begin{aligned} \triangledown\pi(a|s,\theta) & =\pi(a|s,\theta)\frac{\triangledown \pi(a|s,\theta)}{\pi(a|s,\theta)}\\ &=\pi(a|s,\theta)\triangledown\log \pi(a|s,\theta) \end{aligned} \tag{7.1} \]

$\triangledown \log\pi(a|s,\theta)$是得分函式(score function)，

Softmax Policy

針對離散且有限的動作空間，可以使用softmax策略，對每一個“狀態-動作”二元組估計一個引數化的數值偏好$h(s,a,\theta)\in\mathbb{R}$，然后利用softmax函式對動作加權，如

\[\pi(a|s,\theta)\doteq \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}} \]

此時，每個動作的概率正比于指數權重$\pi(a|s,\theta)\propto e^{h(s,a,\theta)}$，若引數化的偏好值為特征$x(s,a)$的線性組合，則$h(s,a,\theta)=\theta^\top x(s,a)$，此時得分函式為

\[\triangledown\log\pi(a|s,\theta)=x(s,a)-\mathbb{E}_\pi [x(s,\cdot)] \]

Gaussian Policy

針對龐大或連續的動作空間，不能直接計算每一個動作的概率，但可以學習概率分布的統計，通常可以使用Gaussian策略，即對應的動作服從高斯分布$a\sim\mathcal{N}(\mu,\sigma^2)$，則策略可以表示為

\[\pi(a|s,\theta)\doteq \frac{1}{\sigma(s,\theta)\sqrt{2\pi}}\exp\bigg(-\frac{\big(a-\mu(s,\theta)\big)^2}{2\sigma(s,\theta)^2}\bigg) \]

其中，均值$\mu:\mathcal{S}\times\mathbb{R}^{d^\prime}\to\mathcal{R}$和方差$\sigma:\mathcal{S}\times\mathbb{R}^{d^\prime}\to\mathcal{R}^+$是兩個引數化的函式近似器，因此，引數向量可以分為兩部分$\theta=[\theta_\mu,\theta_\sigma]^\top$，同時特征向量也分為兩部分$x=[x_\mu,x_\sigma]$，均值部分可以直接使用線性函式近似，而標準差部分要求為正數，可以使用線性函式的指數形式近似

\[\mu(s,\theta)\doteq\theta_\mu^\top x_\mu(s)\text{ 和 }\sigma(s,\theta)\doteq\exp\Big(\theta_\sigma^\top x_\sigma(s)\Big) \]

此時得分函式為

\[\triangledown\log\pi(a|s,\theta)=\frac{(a-\mu(s))x(s)}{\sigma^2} \]

性能度量

有了策略函式，還要定義性能度量來衡量策略的好壞，一般將智能體累積折扣獎勵的期望作為目標函式，用$J(\theta)$表示，利用相關策略的目標函式$J(\theta)$的梯度上升更新$\theta$，從而最大化獎勵的期望：

\[\theta_{t+1}=\theta_t+\alpha\triangledown\widehat{J(\theta_t)} \tag{7.2} \]

在幕式任務和連續任務這兩種不同的情況下，性能度量的定義是不同的：

起始價值(start value)

針對幕式任務，將性能度量定義為該幕起始狀態的價值，即

\[J_1(\theta)\doteq v_{\pi_\theta}(s_0)=\mathbb{E}_{\pi_\theta}[v_0] \]

其中，$v_{\pi_\theta}$是執行策略$\pi_\theta$的真實的價值函式，適用于能夠產生完整經驗軌跡的環境，即從狀態$s_0$開始，以一定的概率分布達到終止狀態時序段內，智能體獲得的累積獎勵，

平均價值(average value)

在連續任務中，智能體沒有明確的起始狀態，可以基于智能體在時序$t$的狀態分布，針對每個可能的狀態計算從$t$開始持續與環境互動所能獲得的獎勵，并按其狀態的概率分布求和，即

\[J_{avgV}(\theta)\doteq\sum_{s\in\mathcal{S}}d_{\pi_\theta}(s)v_{\pi_\theta}(s) \]

其中，$s\sim d_{\pi_\theta}(s)$表示狀態$s$服從根據策略$\pi_\theta$生成的馬爾科夫鏈的狀態分布，

每時序平均獎勵(average reward per time-step)

平均價值使用時序$t$下狀態的平均價值，而每個時序的平均獎勵使用時序$t$的狀態下所有動作的期望，即在時序$t$先計算出智能體所有狀態的可能性，然后計算出在每一種狀態下采取所有可能動作獲得的即時獎勵，并按其對應的概率求和，即

\[\begin{aligned} J_{avgR}(\theta)\doteq\mathbb{E}_{\pi_\theta}[r] &=\sum_{s\in\mathcal{S}}d_{\pi_\theta}(s)\sum_{a\in\mathcal{A}}\pi_\theta(a|s)r_{s,a} \\ &=\sum_s d_{\pi_\theta}(s)\sum_a\pi(a|s)\sum_{s^\prime,r}p(s^\prime,r|s,a)r \end{aligned} \]

其中，$d_{\pi_\theta}(s)=\lim_{t\to\infty}\Pr\{s_t=s|s_0,\pi_\theta\}$是策略$\pi_\theta$下狀態的穩態分布，$s_0$獨立于所有策略，$r_{s,a}$是狀態$s$下執行動作$a$得到的即時獎勵，它實際上是單步MDPs，也就是從狀態$s\sim d(s)$開始，根據策略$\pi$執行動作$a$，此時得到了一個獎勵值$r$，完成了一個時序后結束，

策略梯度定理

已經給出了策略函式和性能度量，但通過調整策略引數$\theta$來保證性能得到改善仍存在疑點，主要在于策略的性能既依賴于動作的選擇，也依賴于動作選擇時所處的狀態分布，而策略$\pi(a|s)$的定義是已知狀態$s$下動作$a$的概率，因此策略對狀態分布的影響未知，這里就引出了策略梯度定理(policy gradient theorem)，

Policy Gradient Theroem
對任何可微策略$\pi(a|s,\theta)$，任意目標函式$J(\theta)$的梯度都可以表示為

\[\triangledown J(\theta)\propto\sum_s\mu(s)\sum_a\triangledown_\theta q_\pi(s,a)\pi(a|s,\theta) \tag{7.3}$$在幕式情況下，比例常數為幕的平均長度；在連續情況下，比例常數為1，$\mu(s)$是同步策略下的狀態分布，證明見Append， ## 策略梯度演算法 ### REINFORCE：Monte Carlo Policy Gradient REINFORCE是一種蒙特卡洛演算法，一般應用于幕式任務，也需要在每個幕達到終止狀態后才能進行更新，根據策略梯度定理$(7.3)$，REINFORCE的目標函式梯度可以寫成 \]

\begin{aligned}
\triangledown J(\theta)
&\propto\sum_s\mu(s)\sum_a\triangledown\pi(a|s,\theta)q_\pi(s,a) \
&=\mathbb{E}\pi \bigg[\sum_a q\pi (S_t,a) \triangledown \pi(a|S_t,\theta)\bigg] \
&=\mathbb{E}\pi \bigg[\sum_a \pi(a|S_t,\theta)q\pi(S_t,a)\frac{\triangledown \pi(a|S_t,\theta)}{\pi(a|S_t,\theta)}\bigg] \
&(A_t\sim\pi，因此\sum_a \pi(a|S_t,\theta)q_\pi(S_t,a)可以用q_\pi(S_t,A_t)代替)\
&=\mathbb{E}\pi\bigg[q\pi(S_t,A_t)\frac{\triangledown \pi(a|S_t,\theta)}{\pi(a|S_t,\theta)}\bigg]\
&(根據q_\pi的定義，q_\pi(S_t,A_t)=\mathbb{E}\pi[G_t|S_t,A_t]，G_t是回報值)\
&=\mathbb{E}\pi\bigg[G_t\frac{\triangledown \pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)}\bigg]
\end{aligned}

\[ 根據式$(7.2)$的隨機梯度上升，可以得到引數的更新如下 \]

\begin{aligned}
\theta_{t+1}
&\leftarrow \theta_t+\alpha G_t\frac{\triangledown \pi(A_t|S_t,\theta_t)}{\pi(A_t|S_t,\theta_t)} \
&\leftarrow\theta_t+\alpha G_t\triangledown \ln \pi(A_t|S_t,\theta_t) \
\end{aligned}

\[ 引數向量$\theta$更新的大小正比于回報$G_t$，使更新朝著更利于產生最大回報動作的方向；同時，它反比于策略$\pi(A_t|S_t,\theta)$，減少非最優動作因選擇頻率過大導致的性能影響， ### REINFORCE with Baseline 帶基線的REINFORCE演算法可以看作是一種推廣形式，它是在策略梯度定理$(7.3)$上添加一個不隨動作$a$變化的任意基線$b(s)$，形式如下 $$\triangledown J(\theta)\propto\sum_s\mu(s)\sum_a\big(q_\pi(s,a)-b(s)\big)\triangledown_\theta \pi(a|s,\theta)\]

基線$b(s)$不會影響策略的梯度，因為梯度方向的單位向量之和為0，即$\triangledown_\theta \sum_a\pi(a|s,\theta)=\triangledown_\theta 1=0$，通常可以將狀態價值函式近似$\hat{v}(S_t,w)$，$w\in\mathbb{R}^m$作為基線，改良后的REINFORCE演算法引數更新為

\[w_{t+1}\leftarrow w_t+\alpha^w \gamma^t \big(G_t-\hat{v}(S_t,w)\big)\triangledown_w \hat{v}(S_t,w)$$$$\theta_{t+1}\leftarrow \theta_t+\alpha^\theta \gamma^t \big(G_t-\hat{v}(S_t,w)\big)\triangledown_\theta \ln \pi(A_t|S_t,\theta) \]

Append

策略梯度定理(證明)

幕式情況

\[\begin{aligned} \triangledown J(\theta) &=\triangledown v_\pi(s) \\ &=\triangledown \bigg[\sum_a \pi(a|s)q_\pi(s,a)\bigg] \\ &=\sum_a\Big[\triangledown\pi(a|s)q_\pi(s,a)+\pi(a|s)\triangledown q_\pi(s,a)\Big] \\ &=\sum_a \Big[\triangledown\pi(a|s)q_\pi(s,a)+\pi(a|s)\triangledown \sum_{s^\prime,r}p(s^\prime|s,a)(r+v_\pi(s^\prime))\Big] \\ &=\sum_a\Big[\triangledown\pi(a|s)q_\pi(s,a)+\pi(a|s)\sum_{s^\prime}p(s^\prime|s,a)\triangledown v_\pi(s^\prime)\Big] \\ &=\sum_a\Big[\triangledown\pi(a|s)q_\pi(s,a)+\pi(a|s)\sum_{s^\prime}p(s^\prime|s,a)\sum_{a^\prime}[\triangledown\pi(a^\prime|s^\prime)q_\pi(s^\prime,a^\prime)+\pi(a^\prime|s^\prime)\sum_{s^{\prime\prime}}p(s^{\prime\prime}|s^\prime,a^\prime)\triangledown v_\pi(s^{\prime\prime})]\Big] \\ &=\sum_{x\in\mathcal{S}}\sum_{k=0}^\infty \Pr(s\to x,k,\pi)\sum_a\triangledown\pi(a|x)q_\pi(x,a) \\ &\propto\sum_s \mu(s)\sum_a\triangledown\pi(a|s)q_\pi(s,a) \\ \end{aligned} \]

其中，$\Pr(s\to x,k,\pi)$表示在策略$\pi$下的$k$步內，狀態$s$轉移到狀態$x$的概率，

連續情況

\[\begin{aligned} \triangledown v_\pi(s) &=\triangledown \Big[\sum_a \pi(a|s)q_\pi(s,a)\Big] \\ &=\sum_a \Big[\triangledown\pi(a|s)q_\pi(s,a)+\pi(a|s)\triangledown q_\pi(s,a)\Big] \\ &=\sum_a\Big[\triangledown\pi(a|s)q_\pi(s,a)+\pi(a|s)\triangledown\sum_{s^\prime,r}p(s^\prime,r|s,a)\big(r-r(\theta)+v_\pi(s^\prime)\big)\Big] \\ &=\sum_a\Big[\triangledown\pi(a|s)q_\pi(s,a)+\pi(a|s)[-\triangledown r(\theta)+\sum_{s^\prime}p(s^\prime|s,a)\triangledown v_\pi(s^\prime)]\Big] \\ \end{aligned} \]

由上可以得到

\[\triangledown r(\theta)=\sum_a\Big[\triangledown\pi(a|s)q_\pi(s,a)+\pi(a|s)\sum_{s^\prime}p(s^\prime|s,a)\triangledown v_\pi(s^\prime)\Big]-\triangledown v_\pi(s) \]

目標函式的梯度為

\[\begin{aligned} \triangledown J(\theta) &=\triangledown r(\theta) \\ &=\sum_a\Big[\triangledown\pi(a|s)q_\pi(s,a)+\pi(a|s)\sum_{s^\prime}p(s^\prime|s,a)\triangledown v_\pi(s^\prime)\Big]-\triangledown v_\pi(s) \\ &(對前一項添加狀態的權重\mu(s)，\sum_s\mu(s)=1，不影響梯度)\\ &=\sum_s\mu(s)\sum_a\Big[\triangledown\pi(a|s)q_\pi(s,a)+\pi(a|s)\sum_{s^\prime}p(s^\prime|s,a)\triangledown v_\pi(s^\prime)\Big]-\triangledown v_\pi(s) \\ &=\sum_s\mu(s)\sum_a \triangledown\pi(a|s)q_\pi(s,a)+\mu(s)\sum_a\pi(a|s)\sum_{s^\prime}p(s^\prime|s,a)\triangledown v_\pi(s^\prime)-\mu(s)\sum_a \triangledown v_\pi(s) \\ &=\sum_s\mu(s)\sum_a\triangledown\pi(a|s)q_\pi(s,a)+\sum_{s^\prime}\mu(s^\prime)\triangledown v_\pi(s^\prime)-\sum_s\mu(s)\triangledown v_\pi(s) \\ &=\sum_s\mu(s)\sum_a\triangledown \pi(a|s)q_\pi(s,a) \end{aligned} \]

References

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction (Second Edition). 2018.
Csaba Szepesvári. Algorithms for Reinforcement Learning. 2009.
Richard S. Sutton et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation. 2000.
Course: UCL Reinforcement Learning Course (by David Silver)

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/227674.html

標籤：其他

上一篇：mysql必知必會--過濾數據

下一篇：云服務器是什么？ECS、BCC、CVM...