强化学习原理python篇02——贝尔曼公式推导和求解

强化学习原理python篇02——贝尔曼公式

  • 概念
    • bootstrapping(自举法)
    • state value
  • 贝尔曼公式(Bellman Equation)
  • 贝尔曼公式以及python实现
    • 解法1——解析解
    • 解法2——迭代法
  • atcion value
  • Ref

本章全篇参考赵世钰老师的教材 Mathmatical-Foundation-of-Reinforcement-Learning State Values and Bellman Equation章节,请各位结合阅读,本合集只专注于数学概念的代码实现。

概念

以bootstrapping来介绍状态值

bootstrapping(自举法)

强化学习原理python篇02——贝尔曼公式推导和求解_第1张图片
让v代表从s1,…,s4的回报
v 1 = r 1 + γ r 2 + γ r 3 2 + . . . = r 1 + γ v 2 ; v 2 = r 2 + γ r 2 + γ r 3 2 + . . . = r 2 + γ v 3 ; v 3 = r 3 + γ r 2 + γ r 3 2 + . . . = r 3 + γ v 4 ; v 4 = r 4 + γ r 2 + γ r 3 2 + . . . = r 4 + γ v 1 ; v_1 = r_1 + γ_{r_2} + γ^2_{r_3} + ... =r_1+\gamma v_2;\\ v_2 = r_2 + γ_{r_2} + γ^2_{r_3} + ... =r_2+\gamma v_3;\\ v_3 = r_3 + γ_{r_2} + γ^2_{r_3} + ...=r_3+\gamma v_4 ;\\ v_4 = r_4 + γ_{r_2} + γ^2_{r_3} + ...=r_4+\gamma v_1 ;\\ v1=r1+γr2+γr32+...=r1+γv2;v2=r2+γr2+γr32+...=r2+γv3;v3=r3+γr2+γr32+...=r3+γv4;v4=r4+γr2+γr32+...=r4+γv1;
用矩阵表示为

[ v 1 v 2 v 3 v 4 ] = [ r 1 r 2 r 3 r 4 ] + γ [ 0 , 1 , 0 , 0 0 , 0 , 1 , 0 0 , 0 , 0 , 1 1 , 0 , 0 , 0 ] [ v 1 v 2 v 3 v 4 ] \left [\begin{matrix}v_1\\v_2\\v_3\\ v_4 \end{matrix} \right ] = \left [\begin{matrix}r_1\\r_2\\r_3\\ r_4 \end{matrix} \right ]+\gamma \left [\begin{matrix}0,1,0,0\\0,0,1,0\\0,0,0,1\\1,0,0,0 \end{matrix} \right ]\left [\begin{matrix}v_1\\v_2\\v_3\\ v_4 \end{matrix} \right ] v1v2v3v4 = r1r2r3r4 +γ 0,1,0,00,0,1,00,0,0,11,0,0,0 v1v2v3v4
写作
v = r + γ P v v = ( 1 − γ P ) − 1 r \pmb v = \pmb r + \pmb{γP} v\\ \pmb v =(1- \pmb{γP})^{-1} \pmb{r} v=r+γPvv=(1γP)1r

state value

S t → A t S t + 1 ; R t + 1 S_t \stackrel{At} {\rightarrow}S_{t+1}; R_{t+1} StAtSt+1;Rt+1
表示从状态st做出动作at到 s t + 1 s_{t+1} st+1,并且获得鼓励 R t + 1 R_{t+1} Rt+1,从t开始,可以获得一个trajectory
S t → A t S t + 1 ; R t + 1 → A t + 1 S t + 2 ; R t + 2 → A t + 2 S t + 3 ; R t + 3 . . . S_t \stackrel{At} {\rightarrow}S_{t+1}; R_{t+1}\stackrel{A_{t+1}} {\rightarrow}S_{t+2}; R_{t+2}\stackrel{A_{t+2}} {\rightarrow}S_{t+3}; R_{t+3}... StAtSt+1;Rt+1At+1St+2;Rt+2At+2St+3;Rt+3...

discounted return 为
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . γ ∈ ( 0 ; 1 ) G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+... \\ \gamma \in (0; 1) Gt=Rt+1+γRt+2+γ2Rt+3+...γ(0;1)

state value 被定义为
v π ( s ) = E [ G t ∣ S t = s ] v_\pi(s)=E[G_t|S_t=s] vπ(s)=E[GtSt=s]

贝尔曼公式(Bellman Equation)

首先,t 时候的trajectory的discount reward为
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . = R t + 1 + γ ( R t + 2 + γ R t + 3 + . . . ) = R t + 1 + γ G t + 1 \begin{align*}G_t =& R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+... \\ =&R_{t+1}+\gamma (R_{t+2}+\gamma R_{t+3}+...)\\ =&R_{t+1}+\gamma G_{t+1} \end {align*} Gt===Rt+1+γRt+2+γ2Rt+3+...Rt+1+γ(Rt+2+γRt+3+...)Rt+1+γGt+1

则该状态值为
v π ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] \begin{align*} v_\pi(s)=&E[G_t|S_t=s] \\ =&E[R_{t+1}+\gamma G_{t+1}|S_t=s] \\ =&E[R_{t+1}|S_t=s]+\gamma E[G_{t+1}|S_t=s] \end{align*} vπ(s)===E[GtSt=s]E[Rt+1+γGt+1St=s]E[Rt+1St=s]+γE[Gt+1St=s]
根据全期望公式(the law of total expectation) E ( E ( Y ∣ X ) ) = E ( Y ) E(E(Y|X))=E( Y) E(E(YX))=E(Y)

E [ R t + 1 ∣ S t = s ] = ∫ r ∈ R r f R ∣ S ( r ∣ S t = s ) d r = ∫ r ∈ R r f R , S ( r , S t = s ) f S ( S t = s ) d r = ∫ r ∈ R r f R , S ( r , S t = s ) f S , A ( S t = s , A t = a ) ⋅ f S , A ( S t = s , A t = a ) f S ( S t = s ) d r = ∫ r ∈ R r f R , S ( r , S t = s ) f S , A ( S t = s , A t = a ) ⋅ f A ∣ S ( a ∣ S t = s ) d r = ∫ r ∈ R r ∫ a ∈ A f R , S , A ( r , a , S t = s ) f S , A ( S t = s , A t = a ) ⋅ f A ∣ S ( a ∣ S t = s ) d r = ∫ r ∈ R r ∫ a ∈ A f R ∣ S , A ( r ∣ a , S t = s ) ⋅ f A ∣ S ( a ∣ S t = s ) d a d r = ∫ a ∈ A ∫ r ∈ R r f R t + 1 ∣ S , A ( r ∣ a , S t = s ) ⋅ π ( a ∣ s ) d a d r = ∫ a ∈ A π ( a ∣ s ) E ( R t + 1 ∣ S = s , A = a ) d a \begin{align*} E[R_{t+1}|S_t=s] =& \int_{r\in R} rf_{R|S}(r|S_t=s) dr\\ =& \int_{r\in R} r\frac{f_{R,S}(r,S_t=s)}{f_S(S_t=s)}dr\\ =& \int_{r\in R} r\frac{f_{R,S}(r,S_t=s)}{f_{S,A}(S_t=s,A_t=a)}·\frac{f_{S,A}(S_t=s,A_t=a)}{f_S(S_t=s)}dr\\ =& \int_{r\in R} r\frac{f_{R,S}(r,S_t=s)}{f_{S,A}(S_t=s,A_t=a)}·f_{A|S}(a|S_t=s)dr\\ =& \int_{r\in R} r \frac{ \int_{a\in A} f_{R,S,A}(r,a,S_t=s) }{ f_{S,A}(S_t=s,A_t=a) }·f_{A|S}(a|S_t=s)dr\\ =& \int_{r\in R} r \int_{a\in A} f_{R|S,A}(r|a,S_t=s) ·f_{A|S}(a|S_t=s)dadr\\ =& \int_{a\in A}\int_{r\in R} r f_{R_{t+1}|S,A}(r|a,S_t=s) ·\pi(a|s)dadr\\ = & \int_{a\in A}\pi(a|s)E(R_{t+1}|S=s,A=a)da \end{align*} E[Rt+1St=s]========rRrfRS(rSt=s)drrRrfS(St=s)fR,S(r,St=s)drrRrfS,A(St=s,At=a)fR,S(r,St=s)fS(St=s)fS,A(St=s,At=a)drrRrfS,A(St=s,At=a)fR,S(r,St=s)fAS(aSt=s)drrRrfS,A(St=s,At=a)aAfR,S,A(r,a,St=s)fAS(aSt=s)drrRraAfRS,A(ra,St=s)fAS(aSt=s)dadraArRrfRt+1S,A(ra,St=s)π(as)dadraAπ(as)E(Rt+1S=s,A=a)da
同理
E [ G t + 1 ∣ S t = s ] = ∫ s ′ ∈ S p ( s ′ ∣ s ) E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] d s ′ = ∫ s ′ ∈ S p ( s ′ ∣ s ) E [ G t + 1 ∣ S t + 1 = s ′ ] d s ′   ( m a r k o v   p r o p e r t y ) = ∫ s ′ ∈ S p ( s ′ ∣ s ) v π ( s ′ ) d s ′ = ∫ s ′ ∈ S v π ( s ′ ) ∫ a ∈ A p ( s ′ ∣ s , a ) p ( a ∣ s ) d a d s ′ = ∫ s ′ ∈ S v π ( s ′ ) ∫ a ∈ A p ( s ′ ∣ s , a ) π ( a ∣ s ) d a d s ′ \begin{align*} E[G_{t+1}|S_t = s]=&\int_{s'\in S}p(s'|s)E[G_{t+1}|S_t = s,S_{t+1}=s']ds'\\ =&\int_{s'\in S}p(s'|s)E[G_{t+1}|S_{t+1}=s'] ds'\ (markov\ property)\\ =&\int_{s'\in S}p(s'|s)v_\pi(s')ds'\\ =&\int_{s'\in S}v_\pi(s')\int_{a\in A}p(s'|s,a)p(a|s)dads'\\ =&\int_{s'\in S}v_\pi(s')\int_{a\in A}p(s'|s,a)\pi(a|s)dads' \end{align*} E[Gt+1St=s]=====sSp(ss)E[Gt+1St=s,St+1=s]dssSp(ss)E[Gt+1St+1=s]ds (markov property)sSp(ss)vπ(s)dssSvπ(s)aAp(ss,a)p(as)dadssSvπ(s)aAp(ss,a)π(as)dads
因此,贝尔曼公式如下
v π ( s ) = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] = ∫ a ∈ A π ( a ∣ s ) E ( R t + 1 ∣ S = s , A = a ) + γ ∫ s ′ ∈ S v π ( s ′ ) ∫ a ∈ A p ( s ′ ∣ s , a ) π ( a ∣ s ) = ∫ a ∈ A π ( a ∣ s ) ∫ r ∈ R r f ( r ∣ s , a ) d r d a + γ ∫ a ∈ A f ( s ′ ∣ s , a ) ∫ s ′ ∈ S v π ( s ′ ) π ( a ∣ s ) d s ′ d a = ∫ a ∈ A ∫ r ∈ R π ( a ∣ s ) r f ( r ∣ s , a ) d r d a + γ ∫ a ∈ A ∫ s ′ ∈ S f ( s ′ ∣ s , a ) v π ( s ′ ) π ( a ∣ s ) d s ′ d a = ∫ a ∈ A π ( a ∣ s ) d a [ ∫ r ∈ R r f ( r ∣ s , a ) d r + γ ∫ s ′ ∈ S f ( s ′ ∣ s , a ) v π ( s ′ ) d s ′ ] = ∫ a ∈ A π ( a ∣ s ) d a [ ∫ r ∈ R ∫ s ′ ∈ S r f ( r , s ′ ∣ s , a ) d r + γ ∫ s ′ ∈ S ∫ r ∈ R f ( s ′ , r ∣ s , a ) v π ( s ′ ) d s ′ d r ] = ∫ a ∈ A π ( a ∣ s ) d a [ ∫ r ∈ R ∫ s ′ ∈ S r f ( r , s ′ ∣ s , a ) + γ f ( s ′ , r ∣ s , a ) v π ( s ′ ) d s ′ d r ] = ∫ a ∈ A π ( a ∣ s ) d a [ ∫ r ∈ R ∫ s ′ ∈ S f ( r , s ′ ∣ s , a ) [ r + γ π ( s ′ ) ] d s ′ d r ] \begin{align*} v_\pi(s)=&E[R_{t+1}|S_t=s]+\gamma E[G_{t+1}|S_t=s] \\ =&\int_{a\in A}\pi(a|s)E(R_{t+1}|S=s,A=a)+\gamma\int_{s'\in S}v_\pi(s')\int_{a\in A}p(s'|s,a)\pi(a|s)\\ =& \int_{a\in A}\pi(a|s)\int_{r\in R}rf(r|s,a)drda+\gamma\int_{a\in A}f(s'|s,a)\int_{s'\in S}v_\pi(s')\pi(a|s)ds'da\\ =& \int_{a\in A}\int_{r\in R}\pi(a|s)rf(r|s,a)drda+\gamma\int_{a\in A}\int_{s'\in S}f(s'|s,a)v_\pi(s')\pi(a|s)ds'da\\ =&\int_{a\in A} \pi(a|s)da[\int_{r\in R}rf(r|s,a)dr+\gamma \int_{s'\in S}f(s'|s,a)v_\pi(s')ds']\\ =&\int_{a\in A} \pi(a|s)da[\int_{r\in R}\int_{s'\in S}rf(r,s'|s,a)dr+\gamma \int_{s'\in S}\int_{r\in R}f(s',r|s,a)v_\pi(s')ds'dr]\\ =&\int_{a\in A} \pi(a|s)da[\int_{r\in R}\int_{s'\in S}rf(r,s'|s,a)+\gamma f(s',r|s,a)v_\pi(s')ds'dr]\\ =&\int_{a\in A} \pi(a|s)da[\int_{r\in R}\int_{s'\in S}f(r,s'|s,a)[r+\gamma \pi(s') ]ds'dr] \end{align*} vπ(s)========E[Rt+1St=s]+γE[Gt+1St=s]aAπ(as)E(Rt+1S=s,A=a)+γsSvπ(s)aAp(ss,a)π(as)aAπ(as)rRrf(rs,a)drda+γaAf(ss,a)sSvπ(s)π(as)dsdaaArRπ(as)rf(rs,a)drda+γaAsSf(ss,a)vπ(s)π(as)dsdaaAπ(as)da[rRrf(rs,a)dr+γsSf(ss,a)vπ(s)ds]aAπ(as)da[rRsSrf(r,ss,a)dr+γsSrRf(s,rs,a)vπ(s)dsdr]aAπ(as)da[rRsSrf(r,ss,a)+γf(s,rs,a)vπ(s)dsdr]aAπ(as)da[rRsSf(r,ss,a)[r+γπ(s)]dsdr]

贝尔曼公式以及python实现

r π ( s ) 代表该状态得分的期望值 r π ( s ) = ∫ a ∈ A π ( a ∣ s ) ∫ r ∈ R r f ( r ∣ s , a ) d r d a p π ( s ′ ∣ s ) 代表 s 转移到 s ‘的概率值 p π ( s ′ ∣ s ) = ∫ a ∈ A f ( s ′ ∣ s , a ) π ( a ∣ s ) r_\pi(s)代表该状态得分的期望值\\ r_\pi(s)=\int_{a\in A}\pi(a|s)\int_{r\in R}rf(r|s,a)drda\\ p_\pi(s'|s)代表s转移到s‘的概率值\\ p_\pi(s'|s)=\int_{a\in A}f(s'|s,a)\pi(a|s) rπ(s)代表该状态得分的期望值rπ(s)=aAπ(as)rRrf(rs,a)drdapπ(ss)代表s转移到s的概率值pπ(ss)=aAf(ss,a)π(as)
v π ( s ) = ∫ a ∈ A π ( a ∣ s ) ∫ r ∈ R r f ( r ∣ s , a ) d r d a + γ ∫ a ∈ A f ( s ′ ∣ s , a ) ∫ s ′ ∈ S v π ( s ′ ) π ( a ∣ s ) d s ′ d a = r π ( s i ) + γ ∫ s ′ ∈ S v π ( s ′ ) p π ( s ′ ∣ s ) d s ′ \begin{align*} v_\pi(s) =& \int_{a\in A}\pi(a|s)\int_{r\in R}rf(r|s,a)drda+\gamma\int_{a\in A}f(s'|s,a)\int_{s'\in S}v_\pi(s')\pi(a|s)ds'da\\ =& r_π(s_i)+\gamma \int_{s'\in S}v_\pi(s') p_\pi(s'|s)ds' \end{align*} vπ(s)==aAπ(as)rRrf(rs,a)drda+γaAf(ss,a)sSvπ(s)π(as)dsdarπ(si)+γsSvπ(s)pπ(ss)ds
在离散状态下,该式子表现为
v π ( s ) = ∫ a ∈ A π ( a ∣ s ) ∫ r ∈ R r f ( r ∣ s , a ) d r d a + γ ∫ a ∈ A f ( s ′ ∣ s , a ) ∫ s ′ ∈ S v π ( s ′ ) π ( a ∣ s ) d s ′ d a v π ( s i ) = r π ( s i ) + γ ∑ s j ∈ S v π ( s j ) p π ( s j ∣ s i ) \begin{align*} v_\pi(s) =& \int_{a\in A}\pi(a|s)\int_{r\in R}rf(r|s,a)drda+\gamma\int_{a\in A}f(s'|s,a)\int_{s'\in S}v_\pi(s')\pi(a|s)ds'da\\ v_\pi(s_i)=& r_π(s_i)+\gamma \sum_{s_j\in S}v_\pi(s_j) p_\pi(s_j|s_i) \end{align*} vπ(s)=vπ(si)=aAπ(as)rRrf(rs,a)drda+γaAf(ss,a)sSvπ(s)π(as)dsdarπ(si)+γsjSvπ(sj)pπ(sjsi)
用矩阵形式表现为
v = r + γ P v v = ( 1 − γ P ) − 1 r \pmb v = \pmb r + \pmb{γP} v\\ \pmb v =(1- \pmb{γP})^{-1} \pmb{r} v=r+γPvv=(1γP)1r
强化学习原理python篇02——贝尔曼公式推导和求解_第2张图片
考虑以下情况
强化学习原理python篇02——贝尔曼公式推导和求解_第3张图片

解法1——解析解

求解逆矩阵就可以获得该解

import numpy as np

## 贝尔曼公式状态值求解
def closed_form_solution(R,P,gamma):

    # 获取行号
    n = R.shape[0]
    # 生成单位阵
    I= np.identity(n)
    matrix_inverse = np.linalg.inv(I-gamma*P)

    # 矩阵点乘
    return matrix_inverse.dot(R)


R = np.array([(0.5*0+0.5*(-1)),1.,1.,1.]).reshape(-1,1)
P = np.array([
    [0,0.5,0.5,0],
    [0,0,0,1],
    [0,0,0,1],
    [0,0,0,1],
])

closed_form_solution(R,P,0.9)

输出:

array([[ 8.5],
       [10. ],
       [10. ],
       [10. ]])

解法2——迭代法

证明
强化学习原理python篇02——贝尔曼公式推导和求解_第4张图片
强化学习原理python篇02——贝尔曼公式推导和求解_第5张图片
强化学习原理python篇02——贝尔曼公式推导和求解_第6张图片

def iterative_solution(n_iter, R, P, gamma):
    # n_iter 为迭代次数
    # 初始化  vπ
    n = R.shape[0]
    v = np.random.rand(n, 1)
    for iter in range(n_iter):
        v = R + (gamma * P).dot(v)
    return v


iterative_solution(100, R, P, 0.9)

输出:

array([[8.49974039],
       [9.99974039],
       [9.99974039],
       [9.99974039]])

atcion value

从a状态出发的行动所带来的回报的期望,数学符号表示为
q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] q_\pi(s,a)=E[G_t|S_t=s,A_t=a] qπ(s,a)=E[GtSt=s,At=a]

action value 和 state value的联系,由全期望公式
E G t [ G t ∣ S t = s ] = E A t ∣ S t ( E G t [ G t ∣ ( S t = s , A t ∣ S t = a ) ] ) = ∫ a ∈ A E G t [ G t ∣ S t = s ] ⋅ π ( a ∣ s ) d a \begin{align*} E_{G_t}[G_t|S_t=s] =& E_{A_t|S_t}(E_{G_t}[G_t|(S_t=s,A_t|S_t=a)] )\\ =&\int_{a\in A}E_{G_t}[G_t|S_t=s]·\pi(a|s)da \end{align*} EGt[GtSt=s]==EAtSt(EGt[Gt(St=s,AtSt=a)])aAEGt[GtSt=s]π(as)da
因此
v π ( s ) = ∫ a ∈ A π ( a ∣ s ) q π ( s , a ) d a \begin{align*} v_\pi(s)=&\int_{a\in A}\pi(a|s)q_\pi(s,a)da \end{align*} vπ(s)=aAπ(as)qπ(s,a)da
代表的是state value是action value的期望

因此将贝尔曼公式代入,则
q π ( s , a ) = r π ( s i ∣ a ) + γ ∫ s j ∈ S v π ( s j ) p π ( s j ∣ s i , a ) d s v π ( s ) = ∫ a ∈ A π ( a ∣ s ) [ r π ( s i ∣ a ) + γ ∫ s j ∈ S v π ( s j ) p π ( s j ∣ s i , a ) d s ] d a q_\pi(s,a)= r_π(s_i|a)+\gamma \int_{s_j\in S}v_\pi(s_j) p_\pi(s_j|s_i,a)ds\\ v_\pi(s) = \int_{a\in A}\pi(a|s)[r_π(s_i|a)+\gamma \int_{s_j\in S}v_\pi(s_j) p_\pi(s_j|s_i,a)ds]da qπ(s,a)=rπ(sia)+γsjSvπ(sj)pπ(sjsi,a)dsvπ(s)=aAπ(as)[rπ(sia)+γsjSvπ(sj)pπ(sjsi,a)ds]da

Ref

Mathematical Foundations of Reinforcement Learning,Shiyu Zhao

你可能感兴趣的:(强化学习,python,开发语言)