强化学习(三)—— 策略学习(Policy-Based)及策略梯度(Policy Gradient)

强化学习(三)—— 策略学习(Policy-Based)及策略梯度(Policy Gradient)

1. 策略学习

Policy Network

  • 通过策略网络近似策略函数

    π

    (

    a

    s

    t

    )

    π

    (

    a

    s

    t

    ;

    θ

    )

    π(a|s_t)≈π(a|s_t;theta)

    π(ast)π(ast;θ)

  • 状态价值函数及其近似

    V

    π

    (

    s

    t

    )

    =

    a

    π

    (

    a

    s

    t

    )

    Q

    π

    (

    s

    t

    ,

    a

    )

    V_π(s_t)=sum_aπ(a|s_t)Q_π(s_t,a)

    Vπ(st)=aπ(ast)Qπ(st,a)

    V

    (

    s

    t

    ;

    θ

    )

    =

    a

    π

    (

    a

    s

    t

    ;

    θ

    )

    Q

    π

    (

    s

    t

    ,

    a

    )

    V(s_t;theta)=sum_aπ(a|s_t;theta)·Q_π(s_t,a)

    V(st;θ)=aπ(ast;θ)Qπ(st,a)

  • 策略学习最大化的目标函数

    J

    (

    θ

    )

    =

    E

    S

    [

    V

    (

    S

    ;

    θ

    )

    ]

    J(theta)=E_S[V(S;theta)]

    J(θ)=ES[V(S;θ)]

  • 依据策略梯度上升进行

    θ

    θ

    +

    β

    V

    (

    s

    ;

    θ

    )

    θ

    thetagetstheta+beta·frac{partial V(s;theta)}{partial theta}

    θθ+βθV(s;θ)

2. 策略梯度

Policy Gradient

V

(

s

;

θ

)

θ

=

a

Q

π

(

s

,

a

)

π

(

a

s

;

θ

)

θ

=

a

Q

π

(

s

,

a

)

π

(

a

s

;

θ

)

θ

=

a

π

(

a

s

;

θ

)

Q

π

(

s

,

a

)

l

n

[

π

(

a

s

;

θ

)

]

θ

=

E

A

π

(

a

s

;

θ

)

[

Q

π

(

s

,

A

)

l

n

[

π

(

A

s

;

θ

)

]

θ

]

Q

π

(

s

t

,

a

t

)

l

n

[

π

(

a

t

s

t

;

θ

)

]

θ

frac{partial V(s;theta)}{theta}=sum_a{Q_pi(s,a)frac{partialpi(a|s;theta)}{partialtheta}}\=int_a{Q_pi(s,a)frac{partialpi(a|s;theta)}{partialtheta}}\=sum_a{pi(a|s;theta)·Q_pi(s,a)frac{partial ln[pi(a|s;theta)]}{partialtheta}}\=E_{Asimpi(a|s;theta)}[Q_pi(s,A)frac{partial ln[pi(A|s;theta)]}{partialtheta}]\≈Q_pi(s_t,a_t)frac{partial ln[pi(a_t|s_t;theta)]}{partialtheta}

θV(s;θ)=aQπ(s,a)θπ(as;θ)=aQπ(s,a)θπ(as;θ)=aπ(as;θ)Qπ(s,a)θln[π(as;θ)]=EAπ(as;θ)[Qπ(s,A)θln[π(As;θ)]]Qπ(st,at)θln[π(atst;θ)]

  • 观测得到状态

    s

    t

    s_t

    st

  • 依据策略函数随机采样动作

    a

    t

    =

    π

    (

    a

    t

    s

    t

    ;

    θ

    )

    a_t = pi(a_t|s_t;theta)

    at=π(atst;θ)

  • 计算价值函数

    q

    t

    =

    Q

    π

    (

    s

    t

    ,

    a

    t

    )

    q_t = Q_pi(s_t,a_t)

    qt=Qπ(st,at)

  • 求取策略网络的梯度

    d

    θ

    ,

    t

    =

    l

    n

    [

    π

    (

    a

    t

    s

    t

    ;

    θ

    )

    ]

    θ

    θ

    =

    θ

    t

    d_{theta,t}=frac{partial ln[pi(a_t|s_t;theta)]}{partialtheta}|theta=theta_t

    dθ,t=θln[π(atst;θ)]θ=θt

  • 计算近似的策略梯度

    g

    (

    a

    t

    ,

    θ

    t

    )

    =

    q

    t

    d

    θ

    ,

    t

    g(a_t,theta _t)=q_t·d_{theta,t}

    g(at,θt)=qtdθ,t

  • 更新策略网络

    θ

    t

    +

    1

    =

    θ

    t

    +

    β

    g

    (

    a

    t

    ,

    θ

    t

    )

    theta_{t+1}=theta_t+beta·g(a_t,theta_t)

    θt+1=θt+βg(at,θt)

3. 案例

目前没有好的方法近似动作价值函数,则未撰写案例。

by CyrusMay 2022 03 29

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
THE END
分享
二维码
< <上一篇
下一篇>>