1. 策略学习

Policy Network

• 通过策略网络近似策略函数

π

(

a

s

t

)

π

(

a

s

t

;

θ

)

π(a|s_t)≈π(a|s_t;theta)

• 状态价值函数及其近似

V

π

(

s

t

)

=

a

π

(

a

s

t

)

Q

π

(

s

t

,

a

)

V_π(s_t)=sum_aπ(a|s_t)Q_π(s_t,a)

V

(

s

t

;

θ

)

=

a

π

(

a

s

t

;

θ

)

Q

π

(

s

t

,

a

)

V(s_t;theta)=sum_aπ(a|s_t;theta)·Q_π(s_t,a)

• 策略学习最大化的目标函数

J

(

θ

)

=

E

S

[

V

(

S

;

θ

)

]

J(theta)=E_S[V(S;theta)]

• 依据策略梯度上升进行

θ

θ

+

β

V

(

s

;

θ

)

θ

thetagetstheta+beta·frac{partial V(s;theta)}{partial theta}

2. 策略梯度

V

(

s

;

θ

)

θ

=

a

Q

π

(

s

,

a

)

π

(

a

s

;

θ

)

θ

=

a

Q

π

(

s

,

a

)

π

(

a

s

;

θ

)

θ

=

a

π

(

a

s

;

θ

)

Q

π

(

s

,

a

)

l

n

[

π

(

a

s

;

θ

)

]

θ

=

E

A

π

(

a

s

;

θ

)

[

Q

π

(

s

,

A

)

l

n

[

π

(

A

s

;

θ

)

]

θ

]

Q

π

(

s

t

,

a

t

)

l

n

[

π

(

a

t

s

t

;

θ

)

]

θ

frac{partial V(s;theta)}{theta}=sum_a{Q_pi(s,a)frac{partialpi(a|s;theta)}{partialtheta}}\=int_a{Q_pi(s,a)frac{partialpi(a|s;theta)}{partialtheta}}\=sum_a{pi(a|s;theta)·Q_pi(s,a)frac{partial ln[pi(a|s;theta)]}{partialtheta}}\=E_{Asimpi(a|s;theta)}[Q_pi(s,A)frac{partial ln[pi(A|s;theta)]}{partialtheta}]\≈Q_pi(s_t,a_t)frac{partial ln[pi(a_t|s_t;theta)]}{partialtheta}

θV(s;θ)=aQπ(s,a)θπ(as;θ)=aQπ(s,a)θπ(as;θ)=aπ(as;θ)Qπ(s,a)θln[π(as;θ)]=EAπ(as;θ)[Qπ(s,A)θln[π(As;θ)]]Qπ(st,at)θln[π(atst;θ)]

• 观测得到状态

s

t

s_t

• 依据策略函数随机采样动作

a

t

=

π

(

a

t

s

t

;

θ

)

a_t = pi(a_t|s_t;theta)

• 计算价值函数

q

t

=

Q

π

(

s

t

,

a

t

)

q_t = Q_pi(s_t,a_t)

• 求取策略网络的梯度

d

θ

,

t

=

l

n

[

π

(

a

t

s

t

;

θ

)

]

θ

θ

=

θ

t

d_{theta,t}=frac{partial ln[pi(a_t|s_t;theta)]}{partialtheta}|theta=theta_t

• 计算近似的策略梯度

g

(

a

t

,

θ

t

)

=

q

t

d

θ

,

t

g(a_t,theta _t)=q_t·d_{theta,t}

• 更新策略网络

θ

t

+

1

=

θ

t

+

β

g

(

a

t

,

θ

t

)

theta_{t+1}=theta_t+beta·g(a_t,theta_t)

3. 案例

