强化学习(三)—— 策略学习(Policy-Based)及策略梯度(Policy Gradient)
1. 策略学习
Policy Network
- 通过策略网络近似策略函数
π
(
a
∣
s
t
)
≈
π
(
a
∣
s
t
;
θ
)
π(a|s_t)≈π(a|s_t;theta)
- 状态价值函数及其近似
V
π
(
s
t
)
=
∑
a
π
(
a
∣
s
t
)
Q
π
(
s
t
,
a
)
V_π(s_t)=sum_aπ(a|s_t)Q_π(s_t,a)
V
(
s
t
;
θ
)
=
∑
a
π
(
a
∣
s
t
;
θ
)
⋅
Q
π
(
s
t
,
a
)
V(s_t;theta)=sum_aπ(a|s_t;theta)·Q_π(s_t,a)
- 策略学习最大化的目标函数
J
(
θ
)
=
E
S
[
V
(
S
;
θ
)
]
J(theta)=E_S[V(S;theta)]
- 依据策略梯度上升进行
θ
←
θ
+
β
⋅
∂
V
(
s
;
θ
)
∂
θ
thetagetstheta+beta·frac{partial V(s;theta)}{partial theta}
2. 策略梯度
Policy Gradient
∂
V
(
s
;
θ
)
θ
=
∑
a
Q
π
(
s
,
a
)
∂
π
(
a
∣
s
;
θ
)
∂
θ
=
∫
a
Q
π
(
s
,
a
)
∂
π
(
a
∣
s
;
θ
)
∂
θ
=
∑
a
π
(
a
∣
s
;
θ
)
⋅
Q
π
(
s
,
a
)
∂
l
n
[
π
(
a
∣
s
;
θ
)
]
∂
θ
=
E
A
∼
π
(
a
∣
s
;
θ
)
[
Q
π
(
s
,
A
)
∂
l
n
[
π
(
A
∣
s
;
θ
)
]
∂
θ
]
≈
Q
π
(
s
t
,
a
t
)
∂
l
n
[
π
(
a
t
∣
s
t
;
θ
)
]
∂
θ
frac{partial V(s;theta)}{theta}=sum_a{Q_pi(s,a)frac{partialpi(a|s;theta)}{partialtheta}}\=int_a{Q_pi(s,a)frac{partialpi(a|s;theta)}{partialtheta}}\=sum_a{pi(a|s;theta)·Q_pi(s,a)frac{partial ln[pi(a|s;theta)]}{partialtheta}}\=E_{Asimpi(a|s;theta)}[Q_pi(s,A)frac{partial ln[pi(A|s;theta)]}{partialtheta}]\≈Q_pi(s_t,a_t)frac{partial ln[pi(a_t|s_t;theta)]}{partialtheta}
θ∂V(s;θ)=a∑Qπ(s,a)∂θ∂π(a∣s;θ)=∫aQπ(s,a)∂θ∂π(a∣s;θ)=a∑π(a∣s;θ)⋅Qπ(s,a)∂θ∂ln[π(a∣s;θ)]=EA∼π(a∣s;θ)[Qπ(s,A)∂θ∂ln[π(A∣s;θ)]]≈Qπ(st,at)∂θ∂ln[π(at∣st;θ)]
- 观测得到状态
s
t
s_t
- 依据策略函数随机采样动作
a
t
=
π
(
a
t
∣
s
t
;
θ
)
a_t = pi(a_t|s_t;theta)
- 计算价值函数
q
t
=
Q
π
(
s
t
,
a
t
)
q_t = Q_pi(s_t,a_t)
- 求取策略网络的梯度
d
θ
,
t
=
∂
l
n
[
π
(
a
t
∣
s
t
;
θ
)
]
∂
θ
∣
θ
=
θ
t
d_{theta,t}=frac{partial ln[pi(a_t|s_t;theta)]}{partialtheta}|theta=theta_t
- 计算近似的策略梯度
g
(
a
t
,
θ
t
)
=
q
t
⋅
d
θ
,
t
g(a_t,theta _t)=q_t·d_{theta,t}
- 更新策略网络
θ
t
+
1
=
θ
t
+
β
⋅
g
(
a
t
,
θ
t
)
theta_{t+1}=theta_t+beta·g(a_t,theta_t)
3. 案例
目前没有好的方法近似动作价值函数,则未撰写案例。
by CyrusMay 2022 03 29