强化学习—— TD算法(Sarsa算法+Q-learning算法)
强化学习—— TD算法(Sarsa算法+Q-learning算法)
1. Sarsa算法
1.1 TD Target
- 回报函数的定义为:
U
t
=
R
t
+
γ
R
t
+
1
+
γ
2
R
t
+
2
+
⋅
⋅
⋅
U
t
=
R
t
+
γ
(
R
t
+
1
+
γ
R
t
+
2
+
⋅
⋅
⋅
)
U
t
=
R
t
+
γ
U
t
+
1
U_t=R_t+gamma R_{t+1}+gamma^2 R_{t+2}+cdot cdot cdot\ U_t=R_t+gamma (R_{t+1}+gamma R_{t+2}+cdot cdot cdot)\ U_t = R_t+gamma U_{t+1}
- 假设t时刻的回报依赖于t时刻的状态、动作以及t+1时刻的状态:
R
t
←
(
S
t
,
A
t
,
S
t
+
1
)
R_t gets (S_t,A_t,S_{t+1})
- 则动作价值函数可以定义为:
Q
π
(
s
t
,
a
t
)
=
E
[
U
t
∣
a
t
,
s
t
]
Q
π
(
s
t
,
a
t
)
=
E
[
R
t
+
γ
U
t
+
1
∣
a
t
,
s
t
]
Q
π
(
s
t
,
a
t
)
=
E
[
R
t
∣
a
t
,
s
t
]
+
γ
E
[
U
t
+
1
∣
a
t
,
s
t
]
Q
π
(
s
t
,
a
t
)
=
E
[
R
t
∣
a
t
,
s
t
]
+
γ
E
[
Q
π
(
S
t
+
1
,
A
t
+
1
)
∣
a
t
,
s
t
]
Q
π
(
s
t
,
a
t
)
=
E
[
R
t
+
γ
Q
π
(
S
t
+
1
,
A
t
+
1
)
]
Q_pi(s_t,a_t)=E[U_t|a_t,s_t]\ Q_pi(s_t,a_t)=E[R_t+gamma U_{t+1}|a_t,s_t]\Q_pi(s_t,a_t)=E[R_t|a_t,s_t]+gamma E[U_{t+1}|a_t,s_t]\ Q_pi(s_t,a_t)=E[R_t|a_t,s_t]+gamma E[Q_pi(S_{t+1},A_{t+1})|a_t,s_t]\ Q_pi(s_t,a_t) = E[R_t + gamma Q_pi(S_{t+1},A_{t+1})]
- 依据蒙特卡洛近似:
y
t
=
r
t
+
γ
Q
π
(
s
t
+
1
,
a
t
+
1
)
y_t= r_t + gamma Q_pi(s_{t+1},a_{t+1})
- TD学习的目标:
y
t
≈
Q
π
(
s
t
,
a
t
)
y_t approx Q_pi(s_t,a_t)
1.2 表格形式的Sarsa算法
- 学习动作价值函数
Q
π
(
s
,
a
)
Q_pi(s,a)
- 假设动作和状态的数量有限。
- 则需要学习下列表格信息:
SA |
a 1 a_1 a1 |
a 2 a_2 a2 |
a 3 a_3 a3 |
a 4 a_4 a4 |
… |
---|---|---|---|---|---|
s 1 s_1 s1 |
Q 11 Q_{11} Q11 |
… | |||
s 2 s_2 s2 |
… | ||||
s 3 s_3 s3 |
… | ||||
s 4 s_4 s4 |
… | ||||
… | … |
计算步骤为:
- 观测到一个transition,即:
(
s
t
,
a
t
,
r
t
,
s
t
+
1
)
(s_t,a_t,r_t,s_{t+1})
- 依据策略函函数对动作进行抽样:
a
t
+
1
∼
π
(
⋅
∣
s
t
+
1
)
a_{t+1}sim pi(cdot|s_{t+1})
- 查表得到TD Target:
y
t
=
r
t
+
γ
Q
π
(
s
t
+
1
,
a
t
+
1
)
y_t = r_t+gamma Q_pi(s_{t+1},a_{t+1})
- TD error为:
δ
t
=
Q
π
(
s
t
,
a
t
)
−
y
t
delta_t=Q_pi(s_t,a_t)-y_t
- 更新表格:
Q
π
(
s
t
,
a
t
)
←
Q
π
(
s
t
,
a
t
)
−
α
⋅
δ
t
Q_pi(s_t,a_t)gets Q_pi(s_t,a_t) - alpha cdot delta_t
1.3 神经网络形式的Sarsa算法
- 用神经网络近似动作价值函数:
q
(
s
,
q
;
W
)
∼
Q
π
(
s
,
a
)
q(s,q;W)sim Q_pi(s,a)
- 神经网络作为裁判去评判动作
- 参数W需要学习
- TD Target为:
y
t
=
r
t
+
γ
⋅
q
(
s
t
+
1
,
a
t
+
1
;
W
)
y_t = r_t+gamma cdot q(s_{t+1},a_{t+1};W)
- TD error为:
δ
t
=
q
(
s
t
,
a
t
;
W
)
−
y
t
delta_t = q(s_t,a_t;W)-y_t
- loss 为:
1
2
⋅
δ
t
2
frac{1}{2}cdot delta_t^2
- 梯度为:
δ
t
⋅
∂
q
(
s
t
,
a
t
;
W
)
∂
W
delta_t cdot frac{partial q(s_t,a_t;W)}{partial W}
- 进行梯度下降:
W
←
W
−
α
⋅
δ
t
⋅
∂
q
(
s
t
,
a
t
;
W
)
∂
W
Wgets W - alpha cdot delta_t cdot frac{partial q(s_t,a_t;W)}{partial W}
2. Q-learning算法
Q-learning用来学习最优动作价值函数:
Q
π
⋆
(
s
,
a
)
Q_pi^star (s,a)
Qπ⋆(s,a)
2.1 TD Target
Q
π
(
s
t
,
a
t
)
=
E
[
R
t
+
γ
⋅
Q
π
(
S
t
+
1
,
A
t
+
1
)
]
Q_pi(s_t,a_t) = E[R_t+gamma cdot Q_pi(S_{t+1},A_{t+1})]
Qπ(st,at)=E[Rt+γ⋅Qπ(St+1,At+1)]
将最优策略函数计为:
π
⋆
pi^star
π⋆
则:
Q
⋆
(
s
t
,
a
t
)
=
Q
π
⋆
(
s
t
,
a
t
)
=
E
[
R
t
+
γ
⋅
Q
π
⋆
(
S
t
+
1
,
A
t
+
1
)
]
Q^star(s_t,a_t)=Q_{pi^star}(s_t,a_t)= E[R_t+gamma cdot Q_{pi^star}(S_{t+1},A_{t+1})]
Q⋆(st,at)=Qπ⋆(st,at)=E[Rt+γ⋅Qπ⋆(St+1,At+1)]
t+1时刻的动作按下式进行计算:
A
t
+
1
=
a
r
g
m
a
x
a
Q
⋆
(
s
t
+
1
,
a
)
A_{t+1}=mathop{argmax}limits_{a} Q^star (s_{t+1},a)
At+1=aargmaxQ⋆(st+1,a)
则最优动作价值函数可作如下近似:
Q
⋆
(
s
t
,
a
t
)
=
E
[
R
t
+
γ
⋅
m
a
x
a
Q
⋆
(
S
t
+
1
,
a
)
]
≈
r
t
+
m
a
x
a
Q
⋆
(
s
t
+
1
,
a
)
Q^star(s_t,a_t)=E[R_t+gamma cdot mathop{max}limits_{a}Q^star(S_{t+1},a)]\ approx r_t+mathop{max}limits_{a}Q^star(s_{t+1},a)
Q⋆(st,at)=E[Rt+γ⋅amaxQ⋆(St+1,a)]≈rt+amaxQ⋆(st+1,a)
2.2 表格形式的Q-learning算法
SA |
a 1 a_1 a1 |
a 2 a_2 a2 |
a 3 a_3 a3 |
a 4 a_4 a4 |
… |
---|---|---|---|---|---|
s 1 ( 找 出 此 行 最 大 的 Q ) s_1(找出此行最大的Q) s1(找出此行最大的Q) |
Q 11 Q_{11} Q11 |
… | |||
s 2 s_2 s2 |
… | ||||
s 3 s_3 s3 |
… | ||||
s 4 s_4 s4 |
… | ||||
… | … |
计算步骤为:
- 观测到一个transition,即:
(
s
t
,
a
t
,
r
t
,
s
t
+
1
)
(s_t,a_t,r_t,s_{t+1})
- TD Target为:
y
t
=
r
t
+
m
a
x
a
Q
⋆
(
s
t
+
1
,
a
)
y_t=r_t+mathop{max}limits_{a}Q^star(s_{t+1},a)
- TD error为:
δ
t
=
Q
⋆
(
s
t
,
a
t
)
−
y
t
delta_t=Q^star(s_t,a_t)-y_t
- 更新表格:
Q
⋆
(
s
t
,
a
t
)
←
Q
⋆
(
s
t
,
a
t
)
−
α
⋅
δ
t
Q^star(s_t,a_t)gets Q^star(s_t,a_t) - alpha cdot delta_t
2.3 神经网络形式的Q-learning算法(DQN)
- 观测到一个transition,即:
(
s
t
,
a
t
,
r
t
,
s
t
+
1
)
(s_t,a_t,r_t,s_{t+1})
- TD Target为:
y
t
=
r
t
+
m
a
x
a
Q
(
s
t
+
1
,
a
;
W
)
y_t=r_t+mathop{max}limits_{a}Q(s_{t+1},a;W)
- TD error为:
δ
t
=
Q
(
s
t
,
a
t
;
W
)
−
y
t
delta_t=Q(s_{t},a_t;W)-y_t
- 参数更新:
W
←
W
−
α
⋅
δ
t
⋅
∂
Q
(
s
t
,
a
t
;
W
)
∂
W
Wgets W - alpha cdot delta_t cdot frac{partial Q(s_t,a_t;W)}{partial W}
3. Saras和Q-learning的区别
- Sarsa学习动作价值函数:
Q
π
(
s
,
a
)
Q_pi(s,a)
- Actor-Critic中的价值网络为用Sarsa训练的
- Q-learning训练最优动作价值函数:
Q
⋆
(
s
,
a
)
Q^star(s,a)
4. Multi-step TD Target
- one-step仅使用一个reward:
r
t
r_t
- multi-step 使用m个reward:
r
t
,
r
t
+
1
,
.
.
.
,
t
t
+
m
−
1
r_t,r_{t+1},...,t_{t+m-1}
4.1 Sarsa的Multi-step TD Target
y
t
=
∑
i
=
0
m
−
1
λ
i
r
t
+
i
+
λ
m
Q
π
(
s
t
+
m
,
a
t
+
m
)
y_t = sum_{i=0}^{m-1}lambda^i r_{t+i} + lambda^mQ_pi(s_{t+m},a_{t+m})
yt=i=0∑m−1λirt+i+λmQπ(st+m,at+m)
4.2 Q-learning的Multi-step TD Target
y
t
=
∑
i
=
0
m
−
1
λ
i
r
t
+
i
+
λ
m
m
a
x
a
Q
⋆
(
s
t
+
m
,
a
)
y_t = sum_{i=0}^{m-1}lambda^i r_{t+i} + lambda^mmathop{max}limits_{a}Q^star(s_{t+m},a)
yt=i=0∑m−1λirt+i+λmamaxQ⋆(st+m,a)
本文为参考B站学习视频书写的笔记!
by CyrusMay 2022 04 08
我们在小孩和大人的转角
盖一座城堡
——————五月天(好好)——————