# 强化学习

## 相关概念

### 模型表示

π

(

a

s

)

pi(a|s)

π(as)。通常策略分为随机性策略和确定性策略，随机性策略即

a

π

(

a

s

)

asim pi(a|s)

aπ(as)，确定性策略即

a

=

π

(

s

)

a = pi(s)

a=π(s)

τ

=

s

0

,

a

0

,

r

1

,

s

1

,

a

1

,

r

2

,

s

2

,

.

.

.

,

s

T

tau = s_0,a_0,r_1,s_1,a_1,r_2,s_2,...,s_T

τ=s0,a0,r1,s1,a1,r2,s2,...,sT

p

(

τ

)

=

p

(

s

0

)

i

=

0

T

1

π

(

a

i

s

i

)

p

(

s

i

+

1

s

i

,

a

i

)

p(tau) = p(s_0)prod_{i=0}^{T-1} pi(a_i|s_i)*p(s_{i+1}|s_i,a_i)

p(τ)=p(s0)i=0T1π(aisi)p(si+1si,ai)

### 目标函数

G

t

=

r

t

+

1

{large G_t = r_{t+1}}

Gt=rt+1

G

t

=

r

t

+

1

+

r

t

+

2

+

.

.

.

+

r

T

{large G_t = r_{t+1}+r_{t+2}+...+r_T}

Gt=rt+1+rt+2+...+rT

G

t

=

r

t

+

1

+

γ

r

t

+

2

+

γ

2

r

t

+

3

+

.

.

.

γ

T

1

r

T

{large {G_t = r_{t+1}+gamma r_{t+2}+ gamma^2 r_{t+3}+...gamma^{T-1}r_T} }

Gt=rt+1+γrt+2+γ2rt+3+...γT1rT

γ

gamma

γ为0时，为第一种情况，

γ

gamma

γ为1时，为第二种情况，

0

<

γ

<

1

0<gamma<1

0<γ<1时，即为折中的办法，可根据现实情况自由调整。

π

θ

(

a

s

)

pi_theta(a|s)

πθ(as)来最大化期望回报（Expected Return），即希望智能体执行一系列的动作来获得尽可能多的平均回报。

J

(

θ

)

=

E

τ

p

θ

(

τ

)

[

G

(

τ

)

]

=

E

τ

p

θ

(

τ

)

[

G

0

]

=

E

τ

p

θ

(

τ

)

[

t

=

0

T

1

γ

t

r

t

+

1

]

{large {J(theta) = mathbb{E}_{tausim p_theta(tau)}[G(tau)] = mathbb{E}_{tausim p_theta(tau)}[G_0]= mathbb{E}_{tausim p_theta(tau)}[sum_{t=0}^{T-1}gamma^tr_{t+1}]} }

J(θ)=Eτpθ(τ)[G(τ)]=Eτpθ(τ)[G0]=Eτpθ(τ)[t=0T1γtrt+1]

### 值函数

E

τ

p

(

τ

)

[

G

(

τ

)

]

=

E

s

p

(

s

0

)

[

E

τ

p

(

τ

)

[

t

=

0

T

1

γ

t

r

t

+

1

τ

s

0

=

s

]

]

=

E

s

p

(

s

0

)

[

V

π

(

s

)

]

{large mathbf{begin{aligned} mathbb{E}_{tau sim p(tau)}[G(tau)] &=mathbb{E}_{s sim pleft(s_{0}right)}left[mathbb{E}_{tau sim p(tau)}left[sum_{t=0}^{T-1} gamma^{t} r_{t+1} mid tau_{s_{0}}=sright]right] \ &=mathbb{E}_{s sim pleft(s_{0}right)}left[V^{pi}(s)right] end{aligned}} }

Eτp(τ)[G(τ)]=Esp(s0)Eτp(τ)t=0T1γtrt+1τs0=s=Esp(s0)[Vπ(s)]

V

π

(

s

)

V^pi(s)

Vπ(s)称为状态价值函数，即在给定初始状态为s的情况的期望回报

V

π

(

s

)

=

E

τ

p

(

τ

)

[

t

=

0

T

1

γ

t

r

t

+

1

τ

s

0

=

s

]

{large {V^pi(s) = mathbb{E}_{tau sim p(tau)}left[sum_{t=0}^{T-1} gamma^{t} r_{t+1} mid tau_{s_{0}}=sright]}}

Vπ(s)=Eτp(τ)t=0T1γtrt+1τs0=s

V

π

(

s

)

=

E

τ

0

:

T

p

[

r

1

+

γ

t

=

1

T

1

γ

t

1

r

t

+

1

τ

s

0

=

s

]

=

E

a

π

(

a

s

)

E

s

p

(

s

s

,

a

)

E

τ

1

:

T

p

(

τ

)

[

r

(

s

,

a

,

s

)

+

γ

t

=

1

T

1

γ

t

1

r

t

+

1

τ

s

1

=

s

]

=

E

a

π

(

a

s

)

E

s

p

(

s

s

,

a

)

[

r

(

s

,

a

,

s

)

+

γ

E

τ

1

:

T

p

(

τ

)

[

t

=

1

T

1

γ

t

1

r

t

+

1

τ

s

1

=

s

]

]

=

E

a

π

(

a

s

)

E

s

p

(

s

s

,

a

)

[

r

(

s

,

a

,

s

)

+

γ

V

π

(

s

)

]

.

{large mathbf{begin{array}{l} V^{pi}(s)=mathbb{E}_{tau_{0: T sim p}}left[r_{1}+gamma sum_{t=1}^{T-1} gamma^{t-1} r_{t+1} mid tau_{s_{0}}=sright] \ =mathbb{E}_{a sim pi(a mid s)} mathbb{E}_{s^{prime} sim pleft(s^{prime} mid s, aright)} mathbb{E}_{tau_{1: T} sim p(tau)}left[rleft(s, a, s^{prime}right)+gamma sum_{t=1}^{T-1} gamma^{t-1} r_{t+1} mid tau_{s_{1}}=s^{prime}right] \ =mathbb{E}_{a sim pi(a mid s)} mathbb{E}_{s^{prime} sim pleft(s^{prime} mid s, aright)}left[rleft(s, a, s^{prime}right)+gamma mathbb{E}_{tau_{1: T} sim p(tau)}left[sum_{t=1}^{T-1} gamma^{t-1} r_{t+1} mid tau_{s_{1}}=s^{prime}right]right] \ =mathbb{E}_{a sim pi(a mid s)} mathbb{E}_{s^{prime} sim pleft(s^{prime} mid s, aright)}left[rleft(s, a, s^{prime}right)+gamma V^{pi}left(s^{prime}right)right] . end{array}} }

Vπ(s)=Eτ0:Tp[r1+γt=1T1γt1rt+1τs0=s]=Eaπ(as)Esp(ss,a)Eτ1:Tp(τ)[r(s,a,s)+γt=1T1γt1rt+1τs1=s]=Eaπ(as)Esp(ss,a)[r(s,a,s)+γEτ1:Tp(τ)[t=1T1γt1rt+1τs1=s]]=Eaπ(as)Esp(ss,a)[r(s,a,s)+γVπ(s)].

V

π

(

s

)

V_pi(s)

Vπ(s)的迭代关系，这个关系叫做贝尔曼方程

V

π

(

s

)

V_pi(s)

Vπ(s)的值，通过反复迭代，最终会收敛到一个正确的状态价值函数。至于证明方法我们暂且不管。

V

π

(

s

)

V^pi(s)

Vπ(s)之前，我们需要先知道策略

π

(

a

s

)

pi(a|s)

π(as)，但是好巧不巧的是强化学习学的就是

π

(

a

s

)

pi(a|s)

π(as)

π

pi

π

π

=

a

r

g

m

a

x

π

V

π

(

s

)

pi = argmax_pi V^pi(s)

π=argmaxπVπ(s)

π

pi

π，但是要怎么更新呢?

π

pi

π参数化成

π

θ

(

a

s

)

pi_theta(a|s)

πθ(as)，我们通过计算期望回报对

θ

theta

θ的梯度从而使用梯度下降的算法优化

θ

theta

θ。另一个就是我们下面要讲的东西。

Q

π

(

s

,

a

)

=

E

s

p

(

s

s

,

a

)

[

r

(

s

,

a

,

s

)

+

γ

V

π

(

s

)

]

{large mathbf{Q^pi(s,a) = mathbb{E}_{s^{prime} sim pleft(s^{prime} mid s, aright)}left[rleft(s, a, s^{prime}right)+gamma V^{pi}left(s^{prime}right)right]}}

Qπ(s,a)=Esp(ss,a)[r(s,a,s)+γVπ(s)]

Q

π

(

s

,

a

)

Q^pi(s,a)

Qπ(s,a)表示给定初始状态和初始动作后的期望回报，通过

Q

π

(

s

,

a

)

Q^pi(s,a)

Qπ(s,a)的定义，我们可以将

V

π

(

s

)

V^pi(s)

Vπ(s)的自迭代转变为

V

π

(

s

)

V^pi(s)

Vπ(s)

Q

π

(

s

)

Q^pi(s)

Qπ(s)的相互迭代。因为

V

π

(

s

)

=

E

a

π

(

a

s

)

Q

π

(

s

,

a

)

{large mathbf{V^pi(s) = mathbb{E}_{a sim pi(a mid s)} Q^pi(s,a)}}

Vπ(s)=Eaπ(as)Qπ(s,a)

a

a^*

a，如果

Q

π

(

s

,

a

)

=

max

a

Q

π

(

s

,

a

)

Q^pi(s,a^*) = max_a Q^pi(s,a)

Qπ(s,a)=maxaQπ(s,a)，这样一来就是说在状态

s

s

s下执行动作

a

a^*

a是回报最高的。所以我们自然可以调高

π

(

a

s

)

pi(a^*|s)

π(as)的值从而达到优化策略的目的

### 优化算法

#mermaid-svg-kWHfWaxWXyHboKDq {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-kWHfWaxWXyHboKDq .error-icon{fill:#552222;}#mermaid-svg-kWHfWaxWXyHboKDq .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-kWHfWaxWXyHboKDq .edge-thickness-normal{stroke-width:2px;}#mermaid-svg-kWHfWaxWXyHboKDq .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-kWHfWaxWXyHboKDq .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-kWHfWaxWXyHboKDq .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-kWHfWaxWXyHboKDq .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-kWHfWaxWXyHboKDq .marker{fill:#333333;stroke:#333333;}#mermaid-svg-kWHfWaxWXyHboKDq .marker.cross{stroke:#333333;}#mermaid-svg-kWHfWaxWXyHboKDq svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-kWHfWaxWXyHboKDq .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-kWHfWaxWXyHboKDq .cluster-label text{fill:#333;}#mermaid-svg-kWHfWaxWXyHboKDq .cluster-label span{color:#333;}#mermaid-svg-kWHfWaxWXyHboKDq .label text,#mermaid-svg-kWHfWaxWXyHboKDq span{fill:#333;color:#333;}#mermaid-svg-kWHfWaxWXyHboKDq .node rect,#mermaid-svg-kWHfWaxWXyHboKDq .node circle,#mermaid-svg-kWHfWaxWXyHboKDq .node ellipse,#mermaid-svg-kWHfWaxWXyHboKDq .node polygon,#mermaid-svg-kWHfWaxWXyHboKDq .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-kWHfWaxWXyHboKDq .node .label{text-align:center;}#mermaid-svg-kWHfWaxWXyHboKDq .node.clickable{cursor:pointer;}#mermaid-svg-kWHfWaxWXyHboKDq .arrowheadPath{fill:#333333;}#mermaid-svg-kWHfWaxWXyHboKDq .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-kWHfWaxWXyHboKDq .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-kWHfWaxWXyHboKDq .edgeLabel{background-color:#e8e8e8;text-align:center;}#mermaid-svg-kWHfWaxWXyHboKDq .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#mermaid-svg-kWHfWaxWXyHboKDq .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-kWHfWaxWXyHboKDq .cluster text{fill:#333;}#mermaid-svg-kWHfWaxWXyHboKDq .cluster span{color:#333;}#mermaid-svg-kWHfWaxWXyHboKDq div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-kWHfWaxWXyHboKDq :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}

Reinforce算法
Reinforce-With-Baseline

Sarsa
Q-Learning

### 动态规划

Q

π

(

s

,

a

)

=

E

s

p

(

s

s

,

a

)

[

r

(

s

,

a

,

s

)

+

γ

V

π

(

s

)

]

V

π

(

s

)

=

E

a

π

(

a

s

)

Q

π

(

s

,

a

)

{large mathbf{Q^pi(s,a) = mathbb{E}_{s^{prime} sim pleft(s^{prime} mid s, aright)}left[rleft(s, a, s^{prime}right)+gamma V^{pi}left(s^{prime}right)right]}} \ {large mathbf{V^pi(s) = mathbb{E}_{a sim pi(a mid s)} Q^pi(s,a)}}

Qπ(s,a)=Esp(ss,a)[r(s,a,s)+γVπ(s)]Vπ(s)=Eaπ(as)Qπ(s,a)

π

pi

π只能通过反复迭代获得，但在介绍了状态价值函数之后，我们可以对迭代对象有更多的考虑，因为对于确定性策略，有

π

(

s

)

=

a

r

g

m

a

x

a

Q

(

s

,

a

)

pi(s) = argmax_aQ(s,a)

π(s)=argmaxaQ(s,a)

#### 值迭代

Q

π

(

s

,

a

)

=

E

s

p

(

s

s

,

a

)

[

r

(

s

,

a

,

s

)

+

γ

V

π

(

s

)

]

=

E

s

p

(

s

s

,

a

)

[

r

(

s

,

a

,

s

)

+

γ

max

a

Q

π

(

s

,

a

)

]

{large begin{aligned} Q^pi(s,a) &= mathbb{E}_{s^{prime} sim pleft(s^{prime} mid s, aright)}left[rleft(s, a, s^{prime}right)+gamma V^pi(s^prime)right] \ &=mathbb{E}_{s^{prime} sim pleft(s^{prime} mid s, aright)}left[rleft(s, a, s^{prime}right)+gamma max_{a^prime} Q^pi(s^prime,a^prime)right] end{aligned}}

Qπ(s,a)=Esp(ss,a)[r(s,a,s)+γVπ(s)]=Esp(ss,a)[r(s,a,s)+γamaxQπ(s,a)]

#### 代码介绍

import gym
# 创建游戏环境
env = gym.make('CartPole-v0')
# policy是一维的tensor，可以存放每个状态对应的动作
def run_episode(env, policy, gamma=1.0, render=False):
obs = env.reset()
total_reward = 0
step_idx = 0
while True:
# 显示游戏画面
if render:
env.render()
obs, reward, done, _ = env.step(int(policy[obs]))
# 计算折扣累积回报
total_reward += (gamma ** step_idx * reward)
step_idx += 1
if done:
break

# 总的策略迭代算法框架
def policy_iteration(env, gamma=1.0):
# 初始化一个随机策略
# 即对每个状态赋予一个动作（确定性策略、状态离散）
policy = np.random.choice(env.action_space.n, size=env.observation_space.n)
max_iterations = 20000
gamma = 1.0
for i in range(max_iterations):
# 策略评估
old_policy_v = compute_policy_v(env, policy, gamma)
# 策略改进
new_policy = extract_policy(env, old_policy_v, gamma)
if (np.all(policy == new_policy)):  # 判断策略是否收敛
print('Policy-Iteration converged at iteration %d.' % (i + 1))
break
policy = new_policy  # 如果没收敛，则以当前你策略为起点，继续循环策略评估和策略提升两个过程
return policy

# 计算当前策略下的价值函数
def compute_policy_v(env, policy, gamma=1.0):
# 初始化值函数为０
v = np.zeros(env.observation_space.n)
eps = 1e-10
while True:
prev_v = np.copy(v)
# 遍历前一时刻的每一个状态
for s in range(env.observation_space.n - 1):
a = policy[s]
v[s] = sum([p * (r + gamma * prev_v[s_]) for p, s_, r, _ in env.env.P[s][a]])
# 如果更新前后的值函数Ｖ小于给定阈值，说明收敛了，返回v_pi
if np.sum((np.fabs(prev_v - v))) <= eps:
print("Value function converged.")
break
return v

Q

π

(

s

,

a

)

=

E

s

p

(

s

s

,

a

)

[

r

(

s

,

a

,

s

)

+

γ

V

π

(

s

)

]

mathbf{Q^pi(s,a) = mathbb{E}_{s^{prime} sim pleft(s^{prime} mid s, aright)}left[rleft(s, a, s^{prime}right)+gamma V^{pi}left(s^{prime}right)right]}

Qπ(s,a)=Esp(ss,a)[r(s,a,s)+γVπ(s)] 利用

V

π

(

s

)

V^pi(s)

Vπ(s)求出

Q

π

(

s

,

a

)

Q^pi(s,a)

Qπ(s,a)，然后根据

Q

π

(

s

,

a

)

Q^pi(s,a)

Qπ(s,a)导出最优策略

# 主要步骤２：策略提升
def extract_policy(env, v, gamma=1.0):
# 初始化策略
policy = np.zeros(env.observation_space.n - 1)
for s in range(env.observation_space.n - 1):
q_sa = np.zeros(env.action_space.n)
# 评估当前状态下，每个动作的收益
for a in range(env.action_space.n):
# 计算动作值函数q
q_sa[a] = sum([p * (r + gamma * v[s_]) for p, s_, r, _ in env.env.P[s][a]])
# 更新后的策略是相对于q的贪婪策略，所以使用max操作
# 这里由于可能有多个最大值，我们等概率随机选择分配
policy[s] = np.random.choice(np.argwhere(q_sa == np.max(q_sa)).flatten())
# 返回提升的策略，根据策略提升理论，这个策略肯定比上一次迭代的策略要好
return policy

### 蒙特卡罗

Q

(

s

,

a

)

=

Q

^

(

s

,

a

)

=

1

N

i

=

1

N

G

(

τ

i

s

0

=

s

,

a

0

=

a

)

Q(s,a) = hat Q(s,a) = frac{1}{N}sum_{i=1}^{N}G(tau_i|s_0=s,a_0=a)

Q(s,a)=Q^(s,a)=N1i=1NG(τis0=s,a0=a)

#### 蒙特卡罗的增量形式

Q

^

N

π

(

s

,

a

)

=

1

N

n

=

1

N

G

(

τ

s

0

=

s

,

a

0

=

a

(

n

)

)

=

1

N

(

G

(

τ

s

0

=

s

,

a

0

=

a

(

N

)

)

+

n

=

1

N

1

G

(

τ

s

0

=

s

,

a

0

=

a

(

n

)

)

=

1

N

(

G

(

τ

s

0

=

s

,

a

0

=

a

(

N

)

)

+

(

N

1

)

Q

^

N

1

π

(

s

,

a

)

)

=

Q

^

N

1

π

(

s

,

a

)

+

1

N

(

G

(

τ

s

0

=

s

,

a

0

=

a

(

N

)

)

Q

^

N

1

π

(

s

,

a

)

)

.

begin{aligned} hat{Q}_{N}^{pi}(s, a) &=frac{1}{N} sum_{n=1}^{N} Gleft(tau_{s_{0}=s, a_{0}=a}^{(n)}right) \ &=frac{1}{N}left(Gleft(tau_{s_{0}=s, a_{0}=a}^{(N)}right)+sum_{n=1}^{N-1} Gleft(tau_{s_{0}=s, a_{0}=a}^{(n)}right)right.\ &=frac{1}{N}left(Gleft(tau_{s_{0}=s, a_{0}=a}^{(N)}right)+(N-1) hat{Q}_{N-1}^{pi}(s, a)right) \ &=hat{Q}_{N-1}^{pi}(s, a)+frac{1}{N}left(Gleft(tau_{s_{0}=s, a_{0}=a}^{(N)} right)-hat{Q}_{N-1}^{pi}(s, a)right). end{aligned}

Q^Nπ(s,a)=N1n=1NG(τs0=s,a0=a(n))=N1(G(τs0=s,a0=a(N))+n=1N1G(τs0=s,a0=a(n))=N1(G(τs0=s,a0=a(N))+(N1)Q^N1π(s,a))=Q^N1π(s,a)+N1(G(τs0=s,a0=a(N))Q^N1π(s,a)).

α

alpha

α为作为新的超参数

Q

t

+

1

(

s

,

a

)

=

Q

t

(

s

,

a

)

+

α

(

G

(

τ

s

0

=

s

,

a

0

=

a

(

N

)

)

Q

^

N

1

π

(

s

,

a

)

)

Q_{t+1}(s,a) = Q_t(s,a) + alpha left(Gleft(tau_{s_{0}=s, a_{0}=a}^{(N)} right)-hat{Q}_{N-1}^{pi}(s, a)right)

Qt+1(s,a)=Qt(s,a)+α(G(τs0=s,a0=a(N))Q^N1π(s,a))

#### 蒙特卡罗+策略迭代

• 随机初始化策略

• 循环

• 策略评估
• 遍历每一个

s

s

，遍历每一个

a

a

• 非增量形式：以

s

,

a

s,a

作为初始状态和动作，然后根据策略采样N条轨迹，计算平均回报作为

Q

(

s

,

a

)

Q(s,a)

• 增量形式：
• 随机初始化Q价值
• 循环直到收敛
• s

,

a

s,a

作为初始状态和动作，然后根据策略采样1条轨迹

• Q

t

+

1

(

s

,

a

)

=

Q

t

(

s

,

a

)

+

α

(

G

(

τ

s

0

=

s

,

a

0

=

a

(

N

)

)

Q

^

N

1

π

(

s

,

a

)

)

Q_{t+1}(s,a) = Q_t(s,a) + alpha left(Gleft(tau_{s_{0}=s, a_{0}=a}^{(N)} right)-hat{Q}_{N-1}^{pi}(s, a)right)

• 策略改进
• 根据

π

=

a

r

g

m

a

x

a

Q

(

s

,

a

)

pi = argmax_aQ(s,a)

更新

π

pi

• 直到收敛

#### 蒙特卡罗+值迭代

• 随机初始化Q价值函数
• 循环
• 遍历每一个s，遍历每一个a
• s

,

a

s,a

作为初始状态和动作，根据价值函数采样1条轨迹

(

a

=

a

r

g

m

a

x

a

Q

(

s

,

a

)

)

left( a=argmax_aQ(s,a)right)

• 只有增量形式：

Q

t

+

1

(

s

,

a

)

=

Q

t

(

s

,

a

)

+

α

(

G

(

τ

s

0

=

s

,

a

0

=

a

)

Q

^

t

π

(

s

,

a

)

)

Q_{t+1}(s,a) = Q_t(s,a) + alpha left(Gleft(tau_{s_{0}=s, a_{0}=a}^{} right)-hat{Q}_{t}^{pi}(s, a)right)

• 直到Q收敛

### 时序差分方法

Q

t

+

1

π

(

s

,

a

)

=

Q

t

π

(

s

,

a

)

+

α

(

G

(

τ

s

0

=

s

,

a

0

=

a

)

Q

t

π

(

s

,

a

)

)

=

Q

t

π

(

s

,

a

)

+

α

(

r

+

γ

G

(

τ

s

0

=

s

,

a

0

=

a

,

s

1

=

s

,

a

1

=

a

)

Q

t

π

(

s

,

a

)

)

=

Q

t

π

(

s

,

a

)

+

α

(

r

+

γ

Q

t

π

(

s

,

a

)

Q

t

π

(

s

,

a

)

)

begin{aligned} Q_{t+1}^pi(s,a) &= Q_t^pi(s,a) + alpha left(Gleft(tau_{s_{0}=s, a_{0}=a}^{} right)-{Q}_{t}^{pi}(s, a)right)\ &=Q_t^pi(s,a) + alpha left(r+gamma*Gleft(tau_{s_{0}=s, a_{0}=a,s_1=s^prime,a_1=a^prime}^{} right)-{Q}_{t}^{pi}(s, a)right) \ &= Q_t^pi(s,a)+alpha left(r+gamma*Q_t^pi(s^prime,a^prime)-{Q}_{t}^{pi}(s, a)right) end{aligned}

Qt+1π(s,a)=Qtπ(s,a)+α(G(τs0=s,a0=a)Qtπ(s,a))=Qtπ(s,a)+α(r+γG(τs0=s,a0=a,s1=s,a1=a)Qtπ(s,a))=Qtπ(s,a)+α(r+γQtπ(s,a)Qtπ(s,a))

Q

t

+

1

π

(

s

,

a

)

=

Q

t

π

(

s

,

a

)

+

α

(

r

+

γ

Q

t

π

(

s

,

a

)

Q

t

π

(

s

,

a

)

)

Q_{t+1}^pi(s,a) = Q_t^pi(s,a)+alpha left(r+gamma*Q_t^pi(s^prime,a^prime)-{Q}_{t}^{pi}(s, a)right)

Qt+1π(s,a)=Qtπ(s,a)+α(r+γQtπ(s,a)Qtπ(s,a))

Q

t

+

1

π

(

s

,

a

)

Q_{t+1}^pi(s,a)

Qt+1π(s,a)

Q

t

π

(

s

,

a

)

Q_t^pi(s,a)

Qtπ(s,a)是迭代前后的价值函数的值，涉及的数据也是只包括一次状态转移的

s

,

a

,

r

,

s

,

a

s,a,r,s^prime,a^prime

s,a,r,s,a。然而，我们同时也能注意到，迭代式的右边的更新的部分除了Q(s,a)本身，还包括了

Q

(

s

,

a

)

Q(s^prime,a^prime)

Q(s,a)，我们可以想到它的迭代是不稳定的，后面会有针对这部分的改进，这是后话了

Q

π

(

s

,

a

)

=

E

s

p

(

s

s

,

a

)

[

r

(

s

,

a

,

s

)

+

γ

Q

π

(

s

,

a

)

]

,

a

=

π

(

s

)

{large begin{aligned} Q^pi(s,a) &= mathbb{E}_{s^{prime} sim pleft(s^{prime} mid s, aright)}left[rleft(s, a, s^{prime}right)+gamma Q^pi(s^prime,a^prime)right],a^prime = pi(s^prime) end{aligned}}

Qπ(s,a)=Esp(ss,a)[r(s,a,s)+γQπ(s,a)],a=π(s)

s

,

a

,

r

,

s

s,a,r,s^prime

s,a,r,s，我们把这样一组数据叫做一个Transition。于是

Q

π

(

s

,

a

)

=

1

N

n

=

1

N

r

(

s

,

a

,

s

)

+

γ

Q

π

(

s

,

a

)

=

Q

π

(

s

,

a

)

+

1

N

(

r

(

s

,

a

,

s

)

+

γ

Q

π

(

s

,

a

)

Q

π

(

s

,

a

)

)

{large begin{aligned} Q^pi(s,a) &= frac{1}{N}sum_{n=1}^{N}rleft(s, a, s^{prime}right)+gamma Q^pi(s^prime,a^prime) \ &= Q^pi(s,a)+frac{1}{N}(rleft(s, a, s^{prime}right)+gamma Q^pi(s^prime,a^prime)-Q^pi(s,a)) end{aligned}}

Qπ(s,a)=N1n=1Nr(s,a,s)+γQπ(s,a)=Qπ(s,a)+N1(r(s,a,s)+γQπ(s,a)Qπ(s,a))

α

alpha

α替代后，

Q

t

+

1

π

(

s

,

a

)

=

Q

t

π

(

s

,

a

)

+

α

(

r

+

γ

Q

t

π

(

s

,

a

)

Q

t

π

(

s

,

a

)

)

large{Q_{t+1}^pi(s,a) = Q_t^pi(s,a)+alpha left(r+gamma*Q_t^pi(s^prime,a^prime)-{Q}_{t}^{pi}(s, a)right)}

Qt+1π(s,a)=Qtπ(s,a)+α(r+γQtπ(s,a)Qtπ(s,a))

#### Sarsa

(

ϵ

)

(请暂时忽略掉算法中出现的epsilon)

(ϵ)

#### Q Learning

Q Learning = 蒙特卡罗+值迭代+时序差分

Q

t

+

1

π

(

s

,

a

)

=

Q

t

π

(

s

,

a

)

+

α

(

r

+

γ

max

a

Q

t

π

(

s

,

a

)

Q

t

π

(

s

,

a

)

)

large{Q_{t+1}^pi(s,a) = Q_t^pi(s,a)+alpha left(r+gamma max_{a^prime}Q_t^pi(s^prime,a^prime)-{Q}_{t}^{pi}(s, a)right)}

Qt+1π(s,a)=Qtπ(s,a)+α(r+γamaxQtπ(s,a)Qtπ(s,a))

Q

π

(

s

,

a

)

=

E

s

p

(

s

s

,

a

)

[

r

(

s

,

a

,

s

)

+

γ

V

π

(

s

)

]

=

E

s

p

(

s

s

,

a

)

[

r

(

s

,

a

,

s

)

+

γ

max

a

Q

π

(

s

,

a

)

]

{large begin{aligned} Q^pi(s,a) &= mathbb{E}_{s^{prime} sim pleft(s^{prime} mid s, aright)}left[rleft(s, a, s^{prime}right)+gamma V^pi(s^prime)right] \ &=mathbb{E}_{s^{prime} sim pleft(s^{prime} mid s, aright)}left[rleft(s, a, s^{prime}right)+gamma max_{a^prime} Q^pi(s^prime,a^prime)right] end{aligned}}

Qπ(s,a)=Esp(ss,a)[r(s,a,s)+γVπ(s)]=Esp(ss,a)[r(s,a,s)+γamaxQπ(s,a)]

### 补充

ϵ

epsilon

ϵ的坑填一下。

ϵ

epsilon

ϵ贪心法如下

#### 同策略与异策略

SARSA算法采样的时候是使用

π

ϵ

(

s

)

pi^epsilon(s)

πϵ(s)策略，而在训练过程中更新的也是该策略，所以SARSA是一个同策略算法。

Q Learning采样的时候使用

π

ϵ

(

s

)

pi^epsilon(s)

πϵ(s)，而训练过程中从来没有更新过

π

ϵ

(

s

)

pi^epsilon(s)

πϵ(s)，而是生成了临时的最优策略用于计算Q函数，所以Q Learning是异策略的算法

### 策略梯度

π

pi

π参数化成

π

θ

(

a

s

)

pi_theta(a|s)

πθ(as)，我们通过计算期望回报对

θ

theta

θ的梯度从而使用梯度下降的算法优化

θ

theta

θ，这种方法叫做策略梯度。

J

(

θ

)

θ

=

θ

p

θ

(

τ

)

G

(

τ

)

d

τ

=

(

θ

p

θ

(

τ

)

)

G

(

τ

)

d

τ

=

p

θ

(

τ

)

(

1

p

θ

(

τ

)

θ

p

θ

(

τ

)

)

G

(

τ

)

d

τ

=

p

θ

(

τ

)

(

θ

log

p

θ

(

τ

)

)

G

(

τ

)

d

τ

=

E

τ

p

θ

(

τ

)

[

θ

log

p

θ

(

τ

)

G

(

τ

)

]

,

begin{aligned} frac{partial mathcal{J}(theta)}{partial theta} &=frac{partial}{partial theta} int p_{theta}(tau) G(tau) mathrm{d} tau \ &=intleft(frac{partial}{partial theta} p_{theta}(tau)right) G(tau) mathrm{d} tau \ &=int p_{theta}(tau)left(frac{1}{p_{theta}(tau)} frac{partial}{partial theta} p_{theta}(tau)right) G(tau) mathrm{d} tau \ &=int p_{theta}(tau)left(frac{partial}{partial theta} log p_{theta}(tau)right) G(tau) mathrm{d} tau \ &=mathbb{E}_{tau sim p_{theta}(tau)}left[frac{partial}{partial theta} log p_{theta}(tau) G(tau)right], end{aligned}

θJ(θ)=θpθ(τ)G(τ)dτ=(θpθ(τ))G(τ)dτ=pθ(τ)(pθ(τ)1θpθ(τ))G(τ)dτ=pθ(τ)(θlogpθ(τ))G(τ)dτ=Eτpθ(τ)[θlogpθ(τ)G(τ)],

θ

log

p

θ

(

τ

)

=

θ

log

(

p

(

s

0

)

t

=

0

T

1

π

θ

(

a

t

s

t

)

p

(

s

t

+

1

s

t

,

a

t

)

)

=

θ

(

log

p

(

s

0

)

+

t

=

0

T

1

log

π

θ

(

a

t

s

t

)

+

t

=

0

T

1

log

p

(

s

t

+

1

s

t

,

a

t

)

)

=

t

=

0

T

1

θ

log

π

θ

(

a

t

s

t

)

begin{aligned} frac{partial}{partial theta} & log p_{theta}(tau)=frac{partial}{partial theta} log left(pleft(s_{0}right) prod_{t=0}^{T-1} pi_{theta}left(a_{t} mid s_{t}right) pleft(s_{t+1} mid s_{t}, a_{t}right)right) \ &=frac{partial}{partial theta}left(log pleft(s_{0}right)+sum_{t=0}^{T-1} log pi_{theta}left(a_{t} mid s_{t}right)+sum_{t=0}^{T-1} log pleft(s_{t+1} mid s_{t}, a_{t}right)right) \ &=sum_{t=0}^{T-1} frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) end{aligned}

θlogpθ(τ)=θlog(p(s0)t=0T1πθ(atst)p(st+1st,at))=θ(logp(s0)+t=0T1logπθ(atst)+t=0T1logp(st+1st,at))=t=0T1θlogπθ(atst)

J

(

θ

)

θ

=

E

τ

p

θ

(

τ

)

[

(

t

=

0

T

1

θ

log

π

θ

(

a

t

s

t

)

)

G

(

τ

)

]

=

E

τ

p

θ

(

τ

)

[

(

t

=

0

T

1

θ

log

π

θ

(

a

t

s

t

)

)

(

G

(

τ

0

:

t

)

+

γ

t

G

(

τ

t

:

T

)

)

]

=

E

τ

p

θ

(

τ

)

[

t

=

0

T

1

(

θ

log

π

θ

(

a

t

s

t

)

γ

t

G

(

τ

t

:

T

)

)

]

begin{aligned} frac{partial mathcal{J}(theta)}{partial theta} &=mathbb{E}_{tau sim p_{theta}(tau)}left[left(sum_{t=0}^{T-1} frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right)right) G(tau)right] \ &=mathbb{E}_{tau sim p_{theta}(tau)}left[left(sum_{t=0}^{T-1} frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right)right)left(Gleft(tau_{0: t}right)+gamma^{t} Gleft(tau_{t: T}right)right)right] \ &=mathbb{E}_{tau sim p_{theta}(tau)}left[sum_{t=0}^{T-1}left(frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} Gleft(tau_{t: T}right)right)right] end{aligned}

θJ(θ)=Eτpθ(τ)[(t=0T1θlogπθ(atst))G(τ)]=Eτpθ(τ)[(t=0T1θlogπθ(atst))(G(τ0:t)+γtG(τt:T))]=Eτpθ(τ)[t=0T1(θlogπθ(atst)γtG(τt:T))]

J

(

θ

)

θ

1

N

n

=

1

N

(

t

=

0

T

1

θ

log

π

θ

(

a

t

(

n

)

s

t

(

n

)

)

γ

t

G

τ

t

:

T

(

n

)

)

frac{partial mathcal{J}(theta)}{partial theta} approx frac{1}{N} sum_{n=1}^{N}left(sum_{t=0}^{T-1} frac{partial}{partial theta} log pi_{theta}left(a_{t}^{(n)} mid s_{t}^{(n)}right) gamma^{t} G_{tau_{t: T}^{(n)}}right)

θJ(θ)N1n=1N(t=0T1θlogπθ(at(n)st(n))γtGτt:T(n))

#### Reinforce With Baseline

Reinforce有一个缺点是不同路径之间的方差很大，导致训练不稳定，这是在高维空间使用蒙特卡罗方法的通病。注意，这里明确了是高维空间，之前介绍的Sarsa和Q Learning也用了蒙特卡罗算法，但是他们只是对一个transition采样，所以问题不大，但是Reinforce是对整个轨迹采样，那么采样空间的维度就变得太高了，所以才会出现稳定性差的问题。

J

(

θ

)

θ

t

=

0

T

1

θ

log

π

θ

(

a

t

(

n

)

s

t

(

n

)

)

γ

t

(

G

τ

t

:

T

(

n

)

V

ϕ

(

s

t

)

)

frac{partial mathcal{J}(theta)}{partial theta} approx sum_{t=0}^{T-1} frac{partial}{partial theta} log pi_{theta}left(a_{t}^{(n)} mid s_{t}^{(n)}right) gamma^{t} left(G_{tau_{t: T}^{(n)}}-V_phi(s_t)right)

θJ(θ)t=0T1θlogπθ(at(n)st(n))γt(Gτt:T(n)Vϕ(st))

J

(

θ

)

θ

=

E

τ

p

θ

(

τ

)

[

t

=

0

T

1

(

θ

log

π

θ

(

a

t

s

t

)

γ

t

G

(

τ

t

:

T

)

)

]

=

t

=

0

T

1

E

τ

p

θ

(

τ

)

[

θ

log

π

θ

(

a

t

s

t

)

γ

t

G

(

τ

t

:

T

)

]

=

t

=

0

T

1

E

s

t

E

a

t

[

θ

log

π

θ

(

a

t

s

t

)

γ

t

G

(

τ

t

:

T

)

]

=

t

=

0

T

1

J

t

(

θ

)

θ

begin{aligned} frac{partial mathcal{J}(theta)}{partial theta} &=mathbb{E}_{tau sim p_{theta}(tau)}left[sum_{t=0}^{T-1}left(frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} Gleft(tau_{t: T}right)right)right] \ &=sum_{t=0}^{T-1}mathbb{E}_{tau sim p_{theta}(tau)}left[frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} Gleft(tau_{t: T}right)right] \ &=sum_{t=0}^{T-1}mathbb{E}_{s_t}mathbb{E}_{a_t}left[frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} Gleft(tau_{t: T}right)right] \ &=sum_{t=0}^{T-1}frac{partial mathcal{J}_{t}(theta)}{partial theta} end{aligned}

θJ(θ)=Eτpθ(τ)[t=0T1(θlogπθ(atst)γtG(τt:T))]=t=0T1Eτpθ(τ)[θlogπθ(atst)γtG(τt:T)]=t=0T1EstEat[θlogπθ(atst)γtG(τt:T)]=t=0T1θJt(θ)

J

t

(

θ

)

θ

=

E

s

t

E

a

t

[

θ

log

π

θ

(

a

t

s

t

)

γ

t

(

G

(

τ

t

:

T

)

b

(

s

t

)

)

]

=

E

s

t

E

a

t

[

θ

log

π

θ

(

a

t

s

t

)

γ

t

G

(

τ

t

:

T

)

]

+

E

s

t

E

a

t

[

θ

log

π

θ

(

a

t

s

t

)

γ

t

b

(

s

t

)

]

begin{aligned} frac{partial mathcal{J}_{t}(theta)}{partial theta} &=mathbb{E}_{s_t}mathbb{E}_{a_t}left[frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} left(Gleft(tau_{t: T}right)-b(s_t)right)right] \ &=mathbb{E}_{s_t}mathbb{E}_{a_t}left[frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} Gleft(tau_{t: T}right)right]+mathbb{E}_{s_t}mathbb{E}_{a_t}left[frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} b(s_t)right] end{aligned}

θJt(θ)=EstEat[θlogπθ(atst)γt(G(τt:T)b(st))]=EstEat[θlogπθ(atst)γtG(τt:T)]+EstEat[θlogπθ(atst)γtb(st)]

E

a

t

[

θ

log

π

θ

(

a

t

s

t

)

γ

t

b

(

s

t

)

]

=

b

(

s

t

)

π

θ

(

a

t

s

t

)

θ

log

π

θ

(

a

t

s

t

)

d

a

t

=

b

(

s

t

)

θ

π

θ

(

a

t

s

t

)

d

a

t

=

θ

b

(

s

t

)

π

θ

(

a

t

s

t

)

d

a

t

=

θ

(

b

(

s

t

)

1

)

=

0

begin{aligned} mathbb{E}_{a_t}left[frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} b(s_t)right] &= b(s_t)intpi_theta(a_t mid s_t)frac{partial}{partial theta}log pi_{theta}left(a_{t} mid s_{t}right)da_t \ &=b(s_t)intfrac{partial}{partial theta}pi_{theta}left(a_{t} mid s_{t}right)da_t \ &=frac{partial}{partial theta}b(s_t)intpi_{theta}left(a_{t} mid s_{t}right)da_t \ &= frac{partial}{partial theta}left(b(s_t)* 1right) =0 end{aligned}

Eat[θlogπθ(atst)γtb(st)]=b(st)πθ(atst)θlogπθ(atst)dat=b(st)θπθ(atst)dat=θb(st)πθ(atst)dat=θ(b(st)1)=0

J

t

(

θ

)

θ

=

E

s

t

E

a

t

[

θ

log

π

θ

(

a

t

s

t

)

γ

t

G

(

τ

t

:

T

)

]

+

0

=

E

s

t

E

a

t

[

θ

log

π

θ

(

a

t

s

t

)

γ

t

G

(

τ

t

:

T

)

]

begin{aligned} frac{partial mathcal{J}_{t}(theta)}{partial theta} &=mathbb{E}_{s_t}mathbb{E}_{a_t}left[frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} Gleft(tau_{t: T}right)right]+0 \ &=mathbb{E}_{s_t}mathbb{E}_{a_t}left[frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} Gleft(tau_{t: T}right)right] end{aligned}

θJt(θ)=EstEat[θlogπθ(atst)γtG(τt:T)]+0=EstEat[θlogπθ(atst)γtG(τt:T)]

V

a

r

(

θ

log

π

θ

(

a

t

s

t

)

γ

t

(

G

(

τ

t

:

T

)

b

(

s

t

)

)

)

=

E

[

θ

log

π

θ

(

a

t

s

t

)

γ

t

(

G

(

τ

t

:

T

)

b

(

s

t

)

)

]

2

(

E

[

θ

log

π

θ

(

a

t

s

t

)

γ

t

(

G

(

τ

t

:

T

)

b

(

s

t

)

)

]

)

2

begin{aligned} Varleft(frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} left(Gleft(tau_{t: T}right)-b(s_t)right)right) = mathbb{E}left[frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} left(Gleft(tau_{t: T}right)-b(s_t)right)right]^2 -\ left(mathbb{E}left[frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t} left(Gleft(tau_{t: T}right)-b(s_t)right)right]right)^2 \ end{aligned}

Var(θlogπθ(atst)γt(G(τt:T)b(st)))=E[θlogπθ(atst)γt(G(τt:T)b(st))]2(E[θlogπθ(atst)γt(G(τt:T)b(st))])2

E

[

(

θ

log

π

θ

(

a

t

s

t

)

γ

t

)

2

(

G

(

τ

t

:

T

)

b

(

s

t

)

)

2

]

<

E

[

(

θ

log

π

θ

(

a

t

s

t

)

γ

t

)

2

G

(

τ

t

:

T

)

2

]

begin{aligned} mathbb{E}left[left(frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t}right)^2 left(Gleft(tau_{t: T}right)-b(s_t)right)^2right] <_？ mathbb{E}left[left(frac{partial}{partial theta} log pi_{theta}left(a_{t} mid s_{t}right) gamma^{t}right)^2 Gleft(tau_{t: T}right)^2right] end{aligned}

E[(θlogπθ(atst)γt)2(G(τt:T)b(st))2]<E[(θlogπθ(atst)γt)2G(τt:T)2]

(

G

(

τ

t

:

T

)

b

(

s

t

)

)

2

<

G

(

τ

t

:

T

)

2

left(Gleft(tau_{t: T}right)-b(s_t)right)^2<Gleft(tau_{t: T}right)^2

(G(τt:T)b(st))2<G(τt:T)2，那么就能减小方差，所以b(st)并不是随便取的，我们需要让b(st)尽可能地与G接近。因此实际中采用的就是

V

π

(

s

t

)

V^pi(s_t)

Vπ(st)，即相当于Gt的期望，所以自然会比较接近Gt。

ϕ

phi

ϕ估计

V

π

(

s

)

V^pi(s)

Vπ(s)，在训练过程中除了优化策略函数的参数

θ

theta

θ，也同时优化

ϕ

phi

ϕ

θ

theta

θ

ϕ

phi

ϕ不同的梯度更新千万不要怵，其实目标函数是一样的，只不过对不同参数求导得到不同结果而已。

THE END