强化学习—— TD算法(Sarsa算法+Q-learning算法)

在这里插入图片描述

1. Sarsa算法

1.1 TD Target

  • 回报函数的定义为:

    U

    t

    =

    R

    t

    +

    γ

    R

    t

    +

    1

    +

    γ

    2

    R

    t

    +

    2

    +

    U

    t

    =

    R

    t

    +

    γ

    (

    R

    t

    +

    1

    +

    γ

    R

    t

    +

    2

    +

    )

    U

    t

    =

    R

    t

    +

    γ

    U

    t

    +

    1

    U_t=R_t+gamma R_{t+1}+gamma^2 R_{t+2}+cdot cdot cdot\ U_t=R_t+gamma (R_{t+1}+gamma R_{t+2}+cdot cdot cdot)\ U_t = R_t+gamma U_{t+1}

    Ut=Rt+γRt+1+γ2Rt+2+Ut=Rt+γ(Rt+1+γRt+2+)Ut=Rt+γUt+1

  • 假设t时刻的回报依赖于t时刻的状态、动作以及t+1时刻的状态:

    R

    t

    (

    S

    t

    ,

    A

    t

    ,

    S

    t

    +

    1

    )

    R_t gets (S_t,A_t,S_{t+1})

    Rt(St,At,St+1)

  • 则动作价值函数可以定义为:

    Q

    π

    (

    s

    t

    ,

    a

    t

    )

    =

    E

    [

    U

    t

    a

    t

    ,

    s

    t

    ]

    Q

    π

    (

    s

    t

    ,

    a

    t

    )

    =

    E

    [

    R

    t

    +

    γ

    U

    t

    +

    1

    a

    t

    ,

    s

    t

    ]

    Q

    π

    (

    s

    t

    ,

    a

    t

    )

    =

    E

    [

    R

    t

    a

    t

    ,

    s

    t

    ]

    +

    γ

    E

    [

    U

    t

    +

    1

    a

    t

    ,

    s

    t

    ]

    Q

    π

    (

    s

    t

    ,

    a

    t

    )

    =

    E

    [

    R

    t

    a

    t

    ,

    s

    t

    ]

    +

    γ

    E

    [

    Q

    π

    (

    S

    t

    +

    1

    ,

    A

    t

    +

    1

    )

    a

    t

    ,

    s

    t

    ]

    Q

    π

    (

    s

    t

    ,

    a

    t

    )

    =

    E

    [

    R

    t

    +

    γ

    Q

    π

    (

    S

    t

    +

    1

    ,

    A

    t

    +

    1

    )

    ]

    Q_pi(s_t,a_t)=E[U_t|a_t,s_t]\ Q_pi(s_t,a_t)=E[R_t+gamma U_{t+1}|a_t,s_t]\Q_pi(s_t,a_t)=E[R_t|a_t,s_t]+gamma E[U_{t+1}|a_t,s_t]\ Q_pi(s_t,a_t)=E[R_t|a_t,s_t]+gamma E[Q_pi(S_{t+1},A_{t+1})|a_t,s_t]\ Q_pi(s_t,a_t) = E[R_t + gamma Q_pi(S_{t+1},A_{t+1})]

    Qπ(st,at)=E[Utat,st]Qπ(st,at)=E[Rt+γUt+1at,st]Qπ(st,at)=E[Rtat,st]+γE[Ut+1at,st]Qπ(st,at)=E[Rtat,st]+γE[Qπ(St+1,At+1)at,st]Qπ(st,at)=E[Rt+γQπ(St+1,At+1)]

  • 依据蒙特卡洛近似:

    y

    t

    =

    r

    t

    +

    γ

    Q

    π

    (

    s

    t

    +

    1

    ,

    a

    t

    +

    1

    )

    y_t= r_t + gamma Q_pi(s_{t+1},a_{t+1})

    yt=rt+γQπ(st+1,at+1)

  • TD学习的目标:

    y

    t

    Q

    π

    (

    s

    t

    ,

    a

    t

    )

    y_t approx Q_pi(s_t,a_t)

    ytQπ(st,at)

1.2 表格形式的Sarsa算法

  • 学习动作价值函数

    Q

    π

    (

    s

    ,

    a

    )

    Q_pi(s,a)

    Qπ(s,a)

  • 假设动作和状态的数量有限。
  • 则需要学习下列表格信息:
SA

a

1

a_1

a1

a

2

a_2

a2

a

3

a_3

a3

a

4

a_4

a4

s

1

s_1

s1

Q

11

Q_{11}

Q11

s

2

s_2

s2

s

3

s_3

s3

s

4

s_4

s4

计算步骤为:

  1. 观测到一个transition,即:

    (

    s

    t

    ,

    a

    t

    ,

    r

    t

    ,

    s

    t

    +

    1

    )

    (s_t,a_t,r_t,s_{t+1})

    (st,at,rt,st+1)

  2. 依据策略函函数对动作进行抽样:

    a

    t

    +

    1

    π

    (

    s

    t

    +

    1

    )

    a_{t+1}sim pi(cdot|s_{t+1})

    at+1π(st+1)

  3. 查表得到TD Target:

    y

    t

    =

    r

    t

    +

    γ

    Q

    π

    (

    s

    t

    +

    1

    ,

    a

    t

    +

    1

    )

    y_t = r_t+gamma Q_pi(s_{t+1},a_{t+1})

    yt=rt+γQπ(st+1,at+1)

  4. TD error为:

    δ

    t

    =

    Q

    π

    (

    s

    t

    ,

    a

    t

    )

    y

    t

    delta_t=Q_pi(s_t,a_t)-y_t

    δt=Qπ(st,at)yt

  5. 更新表格:

    Q

    π

    (

    s

    t

    ,

    a

    t

    )

    Q

    π

    (

    s

    t

    ,

    a

    t

    )

    α

    δ

    t

    Q_pi(s_t,a_t)gets Q_pi(s_t,a_t) - alpha cdot delta_t

    Qπ(st,at)Qπ(st,at)αδt

1.3 神经网络形式的Sarsa算法

  • 用神经网络近似动作价值函数:

    q

    (

    s

    ,

    q

    ;

    W

    )

    Q

    π

    (

    s

    ,

    a

    )

    q(s,q;W)sim Q_pi(s,a)

    q(s,q;W)Qπ(s,a)

  • 神经网络作为裁判去评判动作
  • 参数W需要学习
  • TD Target为:

    y

    t

    =

    r

    t

    +

    γ

    q

    (

    s

    t

    +

    1

    ,

    a

    t

    +

    1

    ;

    W

    )

    y_t = r_t+gamma cdot q(s_{t+1},a_{t+1};W)

    yt=rt+γq(st+1,at+1;W)

  • TD error为:

    δ

    t

    =

    q

    (

    s

    t

    ,

    a

    t

    ;

    W

    )

    y

    t

    delta_t = q(s_t,a_t;W)-y_t

    δt=q(st,at;W)yt

  • loss 为:

    1

    2

    δ

    t

    2

    frac{1}{2}cdot delta_t^2

    21δt2

  • 梯度为:

    δ

    t

    q

    (

    s

    t

    ,

    a

    t

    ;

    W

    )

    W

    delta_t cdot frac{partial q(s_t,a_t;W)}{partial W}

    δtWq(st,at;W)

  • 进行梯度下降:

    W

    W

    α

    δ

    t

    q

    (

    s

    t

    ,

    a

    t

    ;

    W

    )

    W

    Wgets W - alpha cdot delta_t cdot frac{partial q(s_t,a_t;W)}{partial W}

    WWαδtWq(st,at;W)

2. Q-learning算法

Q-learning用来学习最优动作价值函数:

Q

π

(

s

,

a

)

Q_pi^star (s,a)

Qπ(s,a)

2.1 TD Target

Q

π

(

s

t

,

a

t

)

=

E

[

R

t

+

γ

Q

π

(

S

t

+

1

,

A

t

+

1

)

]

Q_pi(s_t,a_t) = E[R_t+gamma cdot Q_pi(S_{t+1},A_{t+1})]

Qπ(st,at)=E[Rt+γQπ(St+1,At+1)]
将最优策略函数计为:

π

pi^star

π
则:

Q

(

s

t

,

a

t

)

=

Q

π

(

s

t

,

a

t

)

=

E

[

R

t

+

γ

Q

π

(

S

t

+

1

,

A

t

+

1

)

]

Q^star(s_t,a_t)=Q_{pi^star}(s_t,a_t)= E[R_t+gamma cdot Q_{pi^star}(S_{t+1},A_{t+1})]

Q(st,at)=Qπ(st,at)=E[Rt+γQπ(St+1,At+1)]
t+1时刻的动作按下式进行计算:

A

t

+

1

=

a

r

g

m

a

x

a

Q

(

s

t

+

1

,

a

)

A_{t+1}=mathop{argmax}limits_{a} Q^star (s_{t+1},a)

At+1=aargmaxQ(st+1,a)
则最优动作价值函数可作如下近似:

Q

(

s

t

,

a

t

)

=

E

[

R

t

+

γ

m

a

x

a

Q

(

S

t

+

1

,

a

)

]

r

t

+

m

a

x

a

Q

(

s

t

+

1

,

a

)

Q^star(s_t,a_t)=E[R_t+gamma cdot mathop{max}limits_{a}Q^star(S_{t+1},a)]\ approx r_t+mathop{max}limits_{a}Q^star(s_{t+1},a)

Q(st,at)=E[Rt+γamaxQ(St+1,a)]rt+amaxQ(st+1,a)

2.2 表格形式的Q-learning算法

SA

a

1

a_1

a1

a

2

a_2

a2

a

3

a_3

a3

a

4

a_4

a4

s

1

(

Q

)

s_1(找出此行最大的Q)

s1(Q)

Q

11

Q_{11}

Q11

s

2

s_2

s2

s

3

s_3

s3

s

4

s_4

s4

计算步骤为:

  1. 观测到一个transition,即:

    (

    s

    t

    ,

    a

    t

    ,

    r

    t

    ,

    s

    t

    +

    1

    )

    (s_t,a_t,r_t,s_{t+1})

    (st,at,rt,st+1)

  2. TD Target为:

    y

    t

    =

    r

    t

    +

    m

    a

    x

    a

    Q

    (

    s

    t

    +

    1

    ,

    a

    )

    y_t=r_t+mathop{max}limits_{a}Q^star(s_{t+1},a)

    yt=rt+amaxQ(st+1,a)

  3. TD error为:

    δ

    t

    =

    Q

    (

    s

    t

    ,

    a

    t

    )

    y

    t

    delta_t=Q^star(s_t,a_t)-y_t

    δt=Q(st,at)yt

  4. 更新表格:

    Q

    (

    s

    t

    ,

    a

    t

    )

    Q

    (

    s

    t

    ,

    a

    t

    )

    α

    δ

    t

    Q^star(s_t,a_t)gets Q^star(s_t,a_t) - alpha cdot delta_t

    Q(st,at)Q(st,at)αδt

2.3 神经网络形式的Q-learning算法(DQN)

  1. 观测到一个transition,即:

    (

    s

    t

    ,

    a

    t

    ,

    r

    t

    ,

    s

    t

    +

    1

    )

    (s_t,a_t,r_t,s_{t+1})

    (st,at,rt,st+1)

  2. TD Target为:

    y

    t

    =

    r

    t

    +

    m

    a

    x

    a

    Q

    (

    s

    t

    +

    1

    ,

    a

    W

    )

    y_t=r_t+mathop{max}limits_{a}Q(s_{t+1},a;W)

    yt=rt+amaxQ(st+1,aW)

  3. TD error为:

    δ

    t

    =

    Q

    (

    s

    t

    ,

    a

    t

    W

    )

    y

    t

    delta_t=Q(s_{t},a_t;W)-y_t

    δt=Q(st,atW)yt

  4. 参数更新:

    W

    W

    α

    δ

    t

    Q

    (

    s

    t

    ,

    a

    t

    ;

    W

    )

    W

    Wgets W - alpha cdot delta_t cdot frac{partial Q(s_t,a_t;W)}{partial W}

    WWαδtWQ(st,at;W)

3. Saras和Q-learning的区别

  1. Sarsa学习动作价值函数:

    Q

    π

    (

    s

    ,

    a

    )

    Q_pi(s,a)

    Qπ(s,a)

  2. Actor-Critic中的价值网络为用Sarsa训练的
  3. Q-learning训练最优动作价值函数:

    Q

    (

    s

    ,

    a

    )

    Q^star(s,a)

    Q(s,a)

4. Multi-step TD Target

  • one-step仅使用一个reward:

    r

    t

    r_t

    rt

  • multi-step 使用m个reward:

    r

    t

    ,

    r

    t

    +

    1

    ,

    .

    .

    .

    ,

    t

    t

    +

    m

    1

    r_t,r_{t+1},...,t_{t+m-1}

    rt,rt+1,...,tt+m1

4.1 Sarsa的Multi-step TD Target

y

t

=

i

=

0

m

1

λ

i

r

t

+

i

+

λ

m

Q

π

(

s

t

+

m

,

a

t

+

m

)

y_t = sum_{i=0}^{m-1}lambda^i r_{t+i} + lambda^mQ_pi(s_{t+m},a_{t+m})

yt=i=0m1λirt+i+λmQπ(st+m,at+m)

4.2 Q-learning的Multi-step TD Target

y

t

=

i

=

0

m

1

λ

i

r

t

+

i

+

λ

m

m

a

x

a

Q

(

s

t

+

m

,

a

)

y_t = sum_{i=0}^{m-1}lambda^i r_{t+i} + lambda^mmathop{max}limits_{a}Q^star(s_{t+m},a)

yt=i=0m1λirt+i+λmamaxQ(st+m,a)
本文为参考B站学习视频书写的笔记!
by CyrusMay 2022 04 08

我们在小孩和大人的转角
盖一座城堡
——————五月天(好好)——————

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
THE END
分享
二维码
< <上一篇
下一篇>>