神经网络常用优化器
神经网络常用优化器
前言
该内容为笔者学习中国大学慕课中北京大学曹健老师Tensorflow笔记所总结
在此之前,笔者观看过吴恩达老师的深度学习和CS231n,其中都对几种优化器进行了讲解,并对几种不同的优化器为什么有效进行了说明,但相比直接曹健老师的讲解更便于记忆
一、预备知识和参数说明
待优化参数
w
w
w
损失函数
l
o
s
s
loss
loss
学习率
l
r
lr
lr
每次迭代一个
b
a
t
c
h
batch
batch
t
t
t表示当前
b
a
t
c
h
batch
batch迭代的总次数
参数更新的步骤:
- 计算t时刻损失函数关于当前参数的梯度
g
t
=
∇
l
o
s
s
=
∂
loss
∂
(
w
t
)
g_t=nabla loss =dfrac{partial text { loss }}{partialleft(w_{t}right)}
- 计算t时刻一阶动量
m
t
m_t
V
t
V_t
- 计算t时刻下降梯度:
η
t
=
l
r
⋅
m
t
/
V
t
eta_t=lr cdot m_t/sqrt{V_t}
- 计算t+1时刻参数:
w
t
+
1
=
w
t
−
η
t
=
w
t
−
l
r
⋅
m
t
/
V
t
w_{t+1}=w_t-eta_t=w_t-lr cdot m_t/sqrt{V_t}
一阶动量:与梯度相关的函数
二阶动量:与梯度平方相关的函数
二、随机梯度下降SGD
一阶动量:
m
t
=
g
t
m_t=g_t
mt=gt 二阶动量:
V
t
=
1
V_t=1
Vt=1
η
t
=
l
r
⋅
m
t
/
V
t
eta_t=lrcdot m_t/sqrt{V_t}
ηt=lr⋅mt/Vt
=
l
r
⋅
g
t
=lrcdot g_t
=lr⋅gt
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-eta_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
m
t
V
t
=w_t-lrcdot m_tsqrt{V_t}
=wt−lr⋅mtVt
=
w
t
−
l
r
⋅
g
t
= w_t-lrcdot g_t
=wt−lr⋅gt
三、SGDM
在SGD基础上增加了一阶动量
在SGDM中
m
t
m_t
mt 表示各时刻梯度方向的指数滑动平均
一阶动量:
m
t
=
β
⋅
m
t
−
1
+
(
1
−
β
)
⋅
g
t
m_t=beta cdot m_{t-1}+(1-beta ) cdot g_t
mt=β⋅mt−1+(1−β)⋅gt 二阶动量:
V
t
=
1
V_t=1
Vt=1
η
t
=
l
r
⋅
m
t
/
V
t
eta_t=lrcdot m_t/sqrt{V_t}
ηt=lr⋅mt/Vt
=
l
r
⋅
m
t
=lrcdot m_t
=lr⋅mt
=
l
r
⋅
(
β
⋅
m
t
−
1
+
(
1
−
β
)
⋅
g
t
)
=lr cdot(beta cdot m_{t-1}+(1-beta ) cdot g_t)
=lr⋅(β⋅mt−1+(1−β)⋅gt)
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-eta_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
(
β
⋅
m
t
−
1
+
(
1
−
β
)
⋅
g
t
)
=w_t-lr cdot(beta cdot m_{t-1}+(1-beta ) cdot g_t)
=wt−lr⋅(β⋅mt−1+(1−β)⋅gt)
三、Adagrad
在SGD基础上增加二阶动量
二阶动量是从开始到现在梯度平方的累计和
一阶动量:
m
t
=
g
t
m_t=g_t
mt=gt 二阶动量:
V
t
=
∑
τ
t
g
τ
2
V_t=sum^t_{tau}g_{tau}^2
Vt=∑τtgτ2
η
t
=
l
r
⋅
m
t
/
(
V
t
)
eta_t=lr cdot m_t/(sqrt{V_t})
ηt=lr⋅mt/(Vt
)
=
l
r
⋅
g
t
/
(
∑
τ
=
1
t
)
g
τ
2
)
=lr cdot g_t/(sqrt{sum^t_{tau=1})g_{tau}^2})
=lr⋅gt/(∑τ=1t)gτ2
)
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-eta_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
g
t
/
(
∑
τ
=
1
t
)
g
τ
2
)
=w_t-lr cdot g_t/(sqrt{sum^t_{tau=1})g_{tau}^2})
=wt−lr⋅gt/(∑τ=1t)gτ2
)
四、RMSProp
在SGD基础上增加二阶动量
二阶动量使用指数滑动平均值计算,表征的是过去一段时间的平均值
一阶动量:
m
t
=
g
t
m_t=g_t
mt=gt 二阶动量:
V
t
=
β
⋅
V
t
−
1
+
(
1
−
β
)
⋅
g
2
2
V_t=beta cdot V_{t-1}+(1-beta)cdot g_2^2
Vt=β⋅Vt−1+(1−β)⋅g22
η
t
=
l
r
⋅
m
t
/
(
(
V
t
)
)
eta_t=lr cdot m_t/(sqrt(V_t))
ηt=lr⋅mt/((
Vt))
=
l
r
⋅
g
t
/
(
β
⋅
V
t
−
1
+
(
1
−
β
)
⋅
g
2
2
)
=lr cdot g_t/(sqrt{beta cdot V_{t-1}+(1-beta)cdot g_2^2})
=lr⋅gt/(β⋅Vt−1+(1−β)⋅g22
)
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-eta_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
g
t
/
(
β
⋅
V
t
−
1
+
(
1
−
β
)
⋅
g
2
2
)
=w_t-lr cdot g_t/(sqrt{beta cdot V_{t-1}+(1-beta)cdot g_2^2})
=wt−lr⋅gt/(β⋅Vt−1+(1−β)⋅g22
)
五、Adam
同时结合SGDM一阶动量和RMWSProp二阶动量
一阶动量:
m
t
=
β
1
⋅
m
t
−
1
+
(
1
−
β
1
)
m_t=beta_1 cdot m_{t-1}+(1-beta_1 )
mt=β1⋅mt−1+(1−β1)
修正一阶动量的偏差:
m
t
^
=
m
t
1
−
β
1
t
hat{m_t}=dfrac{m_t}{1-beta_1^t}
mt^=1−β1tmt
二阶动量:
V
t
=
β
2
⋅
V
t
−
1
+
(
1
−
β
2
)
⋅
g
2
2
V_t=beta_2 cdot V_{t-1}+(1-beta_2)cdot g_2^2
Vt=β2⋅Vt−1+(1−β2)⋅g22
修正二阶动量的偏差:
V
t
^
=
V
t
1
−
β
2
t
hat{V_t}=dfrac{V_t}{1-beta_2^t}
Vt^=1−β2tVt
η
t
=
l
r
⋅
m
t
^
/
(
V
t
^
)
eta_t=lr cdot hat{m_t}/(sqrt{hat{V_t}})
ηt=lr⋅mt^/(Vt^
)
=
l
r
⋅
m
t
1
−
β
1
t
/
V
t
1
−
β
2
t
=lr cdot dfrac{m_t}{1-beta_1^t}/sqrt{dfrac{V_t}{1-beta_2^t}}
=lr⋅1−β1tmt/1−β2tVt
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-eta_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
m
t
1
−
β
1
t
/
V
t
1
−
β
2
t
=w_t-lr cdot dfrac{m_t}{1-beta_1^t}/sqrt{dfrac{V_t}{1-beta_2^t}}
=wt−lr⋅1−β1tmt/1−β2tVt