一般神经网络(DNN)反向传播过程

DNN反向传播过程

多元函数微分

损失函数都是标量函数,它使用范数损失将向量转换为标量。计算损失函数在第L层输入的导数是一种标量对向量的求导。实际上不论是几维向量,都可以视为一列多元函数的自变量数组。
例如,

m

×

n

mtimes n

m×n维度的矩阵

{

W

i

j

}

{W_{ij}}

{Wij}可以转化为一列多元函数的自变量数组:

{

W

i

j

}

(

W

11

,

W

12

.

.

.

W

n

m

)

{W_{ij}}rightarrow(W_{11},W_{12}...W_{nm})

{Wij}(W11,W12...Wnm)
那么关于

{

W

i

j

}

{W_{ij}}

{Wij}的标量函数可以视作关于

(

W

11

,

W

12

.

.

.

W

n

m

)

(W_{11},W_{12}...W_{nm})

(W11,W12...Wnm)的多元函数。多元函数的梯度就是标量函数对矩阵求导的结果。还记得多元函数的梯度是这样省的:

f

x

=

(

f

x

1

,

f

x

2

.

.

.

f

x

n

)

frac{partial f}{partial overrightarrow{x}}=(frac{partial f}{partial x_{1}}, frac{partial f}{partial x_{2}}...frac{partial f}{partial x_{n}})

x

f=(x1f,x2f...xnf)

向量对向量求导

向量函数可以视作多个标量多元函数组成的向量,例如有将向量B映射为A的向量函数。

A

=

G

(

B

)

w

h

e

r

e

 

A

R

N

×

1

,

B

R

M

×

1

A=G(B)\ where Ain R^{Ntimes1},Bin R^{Mtimes1}

A=G(B)where ARN×1,BRM×1

如果我们将向量A视作多个标量多元函数组成的向量,那么求导就方便多了。

A

=

(

a

1

(

b

1

,

b

2

,

.

.

.

b

m

)

,

a

2

(

b

1

,

b

2

,

.

.

.

b

m

)

,

.

.

.

)

A

B

=

(

a

1

B

,

a

2

B

,

.

.

.

)

=

(

a

1

b

1

.

.

.

a

1

b

m

a

2

b

1

.

.

.

a

2

b

m

.

.

.

.

.

.

.

.

.

a

n

b

1

.

.

.

a

n

b

m

)

begin{aligned} A&=(a_{1}(b_{1},b_{2},...b_{m}),a_{2}(b_{1},b_{2},...b_{m}),...)\ frac{partial A}{partial B}&=(frac{partial a_{1}}{partial B},frac{partial a_{2}}{partial B},...)\ &=left( begin{array}{ccc} frac{partial a_{1}}{partial b_{1}} & ... & frac{partial a_{1}}{partial b_{m}}\ frac{partial a_{2}}{partial b_{1}} & ... & frac{partial a_{2}}{partial b_{m}}\ ... & ... & ...\ frac{partial a_{n}}{partial b_{1}} & ... & frac{partial a_{n}}{partial b_{m}}\ end{array} right) end{aligned}

ABA=(a1(b1,b2,...bm),a2(b1,b2,...bm),...)=(Ba1,Ba2,...)=b1a1b1a2...b1an............bma1bma2...bman
Wow, see, 现在向量求导清晰多了。当然,不管你将求导展开成

n

×

m

ntimes m

n×m形式的矩阵还是

m

×

n

mtimes n

m×n的矩阵,只要在求导时统一,都没有关系。

DNN损失函数求导

神经网络的损失函数都是标量函数。常见的损失有L1、L2范数损失、啦啦啦的。以L2范数损失为例,一般的全连接神经网络损失函数:

ϵ

=

1

2

σ

(

a

L

)

y

2

@

E

q

.

1

begin{array}{ccc} epsilon = frac{1}{2} ||sigma (bf{a^{L}})-bf{y}||^{2} & @Eq.1 end{array}

ϵ=21σ(aL)y2@Eq.1
其中

a

L

=

W

L

a

L

1

+

b

L

,

a

L

,

b

L

R

N

L

,

W

L

R

N

L

×

R

N

L

1

bf{a^{L}}=bf{W^{L}}cdotbf{a^{L-1}}+bf{b^{L}}, bf{a^{L}},bf{b^{L}}in R^{N_{L}},bf{W^{L}}in R^{N_{L}}times R^{N_{L-1}}

aL=WLaL1+bL,aL,bLRNL,WLRNL×RNL1表示第L层激活函数的结果,

y

bf{y}

y表示Ground truth。Now,如何求解损失函数对

W

L

,

b

L

bf{W^{L}}, bf{b^{L}}

WL,bL的梯度呢?We only have to expand Eq.1 to the following expression 啦啦啦:

ϵ

=

1

2

Σ

i

N

[

σ

(

Σ

j

M

W

i

j

L

a

j

L

1

+

b

i

L

)

y

i

]

2

ϵ

W

x

y

=

[

σ

(

Σ

j

M

W

x

j

L

a

j

L

1

+

b

x

L

)

y

x

]

×

σ

(

Σ

j

M

W

x

j

L

a

j

L

1

+

b

x

L

)

×

a

y

L

1

s

o

,

ϵ

W

L

=

{

ϵ

W

x

y

L

}

x

:

1

N

,

y

:

1

M

T

h

e

n

 

s

u

r

p

r

i

s

i

n

g

l

y

=

[

σ

(

W

L

a

L

1

+

b

L

)

σ

(

W

L

a

L

1

+

b

L

)

]

(

a

L

1

)

T

begin{aligned} epsilon &= frac{1}{2}Sigma_{i}^{N} [sigma(Sigma_{j}^{M}W_{ij}^{L}cdot a^{L-1}_{j}+b_{i}^{L})-y_{i}]^{2}\ frac{partialepsilon}{partial W_{xy}} &= [sigma(Sigma_{j}^{M}W_{xj}^{L}cdot a^{L-1}_{j}+b_{x}^{L})-y_{x}]timessigma'(Sigma_{j}^{M}W_{xj}^{L}cdot a^{L-1}_{j}+b_{x}^{L})times a_{y}^{L-1}\ so, frac{partialepsilon}{partial bf{W^{L}}}&={frac{partialepsilon}{partial W_{xy}^{L}}}_{x:1rightarrow N,y:1rightarrow M}\ &Then surprisingly\ &=[sigma(bf{W^{L}}cdot a^{L-1}+bf{b^{L}})odotsigma'(bf{W^{L}}cdot a^{L-1}+bf{b^{L}})]cdot (a^{L-1})^{T} end{aligned}

ϵWxyϵso,WLϵ=21ΣiN[σ(ΣjMWijLajL1+biL)yi]2=[σ(ΣjMWxjLajL1+bxL)yx]×σ(ΣjMWxjLajL1+bxL)×ayL1={WxyLϵ}x:1N,y:1MThen surprisingly=[σ(WLaL1+bL)σ(WLaL1+bL)](aL1)T
同样的,损失函数对偏置求导得到:

ϵ

b

L

=

[

σ

(

W

L

a

L

1

+

b

L

)

σ

(

W

L

a

L

1

+

b

L

)

]

frac{partialepsilon}{partial bf{b^{L}}}=[sigma(bf{W^{L}}cdot a^{L-1}+bf{b^{L}})odotsigma'(bf{W^{L}}cdot a^{L-1}+bf{b^{L}})]

bLϵ=[σ(WLaL1+bL)σ(WLaL1+bL)]
通常我们用

z

L

=

W

L

a

L

1

+

b

L

bf{z^{L}}=bf{W^{L}}cdot a^{L-1}+bf{b^{L}}

zL=WLaL1+bL表示未激活输出,

δ

L

=

σ

(

z

L

)

σ

(

z

L

)

bf{delta^{L}}=sigma(bf{z^{L}})odotsigma'(bf{z^{L}})

δL=σ(zL)σ(zL)表示Hadamard乘积结果。那么损失函数对最后一层神经网络参数的梯度就是:

ϵ

W

L

=

δ

L

(

a

L

1

)

T

ϵ

b

L

=

δ

L

begin{aligned} frac{partialepsilon}{partial bf{W^{L}}}&=bf{delta^{L}}cdot (bf{a^{L-1}})^{T}\ frac{partialepsilon}{partial bf{b^{L}}}&=bf{delta^{L}} end{aligned}

WLϵbLϵ=δL(aL1)T=δL
桥豆麻嘚,好像推出来了什么不得了的东西。如果是对第

h

h

h层的参数求导,那么有:

ϵ

W

H

=

δ

H

(

a

H

1

)

T

     

@

E

q

.

2

ϵ

b

H

=

δ

H

                      

@

E

q

.

3

w

h

e

r

e

 

δ

H

=

ϵ

Z

L

Z

L

Z

L

1

.

.

.

Z

H

+

1

Z

H

begin{aligned} frac{partialepsilon}{partial bf{W^{H}}}&=bf{delta^{H}}cdot (bf{a^{H-1}})^{T} @Eq.2\ frac{partialepsilon}{partial bf{b^{H}}}&=bf{delta^{H}} @Eq.3\\ where bf{delta^{H}}&=frac{partialepsilon}{partial bf{Z^{L}}}cdotfrac{partialbf{Z^{L}}}{partial bf{Z^{L-1}}}...frac{partialbf{Z^{H+1}}}{partial bf{Z^{H}}} end{aligned}

WHϵbHϵwhere δH=δH(aH1)T     @Eq.2=δH                      @Eq.3=ZLϵZL1ZL...ZHZH+1
clearly,求导的关键在于求解后一层非激活输出对前一层非激活输出的导数,即:

Z

L

Z

L

1

=

{

Z

i

L

Z

j

L

1

}

Z

i

L

Z

j

L

1

=

W

i

j

L

a

j

L

w

h

i

c

h

i

n

d

i

c

a

t

e

s

 

Z

L

Z

L

1

=

W

L

d

i

a

g

(

a

L

1

)

w

h

e

r

e

 

d

i

a

g

(

a

L

1

)

=

(

a

1

L

1

0

.

.

.

0

a

2

L

1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

a

N

L

1

L

1

)

begin{aligned} frac{partialbf{Z^{L}}}{partial bf{Z^{L-1}}}&={frac{partial Z^{L}_{i}}{partial Z^{L-1}_{j}}}\ frac{partial Z^{L}_{i}}{partial Z^{L-1}_{j}}&=W^{L}_{ij}cdot a^{L}_{j}\ which indicates frac{partialbf{Z^{L}}}{partial bf{Z^{L-1}}}&=bf{W^{L}}cdot diag(bf{a^{L-1}})\ where diag(bf{a^{L-1}})&=left(begin{array}{ccc} a_{1}^{L-1} & 0 & ...\ 0 & a_{2}^{L-1} & ...\ ...& ... & ... \ ... & ... & a_{N^{L-1}}^{L-1}\ end{array}right) end{aligned}

ZL1ZLZjL1ZiLwhichindicates ZL1ZLwhere diag(aL1)={ZjL1ZiL}=WijLajL=WLdiag(aL1)=a1L10......0a2L1...............aNL1L1

将上式代入至

δ

H

delta^{H}

δH中,就可以得到:

δ

H

=

(

Z

L

Z

L

1

.

.

.

Z

H

+

1

Z

H

)

T

δ

L

=

Π

T

(

W

L

d

i

a

g

(

a

L

1

)

)

δ

L

            

@

E

q

.

4

begin{aligned} delta^{H} &= (frac{partialbf{Z^{L}}}{partial bf{Z^{L-1}}}...frac{partialbf{Z^{H+1}}}{partial bf{Z^{H}}})^{T}cdotdelta^{L}\ &= Pi^{T}(bf{W^{L}}cdot diag(bf{a^{L-1}}))cdotdelta^{L} @Eq.4 end{aligned}

δH=(ZL1ZL...ZHZH+1)TδL=ΠT(WLdiag(aL1))δL            @Eq.4
to analyze it from the dimension aspect, Eq.4的维度信息是:

[

(

N

L

N

L

1

)

×

(

N

L

1

N

L

2

)

×

.

.

.

(

N

H

+

1

N

H

)

]

T

×

(

N

L

1

)

=

(

N

H

1

)

[(N^{L}*N^{L-1})times(N^{L-1}*N^{L-2})times...(N^{H+1}*N^{H})]^{T}times(N^{L}*1)=(N^{H}*1)

[(NLNL1)×(NL1NL2)×...(NH+1NH)]T×(NL1)=(NH1)
那么就不难得到任意一层的参数梯度表达式:

ϵ

W

H

=

Π

T

(

W

L

d

i

a

g

(

a

L

1

)

)

δ

L

(

a

H

1

)

T

ϵ

b

H

=

Π

T

(

W

L

d

i

a

g

(

a

L

1

)

)

δ

L

begin{aligned} frac{partialepsilon}{partial bf{W^{H}}}&=Pi^{T}(bf{W^{L}}cdot diag(bf{a^{L-1}}))cdotdelta^{L}cdot (bf{a^{H-1}})^{T}\ frac{partialepsilon}{partial bf{b^{H}}}&=Pi^{T}(bf{W^{L}}cdot diag(bf{a^{L-1}}))cdotdelta^{L} end{aligned}

WHϵbHϵ=ΠT(WLdiag(aL1))δL(aH1)T=ΠT(WLdiag(aL1))δL

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
THE END
分享
二维码
< <上一篇
下一篇>>