# DNN反向传播过程

## 多元函数微分

m

×

n

mtimes n

m×n维度的矩阵

{

W

i

j

}

{W_{ij}}

{Wij}可以转化为一列多元函数的自变量数组：

{

W

i

j

}

(

W

11

,

W

12

.

.

.

W

n

m

)

{W_{ij}}rightarrow(W_{11},W_{12}...W_{nm})

{Wij}(W11,W12...Wnm)

{

W

i

j

}

{W_{ij}}

{Wij}的标量函数可以视作关于

(

W

11

,

W

12

.

.

.

W

n

m

)

(W_{11},W_{12}...W_{nm})

(W11,W12...Wnm)的多元函数。多元函数的梯度就是标量函数对矩阵求导的结果。还记得多元函数的梯度是这样省的：

f

x

=

(

f

x

1

,

f

x

2

.

.

.

f

x

n

)

frac{partial f}{partial overrightarrow{x}}=(frac{partial f}{partial x_{1}}, frac{partial f}{partial x_{2}}...frac{partial f}{partial x_{n}})

x

f=(x1f,x2f...xnf)

## 向量对向量求导

A

=

G

(

B

)

w

h

e

r

e

A

R

N

×

1

,

B

R

M

×

1

A=G(B)\ where Ain R^{Ntimes1},Bin R^{Mtimes1}

A=G(B)where ARN×1,BRM×1

A

=

(

a

1

(

b

1

,

b

2

,

.

.

.

b

m

)

,

a

2

(

b

1

,

b

2

,

.

.

.

b

m

)

,

.

.

.

)

A

B

=

(

a

1

B

,

a

2

B

,

.

.

.

)

=

(

a

1

b

1

.

.

.

a

1

b

m

a

2

b

1

.

.

.

a

2

b

m

.

.

.

.

.

.

.

.

.

a

n

b

1

.

.

.

a

n

b

m

)

begin{aligned} A&=(a_{1}(b_{1},b_{2},...b_{m}),a_{2}(b_{1},b_{2},...b_{m}),...)\ frac{partial A}{partial B}&=(frac{partial a_{1}}{partial B},frac{partial a_{2}}{partial B},...)\ &=left( begin{array}{ccc} frac{partial a_{1}}{partial b_{1}} & ... & frac{partial a_{1}}{partial b_{m}}\ frac{partial a_{2}}{partial b_{1}} & ... & frac{partial a_{2}}{partial b_{m}}\ ... & ... & ...\ frac{partial a_{n}}{partial b_{1}} & ... & frac{partial a_{n}}{partial b_{m}}\ end{array} right) end{aligned}

ABA=(a1(b1,b2,...bm),a2(b1,b2,...bm),...)=(Ba1,Ba2,...)=b1a1b1a2...b1an............bma1bma2...bman
Wow, see, 现在向量求导清晰多了。当然，不管你将求导展开成

n

×

m

ntimes m

n×m形式的矩阵还是

m

×

n

mtimes n

m×n的矩阵，只要在求导时统一，都没有关系。

## DNN损失函数求导

ϵ

=

1

2

σ

(

a

L

)

y

2

@

E

q

.

1

begin{array}{ccc} epsilon = frac{1}{2} ||sigma (bf{a^{L}})-bf{y}||^{2} & @Eq.1 end{array}

ϵ=21σ(aL)y2@Eq.1

a

L

=

W

L

a

L

1

+

b

L

,

a

L

,

b

L

R

N

L

,

W

L

R

N

L

×

R

N

L

1

bf{a^{L}}=bf{W^{L}}cdotbf{a^{L-1}}+bf{b^{L}}, bf{a^{L}},bf{b^{L}}in R^{N_{L}},bf{W^{L}}in R^{N_{L}}times R^{N_{L-1}}

aL=WLaL1+bL,aL,bLRNL,WLRNL×RNL1表示第L层激活函数的结果，

y

bf{y}

y表示Ground truth。Now，如何求解损失函数对

W

L

,

b

L

bf{W^{L}}, bf{b^{L}}

WL,bL的梯度呢？We only have to expand Eq.1 to the following expression 啦啦啦:

ϵ

=

1

2

Σ

i

N

[

σ

(

Σ

j

M

W

i

j

L

a

j

L

1

+

b

i

L

)

y

i

]

2

ϵ

W

x

y

=

[

σ

(

Σ

j

M

W

x

j

L

a

j

L

1

+

b

x

L

)

y

x

]

×

σ

(

Σ

j

M

W

x

j

L

a

j

L

1

+

b

x

L

)

×

a

y

L

1

s

o

,

ϵ

W

L

=

{

ϵ

W

x

y

L

}

x

:

1

N

,

y

:

1

M

T

h

e

n

s

u

r

p

r

i

s

i

n

g

l

y

=

[

σ

(

W

L

a

L

1

+

b

L

)

σ

(

W

L

a

L

1

+

b

L

)

]

(

a

L

1

)

T

begin{aligned} epsilon &= frac{1}{2}Sigma_{i}^{N} [sigma(Sigma_{j}^{M}W_{ij}^{L}cdot a^{L-1}_{j}+b_{i}^{L})-y_{i}]^{2}\ frac{partialepsilon}{partial W_{xy}} &= [sigma(Sigma_{j}^{M}W_{xj}^{L}cdot a^{L-1}_{j}+b_{x}^{L})-y_{x}]timessigma'(Sigma_{j}^{M}W_{xj}^{L}cdot a^{L-1}_{j}+b_{x}^{L})times a_{y}^{L-1}\ so, frac{partialepsilon}{partial bf{W^{L}}}&={frac{partialepsilon}{partial W_{xy}^{L}}}_{x:1rightarrow N,y:1rightarrow M}\ &Then surprisingly\ &=[sigma(bf{W^{L}}cdot a^{L-1}+bf{b^{L}})odotsigma'(bf{W^{L}}cdot a^{L-1}+bf{b^{L}})]cdot (a^{L-1})^{T} end{aligned}

ϵWxyϵso,WLϵ=21ΣiN[σ(ΣjMWijLajL1+biL)yi]2=[σ(ΣjMWxjLajL1+bxL)yx]×σ(ΣjMWxjLajL1+bxL)×ayL1={WxyLϵ}x:1N,y:1MThen surprisingly=[σ(WLaL1+bL)σ(WLaL1+bL)](aL1)T

ϵ

b

L

=

[

σ

(

W

L

a

L

1

+

b

L

)

σ

(

W

L

a

L

1

+

b

L

)

]

frac{partialepsilon}{partial bf{b^{L}}}=[sigma(bf{W^{L}}cdot a^{L-1}+bf{b^{L}})odotsigma'(bf{W^{L}}cdot a^{L-1}+bf{b^{L}})]

bLϵ=[σ(WLaL1+bL)σ(WLaL1+bL)]

z

L

=

W

L

a

L

1

+

b

L

bf{z^{L}}=bf{W^{L}}cdot a^{L-1}+bf{b^{L}}

zL=WLaL1+bL表示未激活输出，

δ

L

=

σ

(

z

L

)

σ

(

z

L

)

bf{delta^{L}}=sigma(bf{z^{L}})odotsigma'(bf{z^{L}})

ϵ

W

L

=

δ

L

(

a

L

1

)

T

ϵ

b

L

=

δ

L

begin{aligned} frac{partialepsilon}{partial bf{W^{L}}}&=bf{delta^{L}}cdot (bf{a^{L-1}})^{T}\ frac{partialepsilon}{partial bf{b^{L}}}&=bf{delta^{L}} end{aligned}

WLϵbLϵ=δL(aL1)T=δL

h

h

h层的参数求导，那么有：

ϵ

W

H

=

δ

H

(

a

H

1

)

T

@

E

q

.

2

ϵ

b

H

=

δ

H

@

E

q

.

3

w

h

e

r

e

δ

H

=

ϵ

Z

L

Z

L

Z

L

1

.

.

.

Z

H

+

1

Z

H

begin{aligned} frac{partialepsilon}{partial bf{W^{H}}}&=bf{delta^{H}}cdot (bf{a^{H-1}})^{T} @Eq.2\ frac{partialepsilon}{partial bf{b^{H}}}&=bf{delta^{H}} @Eq.3\\ where bf{delta^{H}}&=frac{partialepsilon}{partial bf{Z^{L}}}cdotfrac{partialbf{Z^{L}}}{partial bf{Z^{L-1}}}...frac{partialbf{Z^{H+1}}}{partial bf{Z^{H}}} end{aligned}

WHϵbHϵwhere δH=δH(aH1)T     @Eq.2=δH                      @Eq.3=ZLϵZL1ZL...ZHZH+1
clearly，求导的关键在于求解后一层非激活输出对前一层非激活输出的导数，即：

Z

L

Z

L

1

=

{

Z

i

L

Z

j

L

1

}

Z

i

L

Z

j

L

1

=

W

i

j

L

a

j

L

w

h

i

c

h

i

n

d

i

c

a

t

e

s

Z

L

Z

L

1

=

W

L

d

i

a

g

(

a

L

1

)

w

h

e

r

e

d

i

a

g

(

a

L

1

)

=

(

a

1

L

1

0

.

.

.

0

a

2

L

1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

a

N

L

1

L

1

)

begin{aligned} frac{partialbf{Z^{L}}}{partial bf{Z^{L-1}}}&={frac{partial Z^{L}_{i}}{partial Z^{L-1}_{j}}}\ frac{partial Z^{L}_{i}}{partial Z^{L-1}_{j}}&=W^{L}_{ij}cdot a^{L}_{j}\ which indicates frac{partialbf{Z^{L}}}{partial bf{Z^{L-1}}}&=bf{W^{L}}cdot diag(bf{a^{L-1}})\ where diag(bf{a^{L-1}})&=left(begin{array}{ccc} a_{1}^{L-1} & 0 & ...\ 0 & a_{2}^{L-1} & ...\ ...& ... & ... \ ... & ... & a_{N^{L-1}}^{L-1}\ end{array}right) end{aligned}

ZL1ZLZjL1ZiLwhichindicates ZL1ZLwhere diag(aL1)={ZjL1ZiL}=WijLajL=WLdiag(aL1)=a1L10......0a2L1...............aNL1L1

δ

H

delta^{H}

δH中，就可以得到：

δ

H

=

(

Z

L

Z

L

1

.

.

.

Z

H

+

1

Z

H

)

T

δ

L

=

Π

T

(

W

L

d

i

a

g

(

a

L

1

)

)

δ

L

@

E

q

.

4

begin{aligned} delta^{H} &= (frac{partialbf{Z^{L}}}{partial bf{Z^{L-1}}}...frac{partialbf{Z^{H+1}}}{partial bf{Z^{H}}})^{T}cdotdelta^{L}\ &= Pi^{T}(bf{W^{L}}cdot diag(bf{a^{L-1}}))cdotdelta^{L} @Eq.4 end{aligned}

δH=(ZL1ZL...ZHZH+1)TδL=ΠT(WLdiag(aL1))δL            @Eq.4
to analyze it from the dimension aspect, Eq.4的维度信息是：

[

(

N

L

N

L

1

)

×

(

N

L

1

N

L

2

)

×

.

.

.

(

N

H

+

1

N

H

)

]

T

×

(

N

L

1

)

=

(

N

H

1

)

[(N^{L}*N^{L-1})times(N^{L-1}*N^{L-2})times...(N^{H+1}*N^{H})]^{T}times(N^{L}*1)=(N^{H}*1)

[(NLNL1)×(NL1NL2)×...(NH+1NH)]T×(NL1)=(NH1)

ϵ

W

H

=

Π

T

(

W

L

d

i

a

g

(

a

L

1

)

)

δ

L

(

a

H

1

)

T

ϵ

b

H

=

Π

T

(

W

L

d

i

a

g

(

a

L

1

)

)

δ

L

begin{aligned} frac{partialepsilon}{partial bf{W^{H}}}&=Pi^{T}(bf{W^{L}}cdot diag(bf{a^{L-1}}))cdotdelta^{L}cdot (bf{a^{H-1}})^{T}\ frac{partialepsilon}{partial bf{b^{H}}}&=Pi^{T}(bf{W^{L}}cdot diag(bf{a^{L-1}}))cdotdelta^{L} end{aligned}

WHϵbHϵ=ΠT(WLdiag(aL1))δL(aH1)T=ΠT(WLdiag(aL1))δL

THE END