# 1、网络结构

mtcnn 算法人脸检测过程分为三个独立的stage，每一个stage对应一个卷积网络，分别为pnet，rnet，onet。网络结构如图所示。

mtcnn三个网络的结构都相对简单，整个网络只包含3*3和2*2的卷积层、2*2的MaxPooling层、Prelu层和全连接层，网络结构比较简单。mtcnn采用级联网络的思想，pnet->rnet->onet网络结构更加复杂，每个网络采用多任务学习分别进行训练。

# 2、构造网络的三个stages (pnet, rnet, onet)

``````

import os
import numpy as np
import mtcnn.network.mtcnn_pytorch as mtcnn

pnet = mtcnn.PNet()
rnet = mtcnn.RNet()
onet = mtcnn.ONet()

weight_folder = '../output/converted'

``````

# 3、三个子网络的实际作用是什么？

## 3.1 Pnet

``````import torch
print("Given ten images with shape (500 * 500).")
pnet_data = torch.randn(10, 3, 500, 500)
ret1, ret2, _ = pnet(pnet_data)
print("Pnet output: ntTensor shape of ret1 is %s.ntTensor shape of ret2 is %s.n" % (ret1.shape, ret2.shape))
print("Where does '245' come from?")
print("245 = (500-2)/2-2-2 .nApply 'minus 2' every Conv layer with kernel 3*3. Apply 'divide 2' every max-pooling layer with kenel 2.")``````
```给定图片大小是 (500 * 500).
Pnet 输出:
张量ret1的维度是torch.Size([10, 2, 245, 245]).
张量ret2的维度是torch.Size([10, 4, 245, 245]).

245 = (500-2)/2-2-2 .
Apply 'minus 2' every Conv layer with kernel 3*3. Apply 'divide 2' every max-pooling layer with kenel 2.
```

Pnet只有卷积层和relu层没有最后的全连接层，输入一张500*500的图片，输出分别是2*245*245，4*245*245 的feature map。245是怎么来的呢，我们知道原始图片没经过一个n*n的卷积层，输出维度都会减少n-1每经过一个n*n的pooling 层输出维度都会减少n倍。Pnet的网络结构为 conv33->pooling22->conv33->conv33->conv11，所以最终输出维度为(500-2)/2-2-2 = 245。

rnet、onet与pnet不同，由于最后存在全连接层，只能输入固定大小的图像。同样输出分类结果和人脸框坐标回归值onet完成最后的人脸5个关键点的坐标预测任务

## 3.2 Rnet

``````rnet_data = torch.randn(10, 3, 24, 24)
print("The input tensor of Rnet must be with shape (24 * 24)")
ret1, ret2, _ = rnet(rnet_data)
print("Rnet output: ntTensor shape of ret1 is %s.ntTensor shape of ret2 is %s.n" % (ret1.shape, ret2.shape))``````

Rnet输入张量维度必须是(24 * 24)
Rnet 输出:
Tensor shape of ret1 is torch.Size([10, 2]). 分类结果
Tensor shape of ret2 is torch.Size([10, 4]). 坐标回归值

## 3.3 Onet

``````onet_data = torch.randn(10, 3, 48, 48)
print("The input tensor of Rnet must be with shape (48 * 48)")
ret1, ret2, ret3 = onet(onet_data)
print("Rnet output: ntTensor shape of ret1 is %s.ntTensor shape of ret2 is %s.ntTensor shape of ret3 is %s." % (ret1.shape, ret2.shape, ret3.shape))

``````
```Onet的输入维度是 (48 * 48)
Onet输出:
Tensor shape of ret1 is torch.Size([10, 2]).  分类结果
Tensor shape of ret2 is torch.Size([10, 4]).  坐标回归值
Tensor shape of ret3 is torch.Size([10, 10]). 五个关键点坐标
```

# 4、真实照片测试

``````import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

img = Image.open('../tests/asset/images/office5.jpg')
img_array = np.asarray(img)
plt.imshow(img_array)
plt.show()``````

## 4.1 图像金字塔

``````import math

minsize = 12
factor = 0.707

width = img.size[0]
height = img.size[1]

# Compute valid scales
scales = []
cur_width = width
cur_height = height
cur_factor = 1
while cur_width >= minsize and cur_height >= minsize:
# ensure width and height are even
w = cur_width
h = cur_height
scales.append((w, h, cur_factor))

cur_factor *= factor
cur_width = math.ceil(cur_width * factor)
cur_height = math.ceil(cur_height * factor)

# Resize the image
img_pyramid = img_array.copy()
pyramid_list = []
for w, h, f in scales:
im = img.resize((w, h), Image.BILINEAR)
im = np.asarray(im)
img_pyramid[0:h, 0:w] = im
pyramid_list.append(im)

plt.imshow(img_pyramid)
plt.show() ``````

## 4.2 For each resized image in pyramid, pass it into pnet and get classification feature map and bounding box regression feature map.

``````def preprocess(img):
"""
Convert image from NDArray to torch.FloatTensor.
"""
img = img.transpose(2, 0, 1)
img = torch.FloatTensor(img)
# The input of pnet must be normalized.
img = (img - 127.5) * 0.0078125
img = torch.unsqueeze(img, 0)
return img

for (w, h, f), im in zip(scales, pyramid_list):
im = preprocess(im)
p_distribution, box_regs, _ = pnet(im)
score = p_distribution[:, 1]

print("Input shape: %s.nt-->Score shape %s.nt-->Box regression shpae %s.n" % (im.shape, score.shape, box_regs.shape))
``````

```Input shape: torch.Size([1, 3, 375, 500]).
-->Score shape torch.Size([1, 183, 245]).
-->Box regression shpae torch.Size([1, 4, 183, 245]).

Input shape: torch.Size([1, 3, 266, 354]).
-->Score shape torch.Size([1, 128, 172]).
-->Box regression shpae torch.Size([1, 4, 128, 172]).

Input shape: torch.Size([1, 3, 189, 251]).
-->Score shape torch.Size([1, 90, 121]).
-->Box regression shpae torch.Size([1, 4, 90, 121]).

Input shape: torch.Size([1, 3, 134, 178]).
-->Score shape torch.Size([1, 62, 84]).
-->Box regression shpae torch.Size([1, 4, 62, 84]).

Input shape: torch.Size([1, 3, 95, 126]).
-->Score shape torch.Size([1, 43, 58]).
-->Box regression shpae torch.Size([1, 4, 43, 58]).

Input shape: torch.Size([1, 3, 68, 90]).
-->Score shape torch.Size([1, 29, 40]).
-->Box regression shpae torch.Size([1, 4, 29, 40]).

Input shape: torch.Size([1, 3, 49, 64]).
-->Score shape torch.Size([1, 20, 27]).
-->Box regression shpae torch.Size([1, 4, 20, 27]).

Input shape: torch.Size([1, 3, 35, 46]).
-->Score shape torch.Size([1, 13, 18]).
-->Box regression shpae torch.Size([1, 4, 13, 18]).

Input shape: torch.Size([1, 3, 25, 33]).
-->Score shape torch.Size([1, 8, 12]).
-->Box regression shpae torch.Size([1, 4, 8, 12]).

Input shape: torch.Size([1, 3, 18, 24]).
-->Score shape torch.Size([1, 4, 7]).
-->Box regression shpae torch.Size([1, 4, 4, 7]).

Input shape: torch.Size([1, 3, 13, 17]).
-->Score shape torch.Size([1, 2, 4]).
-->Box regression shpae torch.Size([1, 4, 2, 4]).```

## 4.3 特征如何映射到原始图片中的位置？以金字塔第五个图片为例

``````import matplotlib.patches as patches
w, h, f = scales[5]
im = pyramid_list[5].copy()

# Create figure and axes
fig, ax = plt.subplots(1)

# Display the image
ax.imshow(im)

# Create a Rectangle patch
rect = patches.Rectangle((0,0),12,12,linewidth=1,edgecolor='r',facecolor='none')
# Add the patch to the Axes

# Create a Rectangle patch
rect = patches.Rectangle((51,51),12,12,linewidth=1,edgecolor='b',facecolor='none')
# Add the patch to the Axes

plt.imshow(im)
plt.show()

print("We caculate Correspondence by this fomula: ntx1 = x1_map * 2 + 1, nty1 = y1_map * 2 + 1, ntx2 = x1_map * 2 + 1 + 12, nty2 = y2_map * 2 + 1 + 12.")
print("So the axis of Red box in original image is (1, 1, 13, 13), which Correspond to (0, 0) in feature map.")
print("The axis of blue box in original image is (51, 51, 63, 63), which correspond to (25, 25) in feature map.mro")``````

```We caculate Correspondence by this fomula:
x1 = x1_map * 2 + 1,
y1 = y1_map * 2 + 1,
x2 = x1_map * 2 + 1 + 12,
y2 = y2_map * 2 + 1 + 12.
Red box在原图中位置为(1, 1, 13, 13),特征图上位置为 (0, 0).
blue box在原图中位置为(51, 51, 63, 63), 特征图上位置为 (25, 25)```

## 4.4 通过pnet计算feature map，并建立feature map中每一个正样本点到原始图片区域的映射关系

``````def generate_bboxes(probs, offsets, scale, threshold):
"""Generate bounding boxes at places
where there is probably a face.

Arguments:
probs: a FloatTensor of shape [1, 2, n, m].
offsets: a FloatTensor array of shape [1, 4, n, m].
scale: a float number,
width and height of the image were scaled by this number.
threshold: a float number.

Returns:
boxes: LongTensor with shape [x, 4].
score: FloatTensor with shape [x].
"""

# applying P-Net is equivalent, in some sense, to
# moving 12x12 window with stride 2
stride = 2
cell_size = 12

# extract positive probability and resize it as [n, m] dim tensor.
probs = probs[0, 1, :, :]

# indices of boxes where there is probably a face
inds = (probs > threshold).nonzero()

if inds.shape[0] == 0:

# transformations of bounding boxes
tx1, ty1, tx2, ty2 = [offsets[0, i, inds[:, 0], inds[:, 1]]
for i in range(4)]
# they are defined as:
# w = x2 - x1 + 1
# h = y2 - y1 + 1
# x1_true = x1 + tx1*w
# x2_true = x2 + tx2*w
# y1_true = y1 + ty1*h
# y2_true = y2 + ty2*h

offsets = torch.stack([tx1, ty1, tx2, ty2], 1)
score = probs[inds[:, 0], inds[:, 1]]

# P-Net is applied to scaled images
# so we need to rescale bounding boxes back
bounding_boxes = torch.stack([
stride*inds[:, 1] + 1.0,
stride*inds[:, 0] + 1.0,
stride*inds[:, 1] + 1.0 + cell_size,
(stride*inds[:, 0] + 1.0 + cell_size),
], 0).transpose(0, 1).float()

bounding_boxes = torch.round(bounding_boxes / scale).int()
return bounding_boxes, score, offsets

candidate_boxes = torch.empty((0, 4), dtype=torch.int32)
candidate_scores = torch.empty((0))
candidate_offsets = torch.empty((0, 4), dtype=torch.float32)

for (w, h, f), im in zip(scales, pyramid_list):
im = preprocess(im)
p_distribution, box_regs, _ = pnet(im)

# we set filter threshold 0.6 here
candidate, scores, offsets = generate_bboxes(p_distribution, box_regs, f, 0.6)
candidate_boxes = torch.cat([candidate_boxes, candidate])
candidate_scores = torch.cat([candidate_scores, scores])
candidate_offsets = torch.cat([candidate_offsets, offsets])``````

``````img_origin = img_array.copy()

def show_boxes(img, boxes):

# Create figure and axes
fig, ax = plt.subplots(1)

# Display the image
ax.imshow(img)
for box in boxes:
# Create a Rectangle patch
rect = patches.Rectangle((box[0],box[1]),box[2]-box[0],box[3]-box[1],linewidth=1,edgecolor='r',facecolor='none')
# Add the patch to the Axes

plt.show()

show_boxes(img_origin, candidate_boxes)
print("It seems that there are many boxes around regions with face. Pretty good!")
print("There are olso many redundant box here and the axis is olso inaccurate. Don't worry about it.")``````

候选框很多而且不是很准对不对，没关系，上面说了第一阶段网络只保证召回率就可以了。接下来我们需要使用box regression的回归值对坐标形状位置进行调整。并进行nms(非极大抑制)操作过滤重合度过高的候选框。最后将候选框重新refine为正方形（防止进入下一层网络进行resize操作时图像变形）

## 4.4 Accurately adjust coordinates，nms and re-convert boxes to square.

``````def calibrate_box(bboxes, offsets):
"""Transform bounding boxes to be more like true bounding boxes.
'offsets' is one of the outputs of the nets.

Arguments:
bboxes: a IntTensor of shape [n, 4].
offsets: a IntTensor of shape [n, 4].

Returns:
a IntTensor of shape [n, 4].
"""
x1, y1, x2, y2 = [bboxes[:, i] for i in range(4)]
w = x2 - x1 + 1.0
h = y2 - y1 + 1.0
w = torch.unsqueeze(w, 1)
h = torch.unsqueeze(h, 1)

translation = torch.cat([w, h, w, h], 1).float() * offsets
bboxes += torch.round(translation).int()
return bboxes

def convert_to_square(bboxes):
"""Convert bounding boxes to a square form.

Arguments:
bboxes: a IntTensor of shape [n, 4].

Returns:
a IntTensor of shape [n, 4],
squared bounding boxes.
"""

square_bboxes = torch.zeros_like(bboxes, dtype=torch.float32)
x1, y1, x2, y2 = [bboxes[:, i].float() for i in range(4)]
h = y2 - y1 + 1.0
w = x2 - x1 + 1.0
max_side = torch.max(h, w)
square_bboxes[:, 0] = x1 + w*0.5 - max_side*0.5
square_bboxes[:, 1] = y1 + h*0.5 - max_side*0.5
square_bboxes[:, 2] = square_bboxes[:, 0] + max_side - 1.0
square_bboxes[:, 3] = square_bboxes[:, 1] + max_side - 1.0

square_bboxes = torch.ceil(square_bboxes + 1).int()
return square_bboxes

def refine_boxes(bboxes, w, h):
"""
Avoid coordinates beyond image size
"""

bboxes = torch.max(torch.zeros_like(bboxes), bboxes)
sizes = torch.IntTensor([[w, h, w, h]] * bboxes.shape[0])
bboxes = torch.min(bboxes, sizes)
return bboxes

def nms(dets, scores, thresh, mode="Union"):
"""
greedily select boxes with high confidence
keep boxes overlap <= thresh
rule out overlap > thresh
:param dets: [[x1, y1, x2, y2 score]]
:param thresh: retain overlap <= thresh
:return: indexes to keep
"""
x1 = dets[:, 0]
y1 = dets[:, 1]
x2 = dets[:, 2]
y2 = dets[:, 3]
scores = scores

areas = (x2 - x1 + 1) * (y2 - y1 + 1)
order = scores.argsort()[::-1]

keep = []
while order.size > 0:
i = order[0]
keep.append(i)
xx1 = np.maximum(x1[i], x1[order[1:]])
yy1 = np.maximum(y1[i], y1[order[1:]])
xx2 = np.minimum(x2[i], x2[order[1:]])
yy2 = np.minimum(y2[i], y2[order[1:]])

w = np.maximum(0.0, xx2 - xx1 + 1)
h = np.maximum(0.0, yy2 - yy1 + 1)
inter = w * h
if mode == "Union":
ovr = inter / (areas[i] + areas[order[1:]] - inter)
elif mode == "Minimum":
ovr = inter / np.minimum(areas[i], areas[order[1:]])

inds = np.where(ovr <= thresh)[0]
order = order[inds + 1]

return np.array(keep)
``````
``````

candidate_boxes = calibrate_box(candidate_boxes, candidate_offsets)
candidate_boxes = convert_to_square(candidate_boxes)
candidate_boxes = refine_boxes(candidate_boxes, width, height)
keep = nms(candidate_boxes.cpu().detach().numpy(), candidate_scores.cpu().detach().numpy(), 0.7)
candidate_boxes = candidate_boxes[keep]
show_boxes(img_array.copy(), candidate_boxes)

``````

# 5、Stage 2 Rnet

``````boxes = candidate_boxes
# Step one: crop and resize the images and pre-process them.
stage_two_imgs = []
for box in boxes:
im = img_array[box[1]: box[3], box[0]: box[2]]
im = Image.fromarray(im)
im = im.resize((24, 24), Image.BILINEAR)
im = np.asarray(im)
im = preprocess(im)
stage_two_imgs.append(im)

stage_two_imgs = torch.cat(stage_two_imgs)

# Step two: filter the boxes by scores given by rnet
p_distribution, box_regs, _ = rnet(stage_two_imgs)  # rnet forward pass
scores = p_distribution[:, 1]

boxes = calibrate_box(boxes, box_regs)
boxes = convert_to_square(boxes)
boxes = refine_boxes(boxes, width, height)

# nms
keep = nms(boxes.cpu().detach().numpy(), scores.cpu().detach().numpy(), 0.7)
boxes = boxes[keep]``````
`show_boxes(img_array.copy(), boxes)`

# 6、Stage 3 Onet

``````

# Step one: crop and resize the images and pre-process them.
stage_three_imgs = []
for box in boxes:
im = img_array[box[1]: box[3], box[0]: box[2]]
im = Image.fromarray(im)
im = im.resize((48, 48), Image.BILINEAR)
im = np.asarray(im)
im = preprocess(im)
stage_three_imgs.append(im)

stage_two_imgs = torch.cat(stage_three_imgs)

# Step two: filter the boxes by scores given by rnet
p_distribution, box_regs, _ = onet(stage_two_imgs)  # rnet forward pass
scores = p_distribution[:, 1]

boxes = calibrate_box(boxes, box_regs)
boxes = refine_boxes(boxes, width, height)

# nms
keep = nms(boxes.cpu().detach().numpy(), scores.cpu().detach().numpy(), 0.3)
boxes = boxes[keep]

``````

`show_boxes(img_array.copy(), boxes)`

THE END