# 一、实验目的

1. 通过实验掌握基本的 MapReduce 编程方法；
2. 掌握用 MapReduce 解决一些常见数据处理问题的方法，包括数据合并、数据去重、数据排序和数据挖掘等。

# 二、实验平台

• 操作系统：Ubuntu 18.04（或 Ubuntu 16.04）

# 三、实验内容和要求

## 1. 编程实现文件合并和去重操作

### 问题如下：

``````20150101 x
20150102 y
20150103 x
20150104 y
20150105 z
20150106 x
``````

``````20150101 y
20150102 y
20150103 x
20150104 z
20150105 y
``````

``````20150101 x
20150101 y
20150102 y
20150103 x
20150104 y
20150104 z
20150105 y
20150105 z
20150106 x
``````

### 代码如下：

``````#!/usr/bin/env python3
# encoding=utf-8

import sys
for line in sys.stdin:
line = line.strip()
words = line.split('n')
for word in words:
print(word)
``````

``````#!/usr/bin/env python3
# encoding=utf-8

import sys

lines = []

for line in sys.stdin:
line = line.strip()
key, value = line.split('t', 1)
if [key, value] not in lines:
lines.append([key, value])

lines = sorted(lines, key = lambda x:(x[0], x[1]))

for line in lines:
print("%st%s" % (line[0], line[1]))
``````

### 简单测试：

``````cat A B | python3 mapper.py | python3 reducer.py
``````

## 2. 编写程序实现对输入文件的排序

### 问题如下：

``````33
37
12
40
``````

``````4
16
39
5
``````

``````1
45
25
``````

``````1 1
2 4
3 5
4 12
5 16
6 25
7 33
8 37
9 39
10 40
11 45
``````

### 代码如下：

``````#!/usr/bin/env python3
# encoding=utf-8

import sys
for line in sys.stdin:
line = line.strip()
words = line.split('n')
for word in words:
print(word)
``````

``````#!/usr/bin/env python3
# encoding=utf-8

import sys

lines = []

for line in sys.stdin:
line = line.strip()
try:
line = int(line)
except ValueError:
continue
lines.append(line)

i = 1
for line in sorted(lines):
print("%d %s" % (i, line))
i += 1
``````

### 简单测试：

``````cat 1 2 3 | python3 mapper.py | python3 reducer.py
``````

## 3. 对给定的表格进行信息挖掘

### 问题如下：

``````child parent
Steven Lucy
Steven Jack
Jone Lucy
Jone Jack
Lucy Mary
Lucy Frank
Jack Alice
Jack Jesse
David Alice
David Jesse
Philip David
Philip Alma
Mark David
Mark Alma
``````

``````grandchild 	grandparent
Steven 		Alice
Steven 		Jesse
Jone 		Alice
Jone 		Jesse
Steven 		Mary
Steven 		Frank
Jone 		Mary
Jone 		Frank
Philip 		Alice
Philip 		Jesse
Mark 		Alice
Mark 		Jesse
``````

### 代码如下：

``````#!/usr/bin/env python3
# encoding=utf-8

import sys

lines = list(sys.stdin)
for line in lines[1:]:
line = line.strip()
child, parent = line.split()
print('%s %s' % (child, parent))
``````

``````#!/usr/bin/env python3
# encoding=utf-8

import sys

lines = []

for line in sys.stdin:
line = line.strip()
child, parent = line.split()
lines.append([child, parent])

print('grandchildtgrandparent')
for line in lines:
c, p = line[0], line[1]
for line in lines:
if line[0] == p:
print('%stt%s' % (c, line[1]))
``````

### 简单测试：

``````cat child-parent | python3 mapper.py | python3 reducer.py
``````

# 四、在HDFS中运行Python程序

``````cd /usr/local/hadoop
sbin/start-dfs.sh
``````

``````bin/hdfs dfs -mkdir /input
``````

``````bin/hdfs dfs -rm -r /output
``````

``````ls /usr/local/hadoop/share/hadoop/tools/lib/
``````

``````/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.2.2.jar
-input /input/*         -output /output
``````

``````bin/hdfs dfs -cat /output/*
``````

THE END