一、实验目的

1. 通过实验掌握基本的MapReduce编程方法；
2. 掌握用MapReduce解决一些常见的数据处理问题，包括数据去重、数据排序和数据挖掘等。

二、实验平台

1. 操作系统：Linux（建议Ubuntu16.04或Ubuntu18.04）

三、实验内容

编程实现文件合并和去重操作

``````20150101 x
20150102 y
20150103 x
20150104 y
20150105 z
20150106 x
``````

``````20150101 y
20150102 y
20150103 x
20150104 z
20150105 y
``````

``````20150101 x
20150101 y
20150102 y
20150103 x
20150104 y
20150104 z
20150105 y
20150105 z
20150106 x
``````

四、实验步骤

``````cd /usr/local/hadoop
sbin/start-dfs.sh
``````

``````sudo mkdir MapReduce && cd MapReduce
sudo vim A
sudo vim B
``````

``````sudo vim Merge.java
``````

``````import java.io.IOException;

public class Merge {
/**
* @param args
* 对A,B两个文件进行合并，并剔除其中重复的内容，得到一个新的输出文件C
*/
//重载map函数，直接将输入中的value复制到输出数据的key上
public static class Map extends Mapper<Object, Text, Text, Text>{
private static Text text = new Text();
public void map(Object key, Text value, Context context) throws IOException,InterruptedException{
text = value;
context.write(text, new Text(""));
}
}

//重载reduce函数，直接将输入中的key复制到输出数据的key上
public static class Reduce extends Reducer<Text, Text, Text, Text>{
public void reduce(Text key, Iterable<Text> values, Context context ) throws IOException,InterruptedException{
context.write(key, new Text(""));
}
}

public static void main(String[] args) throws Exception{
// TODO Auto-generated method stub
Configuration conf = new Configuration();
conf.set("fs.default.name","hdfs://localhost:9000");
String[] otherArgs = new String[]{"input","output"}; /* 直接设置输入参数 */
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in><out>");
System.exit(2);
}
Job job = Job.getInstance(conf,"Merge and duplicate removal");
job.setJarByClass(Merge.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
``````

``````sudo chown -R hadoop /usr/local/hadoop
``````

``````vim ~/.bashrc
``````

``````export HADOOP_HOME=/usr/local/hadoop
``````

``````source ~/.bashrc
``````

``````javac Merge.java
``````

``````jar -cvf Merge.jar *.class
``````

``````/usr/local/hadoop/bin/hdfs dfs -mkdir -p /user/hadoop
``````

``````/usr/local/hadoop/bin/hdfs dfs -put ./A input
``````

``````/usr/local/hadoop/bin/hadoop jar Merge.jar Merge
``````

``````/usr/local/hadoop/bin/hdfs dfs -cat output/*
``````

``````hadoop@fzqs-Laptop:/usr/local/hadoop\$ hdfs dfs -cat output/*
20170101 x
20170101 y
20170102 y
20170103 x
20170104 y
20170104 z
20170105 y
20170105 z
20170106 x