我已經撰寫了代碼來生成 task1-input1.txt 的單詞及其相應的出現頻率,不包括 stopwords.txt 中的停用詞
public class TopKCommonWords {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
Set<String> stopwords = new HashSet<String>();
private static final String STOP_WORD_PATH = "C:\\Users\\user\\Desktop\\CS4225\\TopKCommonWords\\input\\stopwords.txt";
@Override
protected void setup(Context context) {
try {
Path path = new Path(STOP_WORD_PATH);
FileSystem fs = FileSystem.get(new Configuration());
BufferedReader br = new BufferedReader(new InputStreamReader(
fs.open(path)));
String word = null;
while ((word = br.readLine()) != null) {
stopwords.add(word);
}
} catch (IOException e) {
e.printStackTrace();
}
}
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
if (stopwords.contains(word.toString()))
continue;
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum = val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "counter1");
job.setJarByClass(TopKCommonWords.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[3]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
這是我的論點。

我明白通過改變
FileInputFormat.addInputPath(job, new Path(args[0]));
從0到1,我可以在task1-input2.txt中得到單詞及其出現頻率。
例如在我的事件輸出中:
task1-input1: task1-input2:
coffee 3 coffee 2
happy 10 good 3
good 6 sweet 5
我如何比較這兩個輸出并只回傳常見的和頻率最低的輸出?預期的結果應該是:
coffee 2
good 3
uj5u.com熱心網友回復:
如果要對所有檔案中的單詞求和,則不需要合并輸出檔案,而是可以使用addInputPathMultipleInputs 類多次讀取多個檔案
或者,您應該能夠將input檔案夾作為引數傳遞以讀取其中的所有檔案。
如果你想找到每個檔案中計數最少的單詞,你需要第二個 reducer
您已經將輸出位置作為變數
Path output1 = new Path(args[3];
FileOutputFormat.setOutputPath(job, output1));
因此,創建另一個讀取該位置的作業
但是,如果您使用組合器進行字數統計,并使用檔案名作為密鑰,您可能只能使用一項作業
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/511166.html
標籤:爪哇Hadoop映射减少
