減少mapreduce輸出檔案-有解無憂

我已經撰寫了代碼來生成 task1-input1.txt 的單詞及其相應的出現頻率，不包括 stopwords.txt 中的停用詞

public class TopKCommonWords {

public static class TokenizerMapper
        extends Mapper<Object, Text, Text, IntWritable>{
    Set<String> stopwords = new HashSet<String>();
    private static final String STOP_WORD_PATH = "C:\\Users\\user\\Desktop\\CS4225\\TopKCommonWords\\input\\stopwords.txt";

    @Override
    protected void setup(Context context) {
        try {
            Path path = new Path(STOP_WORD_PATH);
            FileSystem fs = FileSystem.get(new Configuration());
            BufferedReader br = new BufferedReader(new InputStreamReader(
                    fs.open(path)));
            String word = null;
            while ((word = br.readLine()) != null) {
                stopwords.add(word);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            if (stopwords.contains(word.toString()))
                continue;
            context.write(word, one);
        }
    }
}

public static class IntSumReducer
        extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
    ) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum  = val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "counter1");
    job.setJarByClass(TopKCommonWords.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[3]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

}

這是我的論點。減少 mapreduce 輸出檔案

我明白通過改變

FileInputFormat.addInputPath(job, new Path(args[0]));

從0到1，我可以在task1-input2.txt中得到單詞及其出現頻率。

例如在我的事件輸出中：

task1-input1:      task1-input2:

coffee 3           coffee 2
happy 10           good 3
good 6             sweet 5

我如何比較這兩個輸出并只回傳常見的和頻率最低的輸出？預期的結果應該是：

coffee 2
good 3

uj5u.com熱心網友回復：

如果要對所有檔案中的單詞求和，則不需要合并輸出檔案，而是可以使用addInputPathMultipleInputs 類多次讀取多個檔案

或者，您應該能夠將input檔案夾作為引數傳遞以讀取其中的所有檔案。

如果你想找到每個檔案中計數最少的單詞，你需要第二個 reducer

您已經將輸出位置作為變數

Path output1 = new Path(args[3];
FileOutputFormat.setOutputPath(job, output1));

因此，創建另一個讀取該位置的作業

但是，如果您使用組合器進行字數統計，并使用檔案名作為密鑰，您可能只能使用一項作業

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/511166.html

標籤：爪哇Hadoop映射减少

上一篇：JavaHadoop排序詞及其頻率

下一篇：獲取特定檔案，同時保持HDFS的目錄結構