GitHub - kevinweil/FileSetInputFormat: A Hadoop input format for sending lists of files as keys to a mapper. Set the list of files, and an input split will be created per file. Each map task gets only one input key: the filename for its split.

An input format for processing individual Path objects one at a time. Useful for distributing large jobs that are inherently single-machine across the entire cluster. Even if they still run on one machine, this way they get given to an under-utilized node at runtime, and there is built-in resilience to task failure.

This input format takes a set of paths and produces a separate input split for each one. If you need to, for example, unzip a collection of five files in HDFS, that unzipping has to happen on a single machine per file. But the set of five files can at least all be unzipped on different machines. Using this input format lets you process each file as you wish in your mapper.

In your main/run method of your Hadoop job driver class, add


Job job = new Job(new Configuration());
...

job.setInputFormatClass(FileSetInputFormat.class);

FileSetInputFormat.addPath("/some/path/to/a/file");
FileSetInputFormat.addPath("/some/other/path/to/a/file");
// Also see FileSetInputFormat.addAllPaths(Collection paths);

Then, make your mapper take a Path as the key and a NullWritable as the value:


public static class MyMapper extends Mapper<Path, NullWritable, ..., ...> {
    protected void map(Path key, NullWritable value, Context context) throws IOException, InterruptedException {
        // Do something with the path, e.g. open it and unzip it to somewhere.
        ...
    }
}

The Path keys are the same paths passed in to FileSetInputFormat.addPath and FileSetInputFormat.addAllPaths. Duplicates are stripped, and one InputSplit is generated per unique Path. That's it!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
libs		libs
src/com/twitter/twadoop/mapreduce/input		src/com/twitter/twadoop/mapreduce/input
test/com/twitter/twadoop/mapreduce/input		test/com/twitter/twadoop/mapreduce/input
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

kevinweil/FileSetInputFormat

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages