devsearch-concat

Concatenate source files from the DevMine source repositories.

The size of a block on hdfs is at least 64MB. For that reason, if we want to run some large computation with spark or hadoop's MapReduce we need to concatenate small files into bigger ones that are more suitable for hdfs.

devsearch-concat will walk throught the GitHub data that has been made available by DevMine's crawld (https://github.com/DevMine/crawld) and filter out all files that are not text or too large to be human readable code. It will then create tarballs at least 128MB in size with those files.

devsearch-concat assumes a directory structure as follows:

REPO_ROOT
└── Language Folder
    └── Github User
        └── Repository

The repositories can either be normal directories or tar archives.

All the files' paths in the resulting tar archives are relative to REPO_ROOT.

Build & Run

> sbt assembly
> java -jar target/scala-2.10/devsearch-concat-assembly-1.0.jar [-j=<numJobs>] <REPO_ROOT> <OUTPUT_FOLDER>

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
project		project
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
scalastyle-config.xml		scalastyle-config.xml
update_scaladoc.sh		update_scaladoc.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

devsearch-concat

Build & Run

About

Releases

Packages

Languages

License

devsearch-epfl/devsearch-concat

Folders and files

Latest commit

History

Repository files navigation

devsearch-concat

Build & Run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages