Name		Name	Last commit message	Last commit date
parent directory ..
doc		doc
list-pga-heads		list-pga-heads
pga-create		pga-create
pga		pga
pga2uast		pga2uast
poster		poster
web		web
README.md		README.md

README.md

Public Git Archive

Paper (accepted to MSR'18). Presentation.

This dataset consists of two parts:

Siva files with Git repositories.
Index file in CSV format.

Besides, there is a number of auxiliary datasets:

configs.tar.xz - raw git config files for each siva.
heads.csv.xz - mapping from HEAD UUID to repository name.

Since the second version of PGA, we additionally provide the derived dataset of UASTs, extracted from the files in the latest revision of each repository.

Tools

pga - explore the dataset, or download its contents easily.
pga-create - reproduce PGA dataset generation.
borges-indexer - exports a CSV file with metadata from repositories fetched with Borges.
pga2uast - extracts Babelfish UASTs from the HEADs of siva files.
list_heads - lists files in each HEAD contained in siva.

Listing and downloading

To see the full list of repositories in the dataset or download it, you will need to install pga. Simply install Go and then run go get github.com/src-d/datasets/PublicGitArchive/pga.

Then to list all of the repositories in the dataset, simply run:

pga list siva

If you'd rather get a detailed dump of the dataset (not including the file contents) you can choose either pga list siva -f json or pga list siva -f csv.

To download the full dataset, execute:

pga get siva

Or if you want to download only those repositories containing at least a line of Java code:

pga get siva -l java

The pga command has -j/--workers argument which specifies the number of downloading threads to run, it defaults to 10.

For more information, check the pga documentation, or simply run pga -h.

Reproduction

Refer to pga-create documentation for more details about how PGA is generated.

Blacklist

We understand that some GitHub projects may become private or deleted with time. Previous dataset snapshots will continue to include such dead code. If you are the author and want to remove your project from all present and future public snapshots, please send a request to datasets@sourced.tech.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PublicGitArchive

PublicGitArchive

README.md

Public Git Archive

Tools

Listing and downloading

Reproduction

Blacklist

Files

PublicGitArchive

Directory actions

More options

Directory actions

More options

Latest commit

History

PublicGitArchive

Folders and files

parent directory

README.md

Public Git Archive

Tools

Listing and downloading

Reproduction

Blacklist