README.md

Identifiers

Paper (accepted to ML4P'18).

The dataset was extracted from Public Git Archive and consists of:

49 million distinct identifiers - 1 GB
identifiers per language - 1 GB, same processing as (1) but extracted from specific programming language files: Python, Javacript, C, C++, PHP, Ruby, C#, Java, Shell, Go, Objective-C.

CSV, columns:

All the stats correspond to the HEAD revision of each repository in PGA.

Jupyter notebook which reads the per-language identifiers (2) and plots the statistics.