The dataset was extracted from Public Git Archive and consists of:
- 49 million distinct identifiers - 1 GB
- identifiers per language - 1 GB, same processing as (1) but extracted from specific programming language files: Python, Javacript, C, C++, PHP, Ruby, C#, Java, Shell, Go, Objective-C.
CSV, columns:
num_files
- number of files where the identifier was foundnum_occ
- number of times the identifier was found overallnum_repos
- number of repositories in which the identifier was foundtoken
- the value of the identifiertoken_split
- the splitted parts using the sourced-ml heuristics
All the stats correspond to the HEAD revision of each repository in PGA.
- Jupyter notebook which reads the per-language identifiers (2) and plots the statistics.