Skip to content

a dinosaur dataset for 500K Docker containers and their Pip (python) and Apt (debian) packages

Notifications You must be signed in to change notification settings

vsoch/dockerfile-packages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dockerfile Packages

This rpeository is a dinosaur dataset that provides the vsoch/dockerfiles package matrices, meaning for each Docker uri that still exists on Docker Hub, a feature matrix of packages (columns) by containers (rows) for docker URIs extracted in 2017 and 2019. It uses the container-tree library to generate container package trees, and then export vectors for them.

0. Get Docker Container Names

We want to derive a list of Docker containers for 2019, to supplement the original vsoch/dockerfiles container names that are provided in dockerfiles-02-16-2017.pkl from two years earlier. To extract the new names, we use the script 0.find-containers.py that uses the Docker command line client with search, along with search-terms.json to extract a listing of approximately 423,541 new containers. We save this 2019 listing to dockerfiles-01-13-2019.pkl.

1. Generate Package Matrices

In 1.extract-matrices.py we first combine the two files to produce dockerfiles-2017+2019.pkl. (N=129,463 + N=423,541 with some overlap for a total of 505,234 unique containers). We then use Container Package Trees for Apt and Pip (thanks to Google ContainerDiff!) to create package trees for each, and we both save the trees:

And export the final package matrices, not including versions. These are written to:

Note that apt-vectors.* isn't added, goes over the GitHub file size limit. For both apt and pip vectors, the containers that are parsed/seen are provided in seen-containers.pkl

2. Package Metadata

Pip

This approach was developed because it is no longer possible to scrape any sort of metadata or Dockerfile from the update Docker website. Thus, we can retrieve metadata about packages, and use that to say something about the container. In order to do this, we need to extract metadata for pip packages from pypi. While pypi has an API to get descriptions and basic package information, the dependency graph must come from libraries.io. We use the script 2.package-metadata.py to do this. Make sure that you have the libraries.io API token exported into the environment before running the script:

export LIBRARIESIO_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

When we finish this step, we have created pypi-metadata.pkl and pypi-metadata.json

Apt

Apt was harder to do because there isn't package data available on libraries.io. Thus, I created a Docker container in the docker folder that serves one simple purpose - to install and then use apt-cache depends to extract dependencies for a package. I chose a base image of 16.04 as an intermediate between much older Ubuntu (debian) bases (e.g., 14.04) and much newer (18.04). It's not a perfect approach, but should serve to get a good set of data.

Usage of the Docker container looks like this:

$ docker run -it vanessa/apt-package-dependencies adduser

Since some of the runs seemed to timeout, I used multiprocessing to set a limit of one minute for each, and ran 5 at a time.

About

a dinosaur dataset for 500K Docker containers and their Pip (python) and Apt (debian) packages

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published