Skip to content

Thoth Datasets to share Thoth heritage and create new ideas...

License

Notifications You must be signed in to change notification settings

thoth-station/datasets

Repository files navigation

Thoth Datasets

This repo contains notebooks to download and explore the datasets provided by Thoth Team.

The datasets are available on Kaggle under Thoth Station, where you can also use them directly through Kaggle kernels. If you want to work locally, just download the dataset after signing in with your account.

Overview

The goal of this repo is to provide datasets widely available and useful for data scientists. Thoth Team within the office of the CTO at Red Hat has collected datasets that can be made open source within the IT domain for training Machine Learning models.

Videos

Datasets

Each Dataset created has reference to source for the data created and the motivation behind it. Moreover for each class of datasets you can find notebooks that use those data to produce results. You have example of results in each class of datasets.

Thoth Solver Datasets

Thoth Solver Datasets are based on solver reports created using Thoth Dependency Solver which tries to answer a simple question - what packages will be installed (resolved by pip or any Python compliant dependency resolver) for the provided stack?

Thoth Performance Datasets

Thoth Performance Datasets are created with one of the components of Thoth called Amun. Amun is a service that executes the given application stack in the requested environment - given the list of package that should be installed as well as given the hardware that is requested to run the application. This service acts as an execution engine for Thoth where applications are built and tested using Thoth Performance Indicators (PI). More information about Amun-API can be found here. Amun can be scheduled through another component in Thoth called Dependency Monkey. Dependency Monkey was designed to automate the evaluation of certain aspects of a software stack, such as code quality or performance. Therefore, this component aims to automatically verify software stacks and aggregate relevant observations.

Thoth Performance Datasets contains observations on performance for software stacks for different types of applications (e.g Machine Learning) and code quality (e.g. PIimport where it is possible to discover errors during run of the application)

Thoth Security Datasets

Thoth Security Datasets contain outputs from two Thoth Security Indicators (SI) Analyzers and aggregated results from those two:

  1. SI-bandit is an analyzer for security indicators based on bandit Python package, a tool designed to find common security issues in Python code. This Python package has different classes of tests:

    • B1xx misc tests
    • B2xx application/framework misconfiguration
    • B3xx blacklists (calls)
    • B4xx blacklists (imports)
    • B5xx cryptography
    • B6xx injection
    • B7xx XSS

    Each test in a group has two assigned parameters:

    • level of SEVERITY.
    • level of CONFIDENCE.

    that are manually assigned.

  2. SI-cloc is an analyzer for security indicators based on cloc RPM package that counts blank lines, comment lines, and physical lines of source code in many programming languages. It's important to take into account some of the known limitations for this package:

    • Lines containing both source code and comments are counted as lines of code.
    • Python docstrings can serve several purposes. They may contain documentation, comment out blocks of code, or they can be regular strings (when they appear on the right hand side of an assignment or as a function argument). cloc is unable to infer the meaning of docstrings by context; by default, cloc treats all docstrings as comments. The switch --docstring-as--code treats all docstrings as code.
    • Language definition files read with --read-lang-def or --force-lang-def must be plain ASCII text files.

Start working on the data

After cloning the repo follow these steps:

  1. Install micropipenv Ref.
pip install micropipenv
  1. Create and activate virtualenv.

python3 -m venv venv/ && . venv/bin/activate

  1. Install the dependencies provided in Pipfile and Pipfile.lock in an environment.
(venv) $ micropipenv install --dev
  1. Start using the notebook provided or work on your own notebook with the dataset.
(venv) $ jupyter lab

Template notebooks

Template notebook unders notebooks/templates can be reused to analyze similar data from thoth components reports.

Some template requires specific Thoth environment variables (e.g. adviser templates). Please ask the Thoth Team.

Accessing Thoth Data on the Operate-First Public Bucket

A public instance of Thoth's database is available on the Operate-First Public Bucket for external contributors to start developing components of Thoth.

The credentials to the opf-datacatalog bucket are accessible here. Make sure you have installed the AWS command line interface, and after saving the public credentials in your ~/.aws/credentials file, you can copy locally the database using:

aws s3 --endpoint https://s3-openshift-storage.apps.smaug.na.operate-first.cloud cp s3://opf-datacatalog/thoth/datasets/thoth_public_database.sql .

Eventually specifying a --profile if it was set in your credentials file.

Be careful not to store any confidential or valuable information in this bucket as its content can be wiped out at any time.

You can run the generate-dump.sh script to automatically generate a dump of the database.

How you can use the Data

You can download and use this data for free for your own purpose, all we ask is three things

  • you cite Thoth Team as the source if you use the data;
  • you accept that you are solely responsible for how you use the data;
  • you do not sell this data to anyone, it is free!