An open source project from Data to AI Lab at MIT.
MIT-Featuretools TA2 submission for the D3M program.
- Free software: MIT license
- Development Status: Pre-Alpha
- Documentation: https://HDI-Project.github.io/mit-d3m-ta2
This repository contains the TA2 submission for the Data Driven Discovery of Models (D3M) DARPA program developed by the DAI-Lab and Featuretools teams.
mit-d3m-ta2 has been developed and tested on Python 3.6
Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where mit-d3m-ta2 is run.
These are the minimum commands needed to create a virtualenv using python3.6 for mit-d3m-ta2:
pip install virtualenv
virtualenv -p $(which python3.6) mit-d3m-ta2-venv
Afterwards, you have to execute this command to have the virtualenv activated:
source mit-d3m-ta2-venv/bin/activate
Remember about executing it every time you start a new console to work on mit-d3m-ta2!
In order to install mit-d3m-ta2, you will have to clone the repository and checkout its stable branch:
git clone git@github.com:HDI-Project/mit-d3m-ta2.git
cd mit-d3m-ta2
git checkout stable
Once done, make sure to having created and activated your virtalenv and then simply execute:
make install
If you want to contribute to the project, a few more steps are required to make the project ready for development.
First, please head to the GitHub page of the project and make a fork of the project under you own username by clicking on the fork button on the upper right corner of the page.
Afterwards, clone your fork and create a branch from master with a descriptive name that includes the number of the issue that you are going to work on:
git clone git@github.com:{your username}/mit-d3m-ta2.git
cd mit-d3m-ta2
git branch issue-xx-cool-new-feature master
git checkout issue-xx-cool-new-feature
Finally, install the project with the following command, which will install some additional dependencies for code linting and testing.
make install-develop
Make sure to use them regularly while developing by running the commands make lint
and make test
.
Additional dependencies required to execute some of the TA1 primitives have been left out from the command above in order to keep maximum compatibility with the different types of systems and avoid dependency conflicts.
Because of this, some datasets, including timeseries and image data modalities, might not work properly.
In order to make them work, install the additional dependencies and download additional files with the following commands:
sudo apt-get install $(cat system_requirements.txt)
pip install -r devel_requirements.txt
mkdir -p static
python -m d3m.index download -o static
And keep in mind the following considerations:
- The command line script
ta2
explained in the usage section below will stop working and will need to be replaced withpython -m ta2
in all the examples. - Some red warnings might show in the command line indicating that incompatible versions have been install. These warnings can be safely ignored, as their only consequence is the previous point.
mit-d3m-ta2 runs on datasets in the D3M Format
You can find a collection of datasets in the D3M format in the d3m-data-dai S3 Bucket in
AWS, including the corresponding TRAIN
,
TEST
and SCORE
partitions following the schema specification.
More datasets in newer versions of the schema can also be found in the private datasets repository.
Our TA2 system is regularly evaluated over the collection of Seed Datasets found in the private datasets repostory.
As specified in the README
file form this repository, you will need git-lfs
in order to download all the included files.
Note that the complete collection of seed datasets is around 60 GB big, so the recommended approach is to download only those parts of the repository that will be used following the instructions in the Partial Downloading section
Once downloaded, the local testing commands can be used passing the seed_datasets_current
root
folder path to the --input
option.
Example: --input /path/to/d3m/datasets/repo/seed_datasets_current
The following leaderboard has been built using the TA2 Standalone Mode
with 2
as
the maximum number of tuning iterations to perform (budget
) and 30
as the maximum time
allowed for the tuning (timeout
):
dataset | template | cv_score | test_score | elapsed_time | tuning_iterations | data_modality | task_type |
---|---|---|---|---|---|---|---|
30_personae | gradient_boosting_classification.all_hp.yml | 0.728894 | 0.619048 | 5.93087 | 2 | single_table | classification |
57_hypothyroid | gradient_boosting_classification.all_hp.yml | 0.862681 | 0.981003 | 38.6418 | 2 | single_table | classification |
185_baseball | gradient_boosting_classification.all_hp.yml | 0.646959 | 0.675132 | 17.3313 | 2 | single_table | classification |
313_spectrometer | gradient_boosting_classification.all_hp.yml | 0.281409 | 0.304201 | 45.3676 | 2 | single_table | classification |
27_wordLevels | gradient_boosting_classification.all_hp.yml | 0.268882 | 0.288937 | 169.197 | 2 | single_table | classification |
1491_one_hundred_plants_margin | gradient_boosting_classification.all_hp.yml | 0.00957403 | 0.451364 | 114.561 | 2 | single_table | classification |
This table can be also downloaded as a CSV file
Two scripts are included in the repository for local testing:
The TA2 Standalone mode can be executed locally using the ta2
command line interface.
To use this, run the ta2 test
command passing one or more dataset names as positional
arguments as well as either a budget. -b
, or a timeout, -t
.
For example, in order to process the datasets 185_baseball
and 196_autoMpg
during 60 seconds
each, the following command would be used:
ta2 test -t60 185_baseball 196_autoMpg
This will start searching and tuning the best pipeline possible for each dataset during a maximum of 60 seconds and, at the end, print a table with all the results on stdout.
Additionally, the following options can be passed:
-i INPUT_PATH
: Path to the folder where the datasets can be found. Defaults toinput
.-o OUTPUT_PATH
: Path to the folder where the output pipeliens will be saved. Defaults tooutput
.-b BUDGET
: Maximum number of tuning iterations to perform.-t TIMEOUT
: Maximum allowed time for the tuning, in seconds.-a, --all
: Process all the datasets found in the input folder.-v, --verbose
: Set logs to INFO level. Use it twice to increase verbosity to DEBUG.-r CSV_PATH
: Store the results in the indicated CSV file instead of printing them on stdout.-s STATIC_PATH
: Path to a directory with static files required by primitives. Defaults tostatic
.
For a full description of the options, execute ta2 test --help
.
The TA2-TA3 API mode can be executed using the ta2 server
command, as well as any of the
optional named arguments required.
This will start a ta2 server in the background ready to serve requests from a ta3 client.
ta2 server
For a full description of the script options, execute ta2 server --help
.
In order to test the TA2-TA3 Server, a convenience ta3
command line interface has been included,
which allows testing one or more datasets by issuing a predefined sequence of calls to the
TA2-TA3 Server.
To use it, run the ta2 ta3
command passing one or more dataset names as positional
arguments, as well as any of the optional arguments.
For example, in order to process the datasets 185_baseball
and 196_autoMpg
during 60 seconds
each, the following command would be used:
ta2 ta3 -t60 185_baseball 196_autoMpg
NOTE: In order to be able to execute this command, a ta2 server
process must be already
running in the same machine.
This will start sending requests to the ta3-server
to search and tune the best pipeline
possible for each dataset during a maximum of 60 seconds.
For a full description of the script options, execute ta2 ta3 --help
.
Also remember that a TA2-TA3 Server must be running when you execute this script!
In order to run TA2-TA3 server from docker, you first have to build the image and
execute the run_docker.sh
script.
After that, in a different console, you can run the ta3
script passing it the
--docker
flag to adapt the input paths accordingly:
make build
./run_docker.sh
And, in a different terminal:
ta2 ta3 --docker <OPTIONS>
The submission steps are defined here: https://datadrivendiscovery.org/wiki/display/gov/Submission+Procedure+for+TA2
In our case, the submission steps consist of:
- Execute the
make submit
command locally. This will build the docker image and push it to the gitlab registry. - Copy the
kubernetes/ta2.yaml
file to the Jump Server and execute the validation command/performer-toolbox/d3m_runner/d3m_runner.py --yaml-file ta2.yaml --mode ta2 --debug
- If successful, copy the
ta2.yaml
file over to the submission repository folder and commit/push it.
For winter-2019 evaluation, the submission repository was https://gitlab.datadrivendiscovery.org/ta2-submissions/ta2-mit/may2019