This folder contains the configuration, artefacts, and scripts used to perform the experiments in the paper.
We use a combination of Ansible and bash scripts to automate the configuration and execution of the experiments on a twelve node cluster.
The folder ansible
contains the Ansible configuration files and playbooks.
Each algorithm's experiment scripts are located in their corresponding folder.
DISTOD and FASTOD-BID read the same data format, the datasets should be located in the experiments/data
folder.
Since DIST-FASTOD-BID reads JSON, it reads the transformed datasets from experiments/fastod-spark/data
.
The original datasets can be downloaded from the HPI repeatability website
and should be preprocessed with the to-json.py
-script to substitute all values with an integer representation and transform them to header-less CSV and JSON files.
Executing an experiment from the experiments
-folder is done using Ansible playbooks, for example:
cd experiments
ansible-playbook -i ansible/inventory.ini ansible/fastod.yml -e 'experiment=exp1-datasets'
If the experiment is in the Wait until experiment finished step, one can safely stop the Ansible driver process (on the local machine) using Ctrl-C
.
After the experiment finished, you can then obtain the results by changing the load-results.yml
playbook to the executed experiment and running:
ansible-playbook -i ansible/inventory.ini ansible/load-results.yml
Experiment | DISTOD | FASTOD-BID | DIST-FASTOD-BID | Description |
---|---|---|---|---|
exp1-datasets | ✔️ | ✔️ | ✔️ | Tests each algorithm in its most powerfull configuration on all datasets. |
exp2-nodes | ✔️ | (n/a) | ✔️ | Scales the number of nodes on the adult dataset. |
exp3-cost | ✔️ | ❌ | ❌ | Scales the number of cores on the hepatitis and adult datasets. |
exp4-rows | ✔️ | ❌ | ❌ | Scales the number of rows on the adult, flight, and ncvoter datasets. |
exp5-columns | ✔️ | ❌ | ❌ | Scales the number of columns on the plista dataset. |
exp6-memory | Performed manually! | ✔️ | Performed manually! | Compares the runtime of all algorithms with different heap memory limits. |
exp7-caching | ✔️ | (n/a) | (n/a) | Compares the runtimes of DISTOD with partition caching turned off or on. |
exp8-jvms | ✔️ | (n/a) | (n/a) | Runs DISTOD on different JVMs and using different GCs and settings. |
exp9-dispatchers | ✔️ | (n/a) | (n/a) | Runs the DISTOD master and the workers on different dispatcher implementations to compare their impact. |
exp10-network | ✔️ | (n/a) | ❌ | Measures the network utilization while DISTOD is running on the full cluster. Requires password-less sudo and iptraf installed (iptraf-ng in PATH ). |