-
Notifications
You must be signed in to change notification settings - Fork 2
Examples. Running LAC
This page provides two different kind of examples. First, a basic example is shown where only one algorithm is included. Second, a much more advanced example is also detailed where the automation framework is also used.
In this page an example on how to run an existing algorithm will be detailed. In concrete, the used algorithm is known as CMAR. It will be configured to use two personalized parameters. min_sup will be set to 1%, and min_conf will be set to 80%. This algorithm will be run on weather.nominal dataset using the arff format. It could be downloaded from Weka Repository at https://storm.cis.fordham.edu/~gweiss/data-mining/datasets.html.
executions:
- name_algorithm: "CMAR"
configuration:
min_sup: 0.001
min_conf: 0.9
train: "weather-training.arff"
test: "weather-test.arff"
reports: "results/weather"
report_type:
- "KlassReport"
- "ClassifierReport"
To run this configuration file, supposing that it saved as
$ java -jar lac-0.2.0.jar config.yml
Once the algorithm will have finished its execution, it will show additional information to those shown in report files. For instance, it will show runtime for training and testing phases, separated and aggregated. It will also show the number of used rules in the classifier, and metrics of quality for the solutions. Next the output of this execution is shown.
**********************************************************************************************************
Algorithm: CMAR
Dataset: weather.tennis
Runtime: 1010 ms (Building classifier 905 ms; Test phase 105 ms)
Number of rules: 3
Training accuracy: 1.0
Test accuracy: 1.0
**********************************************************************************************************
In this section a typical experimental study for a new proposal, where many comparisons have to be made is explained. Whereas in the previous section a basic example was carried out, in this one it is performed a much more complex one. The goal is not only to prove that LAC enables to do it, but to prove how easy is thanks to its configuration file and design. Next, it has been considered interesting to explain how several tools have tried to solve this problem, and how LAC has solved all of them.
-
Using many files. This is one of the most typical use cases for existing tools, where many configuration files are used. The goal of each configuration file is to run one algorithm in only one dataset. To better illustrate this case, let suppose an experimental study where 10 algorithms are being compared using 30 datasets. Supposing at least one configuration file per algorithm and per dataset, 10 x 30 = 300 files will have to be used. In each one of these files the specific configuration of the algorithms will have to be repeated, thus, if you need to adapt one parameter for one algorithm in all the datasets, at least 30 configuration files will have to be edited. Basically, the duplication of those files are very high hampering the maintainability. Last but not least, to run each one of these files, a different execution will have to be carried out since very few tools enables to pass multiple configuration files and start running those in a sequential fashion. Even less if those configuration files want to be run in a parallel way. In summary, this approach hampers the automation of the experimental studies or it forces to develop external tools to facilitate its automation. Basic computer science principles as DRY (Do not Repeat Yourself) are not respected using this methodology.
-
A unique configuration file. In order to solve the problem of dealing with so many configuration files, some tools have decided to join all these files in only one. Although it could facilitate dealing with changes, since only one big file will have to be changed. It is not a solution when a large experimental study is carried out, since this file could grow too much to be readable. For instance, let suppose previous example of 10 algorithms on 30 datasets, supposing also that 10 lines are required to configure each algorithm in each dataset, file will have 10 x 10 x 30 = 3000 lines. Those large files are very error prone, and the may be avoided as much as possible. Furthermore, again this methodology does not provide a way of solving principles as DRY.
-
GUI. Some tools try to solve the design of the experimental study providing users with graphical interfaces. In the first instance, this approach seems like a solution since researchers do not have to deal with a huge number of configuration of files, but it is not a real solution. If these experimental analysis designed using the GUI cannot be exported to configuration files, they could not be run on servers or clusters where no graphical interface is typically installed. When GUI enables to export, the same problem as dealing with a huge number of files arises.
LAC has been specifically designed with all these problem in minds to address all of them. First of all, configuration file could be split in many files or only use one it is entirely up to the user. Also the problem of DRY is addressed using YML which enables to not repeat common parts, facilitating the maintainability. In this sense, when LAC is executed it could receive from one configuration file as java -jar lac-0.2.0.jar config1.yml up to whatever are required as java -jar lac-0.2.0.jar config1.yml ... configN.yml. It also enables to use regular expressions for selecting configuration files when running LAC, for instance let suppose a directory with the following configuration files.
config.cba.dataset1.yml
config.cba.dataset2.yml
...
config.cba.datasetN.yml
config.cpa.dataset1.yml
config.cpa.dataset2.yml
...
config.cpa.datasetN.yml
config.cmar.dataset1.yml
config.cmar.dataset2.yml
...
config.cmar.datasetN.yml
If only the configuration files of CBA want to be executed, the naive approach would be to write java -jar lac-0.2.0.jar config.cba.dataset1.yml config.cba.dataset2.yml ... config.cba.databaseN.yml. But with LAC is much more easy thanks to regular expressions, the previous command would be equivalent to write this short version java -jar lac-0.2.0.jar config.cba.dataset*.yml. It will automatically detect which configuration files match this regular expression, and run each one of those.
By default, LAC runs each algorithm sequentially, that is, if a configuration file has 10 executions, it will not run the second execution until first has totally finished. This behavior is the most common among existing tools, however, LAC also enables to run experimental studies in a parallel way. It could be personalized to speed up the execution on those servers where the hardware have enough capabilities. To increase the level of parallelism, an environment variable is used. Therefore, in experimental studies where want to be executed using a parallelism of 5, LAC will be run as follows:
$ LAC_THREADS=5 java -jar lac-0.2.0.jar config.yml
It will create 5 independent threads, and run each algorithm on one of those threads. The parallelism of each algorithm is not changed, since they were designed to be sequentially run, but each independent execution is parallelized. LAC will not finish until all the executions, that is, all the threads have finished their executions. Environment variable has been used to facilitate its configuration. In that way, the number of threads could be changed in each different use of LAC or it could be configured at system level. Using environment variables for this kind of configuration is well-known and documented, in fact, there are a recent outstanding movement standardizing this way of configuration known as Twelve-Factor.
Finally, the usefulness of using YML is highlighted. When a complex experimental study is being performed it is much more easy to understand why is really important to not repeat configuration. Typical complex experimental studies involve running multiple algorithms on different datasets, where each algorithm typically shares the configuration among datasets. Let suppose an experimental study where 10 algorithms are being compared using 30 datasets. It means that executions has 10 x 30 = 300 elements, where each one is a different execution. Thus, naive approach would be as follows.
executions:
- name_algorithm: "CBA"
configuration:
min_sup: 0.001
min_conf: 0.9
train: "dataset1-training.arff"
test: "dataset1-testing.arff"
- name_algorithm: "CBA"
configuration:
min_sup: 0.001
min_conf: 0.9
train: "dataset2-training.arff"
test: "dataset2-testing.arff"
...
- name_algorithm: "CMAR"
configuration:
min_conf: 0.9
delta: 4
train: "dataset1-training.arff"
test: "dataset1-testing.arff"
- name_algorithm: "CMAR"
configuration:
min_conf: 0.9
delta: 4
train: "dataset2-training.arff"
test: "dataset2-testing.arff"
...
In this example, two things are repeated. Firstly, name of the datasets are duplicated in each algorithm, that is, as 10 algorithms are being run the same name of the dataset is repeated in 10 different places of the configuration file. If one of those files is changed of path, it requires changing the same path in 10 different places. Secondly, the configuration for each algorithm is also repeated, thus, if a change has to be made 30 different place have to be changed hampering the maintainability of this file. Traditional computer science principles as DRY are totally being violated. However, this problem could be easily solved thanks to using inheritance of YML. It enables to not repeat ourself many times in one configuration file, facilitating whatever change is required. Inheritance in YML work as follows.
.dataset1: &dataset1
train: "dataset1-training.arff"
test: "dataset1-testing.arff"
.dataset2: &dataset2
train: "dataset2-training.arff"
test: "dataset2-testing.arff"
...
.config_cba: &config_cba
configuration:
min_sup: 0.001
min_conf: 0.9
.config_cmar: &config_cmar
configuration:
min_conf: 0.9
delta: 4
...
executions:
- name_algorithm: "CBA"
<<: *config_cba
<<: *dataset1
- name_algorithm: "CBA"
<<: *config_cba
<<: *dataset2
...
- name_algorithm: "CMAR"
<<: *config_cmar
<<: *dataset1
- name_algorithm: "CMAR"
<<: *config_cmar
<<: *dataset2
...
First, an alias for each dataset is defined as datasetX where both train and test file are declared. This alias is included in each execution using the syntax <<: *datasetX. In that way, if datasetX were changed of path, only the alias would have to be changed in one unique place. Second, an alias for each configuration's algorithm is also declared with the form config_nameAlgorithm this configuration is included in each use of nameAlgorithm. In this sense, if one parameter of nameAlgorithm were changed, it would only have to be changed in one unique place (where the alias was defined).