GP-EMaL
This is our method for producing interpretable trees for nonlinear dimensionality reduction (GP for NLDR). To do this, we extend an existing multi-objective method, GPMaLMO. GPMaLMO optimises NLDR embeddings while minimising embedding dimensionality. We changed the objective function to minimise tree complexity instead of dimensionality. To measure tree complexity, we performed a literature survey for existing GP tree complexity metrics and implemented and compared a few of the most promising ones. A method that minimised the semantic complexity of the tree worked the best. We took this metric and extended it to be flexible, allowing the user to specify how the semantic complexity is defined depending on what is interpretable in the context of their dataset, knowledge and methodology. We also modify the tree complexity metric to explicitly optimise for small and symmetrical trees. The result is a flexible, efficient method which suits the subjective nature of interpretability. The new method is applied on all the same datasets as the original algorithm GPMaLMO, and the constructed datasets give similar classification accuracy with much more interpretable trees and minimal added compute cost.
(from the src/ directory):
To run on iris.data (in /src/datasets dir) for 10 generations, using our functional metric
as our seconday objective to minimise (alongside neighbourhood structure embedding loss)
make run DATASET=iris GENS=10 DIM=2 DIR=datasets/
To clean up output files in the directory:
make clean
To run in parallel:
jobs=10
dataset=wine
gens=1000
for i in {1..$jobs}; do make run DATASET=$dataset GENS=$gens OBJ=functional DIR="./datasets/" & done
A main feature of GPEMaL is specifying the operations in your function set and their associated costs. These are specified by the following command line arguments (grabbed by argparse):
# /src/gptools/util.py line 64-67
parser.add_argument("-fs", "--funcset", help="function set",
type=str, default="vadd,vsub,vmul,vdiv,max,min,relu,sigmoid,abs")
parser.add_argument("-oc", "--opcosts", help="node operation costs",
type=str, default="sum,sum,prod,prod,exp,exp,exp,exp,exp")
The function set operation specification is straightforward, but let us analyse the scaling of these costs.
(Here
Cost argument | Complexity | Cost Scaling ( |
---|---|---|
"sum" | |
|
"prod" | max( |
|
"exp" |
Note:
- Datasets used in the paper are in datasets/
- Add your own datasets in csv format, with a header line:
Header: classPosition,#features,#classes,seperator. e.g.
classLast,1024,20,comma (from COIL20.data)
- Most GP parameters are configured in gpmalmo/rundata.py