Skip to content
This repository has been archived by the owner on Feb 20, 2024. It is now read-only.

Learnings from NNI #126

Open
nginyc opened this issue Jun 21, 2019 · 1 comment
Open

Learnings from NNI #126

nginyc opened this issue Jun 21, 2019 · 1 comment
Labels
question Further information is requested

Comments

@nginyc
Copy link
Owner

nginyc commented Jun 21, 2019

As our project is similar to NNI by Microsoft, I thought it might be good to study how they're doing things, compare it how we're doing things, and derive some learnings.

How model's hyperparameter search space is defined

  • NNI calls a set of knobs "Configuration", the knob config "Search Space"
  • In NNI, "Search Space" is defined as JSON on separate file
  • In Rafiki, Knob Config is defined in Python with typed "Knob" classes as part of the model code
  • My opinion:
    • Rename "knob config" -> "knob space" for clarity?
    • More flexible & powerful to configure dynamically with Python
    • Edits to search space are simpler if written in Python alongside model code in the same file
    • Submitting a separate configuration file can be more troublesome
    • Using JSON is more convenient if hyperparameter search space is to be tweaked often, independently of model code

How model developers configure the AutoML algorithm

  • In NNI, model developers need to configure an "Experiment" in a YAML file
    • Choice of configuration: which "Tuner" to use, configuration for that chosen tuner, max no. of trials, max training duration, no. of GPUs, platform to train (e.g. local machine, Kubernetes)
    • A single model is trained for each experiment
  • In NNI, pointers to datasets are hard-coded in model code, and there's no concept of "task"
  • In Rafiki, application developers configure a train job by simply submitting a task, a budget, datasets and maybe model IDs in Python
    • Rafiki matches task to a set of models, and trains these multiple models concurrently
    • Rafiki manages provisioning of training platform & GPUs
    • Rafiki automatically selects & configures which advisor to use based on hyperparameter search space
  • Due to differences between designs of Rafiki and NNI:
    • In Rafiki, a non-expert application developer initiates training instead of a model developer, so configuring training should be non-technical and, as much as possible, abstract away complexity of model selection & tuning configuration
    • Rafiki is designed to be more end-to-end, as a ML-as-a-service
  • My opinion:
    • As with NNI, should model developers in Rafiki be able to optionally configure how their models are tuned e.g. which advisor to use, configuration of advisor?
      • Allows model developers to select more appropriate / empirically better AutoML algorithms for their models, but more burden on them
      • Maybe with another static class method
    • Current abstraction & definition of budget in Rafiki is appropriate

How the model interfaces with the AutoML system

  • In NNI, AutoML system calls upon model code by simply running main Python file (i.e. triggers main method). Supports a directory of Python files
  • In Rafiki, system calls upon model code by importing a given class from a single Python file, then appropriately running methods on instances of that class
  • In NNI, model code calls upon AutoML system by importing nni module and calling e.g. nni.get_next_parameter() to get hyperparameters for trial, nni.report_final_result(metrics) to pass final metrics of a trial that is interpreted by the tuner
  • In Rafiki, model code imports utils module and calls e.g. utils.dataset..., utils.logger... for helper/logging methods. Return values to e.g. evaluate(dataset) pass final score back to system
  • My opinion:
    • NNI's interface maximises portability of existing model code - no need to rewrite into a class definition like in Rafiki
    • NNI's interface more loosely couples model code & AutoML system
    • But Rafiki's well-defined model class gives more flexibility/power to tuning algorithms (e.g. better control flow), and is more appropriate for our design
      • Unlike NNI, Rafiki needs to support predictions, loading & saving of model parameters
    • Consider documenting on how to port existing model code to Rafiki, or brainstorm on tweaking API to improve portability?

How the AutoML system configures the model's training behaviour

  • NNI configures model's training behaviour just through hyperparameters and its framework for early stopping with the concept of an "Assessor"
    • Model code can optionally call nni.report_intermediate_result(metrics) that is interpreted by the assessor, which kills the trial when the intermediate results are poor
    • No explicit support and extension for other ways to configure model's training behaviour other than early stopping e.g. loading of shared parameters, using a downscaled model
  • In Rafiki, we're thinking of configuring a model's training behaviour with PolicyKnob(policy_name) as part of model's knobs, so that model code can switch between different "modes" (e.g. early stop VS don't early stop)
    • My opinion: this can support more advanced tuning strategies e.g. we can introduce more policies in the future without changing Rafiki's code

How the AutoML system supports architecture tuning

  • NNI only currently supports GA-based architecture tuning, where the model code & tuner depend on a custom graph abstraction & architecture space definition as the "Search Space"
    • Which may not be general/flexible enough

  • As in previous section, NNI won't be able formally support an implementation of ENAS as the tuner needs to tell the model code to load shared parameters and switch between "train for 1 epoch" & "just evaluate on a subset of the validation dataset"

  • For Rafiki, we're thinking of representing architecture as an array of categorical values

    • More general (up to model developer to define encoding), but low-level and less "informative" for the architecture tuning algorithm
    • E.g.
    l0 = KnobValue(0) # Input layer as input connection
    l1 = KnobValue(1) # Layer 1 as input connection
    l2 = KnobValue(2) # Layer 2 as input connection
    ops = [KnobValue('conv3x3'), KnobValue('conv5x5'), KnobValue('avg_pool'), KnobValue('max_pool')]
    arch_knob = ArchKnob([
        [l0], ops, [l0], ops,                   # To form layer 1, choose input 1, op on input 1, input 2, op on input 2, then combine post-op inputs as preferred                                     
        [l0, l1], ops, [l0, l1], ops,           # To form layer 2, ...
        [l0, l1, l2], ops, [l0, l1, l2], ops,   # To form layer 3, ...
    ])
@nginyc nginyc added the question Further information is requested label Jun 21, 2019
@nudles
Copy link
Collaborator

nudles commented Jul 11, 2019

Thanks for the comparison. I list some comments (not in order)

  1. search space. NNI has another way of defining the search space, which uses annotation. It moves the hyper-parameter definition closer to the use place. The python code can run with and without hyper-parameter tuning.
  2. by making NNI a library, it would be easier for local development and debugging. The running flow is controlled by model developers. Rafiki provides a platform for hyper-parameter search. Rafiki controls the flow. Like map-reduce, the system controls the flow and the developers fill in the code of map and reduce.
  3. it would be good to decouple a system into modular components. We will have resource management, filesystem or datastore, hyper-parameter tuning, inference queueing, etc.
  4. we may not be able to unify the architecture tuning algorithms and hyper-parameter tuning algorithms. E.g., it is difficult to even unify ENAS and DART algorithms..
  5. mlflow and kubeflow are two other projects with the hyper-parameter tuning feature.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants