The goal of this repository is to show an example of code implementation and usage of the Weights & Biases, Pytorch Lightning, Hydra and their interaction. This implementation is inspired by the Reproducible Deep Learning PhD course from the at Sapienza University (Rome).
The main design principle driving this project are:
- Modularity: each part of the training/evaluation/logging procedure is implemented as an independent block with pre-defined interactions.
- Extensibility: the framework can be easily extended to include new experiments/models/logging procedures, architectures and datasets
- Clarity/Readability: each model contains only the data-agnostic and architecture agnostic-logic.
Main features:
- Get all the perks of a Pytorch Lightning Trainer
- Use the Weights & Bias sweep tool to easily define hyper-parameters search and launch multiple agents across different devices.
- Easily handle complex configuration thanks to the powerful Hydra configuration management.
- No need to write any training/configuration script.
As a general concept, the framework is designed to that each piece can be maximally re-used: the same model can be used with different architectures and dataset, the same architectures and evaluation procedures can be used with different models, and the same optimization procedure can be used for different models without the need to re-write any code.
Create a new dl-kit
environment using conda:
conda env create -f environment.yml
and activate it:
conda activate dl-kit
Assign a name to the device that will be running the experiments and make sure there is a corresponding device in
config/device
contains a configuration file with the same name:
export DEVICE_NAME=<DEVICE_NAME>
You can add the previous export to the .bashrc
file to avoid running it every time a new session is created.
The corresponding config/device/<DEVICE_NAME>.yaml
configuration file contains device-specific information
regarding hardware and paths.
Here we report an example for a device configuration:
# Example of the content of config/device/ivi-cluster.yaml to run experiment on a SLURM cluster
gpus: 1 # Specify the number of GPUs to use for the lightning trainer
data_root: /ssdstore/datasets # Root dataset directory
experiments_root: /hddstore # Root experiment directory
download_files: False # Flag to disable dataset download (from code)
num_workers: 16 # Number of workers used for data-loading
pin_memory: True # See pin_memory flag for the pytorch DataLoader
With this setup, the same code can be used on different machines since all the hardware-dependent configuration
is grouped into the device .yaml
configuration file. Further details can be found in the device section
For Weights & Bias logging run:
wandb init
and login with your credentials. This step is optional since TensorBoard logging is also implemented.
The CLI for training models is based on Hydra. See this simple example and basic grammar for parameter override for further information about Hydra.
To run an experiment use:
python train.py +experiment=<EXPERIMENT_NAME> <OTHER_OPTIONS>
As an example, the following command can be used to run a Variational Autoencoder on the MNIST dataset for 20 epochs
with the value of the hyper-parameter beta
set to 0.1
.
python train.py +experiment=VAE_MNIST train_for="20 epochs" params.beta=0.1
See the Experiment Definition for further information regarding the experimental setup
The default logging uses Weights & Biases, but it is possible to switch to TensorBoard with
python train.py +experiment=<EXPERIMENT_NAME> logging=tensorboard
Alternative loggers can be defined in the config/logging
configuration .yaml
files.
The train.py
script is defined to be compatible with wandb hyper-parameters sweeps.
Each sweep definition can directly access the properties and hyper-parameters defined in the configuration files. The following file reports an example sweep for the MNIST Variational Autoencoder experiment:
program: train.py
command:
- ${env}
- ${interpreter}
- ${program}
- +experiment=MNIST_VAE # The experiment to launch
- train_for="10 epochs" # Training for 10 epochs
- ${args_no_hyphens} # parameters from the sweep (params.beta in this case)
method: bayes # bayesian optimization
metric:
goal: maximize
name: ELBO/Validation # Metric logged and defined in config/experient/MNIST_VAE.yaml
parameters:
params.beta: # The hyper-parameter beta
distribution: log_uniform # will be sampled uniformly in the log space
min: -20 # from e^-20
max: 2 # to e^2
The sweep can be created by running:
wandb sweep sweeps/VAE_MNIST.yaml
which will return the corresponding `<SWEEP_ID>. Agents can then be started with:
wandb agent <WANDB_USER>/<WANDB_PROJECT>/<SWEEP_ID>
The results of the corresponding sweep can be visualized in the Weights & Biases board
The repository contains a few examples of SLURM sbatch scripts that can be used to run experiments or sweeps on a SLURM cluster (such as Das5 or the ivi-cluster).
The run_sweep_2h file report the example used to run a sweep for a VAE model on MNIST once the corresponding wandb sweep has been created.
It is possible to start an agent on the das5 cluster using the provided scripts with:
sbatch --export=DEVICE_NAME=$DEVICE_NAME,SWEEP=<SWEEP_ID> scripts/run_sweep_2h.sbatch
in which the --export
flag is used to pass variables to the script, and <SWEEP_ID> refer to the string
produced when running the wandb sweep
command (see previous section).
The configuration for each run is composed by the following main components:
-
data: the data used for training the models. See section for further information.
-
model : the model to train. Each model must implement the logic regarding the loss computation (e.g.
VAE
,GAN
,VIB
, ...) and functionalities (e.g.sample
,reconstruct
,classify
,...) in architecture and data-agnostic fashion. Each model is an instance of a Pytorch Module. -
optimization: the procedure used for optimizing the model. Definition on how the model is updated by the optimizer (e.g. standanrd step update, adversarial training, joint training of two models, optimizer type, batch-creation procedure). Each optimization procedure is an instance of a Lighning Module.
-
params: collection of the model, architecture, optimization and data hyper-parameters (e.g. number of layers, learning rate, batch size, regularization strength, ...). This design allows for easy definition of sweeps and hyper-parameter tuning.
-
evaluation: Dictionary containing metrics that need to be logged and their corresponding logging frequency.
-
device: Definition of the hardware-specific parameters (such as paths, CPU cores, number of GPUs)
-
trainer: Extra parameters passed to the Lightning Trainer
-
callbacks: the callbacks called during training. Different callbacks can be used for logging, model checkpointing or early stopping. See the corresponding documentation for further details. Note that callbacks are fully optional.
-
logger: Definition of the logging procedure. Both TensorBoard and Weights & Bias are supported.
-
run: Properties of the run such as name and the corresponding project
While
data
,model
,architectures
,optimization procedure
,parameters
,evaluation
,callbacks
andrun
are experiment-specific,device
,logging
andtrainer
define global properties of the device on which the experiments are running, the logging and training parameters respectively.
Each experiment .yaml
configuration file contains a definition of data, model, architectures, optimization procedure,
hyper-parameters and callbacks.
Here we report an example for the
Variational Autoencoder Model trained on the MNIST dataset.
# @package _global_
defaults:
- /data: MNIST # On the MNIST dataset
- /optimization: batch_ADAM # Use batch-based optimization with Adam optimizer
- /model: VAE # For a VAE model
- /architectures: MNIST # With the appropriate architectures
- /evaluation: # And evaluate:
- reconstruction # The reconstructions (on validation set)
- generation # The images sampled from the model
- elbo # The value of ELBO (on train and validation)
seed: 42 # Set a fixed seed for reproducibility
train_for: 50 epochs # Definition of the training time in 'epochs'
# 'global_steps', 'iterations', 'seconds', 'minutes', 'hours' or 'days'
# are also available
run: # Names for the run and project
project: VAE_experiments # set the name of the project (`noname` is default)
name: My_First_VAE # Name of the run (used by Weights & Biases)
params: # Values of the hyper-parameters
z_dim: 64 # Number of latent dimensions
beta: 0.5 # Value of regularization strength
lr: 1e-3 # Learning rate
batch_size: 128 # Batch size
encoder_layers: [ 1024, 128 ] # Layers of the encoder model
decoder_layers: [ 128, 1024 ] # Layers of the decoder model
eval_params: # Parameters for the evaluation (such as frequency or number of samples)
elbo:
every: 1 minute # Compute ELBO every minute
n_samples: 2048 # using 2048 samples
samples:
every: 1 minute # Visualize samples for the generative model every minute
n_samples: 10 # Number of visualized samples
reconstruction:
every: 1 minute # Show the reconstruction every minutes
evaluate_on: valid # of pictures from the validation set
n_samples: 10 # pick 10 of them
The # @package _global_
line is used to indicate that the following keys are global, while defaults
specifies
the values for data
, optimization
, model
, architectures
and evluation
procedures respectively.
In other words, the dictionary defined in the files config/data/MNIST.yaml
,
config/optimization/batch_ADAM.yaml
, config/model/VAE.yaml
,
config/architectures/MNIST.yaml
and
config/evaluation/reconstruction.yaml
(and the other evaluation metrics) are added
to the respective keys.
All the configuration that is used to launch an experiment will be stored inside a hydra
folder for the
corresponding run.
Further information regarding configuration packages and overrides can be found here.
All the hyper-parameters regarding architectures and optimization are grouped under the params
dictionary.
This structure makes the configuration file extremely easy to read and understand, with the added value of making it easy
to change the value of hyper-parameters.
The parameters required by the evaluation metrics are grouped in the eval_params
dictionary,
which is responsible to define details regarding the evaluation procedure (and the logging frequency).
To summarize, each experiments defines the following:
data
: data configurationarchitectures
: which set of architectures to usemodel
: which model (objective, what architectures and hyperparameters are needed)optimization
: definition of the optimization procedure (SGD, Adversarial, others...)params
: list of hyper-parameters (from model, architectures, data and optimization procedure)evaluation
(optional): list of evaluation metrics to logeval_params
(optional): list of evaluation parameters
Note that the values defined in the experiments can be altered directly from the command line. For example, by running
python train.py +experiment=MNIST_VAE params.beta=0.0001 train_for="30 minutes"
one can overwrite the value of params.beta
, changing it from 0.01
to 0.0001
, and set the
total training time to 30 minutes
.
The corresponding log will is visualized hereafter
Further details regarding the aforementioned configuration components can be found in the following sections.
Adding new models, datasets and architectures to the frameworks requires implementing the code and creating the corresponding configuration files. Here we report the conventions used to define the different components:
The datasets definition are collected in the data
configuration folder. Each data object consist
of a dictionary specifying the parameters for the different splits. In this example, we consider train
, valid
, and test
for training, validation and testing purpose respectively, but different keys can be added to the data
dictionary if necessary:
# Content of /config/data/MNIST.yaml. the corresponding keys are added under `data`
train: # Definition of the training set
_target_: code.data.MNIST.MNISTWrapper # Class
root: ${device.data_root} # Initialization parameters (passed to the constructor)
split: train
download: ${device.download_files}
valid:
_target_: code.data.MNIST.MNISTWrapper # Definition of the validation split
root: ${device.data_root}
split: valid
download: ${device.download_files}
test:
_target_: code.data.MNIST.MNISTWrapper # Test split
root: ${device.data_root}
split: test
download: ${device.download_files}
While reading the configuration, the variables ${NAME.ATTRIBUTE}
will be resolved
to the corresponding value.
All the device-dependent parameters (such as the data directory and the flag to enable downloading)
refer to the ${device}
variable. This strategy allows to easily deploy the same model to different devices. Further details can
be found in the device section.
TorchVision, TorchAudio or other existing datasets class definitions can be referenced directly by specifying
the appropriate _target_
(e.g. _target_: torchvision.datasets.MNIST
for the default torchvision MNIST dataset).
The instantiated data
dictionary is passed to the constructor of the optimization procedure,
in which the data-loaders are defined.
The model configuration defines the parameters and references to required architectures. The model code is designed to be completely data-agnostic
so that the same logic can be used across different experiments without any re-writing or patch.
Here we report the example code for the VariationalAutoencoder
model:
# the VariationalAutoencoder class is a torch.nn.Module that:
# - inherits the `sample()` method from GenerativeModel,
# - inherits the `encode(x)` method from RepresentationLearningModel.
# This inheritance structure is useful to better define what models can be used for and how
class VariationalAutoencoder(GenerativeModel, RepresentationLearningModel):
def __init__(
self,
encoder: ConditionalDistribution,
decoder: ConditionalDistribution,
prior: MarginalDistribution,
beta: float
):
'''
Variational Autoencoder Model
:param encoder: the encoder architecture (as a nn.Module : (Tensor) -> Distribution)
:param decoder: the decoder architecture (as a nn.Module : (Tensor) -> Distribution)
:param prior: architecture representing the prior (as a nn.Module : () -> Distribution)
:param beta: trade-off between regularization and reconstruction coefficient
'''
super(VariationalAutoencoder, self).__init__()
# The data-dependent architectures are passed as parameters
self.encoder = encoder
self.decoder = decoder
self.prior = prior
# Store the value of the hyper-parameter beta
self.beta = beta
# Definition of the procedure to compute reconstruction and regularization loss
def compute_loss_components(self, data):
x = data['x']
# Encode a batch of data
q_z_given_x = self.encoder(x)
# Sample the representation using the re-parametrization trick
z = q_z_given_x.rsample()
# Compute the reconstruction distribution
p_x_given_z = self.decoder(z)
# The reconstruction loss is the expected negative log-likelihood of the input
# - E[log p(X=x|Z=z)]
rec_loss = - torch.mean(p_x_given_z.log_prob(x))
# The regularization loss is the KL-divergence between posterior and prior
# KL(q(Z|X=x)||p(Z)) = E[log q(Z=z|X=x) - log p(Z=z)]
reg_loss = torch.mean(q_z_given_x.log_prob(z) - self.prior().log_prob(z))
return {'reconstruction': rec_loss, 'regularization': reg_loss}
# Function called by the optimization procedure to compute the loss for one batch
def compute_loss(self, data, data_idx):
loss_components = self.compute_loss_components(data)
loss = loss_components['reconstruction'] + self.beta * loss_components['regularization']
return {
'loss': loss, # The 'loss' key is used for gradient computation
'reconstruction': loss_components['reconstruction'].item(), # The other keys are returned for logging purposes
'regularization': loss_components['regularization'].item()
}
# Function implemented by representation learning models to define the encoding procedure
def encode(self, x) -> Distribution:
return self.encoder(x)
# Function implemented by generative models to generate new samples
def sample(self, sample_shape: torch.Size = torch.Size([]), sample_output=False) -> torch.Tensor:
# Sample from the prior
z = self.prior().sample(sample_shape)
# Compute p(X|Z=z) for the given sample
p_x_given_z = self.decoder(z)
# Return mean or a sample from p(X|Z=z) depending on the sample_output flag
if sample_output:
x = p_x_given_z.sample()
else:
x = p_x_given_z.mean
return x
Modularity and re-usability are the key design principles that allow to re-use and read the model code in a completely
task-agnostic fashion. All the task-dependent code is contained into the parameters (such as encoder
and decoder
)
that are passed to the model.
Note that the code for the Variational Autoencoder model is completely independent of the specific dataset or architectures. Therefore, the same model can be used on all datasets without any need to rewrite or change the code as long as the different data types are handled correctly by the respective encoder and decoder architectures.
The corresponding configuration file defines where the required architectures and parameters are defined:
_target_: code.models.unsupervised.VariationalAutoencoder # The VAE class defined above
prior: ${architectures.prior} # pass the architecture.prior as the prior
decoder: ${architectures.decoder} # architectures.decoder as the decoder
encoder: ${architectures.encoder} # archidectures.encoder as the encoder
beta: ${params.beta} # and params.beta for the regularization strength
Note that this configuration acts as linker that defines where the different components have to be looked up in other
parts of the configuration. This is useful because one can use the VAE model on completely different datasets
just by swapping in and out different values for the architectures
without modifying the code defining the logic
of the model.
Different architectures are implemented in the code/architectures
folder. Each architecture is designed for a specific
role (e.g. Encoder
, Decoder
, Predictor
, ...), as a result the same architecture can be used in multiple models.
Since the architecture code is data-dependent, each dataset will correspond to a different set of architectures.
Here we report the example for the VariationalAutoencoder
Encoder
on MNIST
:
INPUT_SHAPE = [1, 28, 28]
N_INPUTS = 28*28
N_LABELS = 10
# Model for q(Z|X)
# The forward method is designed to map from a Tensor to a Distribution
class Encoder(ConditionalDistribution):
def __init__(self, z_dim: int, layers: list):
'''
Encoder network used to parametrize a conditional distribution
:param z_dim: number of dimensions for the latent distribution
:param layers: list describing the layers
'''
super(Encoder, self).__init__()
# Create a stack of layers with ReLU activations as specified
nn_layers = make_stack([N_INPUTS] + list(layers))
self.net = nn.Sequential(
Flatten(), # Layer to flatten the input
*nn_layers, # The previously created stack
nn.ReLU(True), # A ReLU activation
StochasticLinear(layers[-1], z_dim, 'Normal') # A layer that returns a factorized Normal distribution
)
def forward(self, x) -> Distribution:
# Note that the encoder returns a factorized normal distribution and not a vector
return self.net(x)
The corresponding configuration links the parameters of the encoder model to the
params
values:
# @package _global_
architectures:
encoder:
_target_: code.architectures.MNIST.Encoder # Class to instantiate
layers: ${params.encoder_layers} # List passed as the layers definition
z_dim: ${params.z_dim} # Size of latent dimensions
# The rest of the architectures are defined below...
Note that the architecture
file defines the linking for the architectures used by any model (and not only the VAE
).
At run time only the required keys will be dynamically looked up.
The optimization procedure is designed to contain the logic regarding how the model is updated over time. This includes the definition of optimizers, data-loaders (or environments for reinforcement learning) and learning-rate schedulers. Once again, the optimization procedure is designed to be modular and as independent from the other components as possible (model-agnostic). Here we report the example for the a batch-based training procedure with the ADAM optimizer:
# Each optimization procedure is a pytorch lightning module
class AdamBatchOptimization(Optimization):
def __init__(self,
model: Model, # The model to optimize (as a nn.Module)
data: dict, # The dictionary of Datasets defined in the previous 'Data' section
num_workers: int, # Number of workers for the data_loader
batch_size: int, # Batch size
lr: float, # Learning rate
pin_memory: bool=True # Flag to enable memory pinning
):
super(AdamBatchOptimization, self).__init__()
# Assigned the variables passed to the constructor as internal attributes
self.model = model
self.data = data
self.batch_size = batch_size
self.num_workers = num_workers
self.lr = lr
self.pin_memory = pin_memory
# Overrides the pl.LightningModule train_dataloader which is used by the Trainer
def train_dataloader(self) -> DataLoader:
# Return a DataLoader for the `train` dataset with the specified batch_size
return DataLoader(self.data['train'],
batch_size=self.batch_size,
num_workers=self.num_workers,
shuffle=True,
pin_memory=self.pin_memory)
# The training step simply returns the computation from the model
# and logs the loss entries
def training_step(self, data, data_idx) -> STEP_OUTPUT:
# Compute the loss using the compute_loss function from the model
loss_items = self.model.compute_loss(data, data_idx)
# Log the loss components
for name, value in loss_items.items():
self.log('Train/%s' % name, value)
# Increment the iteration counts.
# The self.counters dictionary can be used to define custom counts
# (e.g number of adversarial/generator iterations during adversarial training)
self.counters['iteration'] += 1
# Add the internal counters to the log
self.log_counters()
return loss_items
# Instantiate the Adam optimizer passing the model trainable parameters
def configure_optimizers(self):
return Adam(self.model.parameters(), lr=self.lr)
Each optimization procedure is a Pytorch Lightning module, therefore it is possible to extend all the corresponding functions for customized training/data loading.
The corresponding configuration file simply defines the references for the
parameters of the constructor accessing either the hyper-parameters in params
or device-specific configuration in
device
(which depends on the number of CPU cores and presence of a GPU):
_target_: code.optimization.batch_ADAM.AdamBatchOptimization
model: ${model} # Reference to the model
data: ${data} # And dataset(s)
lr: ${params.lr} # Pointing to the value of learning rate
batch_size: ${params.batch_size} # and batch size in `params`
num_workers: ${device.num_workers} # Looking up the number of workers
pin_memory: ${device.pin_memory} # and if memory pinning is used from the device config
The optimization.model
and optimization.data
components point to model
and data
global keys respectively, which
are defined in the data and model sections.
For easy and flexible evaluation, we defined a customized and extensible evaluation procedure using
Pytorch Lighning callbacks.
A custom EvaluationCallback
calls an
EvaluationProcedure
at pre-defined time intervals (in global_steps
, epochs
, seconds
,
minutes
, hours
, days
or any unit defined in the counter
dictionary attribute of the OptimizationProcedure
).
This allows to easily define Evaluation
procedures that map an instance of Optimization
(containing model
and data
)
to a LogEntry
. Here we report the example for a class creating images samples for a generative model:
# The evaluation must extend the base class Evaluation
class ImageSampleEvaluation(Evaluation):
# Constructor containing the evaluation parameters.
def __init__(self, n_pictures=10, sampling_params=None):
self.n_pictures = n_pictures
self.sampling_params = sampling_params if not(sampling_params is None) else dict()
# Definition of the evaluation function
def evaluate(self, optimization: pl.LightningModule) -> LogEntry:
# Get the model from the optimization procedure
model = optimization.model
# And evaluate the model
return self.evaluate_model(model)
# Evaluation function for the model (which needs to be a GenerativeModel
# This means that it has a `sample()` function to generate samples
def evaluate_model(self, model: GenerativeModel):
# Generate the samples
with torch.no_grad():
x_gen = model.sample([self.n_pictures], **self.sampling_params).to('cpu')
# Return a log-entry with the type IMAGE_ENTRY
return LogEntry(
data_type=IMAGE_ENTRY, # Type of the logged object, to be interpreted by the logger
value=make_grid(x_gen, nrow=self.n_pictures) # Value to log
)
The corresponding configuration file defines the name that will be visualized in the Tensorboard/Weights & Bias log and the logging frequency as follows:
Images/Samples: # The key is used to define the entry name
evaluate_every: ${eval_params.samples.every} # Frequency of evaluation
evaluator: # Evaluator object as defined above
_target_: code.evaluation.image.ImageSampleEvaluation # Class for the evaluation
n_pictures: ${eval_params.samples.n_samples} # Parameters passed to __init__()
The parameters for the evaluation metrics are collected in the eval_params
dictiornary, which is defined in the experiment file.
Note that the same experiment can have multiple evaluation metrics as long as the entry names (e.g. Images/Samples
are different from each other).
Each callback in callbacks
must be an instance of a Pytorch Lighning callback.
Callbacks are mainly used for
checkpointing or
early stopping.
Here we include an implementation of a TrainDurationCallback
which allows one to define custom training time (in iterations
,
global_steps
, epochs
, seconds
, minutes
, hours
, days
as for the evaluation metrics).
The TrainDurationCallback
is added to the list of used callbacks by default and reads the train duration from a train_for
variable
in the configuration.
Additional callbacks can be used for other custom functions that need to act on the optimization procedure, model or its components.
The current implementation makes use of the Pytorch Lightning Checkpoint Callback with a slight adaptation for Weight and Bias. Basic Checkpoint callbacks are added into the 'config/logging' configuration.
Since, in the original Pytorch Lightning implementation,
the code for logging differs across different loggers, we implement an extension of TensorBoard
and Weights a& Bias loggers that exposes a unified interface log( name, log_entry, global_timestap)
.
The different log_entry.data_type
are handled differently by different loggers. Currently only scalar
, scalars
and
image
are implemented, but the wraper can be easily extended for other data types.
Here we report the example for the Wandb Logger wrapper:
class WandbLogger(loggers.WandbLogger):
# Custom log function designed to handle typed log-entries.
# if defined, any additional counter in the optimization procedure is also logged.
def log(self, name: str, log_entry: LogEntry, global_step: int = None, counters: dict = None) -> None:
if counters is None:
entry = {}
else:
entry = {k: v for k, v in counters.items()}
entry['trainer/global_step'] = global_step
if log_entry.data_type == SCALAR_ENTRY:
entry[name] = log_entry.value
self.experiment.log(entry, commit=False)
elif log_entry.data_type == SCALARS_ENTRY:
for sub_name, v in log_entry.value.items():
entry['%s/%s' % (name, sub_name)] = v
self.experiment.log(entry, commit=False)
elif log_entry.data_type == IMAGE_ENTRY:
entry[name] = wandb.Image(log_entry.value)
self.experiment.log(data=entry, step=global_step, commit=False)
plt.close(log_entry.value)
elif log_entry.data_type == PLOT_ENTRY:
entry[name] = log_entry.value
self.experiment.log(data=entry, step=global_step, commit=False)
plt.close(log_entry.value)
else:
raise Exception('Data type %s is not recognized by WandBLogWriter' % log_entry.data_type)
The model training is based on the Pytorch Lightning Trainer, therefore all the corresponding parameters can be accessed
and modified.
This can be done from the configuration files (such as in the config/logging/wandb.yaml
or
the config/device/laptop.yaml
files)
# @package _global_
trainer:
checkpoint_callback: False # Disable the default model checkpoints
Or by terminal when launching the train script (e.g. enabling a fast dev run)
python train.py experiment=MNIST_VAE +trainer.fast_dev_run=True
The device-specific configuration is defined in a separate .yaml
configuration file in config/device
, this include
(but is not limited to) directories and hardware-specific options.
The device
configuration will be assigned depending on the environment variable 'DEVICE_NAME'.
As an example, if DEVICE_NAME
is set to laptop
, the configuration in config/laptop.yaml
will be used.
This design allows us to define multiple devices (for deployment, training, testing) that are dynamically selected
based on the local value of DEVICE_NAME
. Adding a new configuration is as easy as creating a new .yaml
file to the
config/device
folder and assigning the corresponding DEVICE_NAME
on the device of interest.
The run configuration object is used to define the name associated to the run (run.name
) and the name of the
corresponding project (run.project
). These properties can be accessed and modified form the command line
or by specifying them in the experiment definition.
The loading.ipynb notebook reports an example on how models can be easily retrieved directly from the
Weights & Bias Api and the Hydra instantiate
function.