Releases: airaria/TextBrewer
TextBrewer 0.2.1
New Features
-
More flexible distillation: Supports feeding different batches to the student and teacher. It means the batches for the student and teacher no longer need to be the same. It can be used for distilling models with different vocabularies (e.g., from RoBERTa to BERT). See the documentation for details.
-
Faster distillation: Users can pre-compute and cache the teacher outputs, then feed the cache to the distiller to save teacher's forward pass time. See the documentation for details.
Improvements
MultiTaskDistiller
now is the subclass ofGeneralDistiller
and supports intermediate feature matching loss.- Tensorboard now records more detailed losses (KD loss, hard label loss, matching losses...).
pkd_loss
now accepts tensors of shape (batch_size, length,hidden_size) or (batch_size,hidden_size). In the latter case, the loss is computed directly on the input tensors, without taking the hidden states on the first position.
TextBrewer 0.2.0.1
Bug Fixes
- Fixed bugs in
MultiTaskDistiller
. - Fixed the endless training loop when training for
num_steps
. Now distillers will stop correctly.
TextBrewer 0.2.0
New Features
-
Now supports distributed data-parallel training with
torch.nn.parallel.DistributedDataParallel
! You can passlocal_rank
to theTrainingConfig
to setup for the distributed training. The detailed usage ofDistributedDataParallel
can be found at the PyTorch docs. -
We also added an example (Chinese NER task) to demonstrate how to use TextBrewer with distributed data-parallel training.
TextBrewer 0.1.10
New Features
- Now supports mixed precision training with Apex! Just set
fp16
toTrue
inTrainingConfig
. See the documentation ofTrainingConfig
for detail. - Added
data_parallel
option inTrainingConfig
to enable data parallel training within TextBrewer.
TextBrewer 0.1.9
New Features
- Added an option
is_caching_logits
toDistillationConfig
. Ifis_caching_logits
is True, the distiller will cache the batches and the output logits of the teacher model, so that those logits will only be calcuated once. It will speed up the distillation process. This feature is only available forBasicDistiller
andMultiTeacherDistiller
. Be caution of setting it to True on large datasets, since it will store the batches and logits into the memory.
Improvements
- Added new argument
max_grad_norm
to distillers'train
method. It sets the strength of gradient clipping. Default -1, i.e., no gradient clipping. - Added new arguments
scheduler_class
andscheduler_args
to distillers'train
method. The oldscheduler
may cause convergence problem and is deprecated in favor ofscheduler_class
andscheduler_args
. See the documentation for details. - Removed
print
in thedisplay_paramters
. Now it won't print the statistics directly to the screen.
Bug Fixes
- Fixed wrong call of zero_grad().
TextBrewer 0.1.8
Improvements:
TrainingConfig.log_dir
can be set toNone
to disable TensorBoard.- Added an attribute
print_freq
to the distiller to control the frequency of logging. - Added a new argument
num_steps
to thetrain
method of the distiller. Ifnum_steps
is specified, the distiller will ignorenum_epochs
and allow an unknown-size dataloader (i.e., which has no__len__
attribute). - Added a new argument
batch_postprocessor
to thetrain
method of the distiller to allow post-processing of batches.
TextBrewer 0.1.7
This is the first release of TextBrewer.