Releases: minimaxir/aitextgen
Releases · minimaxir/aitextgen
Performance improvements for training large datasets
The TokenDataset
is now backed by numpy, which means that aitextgen is now compatible with larger (>1GB) datasets without going OOM on the host machine!
- Loading the dataset uses preallocated numpy arrays populated by tokenized minibatches, ensuring constant memory usage.
- Training also has constant memory usage on the host system (by using native numpy/Torch integration and not creating copied arrays).
- Loading the dataset now has a progress bar!
- For single-texts, aitextgen uses a trick to parse the text as multiple texts (delimited by newlines), allowing multithreaded tokenization, at the minor cost of slightly different tokenization at newline-boundaries. (to disable this behavior and parse the text as a single text, set
text_delim = "/r"
) - Smaller file sizes when compressing TokenDatasets.
Additionally, progress bar refresh rates (for train()
and dataset loading) now update every 10 steps by default, to avoid extra bandwidth usage when using a cloud-based Notebook (e.g. Colab). As a side effect, when using a GPU, this update results in ~25% faster training speeds, unexpectedly.
Breaking Changes
- Loading datasets from previous versions will not work. This is a side effect of being a beta, and not something I intend to break often.
- shuffle/seed on TokenDatasets no longer works; handle that before loading the dataset.
Change Generate defaults
Changed the generation defaults to max_length=256
and temperature=0.7
for a balance between user friendliness and speed.
Notes about the max_length
of 1024 for GPT-2 are places where appropriate.
Initial beta launch!
v0.1 Clean up outstanding files for v0.1