Skip to content

Releases: minimaxir/aitextgen

Performance improvements for training large datasets

02 Jun 04:22
Compare
Choose a tag to compare

The TokenDataset is now backed by numpy, which means that aitextgen is now compatible with larger (>1GB) datasets without going OOM on the host machine!

  • Loading the dataset uses preallocated numpy arrays populated by tokenized minibatches, ensuring constant memory usage.
  • Training also has constant memory usage on the host system (by using native numpy/Torch integration and not creating copied arrays).
  • Loading the dataset now has a progress bar!
  • For single-texts, aitextgen uses a trick to parse the text as multiple texts (delimited by newlines), allowing multithreaded tokenization, at the minor cost of slightly different tokenization at newline-boundaries. (to disable this behavior and parse the text as a single text, set text_delim = "/r")
  • Smaller file sizes when compressing TokenDatasets.

Additionally, progress bar refresh rates (for train() and dataset loading) now update every 10 steps by default, to avoid extra bandwidth usage when using a cloud-based Notebook (e.g. Colab). As a side effect, when using a GPU, this update results in ~25% faster training speeds, unexpectedly.

Breaking Changes

  • Loading datasets from previous versions will not work. This is a side effect of being a beta, and not something I intend to break often.
  • shuffle/seed on TokenDatasets no longer works; handle that before loading the dataset.

Change Generate defaults

17 May 21:38
Compare
Choose a tag to compare

Changed the generation defaults to max_length=256 and temperature=0.7 for a balance between user friendliness and speed.

Notes about the max_length of 1024 for GPT-2 are places where appropriate.

Initial beta launch!

17 May 17:26
Compare
Choose a tag to compare
v0.1

Clean up outstanding files for v0.1