Skip to content

Optimising train, inference and throughput of expensive ML models

Notifications You must be signed in to change notification settings

kyaiooiayk/Cheap-ML-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 

Repository files navigation

Cheap-ML-models

A list of methods/resources/links on how to optimise training, inference and throughput of expensive ML models.


Motivation

  • Using only scale to improve performance means that resource consumption also grows. This motivates research into efficient methods.
  • This project is an attempt to list methods and findings.
  • This is needed in resource-constrained devices such as smartphones or embedded systems.

Methods

  • Pruning allows you to remove unnecessary weights by zeroing them out of your model network.
  • Quantisation allows you to reduce the computation complexity of a model by reducing the precision of the weight’s representation.
  • Distillation/Teacher-student will force a smaller neural network to learn the objectives of a larger neural network.
  • ONNX Runtime was designed with a focus on performance and scalability in order to support heavy workloads in high-scale production scenarios.

Quantisation

  • Quantisation does not have to be applied consistently to all parts of the model. Forward and backward passes can be done in half-precision, while parameters are stored and updated in full precision. In NNs, the amount of time taken to process inputs and to generate outputs (latency) is the sum of two components: data movement and arithmetic operations. Quantization helps improve upon both these facets – using a lower precision helps transfer data in the GPU faster and also enables leveraging specialized hardware in modern GPUs that reduces the time taken for data movement and the matrix multiplications respectively. However, quantizing LLMs has proven to be significantly more challenging as they grow in size.
  • There are several different types of quantisation methods:
    • Fixed-point quantisation, in which each parameter or computation is represented by a fixed number of bits.
    • Floating-point quantisation, in which some parameters or computations are represented with higher precision than others.
    • Dynamic quantisation which quantises the weights of the model during the training process, this is useful for deep learning models where the weights are updated frequently.
    • Quantisation can also be applied on the activation values during the inference process, this is called post-training quantisation.

Libraries

  • ggml is a tensor library for machine learning to enable large models and high performance on commodity hardware.

Articles


Blogs


Releases

No releases published

Packages

No packages published