Skip to content

FP8 questions... #242

Answered by ptrendx
vgoklani asked this question in Q&A
Discussion options

You must be logged in to vote

The activations that are in FP8 are 2x smaller, yes, although how much it impacts the end-to-end memory usage depends on the network. In the large language model case it is typically the weights and the optimizer state that take most of the memory.
As for the weights, the answer is more complicated and really depends on the usage.

  • For inference:
    • In the ideal case (either by using "standalone" inference solution like Faster Transformer or once frameworks start natively supporting FP8 as a type) the 2x reduction in memory usage should mostly hold.
    • In the current TE case, since FP8 is not yet supported as a native type, the network still has the parameters in the framework-native type (lik…

Replies: 3 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by ksivaman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #131 on May 22, 2023 19:57.