te.Embedding layer? #665

sopotc · 2024-02-13T14:14:53Z

sopotc
Feb 13, 2024

I want to port my end-to-end transformer implementation to TE, but I'm missing the embedding layer - everything else is available here (Linear, Softmax, MHA, MLP). Is this expected or out of scope? Should I use raw pytorch embedding layers for token and position embedding?

Andrew-Luo1 · 2024-07-27T15:02:27Z

Andrew-Luo1
Jul 27, 2024

Hi! I'm in a similar boat. I think that would make sense. The main points of TE specialised kernels and fp8 support right? Since the only computation from the embedding layer is indexing, I'm not sure if te would have any fancy kernel. And as for fp8 support, the two aspects are storage and computation. I don't think te supports fp8 storage, and again there isn't much computation going on.

Perhaps I'm missing some advantage of te modules?

0 replies

timmoon10 · 2024-08-14T22:50:12Z

timmoon10
Aug 14, 2024
Maintainer

There are a few challenges with FP8 embedding layers:

Optimizing weights in FP8 tends to hurt convergence, so we usually maintain the weights in high precision and cast to FP8 after each optimization step. However, we only use a small fraction of the embeddings in each training step, so it's likely better to cast the embeddings as needed instead of preemptively casting all of them.
In our experience, embedding layers are more numerically sensitive than linear layers. For example, some of our large-scale training runs with NeMo will reduce linear weight grads in BF16, but embedding grads in FP32 (see Use distributed optimizer support for multiple dtypes NeMo#7359).
Naively doing compute in FP8 results in numerical issues and poor convergence, so much of TE's infrastructure is dedicated to maintaining per-tensor scaling factors that help reduce numerical error. Embedding weights are huge and they behave differently than linear layer weights, so it's an open research question what quantization schemes would work with them.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

te.Embedding layer? #665

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

te.Embedding layer? #665

sopotc Feb 13, 2024

Replies: 2 comments

Andrew-Luo1 Jul 27, 2024

timmoon10 Aug 14, 2024 Maintainer

sopotc
Feb 13, 2024

Andrew-Luo1
Jul 27, 2024

timmoon10
Aug 14, 2024
Maintainer