Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement T5 decoding #864

Merged
merged 7 commits into from
Sep 15, 2023
Merged

Implement T5 decoding #864

merged 7 commits into from
Sep 15, 2023

Conversation

jbochi
Copy link
Contributor

@jbochi jbochi commented Sep 15, 2023

T5 can be used for several tasks out of the box, such as translation and summarization, as requested in #543.

Translation to German:

$ cargo run --example t5 -- --model-id "t5-small" --prompt "translate to German: A beautiful candle." --decode
Running on CPU, to run on GPU, build this example with `--features cuda`
 Eine schöne Kerze.
9 tokens generated (2.39 token/s)

Perhaps this is not the best example of summarization, but it matches the output from huggingface/transformers:

$ cargo run --example t5 -- --model-id "t5-base" --prompt "summarize: state authorities dispatched emergency crews tuesday to survey the damage after an onslaught of severe weather in mississipi." --decode
    Finished dev [unoptimized + debuginfo] target(s) in 0.26s
     Running `target/debug/examples/t5 --model-id t5-base --prompt 'summarize: state authorities dispatched emergency crews tuesday to survey the damage after an onslaught of severe weather in mississipi.' --decode`
Running on CPU, to run on GPU, build this example with `--features cuda`
 mississippi authorities dispatch emergency crews to survey damage . severe weather in mississippi has caused extensive damage .
38 tokens generated (0.33 token/s)

I have also compared the output of the last hidden state to the output from the torch-based implementation.

This is terribly slow for larger models because I didn't implement any optimizations:

  • The encoder output can be fully cached after the first pass
  • The decoder output of past tokens can also be cached
  • KV cache
  • flash attention

@LaurentMazare LaurentMazare merged commit 3e49f8f into huggingface:main Sep 15, 2023
@LaurentMazare
Copy link
Collaborator

Looks great, thanks for adding this!

@LaurentMazare
Copy link
Collaborator

Just to mention that caching for the encoder output and the decoder kv cache have just been added and speed things up quite significantly. Thanks again @jbochi for adding this, looking forward to more models being added!

@jbochi
Copy link
Contributor Author

jbochi commented Sep 17, 2023 via email

@jbochi
Copy link
Contributor Author

jbochi commented Sep 18, 2023

Something seems off with the cache. With temperature zero and no cache, for the prompt "summarize: state authorities dispatched emergency crews tuesday to survey the damage after an onslaught of severe weather in mississipi." I get mississippi authorities dispatch emergency crews to survey damage . severe weather in mississippi has caused extensive damage . With the cache, I get mississippipi authorities dispatchesipi state emergency crews are dispatched to survey the damage after an ons severe weather forecasters are called to survey the damage tued the damage tues .

First difference is in "mississippipi"

Edit: I opened #892 to add the option of disabling the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants