-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Falcon port #24523
Falcon port #24523
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Some initial comments.
Update: Slightly delayed because there are some breaking architecture changes between the different Falcon checkpoints - I'm merging the various layers and using config variables to switch between the behaviours. |
c2c7239
to
290dd30
Compare
The documentation is not available anymore as the PR was closed or merged. |
feel free to ping me for a review anytime! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this! Already clean 🤗
- Tokenizer: missing in the auto mapping, I have no idea which one is used, and how it will be converted to fast. If it was pretrained using tokenizers and has no class this should be specified somewhere
- Tests: let's improve our integration tests. Not all "architecture" that can be enabled by the config are tested. Should be a bit clearer. + some bfloat16 etc conversion / generation are needed. I don't have the issues at hand but let's link them marking them as resolved as we received some linked to these models!
expected_output = ( | ||
"My favorite food is pizza. I love it so much that I have a pizza party every year for my birthday." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected_output = ( | |
"My favorite food is pizza. I love it so much that I have a pizza party every year for my birthday." | |
) | |
EXPECTED_OUTPUT = ( | |
"My favorite food is pizza. I love it so much that I have a pizza party every year for my birthday." | |
) |
Also can we do a batch, with padding to make sure rotary works as expected. Plus let's test small and big
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will see what I can add!
Hi @Rocketknight1
best |
db15e91
to
a17e783
Compare
Hey all! The main modeling code should be ready for final review now. Thanks @ArthurZucker for the comprehensive review - it was really helpful! There's one bug left that's causing a failing test, but I think it's a one-line fix that I can track down tomorrow. This may also be the issue that's causing assisted generation to fail, but those tests are currently skipped. I also need to figure out porting the tokenizer, and then once this is merged I'll need to prepare the repos to transition over to the library code. cc @amyeroberts for core maintainer review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice 🤗 Thanks for all the work porting this model!
Mostly just small comments. Mainly missing / incorrect docstrings for the model that need to be added.
self.word_embeddings = nn.Embedding(config.vocab_size, self.embed_dim) | ||
|
||
# Transformer blocks | ||
self.h = nn.ModuleList([FalconDecoderLayer(config) for _ in range(config.num_hidden_layers)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels | ||
): | ||
model = FalconModel(config=config) | ||
model.to(torch_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to enable and test torch_xla
as a target device backend on Falcon. Happy to do it as a follow up PR or as part of this PR. Any suggestion?
cc @JackCaoG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miladm sounds great, but I think it should definitely go in a follow-up PR after this is merged! We want to make sure the initial launch and the transition from remote_code checkpoints to in-library checkpoints goes smoothly, and then we can start adding features like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miladm The PR is now merged - feel free to start on any follow-up PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work porting this!
Just small comments on my side - mainly that the config docstring doesn't match the config init. I also saw at least one comment from another reviewer still outstanding in the PR - so these should be resolved too before merging.
Only outstanding questions I have are:
- Have we loaded the remote model and this port and checked their outputs are equivalent?
- What is the default model code used when calling e.g. AutoModel.from_pretrained(hub_checkpoint)? I'm assuming the one on this branch?
vocab_size (`int`, *optional*, defaults to 65024): | ||
Vocabulary size of the Falcon model. Defines the number of different tokens that can be represented by the | ||
`inputs_ids` passed when calling [`FalconModel`] | ||
hidden_size (`int`, *optional*, defaults to 64): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring doesn't match default values or some var names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
new_decoder_architecture=False, | ||
multi_query=True, | ||
parallel_attn=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick search, I think new_decoder_architecture
and parallel_attn
are new params, so need documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documented!
self.sin_cached: torch.Tensor | None = None | ||
|
||
def cos_sin(self, seq_len: int, device="cpu", dtype=torch.bfloat16) -> torch.Tensor: | ||
if seq_len != self.seq_len_cached: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're completely right - my mind blupped. I was thinking about NaN's not None's 🤦♀️ Ironically what was in my head at the time
[`PreTrainedTokenizer.__call__`] for details. | ||
|
||
[What are input IDs?](../glossary#input-ids) | ||
past_key_values (`Tuple[Tuple[torch.Tensor]]` of length `config.n_layers`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a n_layers
attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
21333f8
to
fb4e555
Compare
* Initial commit * Update src/transformers/models/falcon/configuration_falcon.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/falcon/configuration_falcon.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Cleanup config docstring * Update src/transformers/models/falcon/configuration_falcon.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Convert to relative imports * Remove torch < 1.8 warning * Restructure cos_sin header * qkv -> query, key, value * Refactor attention calculation * Add a couple of config variables to account for the different checkpoints * Successful merging of the code paths! * Fix misplaced line in the non-parallel attention path * Update config and tests * Add a pad_token_id when testing * Support output_attentions when alibi is None * make fixup * Skip KV cache shape test * No more _keys_to_ignore_on_load_missing * Simplify self attention a bit * Simplify self attention a bit * make fixup * stash commit * Some more attention mask updates * Should pass all tests except assisted generation! * Add big model generation test * make fixup * Add temporary workaround for test * Test overrides for assisted generation * Update src/transformers/models/falcon/modeling_falcon.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/falcon/modeling_falcon.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/falcon/modeling_falcon.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/models/falcon/test_modeling_falcon.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Test overrides for assisted generation * Add generation demo * Update copyright * Make the docstring model actually small * Add module-level docstring * Remove all assertions * Add copied from bloom * Reformat the QKV layer * Add copied from bloom * Update src/transformers/models/falcon/modeling_falcon.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Remove unused line and reformat * No single letter variables * Cleanup return names * Add copied from line * Remove the deprecated arguments blocks * Change the embeddings test to an alibi on/off test * Remove position_ids from FalconForQA * Remove old check for token type IDs * Fix the alibi path when multi_query is False * Update src/transformers/models/falcon/modeling_falcon.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/falcon/modeling_falcon.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/falcon/test_modeling_falcon.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update config naming * Fix typo for new_decoder_architecture * Add some comments * Fix docstring * Fix docstring * Create range in the right dtype from the start * Review comment cleanup * n_head_kv -> num_kv_heads * self.alibi -> self.use_alibi * self.num_kv -> self.num_kv_heads * Reorder config args * Made alibi arguments Optional * Add all model docstrings * Add extra checkpoints * Add author info for Falcon * Stop removing token_type_ids because our checkpoints shouldn't return it anymore * Add one hopeful comment for the future * Fix typo * Update tests, fix cache issue for generation * Use -1e9 instead of -inf to avoid float overflow * Recompute the rotary embeddings much less often * Re-enable disabled tests * One final fix to attention mask calculation, and update tests * Cleanup targeting falcon-40b equivalency * Post-rebase docs update * Update docstrings, especially in the config * More descriptive variable names, and comments where we can't rename them --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Enable the `use_flash_attention` configuration flag for Falcon models. When `use_flash_attention` is set to `true` the [FalconAttention.forwad()](https://github.com/huggingface/transformers/blob/c965d302791cf935d6ea7776428749be678cf509/src/transformers/models/falcon/modeling_falcon.py#L281) method is replaced with a variant that uses Tri Dao's flash_attention instead of pytorch's `scaled_dot_product_attention` function. At the moment the patch works only for falcon-7b but technically it will also work for falcon-40b with the right configuration. The falcon model situation is currently a bit messy: The Falcon model was recently added to Huggingface transformers (see [PR transformers#24523](huggingface/transformers#24523)) but the falcon models on the hugginface hub use still the code which is shipped together with the weights (a PR to change this [was reverted](https://huggingface.co/tiiuae/falcon-7b/discussions/66)). Falcon-7b and 40b use both slightly different code (which was unified in the HF transformers impl and can there be controlled via a configuration member called `new_decoder_architecture` see [configuration_falcon.py#L65-L67](https://github.com/huggingface/transformers/blob/main/src/transformers/models/falcon/configuration_falcon.py#L65-L67)). The HF Falcon impl uses different names in the configuration class, e.g. compare new [configuration_falcon.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/falcon/configuration_falcon.py) and old [configuration_RW.py](https://huggingface.co/tiiuae/falcon-7b/blob/main/configuration_RW.py) HF Falcon implementation compatible model configurations can be found here: 7B: [config.json](https://huggingface.co/tiiuae/falcon-7b/blob/4e2d06f0a7c6370ebabbc30c6f59377ae8f73d76/config.json) 40B: [config.json](https://huggingface.co/tiiuae/falcon-40b/blob/f1ba7d328c06aa6fbb4a8afd3c756f46d7e6b232/config.json)
* Initial commit * Update src/transformers/models/falcon/configuration_falcon.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/falcon/configuration_falcon.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Cleanup config docstring * Update src/transformers/models/falcon/configuration_falcon.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Convert to relative imports * Remove torch < 1.8 warning * Restructure cos_sin header * qkv -> query, key, value * Refactor attention calculation * Add a couple of config variables to account for the different checkpoints * Successful merging of the code paths! * Fix misplaced line in the non-parallel attention path * Update config and tests * Add a pad_token_id when testing * Support output_attentions when alibi is None * make fixup * Skip KV cache shape test * No more _keys_to_ignore_on_load_missing * Simplify self attention a bit * Simplify self attention a bit * make fixup * stash commit * Some more attention mask updates * Should pass all tests except assisted generation! * Add big model generation test * make fixup * Add temporary workaround for test * Test overrides for assisted generation * Update src/transformers/models/falcon/modeling_falcon.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/falcon/modeling_falcon.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/falcon/modeling_falcon.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/models/falcon/test_modeling_falcon.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Test overrides for assisted generation * Add generation demo * Update copyright * Make the docstring model actually small * Add module-level docstring * Remove all assertions * Add copied from bloom * Reformat the QKV layer * Add copied from bloom * Update src/transformers/models/falcon/modeling_falcon.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Remove unused line and reformat * No single letter variables * Cleanup return names * Add copied from line * Remove the deprecated arguments blocks * Change the embeddings test to an alibi on/off test * Remove position_ids from FalconForQA * Remove old check for token type IDs * Fix the alibi path when multi_query is False * Update src/transformers/models/falcon/modeling_falcon.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/falcon/modeling_falcon.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/falcon/test_modeling_falcon.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update config naming * Fix typo for new_decoder_architecture * Add some comments * Fix docstring * Fix docstring * Create range in the right dtype from the start * Review comment cleanup * n_head_kv -> num_kv_heads * self.alibi -> self.use_alibi * self.num_kv -> self.num_kv_heads * Reorder config args * Made alibi arguments Optional * Add all model docstrings * Add extra checkpoints * Add author info for Falcon * Stop removing token_type_ids because our checkpoints shouldn't return it anymore * Add one hopeful comment for the future * Fix typo * Update tests, fix cache issue for generation * Use -1e9 instead of -inf to avoid float overflow * Recompute the rotary embeddings much less often * Re-enable disabled tests * One final fix to attention mask calculation, and update tests * Cleanup targeting falcon-40b equivalency * Post-rebase docs update * Update docstrings, especially in the config * More descriptive variable names, and comments where we can't rename them --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
This PR adds the Falcon model to the main library. It's still a work in progress, and integration tests / model checkpoints still need to be added!
TODO:
AutoTokenizer
for all checkpointsoutput_attention
output_hidden_states