Add Zamba #30950

pglorio · 2024-05-22T05:09:31Z

What does this PR do?

Please include support for Zamba architecture created by Zyphra Technologies.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

pglorio · 2024-05-23T06:03:34Z

Just for future reference, we measured latency of a single mamba layer in Zamba and compared it to that of a single layer in Mamba, which have very similar implementations (we have some reshapings in Zamba, but they should be a non-op, and a concatenation), and found that that the mamba layer in Zamba to have the same speed in a single forward pass, but to be slower on generation.

More specifically, we instantiated these two (random) models:

config = ZambaConfig(num_hidden_layers=81, hidden_size=3712, n_mamba_heads=1, use_cache=True)
model_1 = ZambaForCausalLM(config).cuda()
config = MambaConfig(num_hidden_layers=81, hidden_size=3712, use_cache=True)
model_2 = MambaForCausalLM(config).cuda()

(here n_mamba_heads=1 corresponds to the original Mamba architecture), and use this code for generation:

model.eval()
input_ids = torch.randint(1000, (1, 2048)).to(device=model.device)
with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=300, return_dict_in_generate=False, output_scores=False, use_cache=True, num_beams=1, do_sample=False)

We found that the total time spent computing this line

transformers/src/transformers/models/zamba/modeling_zamba.py

Line 1009 in 87ec872

hidden_states = self.mamba(

is 8.1s, and for this line

transformers/src/transformers/models/mamba/modeling_mamba.py

Line 341 in 87ec872

hidden_states = self.mixer(hidden_states, cache_params=cache_params)

is 6.3s.

ArthurZucker · 2024-05-23T09:51:55Z

cc @younesbelkada !

younesbelkada

Thanks a lot for this PR ! I left some minor suggestions in the modeling code for general improvements
Can you also make sure to rebase with main and make sure make fixup pass locally ? Let me know if you need any assistance!

docs/source/en/model_doc/zamba.md

src/transformers/models/zamba/configuration_zamba.py

src/transformers/models/zamba/modeling_zamba.py

amazingvince · 2024-06-04T05:32:52Z

I tried running basic training script with gradient accumulation and without on this fork and am getting this error:
File "/home/user/transformers_zamba/src/transformers/models/zamba/modeling_zamba.py", line 1051, in forward
hidden_states = hidden_states + from_tf if from_tf is not None else hidden_states
RuntimeError: The size of tensor a (7424) must match the size of tensor b (3712) at non-singleton dimension 2

The from_tf is not well described in the doc strings. Not sure what is not working here.

pglorio · 2024-06-04T06:42:18Z

I tried running basic training script with gradient accumulation and without on this fork and am getting this error:
File "/home/user/transformers_zamba/src/transformers/models/zamba/modeling_zamba.py", line 1051, in forward
hidden_states = hidden_states + from_tf if from_tf is not None else hidden_states
RuntimeError: The size of tensor a (7424) must match the size of tensor b (3712) at non-singleton dimension 2

The from_tf is not well described in the doc strings. Not sure what is not working here.

Thanks for spotting this. We fixed the issue in the most recent push. Please try again and let us know if you still encounter issues.

We are adding more docstrings to explain various parts of the architecture. We will add the description below for from_tf around this line:

from_tf is the output of shared transformer + linear layer (these layers are shown in fig. 2 in https://arxiv.org/pdf/2405.16712). from_tf is then added to the input to the mamba layer (as described in eq. (6) of https://arxiv.org/pdf/2405.16712, where y_l in that equation is from_tf).

pglorio · 2024-06-05T00:50:55Z

Thanks a lot for this PR ! I left some minor suggestions in the modeling code for general improvements
Can you also make sure to rebase with main and make sure make fixup pass locally ? Let me know if you need any assistance!

Thank you for the thorough review!

We ran make fixup and make fix-copies. Running again make fixup gives this output:

Checking/fixing src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py src/transformers/models/zamba/configuration_zamba.py src/transformers/models/zamba/modeling_zamba.py tests/models/roc_bert/test_tokenization_roc_bert.py
All checks passed!
4 files left unchanged
python utils/custom_init_isort.py
python utils/sort_auto_mappings.py
python utils/check_doc_toc.py --fix_and_overwrite
running deps_table_update
updating src/transformers/dependency_versions_table.py
python utils/check_copies.py
Traceback (most recent call last):
  File "/workspace/transformers_zamba/utils/check_copies.py", line 1106, in <module>
    check_copies(args.fix_and_overwrite, args.file)
  File "/workspace/transformers_zamba/utils/check_copies.py", line 856, in check_copies
    raise Exception(
Exception: Found the following copy inconsistencies:
- tests/models/roc_bert/test_tokenization_roc_bert.py: copy does not match models.bert.test_tokenization_bert.BertTokenizationTest.test_is_whitespace at line 167
Run `make fix-copies` or `python utils/check_copies.py --fix_and_overwrite` to fix them.

It looks like make fix-copies is trying to correct parts of the code that are outside of our PR, and some of those fixes still fail. However, we now do not seem to get errors related to our PR.

We pushed the fixes done by make fix-copies.

amazingvince · 2024-06-06T21:34:49Z

Tried training again and am now getting this:

/trainer.py", line 3250, in training_step
self.accelerator.backward(loss)
File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/accelerate/accelerator.py", line 2127, in backward
loss.backward(**kwargs)
File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply
return user_fn(self, *args)
File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 320, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Trying to create tensor with negative dimension -274340773632: [-274340773632]

Quentin-Anthony · 2024-06-07T00:27:51Z

Tried training again and am now getting this:

/trainer.py", line 3250, in training_step self.accelerator.backward(loss) File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/accelerate/accelerator.py", line 2127, in backward loss.backward(**kwargs) File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward torch.autograd.backward( File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward _engine_run_backward( File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply return user_fn(self, *args) File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 320, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward _engine_run_backward( File "/home/user/mambaforge/envs/zamba/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Trying to create tensor with negative dimension -274340773632: [-274340773632]

Hey there! We've been successfully using: huggingface/alignment-handbook@main...Zyphra:alignment-handbook:zamba-instruct here recently to do sft of Zamba. Does your setup meaningfully differ from this? I can't seem to reproduce, can you provide us a reproducer?

ArthurZucker · 2024-06-07T09:53:52Z

cc @younesbelkada should I review this or do you want to do another pass? 🤗

amazingvince · 2024-06-09T20:56:45Z

I am trying to extend the max context length.
{
"max_position_embeddings": 32768,
"rope_theta": 192144,
}

also tried at 16k.

I tried running in your fork of alignment handbook and saw the same results.

younesbelkada

Thanks very much for your great work on this ! I left few minor improvements to address and some file changes to revert - can you make sure to make our CI happy (by making sure make fixup command passes + the tests pass pytest tests/models/zamba/ pass ) let me know if you need any help or have any question

docs/source/en/model_doc/zamba.md

src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py

src/transformers/models/zamba/modeling_zamba.py

docs/source/en/model_doc/zamba.md

src/transformers/models/zamba/modeling_zamba.py

tests/models/roc_bert/test_tokenization_roc_bert.py

pglorio · 2024-06-18T05:19:40Z

Thanks very much for your great work on this ! I left few minor improvements to address and some file changes to revert - can you make sure to make our CI happy (by making sure make fixup command passes + the tests pass pytest tests/models/zamba/ pass ) let me know if you need any help or have any question

Thank you for your help, @younesbelkada!

We believe we have addressed most of the concerns you raised; we still have two pending questions:

pytest tests/models/zamba/: Pytest flags only test_initialization as failing. The specific issue arises with x_proj_weight and dt_proj_weight, where their mean is approximately 10^-2, contrary to the expected 10^-9. This discrepancy is expected, it is due to the initialization scheme using a variance of (d_input)^(-0.5), where d_input is approximately 100 in the test configuration. We implemented nn.Parameter(torch.rand(...)) for initialization of these parameters, which we verified is equivalent to the Kaiming initialization typically used for nn.Linear. It seems that transformers may apply additional steps for the initialization of various layer types, which might not extend to parameters such as x_proj_weight. We have adjusted the tolerance for these parameters to 10^-2 in the initialization test in this line of the test script. Please let us know if additional steps are required.
make fixup: After running it, we have this output:

- src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py: copy does not match models.big_bird.modeling_big_bird.BigBirdBlockSparseAttention at line 720
- tests/models/roc_bert/test_tokenization_roc_bert.py: copy does not match models.bert.test_tokenization_bert.BertTokenizationTest.test_chinese at line 76
- tests/models/roc_bert/test_tokenization_roc_bert.py: copy does not match models.bert.test_tokenization_bert.BertTokenizationTest.test_basic_tokenizer_lower at line 85
- tests/models/roc_bert/test_tokenization_roc_bert.py: copy does not match models.bert.test_tokenization_bert.BertTokenizationTest.test_basic_tokenizer_lower_strip_accents_false at line 94
- tests/models/roc_bert/test_tokenization_roc_bert.py: copy does not match models.bert.test_tokenization_bert.BertTokenizationTest.test_basic_tokenizer_lower_strip_accents_true at line 103
- tests/models/roc_bert/test_tokenization_roc_bert.py: copy does not match models.bert.test_tokenization_bert.BertTokenizationTest.test_basic_tokenizer_lower_strip_accents_default at line 112
Run `make fix-copies` or `python utils/check_copies.py --fix_and_overwrite` to fix them.
make: *** [Makefile:38: repo-consistency] Error 1

all the lines are related to files outside of our PR, so we did not change those files, although indeed I do see that the CircleCI tests performed in this PR still fail. Please, let us know if further action is needed here and what would be the steps we'd need to take.

Thank you so much for your time and help!

younesbelkada

Thanks a lot for iterating ! I left one minor suggestion, can you also merge your branch with upstream main branch? This should make the CI happy and tests should be green

=0.19,

src/transformers/models/zamba/modeling_zamba.py

pglorio · 2024-06-21T19:52:49Z

Thank you so much for your guidance. We tried to rebase our PR and ran into an error related to model generation. It looks like the rebased GenerationMixin.generate method instantiates Zamba's cache as a DynamicCache class https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1775. This is different from HybridMambaAttentionDynamicCache which would be the expected class for Zamba's cache (defined here https://github.com/Zyphra/transformers_zamba/blob/main/src/transformers/models/zamba/modeling_zamba.py#L130). In the non-rebased fork, the cache is instantiated by GenerationMixin.generate in this line: https://github.com/Zyphra/transformers_zamba/blob/main/src/transformers/generation/utils.py#L2379, which correctly instantiates cache as HybridMambaAttentionDynamicCache.

For reference, these are the calls performed from model.generate to the instantiation of the cache object:
using the rebased fork:

-> output = model.generate(**tokenized_prompt, max_new_tokens=300, return_dict_in_generate=False, output_scores=False, use_cache=True, num_beams=1, do_sample=False)
  /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py(115)decorate_context()
-> return func(*args, **kwargs)
  /workspace/transformers_zamba_rebased/src/transformers/generation/utils.py(1775)generate()
-> model_kwargs["past_key_values"] = DynamicCache()
> /workspace/transformers_zamba_rebased/src/transformers/cache_utils.py(305)__init__()

and using the fork before rebasing:

-> output = model.generate(**tokenized_prompt, max_new_tokens=300, return_dict_in_generate=False, output_scores=False, use_cache=True, num_beams=1, do_sample=False)
  /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py(115)decorate_context()
-> return func(*args, **kwargs)
  /workspace/transformers_zamba/src/transformers/generation/utils.py(1743)generate()
-> result = self._sample(
  /workspace/transformers_zamba/src/transformers/generation/utils.py(2379)_sample()
-> model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  /workspace/transformers_zamba/src/transformers/models/zamba/modeling_zamba.py(1588)prepare_inputs_for_generation()
-> past_key_values = HybridMambaAttentionDynamicCache(
> /workspace/transformers_zamba/src/transformers/models/zamba/modeling_zamba.py(146)__init__()

Could you please let us know how we can force Zamba's cache to be HybridMambaAttentionDynamicCache in the rebased fork?

Thanks very much!

ArthurZucker · 2024-08-03T16:25:47Z

He! Super late in coming back to you, you should be able to force it by setting cache_class = "" in the ZambaPreTrainedModel class!

pglorio · 2024-08-06T07:51:12Z

He! Super late in coming back to you, you should be able to force it by setting cache_class = "" in the ZambaPreTrainedModel class!

Hello @ArthurZucker, thank you for the suggestion! We tried adding cache_class = "" to this line but we still couldn't make generation work. As an alternative fix, we added "zamba" to this line, similarly to what was done with Jamba, in which case generation works fine.

We are happy to either keep this fix or to use the one you suggested, in which case we would appreciate if you could say a few more words on how to implement it.

Meanwhile, we rebased our fork. All the local tests with make fixup have passed, except for a few warnings shown below which are unrelated to the updates we implemented:

/workspace/transformers_zamba/src/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
/workspace/transformers_zamba/utils/check_repo.py:376: UserWarning: Full repo consistency checks require all backends to be installed (with `pip install -e '.[dev]'` in the Transformers repo, the following are missing: TensorFlow, Flax. While it's probably fine as long as you didn't make any change in one of those backends modeling files, you should probably execute the command above to be on the safe side.
/workspace/transformers_zamba/src/transformers/models/deit/image_processing_deit.py:87: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  resample: PILImageResampling = PIL.Image.BICUBIC,
/workspace/transformers_zamba/src/transformers/models/chameleon/image_processing_chameleon.py:116: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
  resample: PILImageResampling = PIL.Image.LANCZOS,
/workspace/transformers_zamba/src/transformers/models/efficientnet/image_processing_efficientnet.py:92: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
  resample: PILImageResampling = PIL.Image.NEAREST,

I see most of the CircleCI tests for this PR are still failing, please let me know if more needs to be done to fix those.

Thank you so much!

ArthurZucker · 2024-08-08T11:26:20Z

I'll have a look!

HuggingFaceDocBuilderDev · 2024-10-01T00:13:08Z

Hey! 🤗 Thanks for your contribution to the transformers library!

Before merging this pull request, slow tests CI should be triggered. To enable this:

Add the run-slow label to the PR
When your PR is ready for merge and all reviewers' comments have been addressed, push an empty commit with the command [run-slow] followed by a comma separated list of all the models to be tested, i.e. [run_slow] model_to_test_1, model_to_test_2
- If the pull request affects a lot of models, put at most 10 models in the commit message
A transformers maintainer will then approve the workflow to start the tests

(For maintainers) The documentation for slow tests CI on PRs is here.

hg0428 · 2024-10-01T02:24:16Z

Does this include support for Zamba2?

pglorio · 2024-10-01T04:47:34Z

@hg0428 thanks for asking! Support for Zamba2 will be added in a follow-up PR. Meanwhile, you can install Zyphra's local transformers as described in the Zamba2's model card.

ArthurZucker

Thanks for your contribution! 🤗

hg0428 · 2024-10-01T12:19:28Z

@hg0428 thanks for asking! Support for Zamba2 will be added in a follow-up PR. Meanwhile, you can install Zyphra's local transformers as described in the Zamba2's model card.

Unfortunately, that does not work on my device. Zamba2 transformers runs on mamba_ssm, which requires an NVIDIA GPU. I have Apple Silicon. See my issue: Zyphra/transformers_zamba2#3

ArthurZucker · 2024-10-01T12:29:50Z

Just waiting for https://github.com/huggingface/transformers/actions/runs/11116137341/job/30897696145?pr=30950#step:12:64 to be fixed! (related to accelerate and auto device, good that we have this test!)

pglorio · 2024-10-04T05:33:14Z

Hi @Arthur, thank you again for reviewing.

The test mentioned above test_multi_gpu_data_parallel_forward now passes. We had to change some of the shared layers logic for it to work. Previously, self.mamba_layers and self.linear_layers were both nn.ModuleList objects and self.layers was not, which prevented most of the layers from being scattered across devices. Now only self.layers is nn.ModuleList and everything seems to work.

Additionally, we updated all the model's checkpoints on the hub since this involved changing some of the weight keys related to the shared layers. Separately, given we updated the checkpoints, we also swapped up<->gate in the MLP weight keys as well as in the forward pass so this issue is now addressed.

All tests related to zamba appear to pass. Thank you!

hg0428 · 2024-10-04T11:31:56Z

Hi @Arthur, thank you again for reviewing.

The test mentioned above test_multi_gpu_data_parallel_forward now passes. We had to change some of the shared layers logic for it to work. Previously, self.mamba_layers and self.linear_layers were both nn.ModuleList objects and self.layers was not, which prevented most of the layers from being scattered across devices. Now only self.layers is nn.ModuleList and everything seems to work.

Additionally, we updated all the model's checkpoints on the hub since this involved changing some of the weight keys related to the shared layers. Separately, given we updated the checkpoints, we also swapped up<->gate in the MLP weight keys as well as in the forward pass so this issue is now addressed.

All tests related to zamba appear to pass. Thank you!

Does this Zamba support work on Apple Silicon?

Quentin-Anthony · 2024-10-04T18:36:49Z

Hi @Arthur, thank you again for reviewing.
The test mentioned above test_multi_gpu_data_parallel_forward now passes. We had to change some of the shared layers logic for it to work. Previously, self.mamba_layers and self.linear_layers were both nn.ModuleList objects and self.layers was not, which prevented most of the layers from being scattered across devices. Now only self.layers is nn.ModuleList and everything seems to work.
Additionally, we updated all the model's checkpoints on the hub since this involved changing some of the weight keys related to the shared layers. Separately, given we updated the checkpoints, we also swapped up<->gate in the MLP weight keys as well as in the forward pass so this issue is now addressed.
All tests related to zamba appear to pass. Thank you!

Does this Zamba support work on Apple Silicon?

I don't believe so. We're working on MLX support in a separate (private for now) vein of work from this PR, which just seeks to get basic GPU integration into upstream HuggingFace Transformers.

Quentin-Anthony · 2024-10-04T20:29:26Z

🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉

ArthurZucker · 2024-10-04T20:47:39Z

🚀

fakerybakery · 2024-10-14T22:11:55Z

Hi,
Are there any plans to add Zamba2 to Transformers?
Thanks!

ArthurZucker · 2024-10-15T11:39:01Z

I think the Zyphra team is already working on it!

hg0428 · 2024-10-15T12:16:37Z

Hopefully we get Apple Silicon support for Zamba and Zamba2 soon.

* Update index.md * Rebase * Rebase * Updates from make fixup * Update zamba.md * Batched inference * Update * Fix tests * Fix tests * Fix tests * Fix tests * Update docs/source/en/model_doc/zamba.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update docs/source/en/model_doc/zamba.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update configuration_zamba.py * Update src/transformers/models/zamba/modeling_zamba.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/zamba/modeling_zamba.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/zamba/modeling_zamba.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/zamba/modeling_zamba.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update modeling_zamba.py * Update modeling_zamba.py * Update modeling_zamba.py * Update configuration_zamba.py * Update modeling_zamba.py * Update modeling_zamba.py * Merge branch 'main' of https://github.com/Zyphra/transformers_zamba * Update ZambaForCausalLM * Update ZambaForCausalLM * Describe diffs with original mamba layer * Moved mamba init into `_init_weights` * Update index.md * Rebase * Rebase * Updates from make fixup * Update zamba.md * Batched inference * Update * Fix tests * Fix tests * Fix tests * Fix tests * Update docs/source/en/model_doc/zamba.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update docs/source/en/model_doc/zamba.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update configuration_zamba.py * Update src/transformers/models/zamba/modeling_zamba.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/zamba/modeling_zamba.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/zamba/modeling_zamba.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/zamba/modeling_zamba.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update modeling_zamba.py * Update modeling_zamba.py * Update modeling_zamba.py * Update configuration_zamba.py * Update modeling_zamba.py * Update modeling_zamba.py * Merge branch 'main' of https://github.com/Zyphra/transformers_zamba * Update ZambaForCausalLM * Moved mamba init into `_init_weights` * Update ZambaForCausalLM * Describe diffs with original mamba layer * make fixup fixes * quality test fixes * Fix Zamba model path * circleci fixes * circleci fixes * circleci fixes * circleci fixes * circleci fixes * circleci fixes * circleci fixes * circleci fixes * circleci fixes * Update * circleci fixes * fix zamba test from merge * fix ValueError for disabling mamba kernels * add HF copyright Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * shared_transf --> shared_transformer * Update src/transformers/models/zamba/modeling_zamba.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/zamba/modeling_zamba.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Fixes * Move attention head dim to config * Fix circle/ci tests * Update modeling_zamba.py * apply GenerationMixin inheritance change from upstream * apply import ordering * update needed transformers version for zamba Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * add contribution author * add @slow to avoid CI * Update src/transformers/models/zamba/modeling_zamba.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Define attention_hidden_size * Added doc for attention_head_size * trigger CI * Fix doc of attention_hidden_size * [run-slow] zamba * Fixed shared layer logic, swapped up<->gate in mlp * shared_transformer -> shared_transf * reformat HybridLayer __init__ * fix docstrings in zamba config * added definition of _get_input_ids_and_config * fixed formatting of _get_input_ids_and_config --------- Co-authored-by: root <root@node-4.us-southcentral1-a.compute.internal> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: root <root@node-1.us-southcentral1-a.compute.internal> Co-authored-by: Quentin Anthony <qganthony@yahoo.com>

amyeroberts added the New model label May 22, 2024

younesbelkada reviewed May 23, 2024

View reviewed changes

younesbelkada reviewed Jun 10, 2024

View reviewed changes

younesbelkada reviewed Jun 19, 2024

View reviewed changes

=0.19, Outdated Show resolved Hide resolved

src/transformers/models/zamba/modeling_zamba.py Show resolved Hide resolved

pglorio force-pushed the main branch 2 times, most recently from 0ec2417 to 18e8372 Compare June 21, 2024 03:28

pglorio force-pushed the main branch from d7507d9 to c0f1ddc Compare August 6, 2024 06:22

pglorio force-pushed the main branch from 6281d93 to 1c00938 Compare August 16, 2024 01:24

pglorio and others added 8 commits August 24, 2024 01:10

Update index.md

7eff1cc

Rebase

14961a2

Rebase

b67ff24

Updates from make fixup

0aa1003

Update zamba.md

5e88653

Batched inference

123d959

Update

f35bdf9

Fix tests

1ec90d1

Quentin-Anthony added 2 commits September 30, 2024 11:36

Merge branch 'huggingface:main' into main

634837f

Merge branch 'huggingface:main' into main

4e8db07

ArthurZucker approved these changes Oct 1, 2024

View reviewed changes

pglorio and others added 8 commits October 3, 2024 23:08

Fixed shared layer logic, swapped up<->gate in mlp

1504774

fix shared layer logic, swap up<->gate in mlp

06e3a7a

shared_transformer -> shared_transf

267530d

reformat HybridLayer __init__

0a90fc7

Merge branch 'huggingface:main' into main

fabaaec

fix docstrings in zamba config

75f0d89

added definition of _get_input_ids_and_config

b9545eb

fixed formatting of _get_input_ids_and_config

cdbd690

Quentin-Anthony added 2 commits October 4, 2024 09:23

Merge branch 'huggingface:main' into main

6fabb6a

Merge branch 'huggingface:main' into main

b9f6cce

ArthurZucker merged commit f319ba1 into huggingface:main Oct 4, 2024
17 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Zamba #30950

Add Zamba #30950

pglorio commented May 22, 2024

pglorio commented May 23, 2024

ArthurZucker commented May 23, 2024

younesbelkada left a comment

amazingvince commented Jun 4, 2024

pglorio commented Jun 4, 2024 •

edited

Loading

pglorio commented Jun 5, 2024 •

edited

Loading

amazingvince commented Jun 6, 2024 •

edited

Loading

Quentin-Anthony commented Jun 7, 2024

ArthurZucker commented Jun 7, 2024

amazingvince commented Jun 9, 2024

younesbelkada left a comment

pglorio commented Jun 18, 2024

younesbelkada left a comment

pglorio commented Jun 21, 2024

ArthurZucker commented Aug 3, 2024

pglorio commented Aug 6, 2024

ArthurZucker commented Aug 8, 2024

HuggingFaceDocBuilderDev commented Oct 1, 2024

hg0428 commented Oct 1, 2024

pglorio commented Oct 1, 2024

ArthurZucker left a comment

hg0428 commented Oct 1, 2024 •

edited

Loading

ArthurZucker commented Oct 1, 2024

pglorio commented Oct 4, 2024

hg0428 commented Oct 4, 2024

Quentin-Anthony commented Oct 4, 2024

Quentin-Anthony commented Oct 4, 2024

ArthurZucker commented Oct 4, 2024

fakerybakery commented Oct 14, 2024

ArthurZucker commented Oct 15, 2024

hg0428 commented Oct 15, 2024

Add Zamba #30950

Add Zamba #30950

Conversation

pglorio commented May 22, 2024

What does this PR do?

Before submitting

Who can review?

pglorio commented May 23, 2024

ArthurZucker commented May 23, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

amazingvince commented Jun 4, 2024

pglorio commented Jun 4, 2024 • edited Loading

pglorio commented Jun 5, 2024 • edited Loading

amazingvince commented Jun 6, 2024 • edited Loading

Quentin-Anthony commented Jun 7, 2024

ArthurZucker commented Jun 7, 2024

amazingvince commented Jun 9, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

pglorio commented Jun 18, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

pglorio commented Jun 21, 2024

ArthurZucker commented Aug 3, 2024

pglorio commented Aug 6, 2024

ArthurZucker commented Aug 8, 2024

HuggingFaceDocBuilderDev commented Oct 1, 2024

hg0428 commented Oct 1, 2024

pglorio commented Oct 1, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

hg0428 commented Oct 1, 2024 • edited Loading

ArthurZucker commented Oct 1, 2024

pglorio commented Oct 4, 2024

hg0428 commented Oct 4, 2024

Quentin-Anthony commented Oct 4, 2024

Quentin-Anthony commented Oct 4, 2024

ArthurZucker commented Oct 4, 2024

fakerybakery commented Oct 14, 2024

ArthurZucker commented Oct 15, 2024

hg0428 commented Oct 15, 2024

pglorio commented Jun 4, 2024 •

edited

Loading

pglorio commented Jun 5, 2024 •

edited

Loading

amazingvince commented Jun 6, 2024 •

edited

Loading

hg0428 commented Oct 1, 2024 •

edited

Loading