New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

GPT-J #12243

Closed

StellaAthena wants to merge 67 commits into huggingface:master from StellaAthena:gpt-neo-localattention3

Contributor

StellaAthena commented Jun 18, 2021 •

edited

Loading

This is a work-in-progress focused on reconciling styles and may break without warning. If you want to use GPT-J with the HF interface, you can do that by installing transformers from here. The purpose of this PR is to make progress on converting that repo to the style HF prefers.

What does this PR do?

This is my attempt to reconcile #12106 with the HF style guidelines as described by @sgugger. The original PR was created by @finetuneanon and @kurumuz.

This implementation has not been thoroughly tested yet, but I wanted to get something out as a starting point for continuing the conversation before too much momentum is lost. I need to reread HF documentation a bit more to figure out the things that are wrong, or hopefully one of you lovely people can help me out.

For comparison, a frozen version of the code in the original PR can be found here.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. Link
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@patrickvonplaten @patil-suraj @sgugger

finetunej added 30 commits

May 7, 2021 01:58


          Implement local attention as part of global

330686a

Speed seems to be the same:
https://gist.github.com/finetuneanon/5b2186c3555b652f387c86160cd89b55


          Remove separate local attention

681c7c5


          Fix attn_mask

a49eb70


          Precalculate local attention bias matrix

269c497


          Instantiate on GPU for Colab

d3f7f73


          Don't preserve state_dict when loading

05d6c3c


          Add repetition penalty range and slope to proc

8cf618c


          Add range/slope parameters to generation

f59c587


          Update README.md

a9ad1f6


          Update modeling_gpt_neo.py

404639c


          Only initialize model to GPU in Colab notebooks

0f24f8e


          Update README.md

22ba0ee


          Fix running outside of Colab

69e6d5e


          Initialize gpt-neo on GPU when 8GB+ VRAM

54740e9


          Update README.md

3cff413


          Update README.md

2c7ed30


          Add tail free sampling (1)

e1f2853


          Add tail free sampling (2)


          Add tail free sampling (3)

e65e62c


          Fix typo


          Fix tail free sampling

3f27cc4


          Update README.md

02b605a


          Fix sampling of 0 probability tokens

c9c3fde


          Update README.md

d8dfd5c


          Add info on memory efficient model loading


          Add information about checkpoint splitting

2d24433


          Add information about saving fp16 checkpoint

ec91e00


          Fix tail free sampling with batching

880e806


          Add einops dependency

7293eac


          Add rotary config entries for gpt-neo

c19f3fa

StellaAthena added 5 commits

June 18, 2021 00:46


          Update modeling_gpt-j.py

05d9d51


          Update configuration_gpt-j.py

fcf72dc


          Delete tokenization_gpt-j.py

0d126b3


          Delete tokenization_gpt-j_fast.py


          Merge branch 'master' into gpt-neo-localattention3

6f81b69

Contributor Author

StellaAthena commented Jun 18, 2021

The main thing I'm uncertain about is how to handle unimplemented functionality. GPT-J uses the same tokenizer as GPT-2, so I removed the tokenizer definition. Is that correct, or no? Relatredly, there were many types of modeling that GPT-J was not designed for, and @finetuneanon's PR just deleted the boilerplate for them. Is this correct?

StellaAthena mentioned this pull request

Add GPT-J 6B support to the gpt-neo implementation #12106

Closed

5 tasks

patil-suraj reviewed

View reviewed changes

Contributor

patil-suraj left a comment •

edited

Loading

Amazing, thanks a lot for the PR Stella!

I left a few comments in the modeling file.

Regarding you questions

I'm uncertain about is how to handle unimplemented functionality
The modeling template adds all types of head models (ForMLM, ForMultipleChoice) any such functionality that is not needed for GPT-J can be removed.
GPT-J uses the same tokenizer as GPT-2, so I removed the tokenizer definition. Is that correct, or no?
Yes, we don't need to add a new tokenizer in this case. We can define the tokenizer association in the tokenization_auto.py file, as is done for GPTNeo

transformers/src/transformers/models/auto/tokenization_auto.py

Line 294 in e43e112

(GPTNeoConfig, (GPT2Tokenizer, GPT2TokenizerFast)),

Another important thing is to add tests for the model. We could reuse the GPT2's tests from test_modeling_gpt2.py

Also, sorry to ask this again, but could we not modify generation in this PR, since it seems it's not related to GPT-J.

But great that you took over the PR, let us know if there's anything else we can help with :)

src/transformers/models/gpt-j/modeling_gpt-j.py

Comment on lines +60 to +66

+              def load_tf_weights_in_gptj(model, config, tf_checkpoint_path):
+                  """Load tf checkpoints in a pytorch model."""
+                  try:
+                      import re
+                      import numpy as np
+                      import tensorflow as tf

Contributor

patil-suraj Jun 18, 2021

This can be removed, there is no TF model.

src/transformers/models/gpt-j/modeling_gpt-j.py

Comment on lines +172 to +185

+              def fixed_pos_embedding(dim=None, seq_len=None):
+                  inv_freq = 1. / (10000 ** (torch.arange(0, dim, 2) / dim))
+                  sinusoid_inp = torch.einsum('i , j -> i j', torch.arange(seq_len), inv_freq).float()
+                  return torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)
+              Tdef rotate_every_two(x):
+                  x1 = x[:, :, :, ::2]
+                  x2 = x[:, :, :, 1::2]
+                  x = torch.stack((-x2, x1), axis=-1)
+                  return rearrange(x, '... d j -> ... (d j)')
+              def apply_rotary_pos_emb(x, sincos, offset=0):
+                  sin, cos = map(lambda t: repeat(t[offset:x.shape[1]+offset,:], "n d -> () n () (d j)", j=2), sincos)
+                  return (x * cos) + (rotate_every_two(x) * sin)

Contributor

patil-suraj Jun 18, 2021

It would be nice to not add eionops dependency. Also we could add this as a static method in the attention class so that it can be tested easily. We could probably reuse this implementation

transformers/src/transformers/models/roformer/modeling_roformer.py

Line 324 in e43e112

    
           def apply_rotary_position_embeddings(sinusoidal_pos, query_layer, key_layer, value_layer=None):

src/transformers/models/gpt-j/modeling_gpt-j.py

Comment on lines +190 to +196

+              class GPTJAttentionMixin:
+                  """
+                  A few attention related utilities for attention modules in GPT Neo, to be used as a mixin.
+                  """
+                  def _split_heads(self, tensor, num_heads, attn_head_size, rotary):
+                      """
+                      Splits hidden_size dim into attn_head_size and num_heads

Contributor

patil-suraj Jun 18, 2021

Since there is no local attention all of this can go in GPTJSelfAttention class

src/transformers/models/gpt-j/modeling_gpt-j.py

Comment on lines +202 to +203

		if len(tensor.shape) == 5:
		return tensor.permute(0, 1, 3, 2, 4) # (batch, blocks, head, block_length, head_features)

Contributor

patil-suraj Jun 18, 2021

This can also be removed, no local attention here :/

src/transformers/models/gpt-j/modeling_gpt-j.py

Comment on lines +213 to +214

		if len(tensor.shape) == 5:
		tensor = tensor.permute(0, 1, 3, 2, 4).contiguous()

Contributor

patil-suraj Jun 18, 2021

same comment as above.

src/transformers/models/gpt-j/modeling_gpt-j.py

Comment on lines +257 to +261

+                      if attention_type == "local":
+                          self.register_buffer(
+                              "bias",
+                              bias ^ torch.tril(bias, -config.window_size),
+                          )

Contributor

patil-suraj Jun 18, 2021

same comment as above

src/transformers/models/gpt-j/modeling_gpt-j.py

Comment on lines +367 to +368

		if self.attention_type in ["global", "local"]:
		self.attention = GPTJSelfAttention(self.attention_type, config)

Contributor

patil-suraj Jun 18, 2021

We could remove all this GPTNeo related code from here.

Contributor Author

StellaAthena commented Jun 18, 2021

Also, sorry to ask this again, but could we not modify generation in this PR, since it seems it's not related to GPT-J.

Damn. It looks like I messed something up.... this was supposed to not include @finetuneanon's commits. I might close this and create a replacement PR with the correct commit history.

sualehasif commented Jun 23, 2021

Mmm, I was wondering how this has been going. I would love to try a stable version of this!

Contributor

patil-suraj commented Jun 24, 2021

Hey @sualehasif

A stable version will be available in a week, stay tuned!

mittalpatel commented Jun 24, 2021

Damn. It looks like I messed something up.... this was supposed to not include @finetuneanon's commits. I might close this and create a replacement PR with the correct commit history.

@StellaAthena any idea when would you be adding a new PR? We are also running some experiments so maybe we could help.

Contributor

patil-suraj commented Jun 24, 2021 •

edited

Loading

I'm taking over the PR. But feel free to post your findings :)

Contributor Author

StellaAthena commented Jun 24, 2021

In #12106 @finetuneanon reports the results of some evaluations of the ported model on EleutherAI’s evaluation harness. The numbers were a little lower than what we had found using the original implementation, but both he and I felt this was likely due to FP16. I can now confirm that the ported model achieves the same performance as the original model when evaluated in FP32. The absolute difference in performance on lambada, HellaSwag, PiQA, and Winogrande are all less than 0.5% when done in FP32

finetunej commented Jun 26, 2021

Cool, that's good to know.

Contributor Author

StellaAthena commented Jun 26, 2021

@patil-suraj can you mark this as a draft, as it is not ready to merge in its current state?

vicgalle mentioned this pull request

No stop token vicgalle/gpt-j-api#1

Closed

patil-suraj marked this pull request as draft

July 3, 2021 14:10

socialbrim commented Jul 9, 2021 •

edited

Loading

Hey @sualehasif

A stable version will be available in a week, stay tuned!

Hi, @patil-suraj thanks so much for working on this. Is there any progress on integration to huggingface transformers?

kingoflolz mentioned this pull request

Logits and RepetitionPenalty implementation kingoflolz/mesh-transformer-jax#57

Closed

andreamad8 mentioned this pull request

Support for GPT-J tunib-ai/parallelformers#4

Closed

Contributor

willfrey commented Jul 21, 2021

Just chiming in here: All of the .py files with dashes will not be importable :) So I'd suggest changing gpt-j to gptj or gpt_j in the .py file path names.

calclavia commented Jul 27, 2021

Any updates on this and any help required?

OhadRubin commented Aug 4, 2021

@patil-suraj What is the status of this?
I would really like to use this model, and I don't feel like messing around with forks to get this to work.

StellaAthena mentioned this pull request

GPT-J-6B #13022

Merged

5 tasks

jordiae mentioned this pull request

[Feature request] Introduce GenericTransformer to ease deployment of custom models to the Hub #13311

Closed

github-actions bot commented Aug 29, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

wolfgangmeyers commented Aug 31, 2021

I would still love to see this happen.

Contributor Author

StellaAthena commented Aug 31, 2021

I would still love to see this happen.

This is going to happen any day now, see #13022

StellaAthena closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet