Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tapas tokenization Different from Tensorflow Code #13244

Open
Doreenruirui opened this issue Aug 24, 2021 · 10 comments
Open

Tapas tokenization Different from Tensorflow Code #13244

Doreenruirui opened this issue Aug 24, 2021 · 10 comments

Comments

@Doreenruirui
Copy link

Environment info

  • transformers version: 4.9.1

Who can help

@LysandreJik @sgugger @NielsRogge

Information

Model I am using (Bert, XLNet ...): Tapas

When I am trying to replicate the TAPAS table retrieval results using Huggingface Tapas implementation, I find that Tapas tokenization in Huggingface is different from the original Tensorflow code . The original code first checks whether the table cell is "n/a", "?" or empty. If so, it would return "[EMPTY]" token. The Huggingface code has implemented the same tokenization with the tensorflow code, but it is not used to tokenize the tables. It could be easily fixed by changing all the calls of function self.tokenize to self._tokenize in the _tokenize_table function. After fixing this, I could use the released table retrieval model to replicate their results on NQ dataset with Huggingface Tapas.

@Doreenruirui Doreenruirui changed the title Tapas tokenization Tapas tokenization Different from Tensorflow Code Aug 24, 2021
@NielsRogge
Copy link
Contributor

NielsRogge commented Aug 25, 2021

Hi,

Thanks for your interest in TAPAS. However, I do think the _tokenize() method is effectively used by TapasTokenizer. This is because TapasTokenizer itself inherits from PreTrainedTokenizer, which defines the tokenize() method here. This method will in turn call _tokenize() as can be seen here.

You can also verify this using a simple example:

import pandas as pd
from transformers import TapasTokenizer

tokenizer = TapasTokenizer.from_pretrained("google/tapas-base")

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "n/a"], 'Number of movies': ["?", "53", "69"]}
queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
table = pd.DataFrame.from_dict(data)
inputs = tokenizer(table=table, queries=queries)
print(tokenizer.decode(inputs.input_ids[0]))

As you can see, I've replaced two cell values by n/a and ?, i.e. there are some empty cells in the table. This returns:

[CLS] what is the name of the first actor? [SEP] actors number of movies brad pitt [EMPTY] leondardi di caprio 53 [EMPTY] 69

The empty cells are correctly replaced by the [EMPTY] token.

@Doreenruirui
Copy link
Author

Hi,

Thanks for your interest in TAPAS. However, I do think the _tokenize() method is effectively used by TapasTokenizer. This is because TapasTokenizer itself inherits from PreTrainedTokenizer, which defines the tokenize() method here. This method will in turn call _tokenize() as can be seen here.

You can also verify this using a simple example:

import pandas as pd
from transformers import TapasTokenizer

tokenizer = TapasTokenizer.from_pretrained("google/tapas-base")

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "n/a"], 'Number of movies': ["?", "53", "69"]}
queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
table = pd.DataFrame.from_dict(data)
inputs = tokenizer(table=table, queries=queries)
print(tokenizer.decode(inputs.input_ids[0]))

As you can see, I've replaced two cell values by n/a and ?, i.e. there are some empty cells in the table. This returns:

[CLS] what is the name of the first actor? [SEP] actors number of movies brad pitt [EMPTY] leondardi di caprio 53 [EMPTY] 69

The empty cells are correctly replaced by the [EMPTY] token.

Thank you very much for your reply!

It seems that "n/a" and "?" are tokenized into [EMPTY] token, but if the cell is an empty string, then it is ignored by the tokenizer.
For this example,
data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "n/a"], 'Number of movies': ["", "53", "69"]}
the tokenization result is
[CLS] what is the name of the first actor? [SEP] actors number of movies brad pitt leonardo di caprio 53 [EMPTY] 69.
If I directly call self._tokenize, it is tokenized into
[CLS] what is the name of the first actor? [SEP] actors number of movies brad pitt [EMPTY] leonardo di caprio 53 [EMPTY] 69
I guess it is because the tokenize function returns empty list when the token is an empty string before it is passed to the _tokenize function, which is different from that of the Tapas tensorflow implementation.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@NielsRogge
Copy link
Contributor

That's interesting @Doreenruirui, are you interested in making a PR to fix this?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@NielsRogge
Copy link
Contributor

unstale

@NielsRogge
Copy link
Contributor

Hi @Doreenruirui,

After fixing this, I could use the released table retrieval model to replicate their results on NQ dataset with Huggingface Tapas.

This is very interesting, thanks for letting me know. Are you interested in opening a PR that includes the fix?

We could perhaps also add the table retrieval models to the hub.

Thanks!

@github-actions
Copy link

github-actions bot commented Dec 5, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@andy1122
Copy link

andy1122 commented Jan 4, 2022

Hi @NielsRogge

This is very interesting, thanks for letting me know. Are you interested in opening a PR that includes the fix?

I would like to work on this, i can start if nobody else is working on this.

Thanks

@oneraghavan
Copy link
Contributor

@NielsRogge @Doreenruirui This issue seems to fixed. We can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants