Tapas tokenization Different from Tensorflow Code #13244

Doreenruirui · 2021-08-24T20:19:40Z

Environment info

transformers version: 4.9.1

Who can help

Information

Model I am using (Bert, XLNet ...): Tapas

When I am trying to replicate the TAPAS table retrieval results using Huggingface Tapas implementation, I find that Tapas tokenization in Huggingface is different from the original Tensorflow code . The original code first checks whether the table cell is "n/a", "?" or empty. If so, it would return "[EMPTY]" token. The Huggingface code has implemented the same tokenization with the tensorflow code, but it is not used to tokenize the tables. It could be easily fixed by changing all the calls of function self.tokenize to self._tokenize in the _tokenize_table function. After fixing this, I could use the released table retrieval model to replicate their results on NQ dataset with Huggingface Tapas.

The text was updated successfully, but these errors were encountered:

NielsRogge · 2021-08-25T06:46:24Z

Hi,

Thanks for your interest in TAPAS. However, I do think the _tokenize() method is effectively used by TapasTokenizer. This is because TapasTokenizer itself inherits from PreTrainedTokenizer, which defines the tokenize() method here. This method will in turn call _tokenize() as can be seen here.

You can also verify this using a simple example:

import pandas as pd
from transformers import TapasTokenizer

tokenizer = TapasTokenizer.from_pretrained("google/tapas-base")

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "n/a"], 'Number of movies': ["?", "53", "69"]}
queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
table = pd.DataFrame.from_dict(data)
inputs = tokenizer(table=table, queries=queries)
print(tokenizer.decode(inputs.input_ids[0]))

As you can see, I've replaced two cell values by n/a and ?, i.e. there are some empty cells in the table. This returns:

[CLS] what is the name of the first actor? [SEP] actors number of movies brad pitt [EMPTY] leondardi di caprio 53 [EMPTY] 69

The empty cells are correctly replaced by the [EMPTY] token.

Doreenruirui · 2021-08-26T14:34:32Z

Hi,

Thanks for your interest in TAPAS. However, I do think the _tokenize() method is effectively used by TapasTokenizer. This is because TapasTokenizer itself inherits from PreTrainedTokenizer, which defines the tokenize() method here. This method will in turn call _tokenize() as can be seen here.

You can also verify this using a simple example:
import pandas as pd
from transformers import TapasTokenizer

tokenizer = TapasTokenizer.from_pretrained("google/tapas-base")

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "n/a"], 'Number of movies': ["?", "53", "69"]}
queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
table = pd.DataFrame.from_dict(data)
inputs = tokenizer(table=table, queries=queries)
print(tokenizer.decode(inputs.input_ids[0]))
As you can see, I've replaced two cell values by n/a and ?, i.e. there are some empty cells in the table. This returns:

[CLS] what is the name of the first actor? [SEP] actors number of movies brad pitt [EMPTY] leondardi di caprio 53 [EMPTY] 69

The empty cells are correctly replaced by the [EMPTY] token.

Thank you very much for your reply!

It seems that "n/a" and "?" are tokenized into [EMPTY] token, but if the cell is an empty string, then it is ignored by the tokenizer.
For this example,
data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "n/a"], 'Number of movies': ["", "53", "69"]}
the tokenization result is
[CLS] what is the name of the first actor? [SEP] actors number of movies brad pitt leonardo di caprio 53 [EMPTY] 69.
If I directly call self._tokenize, it is tokenized into
[CLS] what is the name of the first actor? [SEP] actors number of movies brad pitt [EMPTY] leonardo di caprio 53 [EMPTY] 69
I guess it is because the tokenize function returns empty list when the token is an empty string before it is passed to the _tokenize function, which is different from that of the Tapas tensorflow implementation.

github-actions · 2021-09-24T15:02:07Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

NielsRogge · 2021-09-24T15:07:58Z

That's interesting @Doreenruirui, are you interested in making a PR to fix this?

github-actions · 2021-10-19T15:02:40Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

NielsRogge · 2021-10-19T19:27:01Z

unstale

NielsRogge · 2021-11-11T11:05:48Z

Hi @Doreenruirui,

After fixing this, I could use the released table retrieval model to replicate their results on NQ dataset with Huggingface Tapas.

This is very interesting, thanks for letting me know. Are you interested in opening a PR that includes the fix?

We could perhaps also add the table retrieval models to the hub.

Thanks!

github-actions · 2021-12-05T15:02:09Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

andy1122 · 2022-01-04T09:30:54Z

Hi @NielsRogge

This is very interesting, thanks for letting me know. Are you interested in opening a PR that includes the fix?

I would like to work on this, i can start if nobody else is working on this.

Thanks

oneraghavan · 2022-07-18T04:10:00Z

@NielsRogge @Doreenruirui This issue seems to fixed. We can close this issue.

Doreenruirui changed the title ~~Tapas tokenization~~ Tapas tokenization Different from Tensorflow Code Aug 24, 2021

NielsRogge added the Good First Issue label Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tapas tokenization Different from Tensorflow Code #13244

Tapas tokenization Different from Tensorflow Code #13244

Doreenruirui commented Aug 24, 2021

NielsRogge commented Aug 25, 2021 •

edited

Loading

Doreenruirui commented Aug 26, 2021

github-actions bot commented Sep 24, 2021

NielsRogge commented Sep 24, 2021

github-actions bot commented Oct 19, 2021

NielsRogge commented Oct 19, 2021

NielsRogge commented Nov 11, 2021

github-actions bot commented Dec 5, 2021

andy1122 commented Jan 4, 2022

oneraghavan commented Jul 18, 2022

Tapas tokenization Different from Tensorflow Code #13244

Tapas tokenization Different from Tensorflow Code #13244

Comments

Doreenruirui commented Aug 24, 2021

Environment info

Who can help

Information

NielsRogge commented Aug 25, 2021 • edited Loading

Doreenruirui commented Aug 26, 2021

github-actions bot commented Sep 24, 2021

NielsRogge commented Sep 24, 2021

github-actions bot commented Oct 19, 2021

NielsRogge commented Oct 19, 2021

NielsRogge commented Nov 11, 2021

github-actions bot commented Dec 5, 2021

andy1122 commented Jan 4, 2022

oneraghavan commented Jul 18, 2022

NielsRogge commented Aug 25, 2021 •

edited

Loading