Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How the function insert_target() in data.py works? #7

Open
yuboona opened this issue Jun 2, 2020 · 8 comments
Open

How the function insert_target() in data.py works? #7

yuboona opened this issue Jun 2, 2020 · 8 comments

Comments

@yuboona
Copy link

yuboona commented Jun 2, 2020

  • Question:

I found if using insert_target() in data.py, the input data will be split to many sequences which have a lot of overlapping words to each other.

I would like to know why process like this? I think it makes a lot of repeating data.

@yuboona yuboona changed the title How the function insert_target() in data.py? How the function insert_target() in data.py works? Jun 2, 2020
@thaitrinh
Copy link

I have the same question. @yuboona: have you figured out, why do we need to do that? Thank you!

@yuboona
Copy link
Author

yuboona commented Nov 17, 2020

I have the same question. @yuboona: have you figured out, why do we need to do that? Thank you!

I think it's a trick to use more contextual information. Generally, in my previous solution, I just treated punctuation restoration like a sequence labeling task(input one sentence to bert one time, and output all labels of all words in this sentence). But in this repo, it input one sentence, but it just predict one label of the word in the middle of the sentence. I hope this is clear enough.

@thaitrinh
Copy link

Thank you very much!

Is your solution here: https://github.com/yuboona/punctuation-restoration-pytorch?

You meant:
In your solution: Input: "this is the first sentence this is the second sentence" -> Output: "This is the first sentence. This the second sentence."
In this repo: Input: "this is the first sentence this is the second sentence" -> Output: "0 0 0 0 0 PERIOD 0 0 0 0 0 PERIOD".
Do I understand you correctly?

@yuboona
Copy link
Author

yuboona commented Nov 17, 2020

Thank you very much!

Is your solution here: https://github.com/yuboona/punctuation-restoration-pytorch?

You meant:
In your solution: Input: "this is the first sentence this is the second sentence" -> Output: "This is the first sentence. This the second sentence."
In this repo: Input: "this is the first sentence this is the second sentence" -> Output: "0 0 0 0 0 PERIOD 0 0 0 0 0 PERIOD".
Do I understand you correctly?

Not really. I actually means that:

  • In my repo: Input: "this is the first sentence this is the second sentence" -> Output: "0 0 0 0 0 PERIOD 0 0 0 0 0 PERIOD".
  • In this repo:
    1. Input: "this is the first sentence this is the second sentence" -> Output: "PERIOD"
    2. Input: "is the first sentence this is the second sentence this" -> Output: "0"
    3. Input: " the first sentence this is the second sentence this is" -> Output: "0"
    4. .......
    5. .......
      This insert_target() generates the input data and label like above.

@thaitrinh
Copy link

I got it. Thank you so much!

Have you compare the performance of 2 models? Do you think the data preparation of this repo could help the model's performance?

@yuboona
Copy link
Author

yuboona commented Nov 17, 2020

I got it. Thank you so much!

Have you compare the performance of 2 models? Do you think the data preparation of this repo could help the model's performance?

In fact, if use bert in these two ways, the perfermance is nearly same. So I don't really think this data preparation can help. Moreover, this data preparation makes input sequence length too long(compared to 1st way), then training time increased.

@thaitrinh
Copy link

Great! Many thanks, Yuboona! I will try your approach as well!

@kotikkonstantin
Copy link

kotikkonstantin commented Nov 21, 2021

@thaitrinh @yuboona Guys, I've made a visualization for target preparation. You can see here.
Also, in my version, I use stacked hidden states instead of LM-logits. For Russian, it works better. And it decreases the dimension of input for FC-layer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants