preprocessing suggestions #3

alvinntnu · 2023-07-20T11:25:25Z

Thank you for creating this wonderful package. I just had a quick question about improving the accuracy of the alignment. Do you have any suggestions about text preprocessing, especially with symbols, punctuations?
Would removing specific punctuation marks in texts have a great impact on the performance? Thanks!

bfsujason · 2023-07-21T07:39:13Z

I'm not sure whether removing punctuations could imporove the accuracy. It's very easy to give it a try though: just change the code in aligner.py and replace puctuations in the source and target sentences.

Instead of tweaking with preprocessing, I think using other sentence similarity measurements may improve the alignment accuracy. Now bertalign calculates similarity between sentence pairs based on sentence embeddings. However, recent studies (Zhang et al., 2019; Wang & Yu 2023) show that token-level similarity performs better in Semantic Textual Similarity tasks.

References

Wang, H. and Yu, D., 2023, July. Going Beyond Sentence Embeddings: A Token-Level Matching Algorithm for Calculating Semantic Textual Similarity. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 563-570).

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q. and Artzi, Y., 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.

Update imports

ChristianGeng mentioned this issue Sep 1, 2023

Change parametrization #5

Open

ChristianGeng mentioned this issue Oct 6, 2023

Parametrization revisited #6

Open

matgille pushed a commit to matgille/mutilingual_collator that referenced this issue May 23, 2024

Merge pull request bfsujason#3 from ProMeText/update_imports

5e90057

Update imports

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preprocessing suggestions #3

preprocessing suggestions #3

alvinntnu commented Jul 20, 2023

bfsujason commented Jul 21, 2023 •

edited

Loading

preprocessing suggestions #3

preprocessing suggestions #3

Comments

alvinntnu commented Jul 20, 2023

bfsujason commented Jul 21, 2023 • edited Loading

bfsujason commented Jul 21, 2023 •

edited

Loading