-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
preprocessing suggestions #3
Comments
I'm not sure whether removing punctuations could imporove the accuracy. It's very easy to give it a try though: just change the code in aligner.py and replace puctuations in the source and target sentences. Instead of tweaking with preprocessing, I think using other sentence similarity measurements may improve the alignment accuracy. Now bertalign calculates similarity between sentence pairs based on sentence embeddings. However, recent studies (Zhang et al., 2019; Wang & Yu 2023) show that token-level similarity performs better in Semantic Textual Similarity tasks. References Wang, H. and Yu, D., 2023, July. Going Beyond Sentence Embeddings: A Token-Level Matching Algorithm for Calculating Semantic Textual Similarity. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 563-570). Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q. and Artzi, Y., 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675. |
Update imports
Thank you for creating this wonderful package. I just had a quick question about improving the accuracy of the alignment. Do you have any suggestions about text preprocessing, especially with symbols, punctuations?
Would removing specific punctuation marks in texts have a great impact on the performance? Thanks!
The text was updated successfully, but these errors were encountered: