Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preprocessing suggestions #3

Open
alvinntnu opened this issue Jul 20, 2023 · 1 comment
Open

preprocessing suggestions #3

alvinntnu opened this issue Jul 20, 2023 · 1 comment

Comments

@alvinntnu
Copy link

Thank you for creating this wonderful package. I just had a quick question about improving the accuracy of the alignment. Do you have any suggestions about text preprocessing, especially with symbols, punctuations?
Would removing specific punctuation marks in texts have a great impact on the performance? Thanks!

@bfsujason
Copy link
Owner

bfsujason commented Jul 21, 2023

I'm not sure whether removing punctuations could imporove the accuracy. It's very easy to give it a try though: just change the code in aligner.py and replace puctuations in the source and target sentences.

Instead of tweaking with preprocessing, I think using other sentence similarity measurements may improve the alignment accuracy. Now bertalign calculates similarity between sentence pairs based on sentence embeddings. However, recent studies (Zhang et al., 2019; Wang & Yu 2023) show that token-level similarity performs better in Semantic Textual Similarity tasks.

References

Wang, H. and Yu, D., 2023, July. Going Beyond Sentence Embeddings: A Token-Level Matching Algorithm for Calculating Semantic Textual Similarity. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 563-570).

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q. and Artzi, Y., 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.

matgille pushed a commit to matgille/mutilingual_collator that referenced this issue May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants