Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change parametrization #5

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ChristianGeng
Copy link

Thanks for the nice package! Following up on the issue concerning preprocessing suggestions, I have implemented an alternative parametrization that I would like to discuss. I hope you have time to discuss this,

Parametrize the shape in which text comes differently

The implementation has been used the is_split boolean flag to determine the form in which the input comes along.

As discussed already in the issue concerning preprocessing suggestions, it sometimes might be useful to have other options in which the data are passed to Bertalign.

In my special case it turns out that it is better to pass over src and target as lists. This comes from the fact that I need to postprocess the data outside of Bertalign. Passing lists avoids some idempotency issues that I have seen. I am not going into deta
ils here can of course if needed.

So I would feel better to reparamrtrize the is_split into a (ternary) split_type option:

split_type is_split equivalent description default
raw is_split=False splitting has to be done *
lines is_split=True
tokenized n.a. tokenized sentences as lists

Parametrize src and target languanges differently

Allow pass get src and target languages as parameters. The current implementation relies on google translate to detect language id which is an external dependency.

In order not to remove it, I have added very inelegant code that keeps the parametrization intact as much as possible

Afaics, the language id is only used when using split_type=='lines' resp. is_split=True. So maybe there is a better alternative?

Tests

I have also added basic unit tests to show that the parametrization is ok . These can be run using pytest -sv tests/test_results.py after having the test requirements installed, assuming that the package is installed - what I have done using pip install -e \ ..

matgille pushed a commit to matgille/mutilingual_collator that referenced this pull request May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant