Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add language support to the OCR manager #7465

Merged
merged 14 commits into from
Dec 2, 2024

Conversation

Joao-vi
Copy link
Collaborator

@Joao-vi Joao-vi commented Nov 21, 2024

fixes #7439

It's not too clear in code, but the Uwazi context and OCR context are using different language code:
Uwazi = ISO639_3
OCR = ISO639_1

We were using getLanguage that I believe it was used to convert Uwazi language codes to Elastic search languages codes, as Ukranian is not supported in Elastic search, it was retuning null and not executing the OCR service.

I've created LanguageCodeMapper that can convert between language code.

A better solution, it would be to make OCR models, properties more explicit in code so we can make a well defined boundary that can tell what belongs to Uwazi and what belongs to OCR

With that well defined Models we can create a Anti-Corruption Layer and whenever we need to use some feature from OCR we use this Anti-Corruption Layer to get the models from OCR.

@Joao-vi Joao-vi requested review from daneryl and RafaPolit November 21, 2024 18:17
Copy link
Collaborator

@daneryl daneryl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are good to go with this one 👍
the whole languages list and how we use them on every component is still very confusing, lets merge this one and try to fix that part on next issue ?

@Joao-vi
Copy link
Collaborator Author

Joao-vi commented Dec 2, 2024

I think we are good to go with this one 👍 the whole languages list and how we use them on every component is still very confusing, lets merge this one and try to fix that part on next issue ?

Yes, let me know if you want me to create a new issue addressing that.

Update: new issue addressing that #7514

@Joao-vi Joao-vi merged commit 54a4a29 into production Dec 2, 2024
18 of 19 checks passed
@Joao-vi Joao-vi deleted the refactor/ocr-manager-language-support branch December 2, 2024 11:36
RafaPolit pushed a commit that referenced this pull request Dec 11, 2024
* add LanguageCodeMapper

* add ukrainian language to elastic search

* simplify languages support

* update languageSchema

* revert changes

* use legacy elastic mapping

* fix language code

* fix test case

* add other language to available languages array

* fix test errors

---------

Co-authored-by: Joan Gallego Girona <daneryl@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants