Add language support to the OCR manager #7465

Joao-vi · 2024-11-21T18:17:02Z

It's not too clear in code, but the Uwazi context and OCR context are using different language code:
Uwazi = ISO639_3
OCR = ISO639_1

We were using getLanguage that I believe it was used to convert Uwazi language codes to Elastic search languages codes, as Ukranian is not supported in Elastic search, it was retuning null and not executing the OCR service.

I've created LanguageCodeMapper that can convert between language code.

A better solution, it would be to make OCR models, properties more explicit in code so we can make a well defined boundary that can tell what belongs to Uwazi and what belongs to OCR

With that well defined Models we can create a Anti-Corruption Layer and whenever we need to use some feature from OCR we use this Anti-Corruption Layer to get the models from OCR.

…ocr-manager-language-support

daneryl

I think we are good to go with this one 👍
the whole languages list and how we use them on every component is still very confusing, lets merge this one and try to fix that part on next issue ?

Joao-vi · 2024-12-02T11:14:03Z

I think we are good to go with this one 👍 the whole languages list and how we use them on every component is still very confusing, lets merge this one and try to fix that part on next issue ?

Yes, let me know if you want me to create a new issue addressing that.

Update: new issue addressing that #7514

* add LanguageCodeMapper * add ukrainian language to elastic search * simplify languages support * update languageSchema * revert changes * use legacy elastic mapping * fix language code * fix test case * add other language to available languages array * fix test errors --------- Co-authored-by: Joan Gallego Girona <daneryl@gmail.com>

add LanguageCodeMapper

bc3e953

Joao-vi requested review from daneryl and RafaPolit November 21, 2024 18:17

Joao-vi and others added 13 commits November 25, 2024 15:32

add ukrainian language to elastic search

95477c5

Merge branch 'production' of github.com:huridocs/uwazi into refactor/…

3e96d44

…ocr-manager-language-support

simplify languages support

b50c30d

update languageSchema

ca3cc49

Merge branch 'production' of github.com:huridocs/uwazi into refactor/…

34ffd6d

…ocr-manager-language-support

revert changes

5f27890

use legacy elastic mapping

856a95f

fix language code

a52b9d2

fix test case

f360b23

Merge branch 'production' of github.com:huridocs/uwazi into refactor/…

48cd9b4

…ocr-manager-language-support

add other language to available languages array

2f30bf2

fix test errors

3a9b545

Merge branch 'production' into refactor/ocr-manager-language-support

0577c33

daneryl approved these changes Dec 2, 2024

View reviewed changes

Joao-vi merged commit 54a4a29 into production Dec 2, 2024
18 of 19 checks passed

Joao-vi deleted the refactor/ocr-manager-language-support branch December 2, 2024 11:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add language support to the OCR manager #7465

Add language support to the OCR manager #7465

Joao-vi commented Nov 21, 2024

daneryl left a comment

Joao-vi commented Dec 2, 2024 •

edited

Loading

Add language support to the OCR manager #7465

Add language support to the OCR manager #7465

Conversation

Joao-vi commented Nov 21, 2024

daneryl left a comment

Choose a reason for hiding this comment

Joao-vi commented Dec 2, 2024 • edited Loading

Joao-vi commented Dec 2, 2024 •

edited

Loading