Use these options in order to ameliorate the results on Arabic datasets #279

Tailor2019 · 2021-09-26T17:24:54Z

Hello!
@ChWick
Does these options can effect results when training Calamari on an Arabic datasets and it is adviced to use them as options in the training command or not:
--data.pre_proc.processors.0.modes
--data.pre_proc.processors.1.modes
--data.pre_proc.processors.1.extra_params
--data.pre_proc.processors.1.line_height
--data.pre_proc.processors.2.modes
--data.pre_proc.processors.2.normalize DATA.PRE_PROC.PROCESSORS.2.NORMALIZE
--data.pre_proc.processors.2.invert DATA.PRE_PROC.PROCESSORS.2.INVERT
--data.pre_proc.processors.2.transpose DATA.PRE_PROC.PROCESSORS.2.TRANSPOSE
--data.pre_proc.processors.2.pad DATA.PRE_PROC.PROCESSORS.2.PAD
--data.pre_proc.processors.2.pad_value DATA.PRE_PROC.PROCESSORS.2.PAD_VALUE
--data.pre_proc.processors.3.modes
--data.pre_proc.processors.3.bidi_direction {LTR,RTL,AUTO,L,R,auto}
--data.pre_proc.processors.4.modes
--data.pre_proc.processors.5.modes
--data.pre_proc.processors.5.unicode_normalization DATA.PRE_PROC.PROCESSORS.5.UNICODE_NORMALIZATION
--data.pre_proc.processors.6.modes
--data.pre_proc.processors.6.replacement_groups
--data.pre_proc.processors.7.modes
Thanks for your reply

andbue · 2021-09-27T09:55:57Z

The only thing that is worth worrying about when training on Arabic data is the bidi_direction. Most of the time, "auto" works just fine as it identifies strings containing only RTL or neutral characters and correctly sets the writing direction. There may be, however, some strings that can't be identified easily as RTL or LTR. In those cases, the algorithm defaults to LTR if you don't set bidi_direction=RTL:

>>> from bidi.algorithm import get_display
>>> list(get_display("سلام"))
['م', 'ا', 'ل', 'س']
>>> list(get_display("(1) 125", base_dir="L"))
['(', '1', ')', ' ', '1', '2', '5']
>>> list(get_display("(1) 125", base_dir="R"))
['1', '2', '5', ' ', '(', '1', ')']
>>> list(get_display("(1) 125"))
['(', '1', ')', ' ', '1', '2', '5']

If you know that all of your text is RTL, the safe option is to set bidi_direction=RTL.

Tailor2019 · 2021-09-27T15:16:42Z

@andbue
thanks for your reply
Please do these options have an effect on the architecture if we change their default value
--data.pre_proc.processors.0.modes
--data.pre_proc.processors.1.modes
--data.pre_proc.processors.1.extra_params
--data.pre_proc.processors.1.line_height
--data.pre_proc.processors.2.modes
--data.pre_proc.processors.2.normalize DATA.PRE_PROC.PROCESSORS.2.NORMALIZE
--data.pre_proc.processors.2.invert DATA.PRE_PROC.PROCESSORS.2.INVERT
--data.pre_proc.processors.2.transpose DATA.PRE_PROC.PROCESSORS.2.TRANSPOSE
--data.pre_proc.processors.2.pad DATA.PRE_PROC.PROCESSORS.2.PAD
--data.pre_proc.processors.2.pad_value DATA.PRE_PROC.PROCESSORS.2.PAD_VALUE
--data.pre_proc.processors.3.modes
Thanks in advance!

andbue · 2021-09-28T09:25:44Z

They don't have any effect on the network architecture (if that is what you mean). The only thing coming close to that might be the line_height parameter that changes the height in pixels the line images are scaled to (defaults to 48). The rest only affect the image preprocessing (center_normalizer parameters, image normalization, inverting, transposing, padding).

Tailor2019 · 2021-09-28T18:03:44Z

Thanks a lot for your reply!
@andbue
for the numbers 0;1;2;3 does it refers to the preprocessing of the image in the layer 0(conv)...?
for example when I change this value --data.pre_proc.processors.2.pad to 32 it will have a macroscopic effect on my system?

andbue · 2021-09-29T09:06:45Z

No, the numbers are only there to put the preprocessing functions in the correct order. If you set "pad" to 32 it will add 32xline_height (instead of default 16xline_height, if I'm not mistaken) empty pixels to each side of the text line, nothing more.

Tailor2019 · 2021-09-30T04:36:54Z

Thanks for your reply!
@andbue
for example for the the preprocessing functions in order 2 there is 3 options:
--data.pre_proc.processors.2.modes
--data.pre_proc.processors.2.normalize DATA.PRE_PROC.PROCESSORS.2.NORMALIZE
--data.pre_proc.processors.2.invert DATA.PRE_PROC.PROCESSORS.2.INVERT
Why we don't use different numbers for these preprocessing functions and we use only the number "2" for these functions?
What is the role of this option ""--data.pre_proc.processors.2.modes ""?
thanks in advance!

ChWick · 2021-09-30T07:02:10Z

During preprocessing there is a (customizable) list of preprocessors that are applied to the line images. Each of the preprocessors has an ID (the number in the command line arguments). Some have additional parameters (e.g. the NormalizeProcessor that can invert/transpose/pad... images).
The modes parameter is valid for every processor and states when to apply it (Training, Evaluation, Prediction). By default, the processor is applied always, but there are processors, for example DataAugmentation that should only be applied during training.
The defaults are already sane, so you should not/never change these settings unless you know what you are doing.

Tailor2019 · 2021-09-30T10:00:54Z

Thanks a lot for this eplanation!
@ChWick
as in the documentation we can guess that there is 8 preprocessors but the role of the modes parameter does to activate the adequate preprocessor ?
What is its effect?
for example for this preprocessor "--data.pre_proc.processors.5.modes" and "--data.pre_proc.processors.4.modes"
what is the contribtion of the modes parameter in this 2 preprocessors?
Thanks in advance!

andbue · 2021-09-30T11:24:25Z

The modes parameter, as @ChWick already stated, activates the processor for a specific scenario, e.g. for training or for prediction. You would only need to change it if your input data were already preprocessed (e.g. your images are already normalized, inverted and transposed or your text is already transformed into display order).

It is a bit harder to see which of the default preprocessors correspons to which function. To find out, try something like

>>> from calamari_ocr.ocr.dataset.data import Data
>>> params = Data.default_params()
>>> list(enumerate(params.pre_proc.processors))
[
(0, CenterNormalizerProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>}, extra_params=(4, 1.0, 0.3), line_height=-1)), 
(1, FinalPreparationProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>}, normalize=True, invert=True, transpose=True, pad=16, pad_value=0)), 
(2, BidiTextProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.TARGETS: 'targets'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>}, bidi_direction=<BidiDirection.AUTO: 'auto'>)), 
(3, StripTextProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.TARGETS: 'targets'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>})), 
(4, TextNormalizerProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.TARGETS: 'targets'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>}, unicode_normalization='NFC')), 
(5, TextRegularizerProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.TARGETS: 'targets'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>}, replacement_groups=[<ReplacementGroup.Spaces: 'spaces'>], replacements=None)), 
(6, AugmentationProcessorParams(modes={<PipelineMode.TRAINING: 'training'>}, augmenter=DefaultDataAugmenterParams(), n_augmentations=0)), 
(7, PrepareSampleProcessorParams(modes={<PipelineMode.TRAINING: 'training'>, <PipelineMode.PREDICTION: 'prediction'>, <PipelineMode.EVALUATION: 'evaluation'>}))
]

The code for each of these preprocessors can be found in /imageprocessors (0, 1, 6, 7) or /textprocessors (2-5) at https://github.com/Calamari-OCR/calamari/tree/master/calamari_ocr/ocr/dataset.
@ChWick : maybe it would be helpful if paiargparse could somehow include the name of the preprocessor classes in the help strings?

twerkmeister mentioned this issue Nov 29, 2021

Exchanging preprocessor #297

Open

bertsky closed this as completed Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use these options in order to ameliorate the results on Arabic datasets #279

Use these options in order to ameliorate the results on Arabic datasets #279

Tailor2019 commented Sep 26, 2021 •

edited

Loading

andbue commented Sep 27, 2021

Tailor2019 commented Sep 27, 2021

andbue commented Sep 28, 2021

Tailor2019 commented Sep 28, 2021 •

edited

Loading

andbue commented Sep 29, 2021

Tailor2019 commented Sep 30, 2021

ChWick commented Sep 30, 2021

Tailor2019 commented Sep 30, 2021

andbue commented Sep 30, 2021

Use these options in order to ameliorate the results on Arabic datasets #279

Use these options in order to ameliorate the results on Arabic datasets #279

Comments

Tailor2019 commented Sep 26, 2021 • edited Loading

andbue commented Sep 27, 2021

Tailor2019 commented Sep 27, 2021

andbue commented Sep 28, 2021

Tailor2019 commented Sep 28, 2021 • edited Loading

andbue commented Sep 29, 2021

Tailor2019 commented Sep 30, 2021

ChWick commented Sep 30, 2021

Tailor2019 commented Sep 30, 2021

andbue commented Sep 30, 2021

Tailor2019 commented Sep 26, 2021 •

edited

Loading

Tailor2019 commented Sep 28, 2021 •

edited

Loading