Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Python 3.9+ #304

Closed
mikegerber opened this issue Feb 23, 2022 · 20 comments
Closed

Support Python 3.9+ #304

mikegerber opened this issue Feb 23, 2022 · 20 comments

Comments

@mikegerber
Copy link
Contributor

The current requirements.txt wants TensorFlow 2.4.x - which is not available on PyPI for Python 3.9+.

(Side note: We have been using TensorFlow >= 2.5.0 with Calmari 1.0.x for this reason.)

@andbue
Copy link
Member

andbue commented Feb 23, 2022

Alright, thanks for the note! I just uploaded calamari version 2.2.0 to PyPI. Let's hope we don't get too many problems with models being unable to update between tf 2.4 and 2.5...

@mikegerber
Copy link
Contributor Author

mikegerber commented Feb 23, 2022

I have not had any issues with my Calamari 1.0 models in respect to TF 2.4 or TF 2.5, but we're going to have to test anyway.

Learning from my experience with combinations of PyPI TensorFlow versions and Python versions, I'll also update the ocrd_calamari tests to test on all Python versions 3.7+ (3.6 is EOL and AFAICT also not compatible with tfaip)

@bertsky
Copy link
Collaborator

bertsky commented Feb 23, 2022

BTW, should there be problems with deserializing HDF5 models in other Python / TF releases, consider switching to SavedModel format. Conversion is as simple as an interactive load+save session (with a path name without the .h5 suffix). I have recently done it for ocrd_anybaseocr's tiseg and layout-analysis models and got them working under newer Pythons.

@mikegerber
Copy link
Contributor Author

Alright, thanks for the note! I just uploaded calamari version 2.2.0 to PyPI.

I think this release is missing here on GitHub!

@mikegerber
Copy link
Contributor Author

Can't install Calamari 2.2.2 on Python 3.10: It depends on tfaip==1.2.6 which in return depends on tensorflow<2.7.0,>=2.4.0, but PyPi only has 2.8.0+ available for Python 3.10.

@bertsky
Copy link
Collaborator

bertsky commented Aug 15, 2023

BTW, should there be problems with deserializing HDF5 models in other Python / TF releases, consider switching to SavedModel format.

So since with newer Python versions (>=3.8) the old HDF5 models don't load anymore (Bad marshal data), we do need the SavedModel format conversion for all published 2.x models now (calamari_models, calamari_models_experimental).

Luckily, @andbue already solved this in #321 in the usual on-load on-demand converter – fantastic, thanks!

@mikegerber
Copy link
Contributor Author

Calamari 1.0.x (!) branch works with 3.7-3.11, only a small issue with old 1.0.x models on Python 3.11, model upgrade procedure here: OCR-D/ocrd_calamari#91 (the replacements regexen in the model need to have their global flags in front now.)

@jbarth-ubhd
Copy link

jbarth-ubhd commented May 27, 2024

I'll get ValueError: bad marshal data (unknown type code) with

calamari-predict --checkpoint ~/calamari_models-2.0/gt4histocr/*.ckpt.json --data.images *.tif

calamari_models-2.0 is from 2.0.zip linked from README

(calamari-ocr) xx@yyy:~/calamari-ocr> pip3 list
Package                 Version
----------------------- -----------
absl-py                 0.15.0
adabelief-tf            0.2.1
appdirs                 1.4.4
astunparse              1.6.3
cachetools              4.2.4
calamari-ocr            2.2.2
certifi                 2024.2.2
charset-normalizer      3.3.2
clang                   5.0
colorama                0.4.6
dataclasses-json        0.5.5
edit-distance           1.0.6
editdistance            0.8.1
et-xmlfile              1.1.0
flatbuffers             1.12
gast                    0.4.0
gitdb                   4.0.11
GitPython               3.1.43
google-auth             1.35.0
google-auth-oauthlib    0.4.6
google-pasta            0.2.0
grpcio                  1.64.0
h5py                    3.1.0
idna                    3.7
imageio                 2.34.1
importlib_metadata      7.1.0
keras                   2.6.0
Keras-Preprocessing     1.1.2
Levenshtein             0.25.1
lxml                    5.2.2
Markdown                3.6
MarkupSafe              2.1.5
marshmallow             3.21.2
marshmallow-enum        1.5.1
mypy-extensions         1.0.0
networkx                3.1
nptyping                1.4.4
numpy                   1.19.5
oauthlib                3.2.2
opencv-python-headless  4.9.0.80
openpyxl                3.1.2
opt-einsum              3.3.0
packaging               24.0
paiargparse             1.1.2
pandas                  1.4.4
pillow                  10.3.0
pip                     24.0
pkg_resources           0.0.0
pooch                   1.4.0
prettytable             3.10.0
protobuf                3.19.6
pyasn1                  0.6.0
pyasn1_modules          0.4.0
python-bidi             0.4.2
python-dateutil         2.9.0.post0
python-Levenshtein      0.25.1
pytz                    2024.1
PyWavelets              1.4.1
rapidfuzz               3.9.1
requests                2.32.2
requests-oauthlib       2.0.0
rsa                     4.9
scikit-image            0.19.3
scipy                   1.10.1
setuptools              44.0.0
six                     1.15.0
smmap                   5.0.1
tabulate                0.9.0
tensorboard             2.6.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.1
tensorflow              2.6.5
tensorflow-addons       0.16.1
tensorflow-estimator    2.6.0
termcolor               1.1.0
tfaip                   1.2.6
tifffile                2023.7.10
tqdm                    4.66.4
typeguard               2.13.3
typing-extensions       3.10.0.2
typing-inspect          0.9.0
typish                  1.9.3
urllib3                 2.2.1
wcwidth                 0.2.13
Werkzeug                3.0.3
wheel                   0.43.0
wrapt                   1.12.1
xlrd                    1.2.0
XlsxWriter              3.2.0
zipp                    3.19.0

@jbarth-ubhd
Copy link

PS: using Python 3.8.10 from Ubuntu 20.04

@jbarth-ubhd
Copy link

jbarth-ubhd commented May 27, 2024

Just tried with Ubuntu18.04 (in schroot) & python3.7:

(calamari-ocr-ubuntu18.04-python3.7) jbjb@pers16:~/calamari-ocr-ubuntu18.04-python3.7> calamari-predict --version
calamari-predict v2.2.2
(calamari-ocr-ubuntu18.04-python3.7) jbjb@pers16:~/calamari-ocr-ubuntu18.04> calamari-predict
...
...
  File "/usr/lib/python3.7/typing.py", line 238, in <genexpr>
    for t2 in all_params - {t1}):
  File "/usr/lib/python3.7/abc.py", line 143, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
  File "/home/jb/calamari-ocr-ubuntu18.04/lib/python3.7/site-packages/typing_extensions.py", line 1545, in _proto_hook
    raise TypeError("Instance and class checks can only be used with"
TypeError: Instance and class checks can only be used with @runtime protocols

@jbarth-ubhd
Copy link

With ubuntu18.04 & python3.6 I'll get calamari-predict v1.0.6 - how to install 2.x?

@jbarth-ubhd
Copy link

ubuntu 20.04 & python3.9: bad marshal data

@andbue
Copy link
Member

andbue commented May 27, 2024

Have a look at #304 (comment), #356 and the demo notebook at https://github.com/andbue/calamari_demo. I think that loading the models with the python version they've been created with (3.7 in most cases) and a calamari containing the commits in #321 will convert them to the SavedModel format. The converted models should work in python 3.8.

@jbarth-ubhd
Copy link

So this os not available with pip install ?

@andbue
Copy link
Member

andbue commented May 27, 2024

Unfortunately only in the current master branch. pip install git+https://github.com/Calamari-OCR/calamari.git.

@jbarth-ubhd
Copy link

ah the secret ingredient...

Are the models very position dependent? I'll try using Abbyy segmented image snippets with calamari2 and the output is much worse than using OCR-D workflow with calamari(1)

@andbue
Copy link
Member

andbue commented May 27, 2024

Dependent on the position of the text in the line snippet? There is a certain dependency concerning the height of the lines. I often have errors in the first line of the page if the line segment there contains a lot of empty space above the text. Could be the same problem here if abbyy creates overlapping line segments.

Also, the old models are trained on binarized input. Binarizing the images (ocropus-nlbin) does improve the results with these models. The OCR-D workflow might run the binarization automatically. Newer models like deep3_lsh4 in calamari_models_experimental are trained on grayscale (am I right, @chreul?).

@chreul
Copy link
Member

chreul commented May 29, 2024

yes, new models like lsh4 were trained on various preprocessing outputs including different binarizations and also the normalized grayscale output of ocropus

@jbarth-ubhd
Copy link

ocropus-nlbin helped a lot. Thanks.

@bertsky
Copy link
Collaborator

bertsky commented Oct 2, 2024

yes, new models like lsh4 were trained on various preprocessing outputs including different binarizations and also the normalized grayscale output of ocropus

@chreul but Ocropus' nrm is not really grayscale as you know, and AFAICS all the models use the Ocropus CenterNormalizer (line dewarper) as preprocessor, so actual grayscale images would quickly degrade the training as the dewarper would hallucinate weird center lines and therefore distort the input.

As a user, I don't even know how to deactivate the dewarper during training or prediction, yet.

But my question to you is: why has true grayscale not been done anywhere, not even in your Gothic handwriting models?

Anyway, the issue was about Py39 support and other dependencies, as well as HDF5 problems, which are solved as of Calamari 2.3

@bertsky bertsky closed this as completed Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants