Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Tesseract 4.0.0 – open tasks #1423

Closed
stweil opened this issue Mar 24, 2018 · 194 comments
Closed

RFC: Tesseract 4.0.0 – open tasks #1423

stweil opened this issue Mar 24, 2018 · 194 comments
Labels

Comments

@stweil
Copy link
Contributor

stweil commented Mar 24, 2018

I'd like to collect open tasks which should be addressed before tagging the official release 4.0.0.

These tasks are on my own list and to be discussed whether we consider them important for the new release or not:

  • Remove deprecated code. This does not include OpenCL or the old Tesseract engine.
  • Add --version parameter for all command line commands.
  • Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version. This will make the command slower, because each file must be opened and parsed.
  • Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...).
  • Relative includes for traineddata: tessedit_load_sublangs should search for the sublangs relative to the parent, not starting in tessdata dir.
  • Maybe more fixes for compiler warnings and issues reported by Coverity Scan.
  • (list still incomplete)
@amitdo
Copy link
Collaborator

amitdo commented Mar 24, 2018

Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...).

SSE and AVX are also done on CPU :)

@amitdo
Copy link
Collaborator

amitdo commented Mar 24, 2018

Remove deprecated code. This does not include OpenCL or the old Tesseract engine.

Adding a compile option NO_LEGACY_OCR_ENGINE would be nice.

@amitdo
Copy link
Collaborator

amitdo commented Mar 24, 2018

  • Fix the autotools build so that the debug mode uses -O0 as intended (instead of -O2).
    Probably, It can be adapted from Refactor Autotools build #974

I'll do it.

@Shreeshrii
Copy link
Collaborator

Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version. This will make the command slower, because each file must be opened and parsed.

My suggestion would be to leave --list-langs as is,

and add this as --list-langs-details

or as --list-lang-details for one language file based on lang-code.

@Shreeshrii
Copy link
Collaborator

--list-langs should also display the directory it is using. This is useful when tessdata files ate installed in multiple directories, eg. By ppa or Linux distribution vs when built directory.

@Shreeshrii
Copy link
Collaborator

Re: tessdata,
Config and tessconfigs and pdf.ttf are needed in the directory which is being used via tessdata_prefix or tessdata-dir.

Eg. When doing lstm training, lstm.train config file is not found if one uses tessdata_best as the continue_from dir.

My workaround has been to copy these to both tessdata_fast and tessdata_best repos.

@Shreeshrii
Copy link
Collaborator

Add/implement install-langs.

@jbreiden
Copy link
Contributor

A week with no API changes.

@Shreeshrii
Copy link
Collaborator

Add a simple bash script for building tesseract.

I use the following, it should probably also add commands to offer to download osd and eng traineddata files for first time users.

#!/bin/bash
./autogen.sh
./configure --disable-openmp  --disable-graphics --disable-opencl
make
sudo make install
sudo ldconfig
make training
sudo make training-install

rm -rf ./googletest
git submodule update --init
autoreconf -fiv
#export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
export TESSDATA_PREFIX=../tessdata_fast
make check

@zdenop
Copy link
Contributor

zdenop commented Mar 25, 2018

I would add this:

@amitdo
Copy link
Collaborator

amitdo commented Mar 25, 2018

A week with no API changes.

Mission impossible.


Edit: That was a joke.

@zdenop
Copy link
Contributor

zdenop commented Mar 26, 2018

There was (online) tool that is monitoring API changes (for tesseract). But I can not find a link for it. Does somebody has it? Can somebody show changes 4.0.beta1 vs. current code?

@Shreeshrii
Copy link
Collaborator

Please see #793

The tracker is at https://abi-laboratory.pro/tracker/timeline/tesseract/
Currently it is tracking stable release 3.05.01

@zdenop Please tag another release for 3.05 branch since 3.05.01 had a couple of problems which have been fixed in later commits.

@stweil
Copy link
Contributor Author

stweil commented Mar 26, 2018

The good news is that the latest Debian / Ubuntu tesseract-ocr does not include the development files, so there will not be any API between that version and the future 4.0.0 which we have to take care of.

Sorry, I was wrong: there is libtesseract-dev.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 27, 2018

@zdenop I suggest adding labels to issues with the following proposed list of keywords, so that it is easy to see related issues and see if there are any critical pending issues.

4.0.0 for the final relaese
4.0x for 4.00.00alpha and 4.0.0-beta.1
3.0x for 3.05/3.04

LSTM training
training for 3.0x legacy tesseract training

Accuracy for reports of incorrect recognition
Performance for questions related to speed
Crashes for asserts and program crashes

Build related to compile and build from source

This is a suggested list.

@amitdo
Copy link
Collaborator

amitdo commented Mar 27, 2018

IMO, our final 4.0.0 should not significantly diverge from the version that will be shipped in Ubuntu 18.04.

  • No ABI & API changes.
  • No changes to user interface (command line).

A new branch should be created for 4.0.0.
Only commits that follow the above rules should be backported from master.
4.0.0 should have at least rc.1 before final release.

We can decide that 4.1.0 will be released 2-3 months after 4.0.0 (still with legacy?).

@stweil
Copy link
Contributor Author

stweil commented Mar 27, 2018

How do you define "significantly"? There are some changes with the latest Git master:

  • Trained data for scripts was moved.
  • Some deprecated functions, parameters and command line options were removed.
  • The Tesseract specific integer data types (inT32, ...) and macros (MIN_INT32, ...) were removed.

Would you suggest reverting these changes? They are major changes which require a step of the major version, so I think 4.0.0 is a good candidate to include those changes. Otherwise we would have to wait for 5.0.0.

I would even go further and fix potential name space problems with the 58 include files which are part of the Tesseract programming API in 4.0.0-beta.1, although that is a significant change, too.

@amitdo
Copy link
Collaborator

amitdo commented Mar 27, 2018

How do you define "significantly"?

basically, any bug fix is ok, must follow the 2 conditions I specified, no new features.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 27, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented Mar 27, 2018

18.04 is much more significant because it's LTS - supported for 5 years.
18.10 will be supported for only 9 months. We should not care about it.

@amitdo
Copy link
Collaborator

amitdo commented Mar 27, 2018

What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C

We tagged it as 4.0.0-beta.1.

@amitdo
Copy link
Collaborator

amitdo commented Mar 27, 2018

Another option is to skip final 4.0.0 and go straight to 5.0.0.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 27, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented Mar 27, 2018

@zdenop, your thoughts about these two options?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 27, 2018 via email

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 27, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented Mar 27, 2018

Jeff just said that the version in Ubuntu won't change in final 18.04.

We are talking about what we want to do in Tessseract's official Github repo.
We are the upstream, not Ubuntu!

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Mar 27, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented Mar 27, 2018

I want to hear @zdenop's and @jbreiden's opinions.

I think that as maintainers, they will understand (but not necessary agree with) my proposal.

@Shreeshrii
Copy link
Collaborator

@stweil Thank you for all your work in getting 4.0.0 ready for release.

One of the things that will be useful, IMO,
If the version info from traineddata files can also be displayed when using tesseract for ocr. It might require updating the version strings to include the repo name also.

It would be useful when people report issues.

However, this is only a nice to have feature, and could wait for 4.1.0.

@zdenop
Copy link
Contributor

zdenop commented Oct 16, 2018

https://github.com/tesseract-ocr/tesseract/milestones/4.0.0 show only one open topic. ;-)

It would be great if following issues are solved:

@stweil
Copy link
Contributor Author

stweil commented Oct 23, 2018

@zdenop, are you planning a rc4 before the final 4.0.0? Maybe rc4 today, 4.0.0 next weekend?

I'm afraid that we won't be able to solve the issues in your list for 4.0.0.

@egorpugin
Copy link
Contributor

Don't hurry. Do as many betas and rcs as needed.

@zdenop
Copy link
Contributor

zdenop commented Oct 23, 2018

@stweil: rc4 could be tagged, if issue #736 is solved/tested...

@zdenop
Copy link
Contributor

zdenop commented Oct 24, 2018

rc4 released.
BTW: for final release I want to omit git sha info (autotools build) version will be just plain "4.0.0". After release git-rev will be restored. Any objections?

@stweil
Copy link
Contributor Author

stweil commented Oct 24, 2018

That works automatically, also for the release candidates:

$ git describe 
4.0.0-rc4

It's not necessary to omit and restore something. Just update VERSION and ChangeLog.

@stweil
Copy link
Contributor Author

stweil commented Oct 24, 2018

What about replacing ChangeLog by a very short file which just links to the release notes in the Tesseract Wiki?

@amitdo
Copy link
Collaborator

amitdo commented Oct 24, 2018

+1

You can add:

To get the git changelog, run this command:
git log 3.04.01..4.0.0

@stweil
Copy link
Contributor Author

stweil commented Oct 24, 2018

https://github.com/tesseract-ocr/tesseract/commits/4.0.0-rc4 shows the commit list for rc4, so users who don't have a git command line can look at https://github.com/tesseract-ocr/tesseract/commits/4.0.0 for the commits of 4.0.0. Such information can be added to the Wiki, so it would be sufficient to refer to the Wiki in the ChangeLog file.

@amitdo
Copy link
Collaborator

amitdo commented Oct 30, 2018

Congratulation on the release of 4.0.0 🎉

Thanks to every one who contributed: developers, testers, documentation writers, bug reporters.

@zdenop
Copy link
Contributor

zdenop commented Oct 31, 2018

Closing because of 4.0.0. was released..

@zdenop zdenop closed this as completed Oct 31, 2018
@Shreeshrii
Copy link
Collaborator

@zdenop Any plans for a bug fix release.

@stweil Should another issue be opened to discuss plans for next release?

Thanks!

@zdenop
Copy link
Contributor

zdenop commented Feb 10, 2019

Well, be broke API/ABI compatibility so bug/fix release is not easy (we should remove some fixes/improvement to keep it).

Maybe we should think about next release (4.1.0) or do not care about compatibility (release 4.0.1) which is IMO not right, but in line with tesseract history ;-)

@stweil
Copy link
Contributor Author

stweil commented Feb 10, 2019

We decided to use semantic versioning (which I think is good), so a new release which is based on Git master would have to be 4.1.0. @AlexanderP, is that a problem for the Debian tesseract-ocr packages? Maybe /usr/share/tesseract-ocr/4.00/tessdata would have to be renamed (I suggest to use /usr/share/tesseract-ocr/4/tessdata).

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Feb 11, 2019 via email

stweil referenced this issue Jul 3, 2019
The function pointers and callbacks file_reader_, file_writer_,
checkpointer_reader_ and checkpoint_writer_ are always set to
the same values. Replacing them by direct function calls
simplifies the code and allows removing more code from tesscallback.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor Author

stweil commented Jul 3, 2019

Debian will start with a new stable release in a few days, and as far as I see that new release will include Tesseract 4.0 for the next few years. Should we backport important fixes to the 4.0 branch? What does that mean for Tesseract 4.1? Are there still interested parties who need it? Or should we focus on Tesseract 5 which may drop or replace old code? @AlexanderP, what upgrade path do you see for Debian?

@amitdo
Copy link
Collaborator

amitdo commented Jul 3, 2019

This project has limited resources, so I suggest to release 4.1 soon (1-6 weeks), and then concentrate on 5.0 and abandon 4.x.

@zdenop
Copy link
Contributor

zdenop commented Jul 3, 2019

I planned to release 4.1 on first of July. Unfortunately I found out there are problem with backwards API compatibility...

@AlexanderP
Copy link

I think it is necessary to load version 4.1 and to upgrade to version 5.0 is closer to release.

@zdenop
Copy link
Contributor

zdenop commented Jul 5, 2019

@AlexanderP : Does it mean that if we make 4.1 backwards compatible, you can get it to Debian?

@AlexanderP
Copy link

@zdenop I think he can get into the Debian Backports.

@stweil
Copy link
Contributor Author

stweil commented Jul 7, 2019

So Debian Buster will keep using Tesseract 4.0 for the next years? Then a 4.0.1 with carefully selected bug fixes will be required.

@AlexanderP
Copy link

AlexanderP commented Jul 8, 2019

So Debian Buster will keep using Tesseract 4.0 for the next years?

Yes, but it is necessary to ask @jbreiden
I think 4.1.0, can enter Debian buster-backports.

@jbreiden
Copy link
Contributor

jbreiden commented Jul 9, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants