Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Showing operational state. #39

Open
edegraaff opened this issue Nov 2, 2024 · 10 comments
Open

Showing operational state. #39

edegraaff opened this issue Nov 2, 2024 · 10 comments
Assignees
Milestone

Comments

@edegraaff
Copy link

Is your feature request related to a problem? Please describe.
Not realy a problem, but when processing documents i have no clue what the program is dooing. I have a "washing machine" icon but is the program not responding or just bussy.

Describe the solution you'd like
Some sort of que management or procces state

Describe alternatives you've considered
I am now using the proc mon to see if the program is operational or not when it uses 0% cpu it is in wait for input, when it is in 10% it is dooing something.

Additional context
I don;t think they are needed but if thats the case i am able to add them

@rffrasca rffrasca self-assigned this Nov 3, 2024
@rffrasca
Copy link
Owner

rffrasca commented Nov 3, 2024

The upload process and its folder maintenance are asynchronous tasks. This was designed this way to not block the user from using the application while documents are being uploaded. The "green clock" on the status bar will appear while the upload is running to provide status to the user.

@edegraaff
Copy link
Author

edegraaff commented Nov 9, 2024

ok thanks.
We want for our association procces 100 or more pdf docs to have an index from all the work done over the last 15 years, ... i want to make sure it is processed in the right way. I am still puzzeled abouth the approach, to get it working, the pdf's contain indexes and all kind of subject that have been discussed in the documents over the years. The tool could deliver us access to all those pdfs and find old subjects described in detail and also prohibbit to write an article that is already covered in older documents.

@rffrasca
Copy link
Owner

"I want to make sure it is processed in the right way."
Are you looking for logging to be added to the upload process?

@rffrasca
Copy link
Owner

Another user opened a request for a progress bar during the upload process. Would this be helpful for you and address your issue? I was thinking a marquee style progress bar that would appear on the status bar during the upload. This progress bar would replace the "green clock".

@edegraaff
Copy link
Author

Hi, i think somthing that gives you insights in the progress of importing the pdf's would be great. On the another hand, i am also searching for some more advanced functions. We are trying to get an overview over all the journals that have been published over the time, and hashtags or other markers to search on, or index on could be needed. I understand you never developed the program for this purpose and maybee i need to use something else. But beeing able to see progress and als to add special words to search and or index could be practical, in your tax example i want to see everything for year 2020, or everything related to healthcare, or, buy equipment. For us as a non profit organisation we want to serve our members to look back in the past what interesing articles have been published about what subject. Regards Eelco

@rffrasca
Copy link
Owner

Hi Eelco,
I want to first focus on the upload progress, then we can move on to next part.

When you say "Insights" are you referring to information such as the filename being processed and the operation being performed on each page like text extraction, OCR, etc.?

@edegraaff
Copy link
Author

Yes something like that, it could be somewhere a progress bar, and or logging file that says: picking up "filename.pdf", processing ocr, storing, indexing, done, next ... it all depends on time and effort you want to put in this. I could imagine when you have 100 files in the inbox, you will poll lets say every 10 seconds. i drop 100 files, then you get something for i is 1 to 100, and then you could make a progrebar or an counter. The issue is, time... when you procces a lot of files, and someone adds new ones... on the another hand who cares. Hum, what about this approach: poll inbox, pickup file, and count files ok there are 100 files there, processing one of 100, proces is ready, do next poll, count files, 10 added, ok counting 99+10 and count from there... after a while the inbox is empty system in steady state... i am not a programmer, so i have no clue what problems you will find if you build something like this ...
keep it simple :-|)

@rffrasca
Copy link
Owner

rffrasca commented Nov 12, 2024

Thanks for the detailed explanation. With that, I came up with the following solution:

I will be implementing a marquee style progress bar for another request that will be visible when the upload is running. So, for your request, I was thinking of:

  • Adding to the status bar, a message like this "Uploading PDF # of #" that would be visible during the upload.
  • I was also thinking of logging the upload with "Processing \filename.pdf" and for each page, log messages like "Extracting PDF properties.", "Extracting text annotations". For each page, log status such as "Page 1 extracting text", or "Page 1 extracting text using OCR.". When a page is not processed, that will be logged also.
  • During each upload cycle, a new log will be created that can be viewed using a log viewer that I can add to the application. The log viewer could have a Delete button that you can use to delete the log after you have reviewed it.

If you like this plan, give it a thumbs up and I will schedule it in a future release.

@rffrasca
Copy link
Owner

rffrasca commented Nov 12, 2024

Regarding the advanced searching and indexing.

PDFKeeper creates a full-text index of the following columns:

Title – Extracted from the PDF during the upload.
Author – Extracted from the PDF during the upload.
Subject – Extracted from the PDF during the upload.
Keywords – Extracted from the PDF during the upload.
Added – Date and time the document was added to the database.
Notes – User editable notes that is indexed after every update.
PDF – The PDF document contents as a byte array. This column is indexed by Oracle Text after being added to the database. This is specific to Oracle Database. For SQLite, this column is not indexed.
Category – Optional category of the document. (user assigned)
Tax Year – Optional tax year of the document. (user assigned)
Text Annotations – Extracted from the PDF during the upload.
Text – Extracted from the PDF during the upload.

If you haven't already done so, I recommend trying out some searches with your uploaded documents to identify what needs to be enhanced. The help file gives an overview on the search options available in PDFKeeper. As for searching by search term, you can search on any of the columns above. Full-text search can be very powerful, but the help file only provides basic examples because of how complex full text searching can be. You may be able to look up some advanced searches that can be performed by search term in the online documentation of the database platform being used. If you need help finding the documentation, let me know which database you're using and the version if Oracle.

If the advanced searches you're trying cannot be performed within PDFKeeper and you have experience with SQL, you can try using SQL*Plus if using Oracle or with the SQL command line tool if using SQLite. Let me know if you need the links to these. If you do use SQL, keep track of the queries that work for you. I will need these if I need to add support for this in PDFKeeper.

Once you know the specifics of what you need, feel free to open an enhancement request and we can discuss further. Keep in mind that I won't be able to accommodate if the changes will not benefit other users.

Hope this information helps.

@edegraaff
Copy link
Author

Thanks, yes i will, i received some sample documents in all sizes and will play with them. I also figured out how to make a reset on the data and rebuild the content of the database. I will test it later this week thanks!

@rffrasca rffrasca added this to the 10.2.0 milestone Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

2 participants