Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress callback is called too soon ? #741

Closed
eblis opened this issue Jul 29, 2022 · 11 comments
Closed

Progress callback is called too soon ? #741

eblis opened this issue Jul 29, 2022 · 11 comments

Comments

@eblis
Copy link

eblis commented Jul 29, 2022

Python version

('python=3.9.12 (tags/v3.9.12:b28265d, Mar 23 2022, 23:52:46) [MSC v.1929 64 '
 'bit (AMD64)]')
'os=Windows-10-10.0.19044-SP0'
'numpy=1.23.1'
ldf is not supported
xls is not supported
xlsx is not supported
'asammdf=7.1.0'

Code

Code snippet

logger = logging.get_logger(__name__)

def load_progress(progress_bar: tqdm.tqdm, current, total):
    if progress_bar.total != total:
        progress_bar.clear()
        progress_bar.reset(total=total)
    progress_bar.update(1)
    progress_bar.refresh()

loading = tqdm.tqdm(total=100, leave=False, desc=f"Loading {file.name}")
progress_function = functools.partial(load_progress, loading)
try:
    with MDF(file, callback=progress_function) as mdf_file:
        loading.close()  # done with load progress
        logger.info(f"Reading finished for {file}")

Description

The progress for loading the file finishes a lot faster than the file is actually loaded.
The progress for my file (but the file itself doesn't matter from what I can tell) has total set to 13903 or similar and the progress finishes in a few seconds, but the progress bar is closed and the log message saying Reading finished for {file} is only written minutes later

@danielhrisca
Copy link
Owner

current and total are not confined in the interval 0..100
Can this be a problem for your progress bar?

@eblis
Copy link
Author

eblis commented Jul 29, 2022

If the total value reported by asammdf is different than the current one (default 100) the progress bar is reset to use the new total as the maximum. And every time it's called the progress is increased by 1.
If you do a progress_bar.write(f"{current}: {total}") in the callback function you can see that value goes from 0 to whatever total has, but the file hasn't finished loading at that point, it still takes a lot longer after callback has been called for the last time

@eblis
Copy link
Author

eblis commented Aug 1, 2022

So it turns out that the problem is the sort call after reading the data.

I added some quick print statements to the _read() method (for mdf_v4) and this is what I got. I measured the time it took from the start of the method call to the self._sort() call which seems to load the data, the time it took for the self._sort() call and then the time after sorting until the end, which seems to be handling some kind of events.

Reading took 9.03205919265747 seconds
Sorting took 1453.9986987113953 seconds
Events took 0.37953495979309 seconds

The bulk of the processing time is for DTBLOCKs

@eblis
Copy link
Author

eblis commented Aug 1, 2022

It takes around 4s for one iteration of the following for block (I added the progress stuff):

            for info in group.get_data_blocks():
                if self._callback:
                    progress += 1
                    self._callback(progress, size)

                dtblock_address, dtblock_raw_size, dtblock_size, block_type, param = (
                    info.address,
                    info.original_size,
                    info.compressed_size,
                    info.block_type,
                    info.param,
                )
...

I have 575 groups in total so it will take about 30-35 minutes.

@danielhrisca
Copy link
Owner

It takes around 4s for one iteration of the following for block (I added the progress stuff):

            for info in group.get_data_blocks():
                if self._callback:
                    progress += 1
                    self._callback(progress, size)

                dtblock_address, dtblock_raw_size, dtblock_size, block_type, param = (
                    info.address,
                    info.original_size,
                    info.compressed_size,
                    info.block_type,
                    info.param,
                )
...

I have 575 groups in total so it will take about 30-35 minutes.

Damn, how big is the file?

@eblis
Copy link
Author

eblis commented Aug 1, 2022

About 574 MBs

@danielhrisca
Copy link
Owner

please try the development branch code

@eblis
Copy link
Author

eblis commented Aug 5, 2022

I had a look at the changes.
From what I can tell the new changes will have 2 new steps for the progress, one of which is the whole sort() call. But in my case the sort took the longest time by far, with these changes it will look like the progress bar is stuck for a long long time on the second to last step.

I tried to think of a way to update the progress while sorting is taking place.
Would something like this be possible ?
Add 100 steps (or some other, large, fixed number) for the sorting of the data.
Then calculate the individual progress increment for one data block, for example if we have 55 data blocks and 100 steps in total it would be increment = sort_steps / data_blocks, sort_steps being the total number (100) and data_blocks being the total number of data blocks by calculating the number of data blocks in each group and adding all of them together.
In our example it would be increment = 100 / 55 -> increment=1.818181

Then you could call the progress function inside the sort() call, updating by the increment every time.
This would make the progress smoother, but it will mean that the increment value is no longer an integer, it's a float (you could round the value when calling the progress function) if the callback function really expects ints (but then again why should it?).

eblis added a commit to eblis/asammdf that referenced this issue Sep 27, 2022
Added progress information while sorting all the groups.
 - Added 102 extra steps in the load progress, 100 for sorting and 2 after sorting is completed.
 - When incrementing the progress updates during sorting we first compute a step, based on the total number of steps we have (100), the number of groups and the total items and then we increment each sorting by the step count.
 - If the previous step count (integer) is greater than the current step (integer) then we can update the progress. This avoids calling the callback function with floating values and avoids calling the callback function with the same current value of the progress bar - i.e. we only call when there's an update in progress.
@eblis
Copy link
Author

eblis commented Sep 27, 2022

I have created a pull request with my proposed solution implemented.
I've tweaked the solution a bit so it calls callback function with integers instead of floats and only calls the callback function if the progress (as int) has increased.

@eblis
Copy link
Author

eblis commented Oct 6, 2022

You can use the following script to check the progress updates.

import functools
import pathlib

import asammdf
from tqdm import tqdm


def load_progress(progress_bar: tqdm, current, total):
    if progress_bar.total != total:
        progress_bar.clear()
        progress_bar.reset(total=total)
    progress_bar.update(1)


file = pathlib.Path("d:/tmp/mf4/scramble.scrambled.mf4")
loading = tqdm(total=100, leave=False, desc=f"Loading {file.name}")

progress_function = functools.partial(load_progress, loading)
with asammdf.MDF(file, callback=progress_function) as mf4_file:
    loading.close()  # done with load progress

danielhrisca added a commit that referenced this issue Oct 7, 2022
@danielhrisca
Copy link
Owner

in the version 7.3.0 you will have to change the argument name to progress

with asammdf.MDF(file, progress=progress_function) as mf4_file:
    loading.close()  # done with load progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants