Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Responsive and thread-safe track analysis (ad-hoc + batch) #1413

Closed
wants to merge 146 commits into from
Closed

Responsive and thread-safe track analysis (ad-hoc + batch) #1413

wants to merge 146 commits into from

Conversation

uklotzde
Copy link
Contributor

@uklotzde uklotzde commented Dec 13, 2017

...that might prematurely abort a batch analysis. The blocking operation must be enclosed in a loop.

Related bug reports:

Existing implementation:

  • Summary: Hardly usable and unsafe, inappropriate for a release
  • Awkward API design causes unresponsiveness of the UI when starting and during an analysis
  • Unneeded and inefficient prioritization of scheduled tracks
  • Progress update gets confused and displays invalid or inconsistent values when adding tracks during an analysis (you might already have seen negative values)
  • Race conditions between scheduler (= queue) and worker thread

New implementation:

  • Re-designed scheduler API (formerly known as "AnalyzerQueue") for efficiently scheduling multiple tracks for analysis
  • Lock-free worker thread design that does not block the scheduler/UI, neither when scheduling tracks nor during progress updates
  • No more inconsistencies during progress update caused by inappropriate parameterization
  • Thread-safe creation of isolation of fidlib filters
  • Thread-safe isolation of Vamp plugins after experiencing spurious crashes when running multiple worker threads concurrently
  • Enabling of multi-threaded analysis is completely transparent and does not require any changes in the remaining code base

The PR got too big, because after detecting some minor issues we discovered new and even worse ones. Thank you all for your valuable input 👍 In the end a complete re-design was needed to get the whole analysis code into a stable, safe, and usable state. If we would have known in advance we could have started fixing single issues in isolation. Unfortunately I don't have the time to decompose this PR again into smaller parts that could be reviewed one by one. That would cost me at least another day or even more, no way.

@Be-ing
Copy link
Contributor

Be-ing commented Dec 14, 2017

This has not fixed my issue with batch analysis :/ I still see the same stuff in my log.

@Be-ing
Copy link
Contributor

Be-ing commented Dec 14, 2017

I tried renaming my mixxxdb.sqlite file and that didn't fix it either.

@uklotzde uklotzde changed the title Fix potential race condition in AnalyzerQueue... Fix potential race condition in AnalyzerQueue Dec 14, 2017
@JosepMaJAZ
Copy link
Contributor

JosepMaJAZ commented Dec 14, 2017

And... What about my Multithreaded branch? In there, the isLoadedTrackWaiting method is not present. I removed it because I couldn't really understand what it tried to do. It seems as if it tries to remove a track from the list, if it is going to analyze it.
It's not really something that needs special treatment, except, at much, to avoid analysing more than once on two threads at the same time. (although I'm not 100% sure either if it checks that scenario)

@uklotzde
Copy link
Contributor Author

The AnalyzerQueue still seems to have serious issues. The more tracks are added to the queue the more sluggish the interface becomes. Not yet finished analysing why this happened. Maybe an unexpected side-effect with the recently added TrackCache.

Here I just fixed some minor issues that should not cause any more regressions. Reducing lock contention on the TrackCache is work in progress and a different aspect.

@uklotzde
Copy link
Contributor Author

@JosepMaJAZ I plan to review your multi-threaded analysis work! But please don't expect that we are able to finish it very soon. If we don't grasp and tame the dynamics of a single concurrent task within the boundaries of the current architecture, I doubt that we are able to get it right for multiple threads.

First we need a rock solid solution for a single concurrent thread that never fails, neither in real life nor in theory. Then we may generalize that for multiple threads.

@daschuer
Copy link
Member

It does not hurt to is atomics here.
But plain ints are atomic as well. So the latest commits should not make a difference.

I have not get the point where was the race condition in the first commit. Could you explain it a bit more?

For me this PR is up to now (only) a nice refactoring. Can we merge it as it is? To keep a possible final solution PR readable?

@daschuer
Copy link
Member

Suspicion:
We have a pending race condition with a global qt mutex protecting the gui thread signal queue.
If I recall it correct, the engine locks it whenever writing an co. This low priority analyser also uses qt signals to the GUI thread. So maybe we see a priority inversion caused by this mutex.
An idea to solve it is to pass all all engine signals (co valueChanged) to a second thead in a non locking way. This thread can now singnal the GUI thread and hold the mutex without blocking the engine.

This is of cause no solution for priority inversion with the analysis and other thread signaling the GUI, but at least it will prevents sound dropouts.

Can this be you issue?

@uklotzde
Copy link
Contributor Author

👍 Even after reducing the lock contention on TrackCache (will be published soon) the analysis is still very "sluggish". I had no idea why, but this could be the reason!!

@uklotzde
Copy link
Contributor Author

uklotzde commented Dec 15, 2017

Race condition: The queue might already be empty again after the thread has actually be unblocked. You always should to enclose those lock/wait/check patterns in a loop!

Rethinking this special case: The worker thread is currently the only consumer that is able to take items from the queue. But even then we should apply the common pattern now. In the future the same queue might be consumed by multiple threads. Better to get this correct right from the start!

@daschuer
Copy link
Member

This explains a on time drop out , but it does not explain that the GUI starts to be sluggish, is there anything new, continuously sending signals?

@daschuer
Copy link
Member

Did not read you latest comment before mine.
Thank you for explaining. Does this mean we can merge this now. LGTM by the way.

@uklotzde
Copy link
Contributor Author

Those atomic variables are read and written from different threads and should use the atomic types with memory fences/barriers to avoid race conditions. It is at least good style to prevent any undefined behaviour and the atomic type indicates the sharing between threads.

@uklotzde
Copy link
Contributor Author

uklotzde commented Dec 15, 2017

The analyzer thread continously sends progress updates, at least every 60 ms. But this didn't change, so I wonder why this now seems to be an issue.

@JosepMaJAZ
Copy link
Contributor

I still don't see what you all try to address with this PR.

If the problem is why it blocks the UI when starting a new analysis with hundreds (if not thousands) of tracks, then that has two reasons:
First one, we still load the tracks, instead of only the Id's.
Second one, we use a simple queue in which we test if a track has previously been added. The more tracks there are, the longer it takes to add one more track.

If you are trying to reduce glitches while the analysis is on (I wonder which ones), then what you need to redo is how the UI updates are being done. I didn't modify that part on my multithreaded PR, but I described that the mechanism in place is a bit stupid, because what it blocks is the emit messages from the analyzer thread to the other thread, (supposedly main thread), which in turn emits another message for the UI update.

The UI thread (maybe tied to the refresh rate), should be telling if it has updated the ui or not, and just after that has been done, allow another update. But if the 60ms wait is in place, there isn't any reason to worry about UI updates (that's around 15 HZ).

So, really... what is being fixed here?

@uklotzde uklotzde changed the title Fix potential race condition in AnalyzerQueue Fix ~~potential~~ race condition in AnalyzerQueue Dec 16, 2017
@uklotzde uklotzde changed the title Fix ~~potential~~ race condition in AnalyzerQueue Fix race condition in AnalyzerQueue Dec 16, 2017
@uklotzde
Copy link
Contributor Author

@Be-ing Does the last commit fix your issue with tracks not being analyzed? Enqueuing was broken since ???, this is no regression.

@uklotzde
Copy link
Contributor Author

@JosepMaJAZ I added a TODO with a hint for the O(n²) complexity when adding multiple tracks in a row. But this is currently not the main issue.

There is already a semaphore in place that tells the analyzer if the UI is ready for the next update. I don't know if this is appropriate or not. One step after the other.

@Be-ing
Copy link
Contributor

Be-ing commented Dec 16, 2017

@Be-ing Does the last commit fix your issue with tracks not being analyzed? Enqueuing was broken since ???, this is no regression.

No :/

@uklotzde uklotzde added this to the 2.1.0 milestone Dec 16, 2017
@JosepMaJAZ
Copy link
Contributor

Be-ing: can you add additional information as to what this PR tries to fix?
I built the master on both, Windows and Linux, used different skins and analyzed a bunch of tracks, and I haven't noticed anything strange (Linux was run under VirtualBox, so it isn't exactly a good test environment, but I didn't see anything that I wasn't working as expected.)

@Be-ing
Copy link
Contributor

Be-ing commented Dec 16, 2017

@JosepMaJAZ: I renamed my mixxxdb.sqlite because the schema got messed up somewhere along the way testing various PRs. Upon doing batch analysis for my whole library, I found a serious bug that @uklotzde is trying to fix here.

@uklotzde
Copy link
Contributor Author

The "sluggish" response is actually caused by the way how the worker thread communicates with the analyzer component in the main UI thread. I need to decide how to decouple them to avoid blocking on the semaphore. Shared state between threads with different priorities is a bad idea. The semaphore always seemed odd to me and I should have noticed earlier!

This might take some time also for testing, so I'm not sure if it still fits into this PR that contains some bug fixes and refactorings to make the code comprehensible and maintainable.

@Be-ing
Copy link
Contributor

Be-ing commented Dec 16, 2017

Hanging the GUI is a problem, but not as big of a problem as skipping analysis of all tracks. Let's prioritize that.

@uklotzde uklotzde changed the title Fix race condition in AnalyzerQueue [WiP] Fix race condition in AnalyzerQueue Dec 17, 2017
@uklotzde
Copy link
Contributor Author

uklotzde commented Dec 17, 2017

We also need to consider the fact that tracks are still saved (metadata export + database update) from the low-prio analysis thread that currently owns the last track pointer reference. All clear, save actions are executed within the [Main] thread. Nevertheless this needs to be analyzed and documented thoroughly.

@uklotzde
Copy link
Contributor Author

Bulk updates for many tracks are still an issue in the main thread, but the lock-free implementation avoids to block the main thread from a low-prio background thread. I also reduced the amount of inter-thread communication.

Updates during analysis don't need to be consumed by the main thread, since each subsequent update replaces and includes its predecessor. The only crucial point here is that the last update from the analysis thread should finally reach the main thread. I need to check that this requirement is still met.

Note: During testing with massive loads I experienced one or two crashes within the Qt event loop, that could not be backtraced into Mixxx code. Those crashes could be caused by any code that is not respecting the rules or even a bug in Qt itself.

@Be-ing
Copy link
Contributor

Be-ing commented Mar 19, 2018

Since the immediate issue of the GUI freezing during batch analysis was fixed in #1485, I do not think we should rush this for 2.1.

@uklotzde
Copy link
Contributor Author

I'll keep this PR aligned with and targeted to 2.1 rather than master. It contains so many fixes we might need at some time for a 2.1.x fix release 😉

@Be-ing
Copy link
Contributor

Be-ing commented Apr 18, 2018

I'm not comfortable merging a +2,100 −1,078 pull request into a released branch unless it solves a critical bug that makes Mixxx unusable. There will be bugs from this and it needs wider testing before going into a release. Let's target it to master.

@Be-ing Be-ing mentioned this pull request Apr 18, 2018
@Be-ing Be-ing changed the base branch from 2.1 to master April 18, 2018 05:18
@uklotzde
Copy link
Contributor Author

This will not be an issue until serious merge conflicts arise.

I would like to reap the benefits of this work for my personal, customized, and stable 2.1 version. Anyone is free to fork this branch at any time and maintain it by themselves. But whoever decides to do so takes all the responsibilities. There is no such thing as a free lunch 😉

@uklotzde
Copy link
Contributor Author

I think about creating a new branch without all the intermediate changes and dead ends. The change set is much bigger than it ought to be. Although the history of commits and comments give some insights about where we got here finally.

@Be-ing
Copy link
Contributor

Be-ing commented Apr 18, 2018

I would like to reap the benefits of this work for my personal, customized, and stable 2.1 version.

Of course, you can do whatever you want on your own computer. But I am not comfortable merging such a big branch into the upstream 2.1 branch. Splitting this into smaller branches would be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants