Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lp1842679: Fix caching of audio data #2265

Merged
merged 13 commits into from
Sep 15, 2019
Merged

lp1842679: Fix caching of audio data #2265

merged 13 commits into from
Sep 15, 2019

Conversation

uklotzde
Copy link
Contributor

@uklotzde uklotzde commented Sep 4, 2019

https://bugs.launchpad.net/mixxx/+bug/1842679

Trying to mitigate the issue by reserving more chunks. The current number of chunks seems to be insufficient when using multiple decks including sample decks! Reverted, my initial guess was wrong. But I found issues with the double-linked MRU/LRU list, at least some assertions failed. Fixed.

The comment was outdated, I updated it.

@uklotzde uklotzde added this to the 2.2.3 milestone Sep 4, 2019
@daschuer
Copy link
Member

daschuer commented Sep 5, 2019

I have some doubts if this fixes the linked bug. Do we have an idea what need to happen that none of the 80 chunks can be freed?

@uklotzde
Copy link
Contributor Author

uklotzde commented Sep 5, 2019

It will not fix the bug if something else is blocking the processing of chunks and they get stuck in one of the FIFOs.

Nevertheless, it doesn't hurt to increase the number, 16 MB should be acceptable. And the size of the FIFOs should match exactly this number! This is fixed now.

@uklotzde
Copy link
Contributor Author

uklotzde commented Sep 5, 2019

The logs attached to the bug report clearly show a strong timely correlation of enabling recording and running out of chunks. Everything back to normal after disabling recording. Something in the recording path of the engine seems to block and cause a chain reaction.

@daschuer
Copy link
Member

daschuer commented Sep 5, 2019

https://github.com/uklotzde/mixxx/blob/0c663d1729c0b0c3ddb6fdf4309389c36740ae5a/src/engine/cachingreader.cpp#L134

Why can this ever be true? If I understand the code correct, lru and mru are the tail and head it the linked list in use. If all 80 chunks are in used, lru and mru must have a value and hold all chunks.

So maybe the recording triggers as bug that looses the lru?

@uklotzde
Copy link
Contributor Author

uklotzde commented Sep 6, 2019

https://github.com/uklotzde/mixxx/blob/0c663d1729c0b0c3ddb6fdf4309389c36740ae5a/src/engine/cachingreader.cpp#L134

Why can this ever be true? If I understand the code correct, lru and mru are the tail and head it the linked list in use. If all 80 chunks are in used, lru and mru must have a value and hold all chunks.

So maybe the recording triggers as bug that looses the lru?

If all allocated chunks are in on their way between CachingReaderWorker and CachingReader both head/tail pointers of the MRU/LRU list will become nullptr. See CachingReaderChunk::removeFromList()/insertIntoListBefore() and the FIFOs in CachingReader that serve as channels.

@uklotzde
Copy link
Contributor Author

uklotzde commented Sep 6, 2019

If CachingReaderWorker got stuck, probably due to some file system I/O, then we will eventually run out of chunks if CachingReader receives requests, but doesn't get back any chunks from the worker and all chunks are in the state READ_PENDING.

This should only happen for files > 5 MB, our current limit.

@uklotzde uklotzde changed the title lp1842679: Increase number of cached chunks [WiP] lp1842679: Increase number of cached chunks Sep 6, 2019
@uklotzde uklotzde changed the title [WiP] lp1842679: Increase number of cached chunks lp1842679: Increase number of cached chunks Sep 6, 2019
@uklotzde
Copy link
Contributor Author

uklotzde commented Sep 6, 2019

Looks like there were indeed some(?) bugs in the double-linked list management! I experienced assertion errors when reducing the total number of chunks to just 1, 2, or 3 for testing all edge cases.

Now it should be super-safe. The code has been cleaned up and is guarded by lots of assertions. I did not experience a single issue (apart from massive drop-outs and warning logs) when shrinking the cache to minimum capacity (see comment in last commit).

I have not analyzed if and how blocking could occur during recording.

@uklotzde uklotzde changed the title lp1842679: Increase number of cached chunks lp1842679: Increase number of cached chunks / fix caching Sep 6, 2019
@daschuer
Copy link
Member

daschuer commented Sep 6, 2019

Thank you for double checking the list functions.

I think the audio dropouts itself are caused by the locking debug messages.
Maybe we can add a flag to issue only one during a Mixxx run?
We did this with other engine debug messages as well.

Due to all the Samples, the increased number of chunks highers the memory footprint from
345 MB to ~1 GB. If this leads to storing memory on the swap partition in low end systems, all the caching efforts are for nothing.

Without having a closer lock why the Caching Reader Worker is locked I guess this is related to looking disk IO which we cannot avoid. So It is probably a good idea to deal with it. In case the worker is locked all chunks are stored on the fifio until it is full. After unlocking the worker, it processes a probably outdated chunks first.

Normally there are ~6 chunks requested per callback and normally only one chunk changes. All the other 72 chunks are not interesting except fro random seeks we cannot predict anyway.

It will took a while until the reader recovers from an overflow. If this happens with two decks, they likely lock each others again.

From this point of view it would be better to limit the worker fifo to a reasonable low number and flush it in case of overflow. This way, likely required chunks are read after recovering instead of old unwanted chunks. We may also reduce the over all number of chunks.

What do you think?

@uklotzde
Copy link
Contributor Author

uklotzde commented Sep 6, 2019

Cache size
Oh, I forgot that we have one reader per deck 🙈 Then we should indeed reduce the number of chunks to the original or a lower amount.

Debug logs
Thanks for the hint! Leaving the debug logs enabled in production was never a good idea. This causes all kinds of issues as we see here once again. Adding a flag for every single debug log is error-prone. Then I prefer to delete them entirely. I will check how many are affected.

FIFOs
Messing around with the FIFOs is difficult, the whole design is still fragile. We must take into account ownership between reader and worker and that no chunks get lost. Let's leave this as it is. I'm not keen on spending hours or days on this again.

@uklotzde uklotzde changed the title lp1842679: Increase number of cached chunks / fix caching lp1842679: Fix caching of audio data Sep 6, 2019
@daschuer
Copy link
Member

daschuer commented Sep 6, 2019

Leaving the debug logs enabled in production was never a good idea.

We should use them wisely. Imagine if we did not have the message here that we would probably have a hard time solving the problem.

Messing around with the FIFOs ...

What if we just reduce the FiFo size to smaller number a number? ~20?

@uklotzde
Copy link
Contributor Author

uklotzde commented Sep 6, 2019

We must not reduce the size of the back channel m_readerStatusFIFO, because the worker uses writeBlocking()! Reducing the size of the request FIFO would drop incoming requests when full.

@uklotzde
Copy link
Contributor Author

uklotzde commented Sep 6, 2019

Done with very detailed comments.

@uklotzde
Copy link
Contributor Author

uklotzde commented Sep 6, 2019

OMG, FIFO::writeBlocking() only works for count = 1!!

@uklotzde
Copy link
Contributor Author

uklotzde commented Sep 6, 2019

boost::lockfree::queue is not ideal, but would work. After a failed bounded_push() the CachingReader could invoke pop() (multi-consumer) to free the slot with the oldest request and retry. The 2nd bounded_push() attempt then must succeed.

@uklotzde
Copy link
Contributor Author

uklotzde commented Sep 6, 2019

But I'm not planning to integrate boost.

@daschuer
Copy link
Member

Works good now. Hopefully it actually lints the user issue.
LGTM, Thank you.

@daschuer daschuer merged commit 4000c66 into mixxxdj:2.2 Sep 15, 2019
@uklotzde
Copy link
Contributor Author

@daschuer Thanks for merging these fixes! I was really concerned about the findings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants