Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

OfflineMapManager is crashing on serial queue access #15536

Closed
pappalar opened this issue Sep 2, 2019 · 13 comments
Closed

OfflineMapManager is crashing on serial queue access #15536

pappalar opened this issue Sep 2, 2019 · 13 comments
Labels
bug crash iOS Mapbox Maps SDK for iOS

Comments

@pappalar
Copy link

pappalar commented Sep 2, 2019

Crash:

mbgl::DefaultFileSource::setOfflineRegionObserver(mbgl::OfflineRegion&, std::__1::unique_ptr<mbgl::OfflineRegionObserver, std::__1::default_delete<mbgl::OfflineRegionObserver> >) + 162

crash.txt

Hello,

in order to remove all packs that are not needed anymore, we use a serial queue to remove all packs as soon as the Mabbox DB is loaded.

so on keyPath packs change (NSKeyValueChange.settings)
we dispatch to a SerialQueue and remove the pack one by one and wait for the first one to be completed before proceeding to reduce the load on the Database access.

serialQueue.async { [weak self] in
     guard let packs = storage.packs else {
        return []
    }
    let validPacks = packs.filter { $0.state != .invalid }

    for pack in validPacks {
        let packGroup = DispatchGroup()
        packGroup.enter()
        storage.removePack(pack) { [weak self] (error) in
            packGroup.leave()                
        }
        // Wait until the pack completion handler is executed
        // Marking the pack as removed for us
        packGroup.wait()
   }
}

However this seems to crash for many users.

Could you help us out?
we know the storage is a singleton, there is any indications on thread-safety that we are not aware about?

is the singleton expecting to be accessed from the main thread or any other type of single thread-queue access?

Thanks

@julianrex julianrex added iOS Mapbox Maps SDK for iOS bug crash labels Sep 3, 2019
@julianrex
Copy link
Contributor

@racer1988 thanks for the report. A few comments/questions:

  1. Is there a reason why you want to remove packs one-by-one on a background thread? removePack is already asynchronous, so you should be able to use the same kind of DispatchGroup mechanism to notify you when all the packs have been removed.

  2. MGLOfflineStorage.packs shouldn't be considered thread-safe (the property is marked with nonatomic). Can you try capturing packs earlier, before the async call? So something like

    if let packs = storage.packs  {
        serialQueue.async { [weak self] in
            let validPacks = packs.filter { $0.state != .invalid }
    
            <existing code>
        }
    }
  3. Looking at the crash report, it looks like either MGLOfflinePack is accessing a null or dangling pointer. We definitely have a bug here in that removePack:withCompletionHandler: can't be called twice with the same pack. I suspect that in your scenario above, there's a race condition where this is happening.

/cc @jmkiley

@pappalar
Copy link
Author

pappalar commented Sep 5, 2019

@julianrex Good questions! I'll give the best answers I have at the moment:

  1. Yes! There is a good question why I want to remove the packs one by one:
    removePack is asynchronous hence I want to wait for one removal to end to start the next one, this cannot be done on the mainQueue for obvious reason.
    The question is Why. This is because of a separate performance issue which I did not report until now and I would now track separately in Performance: Running removePack on multiple packs in parallel makes the DefaultFileSource queue stuck to 100% for a long time #15573
    Basically removing packs in parallel is WAY slower (of a x100 factor IMHO, from milliseconds to 10+seconds) and I cannot wait the DB to finish else the maps wont render as reported in Map is not rendering while Offline packs are getting deleted #15297

  2. I can try, but basically I am trying to run all operation that access the storage.packs on a serial queue to increase the performances and avoid race conditions.
    it might happen that the packs will be modified from another block executed before this one, so capturing the packs outside of the block might also be a issue with old broken references?

  3. This is a good question. I implemented the serial queue exactly to avoid race conditions on the storage.packs changes.
    I also don't want to use key-value observers on the packs because then I would simply make a copy of the storage in my code and that doens't make sense for me, I'd rather keep accessing the updated packs variable.
    There is something fishy going on there, as I noticed sometime I get crashes in the progress update for a pack because I try to read pack.state and this is .invalid and triggers an assertion (even for reading the state) which makes me think there is a mismatch in the pointers.

@julianrex
Copy link
Contributor

Re 3. - I'm able to reproduce a similar crash, and am investigating solutions. Stay tuned.

@julianrex
Copy link
Contributor

julianrex commented Sep 6, 2019

I suspect what's happening in your crash is that an MGLOfflinePack is being invalidated on the main thread (by reloadPacks or addContentsOfURL:withCompletionHandler:), while it's also being invalidated on your serial queue. There is a genuine bug here being addressed in #15582, though the larger issue of accessing packs from two threads is problematic.

In your scenario I would to do two things:

  1. Wait to remove the packs after the first call to sharedOfflineStorage: sharedOfflineStorage calls reloadPacks, which takes some time. You may have this covered by your use of KVO (I've not tested this).

  2. Ensure the calls to removePack occur on the main thread. You can still use a serial background queue to coordinate this and remove packs one-by-one - it's just a little more dancing around 😞.

@pappalar
Copy link
Author

@julianrex Ok Thanks.
I can try (2.). Do you think the idea is the same if I access storage.packs always on a background thread (serial) or you see any issue with that too?

I know the singleton is not thread safe, however it adds a huge layer of complexity as everything is a "race condition" coming either with KVO or notifications while I might be iterating on the packs.

I understand the need of this but I wonder if accessing am invalid pack or removing a invalid pack and similar scenarios wouldn't be better handled with a throw rather than with a crash.

This way if there is some sort of reloadPacks in the middle of my iteration (caused by something else) I could still safely fallback to do nothing on that pack and avoid a user crash.

The main problematic thing for me is the storage.packs variable, it gives a direct access to the underlying level, updates automatically, and I get information about how it changes with KVO and Notifications.
However I either track a copy of the entire DB state by using the KVO on a local structure, or I fully trust the storage.pack to return me the valid packs.

@pappalar
Copy link
Author

by the way all of my operations are run after the first load of packs in KVO (I use that to trigger the cleanup section) (so on keyPath packs change (NSKeyValueChange.settings))

@julianrex
Copy link
Contributor

julianrex commented Sep 10, 2019

Do you think the idea is the same if I access storage.packs always on a background thread (serial) or you see any issue with that too?

Yep, since addContents(of:withCompletionHandler:) modifies packs on a background thread that differs from your serial queue and the main queue:

[strongSelf.packs enumerateObjectsUsingBlock:^(MGLOfflinePack * _Nonnull pack, NSUInteger idx, BOOL * _Nonnull stop) {
MGLOfflinePack *newPack = packsByIdentifier[@(pack.mbglOfflineRegion->getID())];
if (newPack) {
MGLOfflinePack *previousPack = [mutablePacks objectAtIndex:idx];
[previousPack invalidate];

I understand the need of this but I wonder if accessing am invalid pack or removing a invalid pack and similar scenarios wouldn't be better handled with a throw rather than with a crash.

Absolutely - we don't want it to crash! 😄 Fixing the crash is happening in #15582 - it won't trigger an exception (at the moment) given the current situation, however longer term I think that's the right thing to do.

I think the solution here is to:

  1. Add a queue parameter somewhere, that defines where completion blocks should be dispatched too. (Maybe something like MGLOfflineStorage.sharedWithQueue(_:) 🤔)
  2. Ensure all accesses to the packs property is serial (probably using another internal queue)
  3. Clarify/fix what happens if a reloadPacks happens while modifying, e.g. through an add or remove (test here though still WIP)

These three will be addressed in follow-on work from #15582.

@julianrex
Copy link
Contributor

For tracking, here's a related ticket: #14206

@pappalar
Copy link
Author

pappalar commented Sep 10, 2019

Ah Thanks for the explanation.

What I would think of: (I hope random brainstorming inputs are welcome 😄 )

  • reloadPacks changes the full underlying memory structure too often (done I think on each removal?) making it really hard both inside and outside to ensure address are kept consistent and avoiding dangling pointers. What is the real need behind it? Could the state .invalid be used as a source of truth until the DB is reloaded for other reasons? (like garbage collected but not forcefully re-read each time)
  • Could the storage.packs being handled internally with a concurrent sync queue (to implement a write-read lock) in order to keep read performances as high as possible but ensure any deletion and change of state to a pack will make readers wait (with implication on read access from the main queue)

combining the above together will make the database storage more thread-safe and probably avoid race conditions.

EDIT:

just saw your comment about #14206. If there is no other way to make thread safety a possibility (due to caching etc) maybe the entire storage should either be documented as mandatory to be called on the mainThread and raise exception if not) or should internally dispatch to it's own queue to ensure thread-safety.

As with the points above and my previous comments, I am totally for removing the storage.packs property and make it an asynchronous call (so that internally it can be dispatched thread-safe but doesn't block the mainQueue for example)

Every other interaction method with the storage (add, remove, invalidate) is already having a completionBlock.
making the read also using completion block could be the step to make the singleton internals to work completely independent from the outside context.
(would probably require a major version update)

@julianrex
Copy link
Contributor

I hope random brainstorming inputs are welcome

Always! As are PRs 😃

Could the storage.packs being handled internally with a concurrent sync queue

Yep, that's what I'm thinking. We already use this pattern elsewhere in the SDK so it's 👍.

I am totally for removing the storage.packs

This may not be necessary, though it could become a convenience property. Either way this may involve a SEMVER change, though ideally we could do the bulk of the changes without needing one.

I added #15598 as a catchall ticket declaring our desire for refactoring - it'll make our testing more stable regardless.

Re #14206 - I think this class should be able to be used from a background thread. At the moment it's precarious as you've found. Hopefully with the crash fixed, that'll get you closer to where you want to be. The refactoring is going to take a little longer.

@pappalar
Copy link
Author

@julianrex I did move my entire implementation today to access all MGLOfflineStorage API calls (packs, addPack, removePack etc..) plus all packs internal calls (pack.resume, pack.suspend etc..) all from the MainThread.

Due to the old implementation I have main Dispatch one inside the other, always to the main queue..

however I wanted to report that even this way I get the same crash.

See below a screenshot of the same stack trace I reported before, but all coming from the main thread

Screenshot 2019-09-10 at 18 16 49

@julianrex
Copy link
Contributor

Yep, I think this is still to be expected. The crash will be fixed in #15582.

julianrex pushed a commit that referenced this issue Sep 12, 2019
julianrex pushed a commit that referenced this issue Sep 18, 2019
@julianrex
Copy link
Contributor

The crash has been addressed in #15536, so closing this ticket.

However, for the larger refactor (and support background queues) please see #15598 and its child issues.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug crash iOS Mapbox Maps SDK for iOS
Projects
None yet
Development

No branches or pull requests

2 participants