Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move shared memory block swapping control to osrm-datastore #2570

Closed
danpat opened this issue Jun 21, 2016 · 9 comments · Fixed by #3015
Closed

Move shared memory block swapping control to osrm-datastore #2570

danpat opened this issue Jun 21, 2016 · 9 comments · Fixed by #3015
Assignees
Milestone

Comments

@danpat
Copy link
Member

danpat commented Jun 21, 2016

OSRM supports a hot-swap feature - you can use osrm-datastore to replace an old dataset with a new one while osrm-routed (or node-osrm) remains running, with only a very brief hitch in request handling performance.

However, the design is a bit ugly - osrm-datastore simply loads the new data, but it's actually up to the reader process (osrm-routed) to perform the data exchange, mark the new block in use and release the old memory.

This has a few problems:

  • the locking logic is hard to grasp
  • you can't have multiple readers of the shared memory block, and support hot-swapping
  • readers need write-access to the block

We should change this. I picture it working as follows:

  • shared memory segments must be named via the command-line, no more "magic numbers"
  • users of a shared memory block register their interest via a semaphore
  • osrm-datastore loads new data, then signals all registered listeners that new data is available, then waits
  • listeners swap to the new data address, and release interest in the old one
  • osrm-datastore removes the old data once all listeners have moved, then exits

There would be some additional locking required to avoid problems with running additional osrm-datastore during an exchange, and to handle the startup/shutdown of listeners during an exchange.

@emiltin
Copy link
Contributor

emiltin commented Jun 21, 2016

related #1221

@TheMarex
Copy link
Member

TheMarex commented Oct 5, 2016

As a first step I would like to implement this without having named memory segments, since this would definitely require an API change. I believe to get this right we cannot operate on process granularity but need to consider every thread.

Currently we have a piece of code that waits until all requests are finished before it swaps the data. However that is unnecessary, what we really want to do is to wait until all requests that referenced the old dataset are finished, before we deallocate that shared memory region.

To allow this we would need to have multiple versions of the shared-memory data facade in memory. Every request holds shared ownership of that data facade. If all ownerships are given up the facade is deallocated, and we can signal for the release of the shared memory region.

Things that need to change to enable this:

  • All stored references to data facades (e.g. for routing algorithms or inside the plugins) need to be changed to non-owning references when the request is executed
  • The CheckAndReleadFacade() logic can't live in the shared dataface anymore, because every datafacade is immutable
  • Engine will store a shared_ptr to the current datafacade, which will be protected by a (process local) read-writer lock. Every time a new facade is create a write-lock will need to be aquired.
  • Every request will first copy (acquire&release read lock) that shared_ptr and hold it for the duration of the whole request.
  • An IPC reader/writer lock for the every dataset in use. Every datafacade on creation enters the critical section as reader and leaves the critical section when the destructor is called.
  • An IPC read/writer lock for the shared memory section that contains information about the newest dataset.

@daniel-j-h do you think this sounds sane at all?

@TheMarex TheMarex self-assigned this Oct 5, 2016
TheMarex added a commit that referenced this issue Oct 6, 2016
This is the first step to having fine grained locking on data updates,
see issue #2570.
TheMarex added a commit that referenced this issue Oct 6, 2016
This is the first step to having fine grained locking on data updates,
see issue #2570.
@MoKob
Copy link

MoKob commented Oct 6, 2016

First of all, I agree with the assessment in #2570 (comment).

Still I'd like to raise the question if we are implementing a complex solution here that might be based on not-changing the API and/or unnecessarily complicated.

Currently we have a piece of code that waits until all requests are finished before it swaps the data. However that is unnecessary, what we really want to do is to wait until all requests that referenced the old dataset are finished, before we deallocate that shared memory region.

Is definitely a bottleneck, but given our usual request time not a big problem to me. We don't loose much here.

What I am more concerned about is the complexity of the solution. Could we do better if we target this for a 6.0 release and make it an API change? Are we doing a major change here to work around an API change?

@daniel-j-h
Copy link
Member

@TheMarex sounds good overall, shared ownership makes sense.

Currently we have a piece of code that waits until all requests are finished before it swaps the data. However that is unnecessary, what we really want to do is to wait until all requests that referenced the old dataset are finished, before we deallocate that shared memory region.

The issue I'm seeing here is the following: at the moment we only ever have one dataset loaded. With your proposed design we potentially have multiple datasets loaded to memory as long as there are requests holding onto it. This multiplies memory requirements, or am I missing something here?

For the record, I played around with shared memory reader / writer locks for hotswapping over at
https://github.com/daniel-j-h/shm-hotswap

@MoKob
Copy link

MoKob commented Oct 6, 2016

The issue I'm seeing here is the following: at the moment we only ever have one dataset loaded. With your proposed design we potentially have multiple datasets loaded to memory as long as there are requests holding onto it. This multiplies memory requirements, or am I missing something here?

Is this not already the case? Loading a dataset takes quite some time. So I guess the hot-swap would be a real cold swap if we don't already load two sets into memory.

@TheMarex
Copy link
Member

TheMarex commented Oct 6, 2016

Is definitely a bottleneck, but given our usual request time not a big problem to me. We don't loose much here.

We sometimes hit outliers on our production machines where requests take several seconds and clog the pipeline. This makes switching dataset times quite non-deterministic. Combine this with the fact that datastore doesn't actually wait until the data is completely switched over, but just exists after the shared memory region is written, we have hacks in place just to ensure that the data correctly loaded.

What I am more concerned about is the complexity of the solution.

This is actually less complex then our current locking setup, because we needed to add several locks to work around edge cases (concurrent datastore runs, non-commited data changes, several race conditions between the two). And multiple clients are just not supported at all which is a major downside for a shared-memory based system.

We can replace that haywire of mutexes using established multi-processing idioms. This will make the code both easier to understand and less error prone. And as cherry on top we will get support for multiple clients.

Could we do better if we target this for a 6.0 release and make it an API change?

As I outlined above this is a strict feature subset of something that would require an API change. This approach can be extended to use named datastores, but that is actually completely independent on how we do the locking.

This multiplies memory requirements, or am I missing something here?

Both yes and no. It multiplies the memory requirement for storing a SharedDataFacade object, which can now have two instances (old and new dataset). However since that object does not actually own any of the significant memory but is only a collection of pointers and wrappers this will not impact our memory usage.
As @MoKob already pointed out, when using shared memory there are always two complete datasets in shared memory at the time of switching. So this does not change that part of our memory requirements.

TheMarex added a commit that referenced this issue Oct 6, 2016
This is the first step to having fine grained locking on data updates,
see issue #2570.
@TheMarex
Copy link
Member

TheMarex commented Oct 8, 2016

I found a limitation with the design:

  1. Start osrm-routed with shared memory
  2. Run osrm-datastore with another dataset
  3. Observe that osrm-datastore is blocked because it can't free the old dataset

Even though there are no queries referencing the old dataset, osrm-datastore is blocked. This is because the Engine stores a reference to the dataset, to be able to serve queries from it. If there are no queries, there is no code in Engine being executed: there is no way for it to detect a new dataset. So osrm-datastore actually waiting for the next query to arrive to detect a new dataset and do the switch (and by that releasing ownership of the old dataset).

This of course breaks our test setup, where we start an osrm-routed instance, wait for osrm-datastore to finish and then send queries.

We could side-step that problem by not needing any client intervention to switch datasets. That would mean we don't keep state (e.g. safe the current SharedDataFacade) but always create it new for every request. However right now creating a new data facade is a rather expensive operation involving the creation of a lot of wrapper objects and mapping a shared memory region (that is at least one syscall, probably more).

I'm going to benchmark this to see what we would be looking at here in terms of overhead. We could potentially safe some overhead with keeping the shared memory mapped at constant locations (hence a syscall less). Downside there would probably be that we have a constant requirement of datasize*2 (not just during updates). That could be improved just switching the memory lock on/off., hence letting the OS do its job and just page it out.

@TheMarex
Copy link
Member

I did a quick benchmark on this and creating SharedDataFacade objects is surprisingly expensive, about half a millisecond. As such I think adapting the design in the following way should get us closest to where we want to go:

  1. osrm-datastore always tried to overwrite the oldest dataset. It can never overwrite the most recent dataset. osrm-datastore runs can't happen in parallel.
  2. If the oldest dataset is still in use (read lock) it blocks and waits for the requests to finish before updating.
  3. A read lock on a shared memory region is only maintained during query (not bound to the lifetime of a ShareDataFacade)
  4. Every time a SharedDataFacade is deallocated it deallocates the shared memory it is pointing to if:
    • the shared memory is not the newest memory
    • there is no lock on it (still in use, or in use again)

@TheMarex
Copy link
Member

I have this design now finally working. Only remaining issues are around hardening: How are we going to detect and recover from dangling locks. This might happen if either osrm-routed or osrm-datastore crash in a critical section of a shared memory lock. We have a tool for that osrm-unlock-all but some resilience for server deployments is important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants