Move shared memory block swapping control to osrm-datastore #2570

danpat · 2016-06-21T14:53:38Z

OSRM supports a hot-swap feature - you can use osrm-datastore to replace an old dataset with a new one while osrm-routed (or node-osrm) remains running, with only a very brief hitch in request handling performance.

However, the design is a bit ugly - osrm-datastore simply loads the new data, but it's actually up to the reader process (osrm-routed) to perform the data exchange, mark the new block in use and release the old memory.

This has a few problems:

the locking logic is hard to grasp
you can't have multiple readers of the shared memory block, and support hot-swapping
readers need write-access to the block

We should change this. I picture it working as follows:

shared memory segments must be named via the command-line, no more "magic numbers"
users of a shared memory block register their interest via a semaphore
osrm-datastore loads new data, then signals all registered listeners that new data is available, then waits
listeners swap to the new data address, and release interest in the old one
osrm-datastore removes the old data once all listeners have moved, then exits

There would be some additional locking required to avoid problems with running additional osrm-datastore during an exchange, and to handle the startup/shutdown of listeners during an exchange.

The text was updated successfully, but these errors were encountered:

emiltin · 2016-06-21T20:32:37Z

related #1221

TheMarex · 2016-10-05T21:43:48Z

As a first step I would like to implement this without having named memory segments, since this would definitely require an API change. I believe to get this right we cannot operate on process granularity but need to consider every thread.

Currently we have a piece of code that waits until all requests are finished before it swaps the data. However that is unnecessary, what we really want to do is to wait until all requests that referenced the old dataset are finished, before we deallocate that shared memory region.

To allow this we would need to have multiple versions of the shared-memory data facade in memory. Every request holds shared ownership of that data facade. If all ownerships are given up the facade is deallocated, and we can signal for the release of the shared memory region.

Things that need to change to enable this:

All stored references to data facades (e.g. for routing algorithms or inside the plugins) need to be changed to non-owning references when the request is executed
The CheckAndReleadFacade() logic can't live in the shared dataface anymore, because every datafacade is immutable
Engine will store a shared_ptr to the current datafacade, which will be protected by a (process local) read-writer lock. Every time a new facade is create a write-lock will need to be aquired.
Every request will first copy (acquire&release read lock) that shared_ptr and hold it for the duration of the whole request.
An IPC reader/writer lock for the every dataset in use. Every datafacade on creation enters the critical section as reader and leaves the critical section when the destructor is called.
An IPC read/writer lock for the shared memory section that contains information about the newest dataset.

@daniel-j-h do you think this sounds sane at all?

This is the first step to having fine grained locking on data updates, see issue #2570.

MoKob · 2016-10-06T08:43:44Z

First of all, I agree with the assessment in #2570 (comment).

Still I'd like to raise the question if we are implementing a complex solution here that might be based on not-changing the API and/or unnecessarily complicated.

Currently we have a piece of code that waits until all requests are finished before it swaps the data. However that is unnecessary, what we really want to do is to wait until all requests that referenced the old dataset are finished, before we deallocate that shared memory region.

Is definitely a bottleneck, but given our usual request time not a big problem to me. We don't loose much here.

What I am more concerned about is the complexity of the solution. Could we do better if we target this for a 6.0 release and make it an API change? Are we doing a major change here to work around an API change?

daniel-j-h · 2016-10-06T09:12:19Z

@TheMarex sounds good overall, shared ownership makes sense.

Currently we have a piece of code that waits until all requests are finished before it swaps the data. However that is unnecessary, what we really want to do is to wait until all requests that referenced the old dataset are finished, before we deallocate that shared memory region.

The issue I'm seeing here is the following: at the moment we only ever have one dataset loaded. With your proposed design we potentially have multiple datasets loaded to memory as long as there are requests holding onto it. This multiplies memory requirements, or am I missing something here?

For the record, I played around with shared memory reader / writer locks for hotswapping over at
https://github.com/daniel-j-h/shm-hotswap

MoKob · 2016-10-06T09:25:09Z

The issue I'm seeing here is the following: at the moment we only ever have one dataset loaded. With your proposed design we potentially have multiple datasets loaded to memory as long as there are requests holding onto it. This multiplies memory requirements, or am I missing something here?

Is this not already the case? Loading a dataset takes quite some time. So I guess the hot-swap would be a real cold swap if we don't already load two sets into memory.

TheMarex · 2016-10-06T10:35:18Z

Is definitely a bottleneck, but given our usual request time not a big problem to me. We don't loose much here.

We sometimes hit outliers on our production machines where requests take several seconds and clog the pipeline. This makes switching dataset times quite non-deterministic. Combine this with the fact that datastore doesn't actually wait until the data is completely switched over, but just exists after the shared memory region is written, we have hacks in place just to ensure that the data correctly loaded.

What I am more concerned about is the complexity of the solution.

This is actually less complex then our current locking setup, because we needed to add several locks to work around edge cases (concurrent datastore runs, non-commited data changes, several race conditions between the two). And multiple clients are just not supported at all which is a major downside for a shared-memory based system.

We can replace that haywire of mutexes using established multi-processing idioms. This will make the code both easier to understand and less error prone. And as cherry on top we will get support for multiple clients.

Could we do better if we target this for a 6.0 release and make it an API change?

As I outlined above this is a strict feature subset of something that would require an API change. This approach can be extended to use named datastores, but that is actually completely independent on how we do the locking.

This multiplies memory requirements, or am I missing something here?

Both yes and no. It multiplies the memory requirement for storing a SharedDataFacade object, which can now have two instances (old and new dataset). However since that object does not actually own any of the significant memory but is only a collection of pointers and wrappers this will not impact our memory usage.
As @MoKob already pointed out, when using shared memory there are always two complete datasets in shared memory at the time of switching. So this does not change that part of our memory requirements.

This is the first step to having fine grained locking on data updates, see issue #2570.

TheMarex · 2016-10-08T15:59:12Z

I found a limitation with the design:

Start osrm-routed with shared memory
Run osrm-datastore with another dataset
Observe that osrm-datastore is blocked because it can't free the old dataset

Even though there are no queries referencing the old dataset, osrm-datastore is blocked. This is because the Engine stores a reference to the dataset, to be able to serve queries from it. If there are no queries, there is no code in Engine being executed: there is no way for it to detect a new dataset. So osrm-datastore actually waiting for the next query to arrive to detect a new dataset and do the switch (and by that releasing ownership of the old dataset).

This of course breaks our test setup, where we start an osrm-routed instance, wait for osrm-datastore to finish and then send queries.

We could side-step that problem by not needing any client intervention to switch datasets. That would mean we don't keep state (e.g. safe the current SharedDataFacade) but always create it new for every request. However right now creating a new data facade is a rather expensive operation involving the creation of a lot of wrapper objects and mapping a shared memory region (that is at least one syscall, probably more).

I'm going to benchmark this to see what we would be looking at here in terms of overhead. We could potentially safe some overhead with keeping the shared memory mapped at constant locations (hence a syscall less). Downside there would probably be that we have a constant requirement of datasize*2 (not just during updates). That could be improved just switching the memory lock on/off., hence letting the OS do its job and just page it out.

TheMarex · 2016-10-11T10:46:51Z

I did a quick benchmark on this and creating SharedDataFacade objects is surprisingly expensive, about half a millisecond. As such I think adapting the design in the following way should get us closest to where we want to go:

osrm-datastore always tried to overwrite the oldest dataset. It can never overwrite the most recent dataset. osrm-datastore runs can't happen in parallel.
If the oldest dataset is still in use (read lock) it blocks and waits for the requests to finish before updating.
A read lock on a shared memory region is only maintained during query (not bound to the lifetime of a ShareDataFacade)
Every time a SharedDataFacade is deallocated it deallocates the shared memory it is pointing to if:
- the shared memory is not the newest memory
- there is no lock on it (still in use, or in use again)

TheMarex · 2016-10-12T10:25:21Z

I have this design now finally working. Only remaining issues are around hardening: How are we going to detect and recover from dangling locks. This might happen if either osrm-routed or osrm-datastore crash in a critical section of a shared memory lock. We have a tool for that osrm-unlock-all but some resilience for server deployments is important.

danpat added Refactor Feature Request labels Jun 21, 2016

danpat mentioned this issue Aug 26, 2016

Permission issue for osrm-datastore + libosrm multi-user setup (e.g. root vs user) #1724

Closed

daniel-j-h added this to the 5.6.0 milestone Sep 28, 2016

TheMarex self-assigned this Oct 5, 2016

oxidase mentioned this issue Oct 5, 2016

Add -fsanitizer=address for debug builds in travis #2653

Merged

4 tasks

TheMarex added a commit that referenced this issue Oct 6, 2016

Make DataFacade local to every request

031d67b

This is the first step to having fine grained locking on data updates, see issue #2570.

TheMarex added a commit that referenced this issue Oct 6, 2016

Make DataFacade local to every request

241ee7b

This is the first step to having fine grained locking on data updates, see issue #2570.

TheMarex mentioned this issue Oct 6, 2016

Make DataFacade local to every request #3012

Merged

2 tasks

TheMarex added a commit that referenced this issue Oct 6, 2016

Make DataFacade local to every request

1c2ead8

This is the first step to having fine grained locking on data updates, see issue #2570.

TheMarex mentioned this issue Oct 7, 2016

Make SharedDataFacade immutable #3015

Merged

8 tasks

TheMarex closed this as completed in #3015 Oct 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move shared memory block swapping control to osrm-datastore #2570

Move shared memory block swapping control to osrm-datastore #2570

danpat commented Jun 21, 2016 •

edited

Loading

emiltin commented Jun 21, 2016

TheMarex commented Oct 5, 2016 •

edited

Loading

MoKob commented Oct 6, 2016

daniel-j-h commented Oct 6, 2016

MoKob commented Oct 6, 2016

TheMarex commented Oct 6, 2016 •

edited

Loading

TheMarex commented Oct 8, 2016

TheMarex commented Oct 11, 2016

TheMarex commented Oct 12, 2016

Move shared memory block swapping control to osrm-datastore #2570

Move shared memory block swapping control to osrm-datastore #2570

Comments

danpat commented Jun 21, 2016 • edited Loading

emiltin commented Jun 21, 2016

TheMarex commented Oct 5, 2016 • edited Loading

MoKob commented Oct 6, 2016

daniel-j-h commented Oct 6, 2016

MoKob commented Oct 6, 2016

TheMarex commented Oct 6, 2016 • edited Loading

TheMarex commented Oct 8, 2016

TheMarex commented Oct 11, 2016

TheMarex commented Oct 12, 2016

danpat commented Jun 21, 2016 •

edited

Loading

TheMarex commented Oct 5, 2016 •

edited

Loading

TheMarex commented Oct 6, 2016 •

edited

Loading