Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor mapping of files to memory locations #4952

Closed
TheMarex opened this issue Mar 12, 2018 · 2 comments
Closed

Refactor mapping of files to memory locations #4952

TheMarex opened this issue Mar 12, 2018 · 2 comments
Assignees
Labels

Comments

@TheMarex
Copy link
Member

Currently the logic to map a section of a file to a location in memory is hand written in storage.cpp for every file in the function PopulateLayout. This is tedious and error prone as it assume this logic needs to be adapted every time we change the structure of the files.

To change this we should capture the logic of figuring out the blocks of memory contained in each file to an own function.

Unblocks mapping separate files in #1947 and makes #4873.

@TheMarex TheMarex self-assigned this Mar 12, 2018
@TheMarex
Copy link
Member Author

When thinking how to implement this it became clear very fast that the main blocker for us is that our on-disk format is relatively unstructured and does not allow us to compute the size of memory segments and read/map the data in a very general way.

To address these problems we need to change our file format. Seems like #2242 is rearing its ugly head again. Our new constraints would be:

  • List content of file and its size
  • Content of file mmap-able to memory (e.g. no decoding step with memory buffer)
  • Writable in a streamable way
  • Data should have a human-readable name
  • Data should be hierarchical

This already sound very much like a tar file and any implementation of it would probably be very close to a tar-lite format. However using tar files would additionally solve a lot a problems around tooling, they would be easy to inspect, extend and modify. Eventually we don't need to care about how data is packaged, we can accept any split of data.

Storage

Our investment in using a general abstraction layer with FileWriter and FileReader seems to pay off: We can swap out the current implementation to write named data. A low-level library to read/write tar files we can use microtar which we can easily wrap/bundle as third-party library.

We can use a "filesystem" like structure to get a human-readable hierarchical representation of data. For example the current .osrm.mldgr file could be represented as:

/mld/multilevelgraph/node_array
/mld/multilevelgraph/edge_array
/mld/multilevelgraph/node_to_edge_offsets
OSRM_VERSION

Memory

In conjunction with the new on-disk format we can now make osrm-datastore a lot more "stupid" as it only needs to care about the following things:

  1. Discovering data in files
  2. Building an index where to find the data in files
  3. Allocating enough in-memory storage (potentially using multiple memory blocks)
  4. Building an index to find the data in memory
  5. Reading the data to memory

The last three steps would be optional if we go for an mmap based approach in the future.

Data organization

Using hierarchical naming makes it very easy to implement loading multiple datasets/profiles using osrm-datastore using namespaces:

osrm-datastore bike_data.osrm --name=bike
osrm-datastore walk_data.osrm --name=walk

Would create data in the namespaces:

/bike/ch/*
/bike/*
/walk/ch/*
/walk/*

How this data is split up between shared memory segments could be determined inside osrm-datastore by easy rules like all files matching */metric/routability/* get an own shared memory segment.

Impact

Moving on this refactor will make a range of issues much easier to implement:

#4007
#10
#2242
#4873

/cc @oxidase @danpat

@TheMarex
Copy link
Member Author

TheMarex commented Apr 6, 2018

This shipped.

@TheMarex TheMarex closed this as completed Apr 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants