Add "priority" concept to v2 file format #2863

westonpace · 2024-09-11T16:22:31Z

Currently the v2 scheduler and decoder rely on calculating a priority for each I/O request. It is important that both the scheduler and the decoder agree on this priority. This allows us to implement backpressure (lower priority I/O is blocked if too much high priority I/O is in progress).

The algorithm we use for tabular data is "the lower the top-level row the higher the priority". This greedy algorithm is pretty much optimal and works well. However, it is rather difficult to calculate. For example, if we are fetching a page of items that belongs to a List<List> then we might be fetching the 50,000th item in the column but it might still be the 10th top-level row (if the lists are large). As a result we need to do a lot of (potentially costly) book keeping to keep track of what the priority is.

A simpler approach would be to write down the priority associated with each page when we are writing the file. Then the scheduler/decoder won't need to do any bookkeeping and can schedule / decode pages in priority order much more easily.

Minor note: I say "priority" and not "top level row" because this is a file format concept and want to leave the door open for non-tabular data (e.g. columns don't have the same length) but it's just a name.

westonpace mentioned this issue Sep 11, 2024

Lance File Format v2.1 #2856

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "priority" concept to v2 file format #2863

Add "priority" concept to v2 file format #2863

westonpace commented Sep 11, 2024

Add "priority" concept to v2 file format #2863

Add "priority" concept to v2 file format #2863

Comments

westonpace commented Sep 11, 2024