services/horizon/ingest/filtering: Backfill low level design #4267

sreuland · 2022-03-09T20:32:13Z

What problem does your feature solve?

Design proposal for backfill behavior of filters.

What would you like to see?

A design that defines user experience with filter rule changes and backfill behaviors in horizon.
Low level design writeup for filtered data range behavior.

Acceptance Criteria:

Low level design that provides dev path for supporting:
- As a Horizon user, I want to configure the range of ingested historical data that my filter rules will be effective within aka 'backfill' timeframe by using existing --history-retention-count
- As a Horizon user, I want any transactions related to entities that are recognized as allowed by filter rules to be included in 'backfill' timeframe.

Pre-requisites:

This design is dependent on Ingestion Filtering by Account #4221 and Ingestion Filtering by Asset #4222 being implemented first to establish the initial filtration framework within ingestion.
This feature leverages the existing --history-retention-count for backfill range.

The text was updated successfully, but these errors were encountered:

2opremio · 2022-03-10T16:48:19Z

As a Horizon user, I want to configure the range of ingested historical data that my filter rules will be effective within aka 'backfill' timeframe.

Is this really necessary?

Can't we piggyback on the retention count cmdline flag?

2opremio · 2022-03-10T17:02:23Z

Here's an idea of how we can implement backfilling:

We keep create indices for the filtered out data to backfill:

history_ledger_participants (similar to the already existing history_operation_participants), indicating in what ledgers an account participated in history_ledger_assets`

Hopefully these tables won't occupy a lot (less than what history_operation_participants) since they are done at the ledger level. But we should check experimentally. If they do, we could refine this further to only include information about the entities (accounts/assets) that we filtered out.
We create a table to keep backfilling state (say backfill_queue) to indicate what accounts/ledgers need to be backfilled based on changes on the filter configurations. We need this to make sure backfilling works across restarts.
We modify the ingestion system to walk over the backfill_queue and the history_ledger_* to perform the backfilling.

(1) could be replaced by a txmeta archive + index once we have it.

Note that (1) needs to happen even when filtering isn't enabled, since, otherwise, backfilling won't work if filtering is enabled later on.

2opremio · 2022-03-10T17:05:28Z

@bartekn thoughts?

sreuland · 2022-03-10T17:29:24Z

Can't we piggyback on the retention count cmdline flag?

yes, definitely, updated description to be clear about that usage.

sreuland · 2022-03-10T18:46:24Z

@2opremio , @bartekn , wondering if could remove the backfill_queue aspect just to explore simplicity, the new backfill process invokes the ingestion engine regularly on scheduled intervals, triggering the latest filter rules to be executed by processors with range based on --history-retention-count. The end result would be idempotent, as ingestion processors just upsert to history, correct?

the backfill_queue increases in complexity with the upcoming transitive rules, like 'all trustlines from account x' or 'sponsored accounts of account x', where the filters would need to compute effective list dynamically and then convey that into the queue, whereas if we just let the ingest processors run the filters, then by default effective scope of the filters is exercised by default through the existing ingest->processor->filter.FilterTransaction() path.

2opremio · 2022-03-10T19:55:06Z

The end result would be idempotent, as ingestion processors just upsert to history, correct?

Yes it would but I am not sure we can afford that (performance-wise). I also think that a stateless design is cleaner.

BTW, we may also need to think about garbage collection (what if a filter is updated to not include certain accounts anymore, should we care about the data left behind?)

In this case you I think do need state, otherwise you won't know what to gargage-collect if Horizon is restarted.

2opremio · 2022-03-10T19:58:14Z

Note that (1) needs to happen even when filtering isn't enabled, since, otherwise, backfilling won't work if filtering is enabled later on.

Actually, it doesn't need to be enabled when filtering is disabled, because then everything is reingested and thus, there is nothing to backfill there.

sreuland · 2022-03-10T21:22:55Z

BTW, we may also need to think about garbage collection (what if a filter is updated to not include certain accounts anymore, should we care about the data left behind?)

@jcx120 , I think we discussed this 'stale' or 'garbage' state as part of filter lifecycle earlier, but can find reference, can you confirm product expectation for when a filter rule scope is changed which reduces it's scope and therefore any already ingested history data that no longer fits within the filter rule scope becomes 'stale' but still left on history db, is that acceptable or does the 'stale' data need to be purged?

jcx120 · 2022-03-10T22:55:42Z

@sreuland Purging of 'garbage' data (that are left behind due to a change in filter rules) can be, for this current phase, treated as a low priority / nice to have (and something we can come back to after all top priority features are implemented). The assumption is that for majority of users who apply filters, the dataset will be sufficiently small such that doing a clean re-ingest from scratch would be fast and cheap.

bartekn · 2022-03-15T21:32:39Z

history_ledger_participants (similar to the already existing history_operation_participants), indicating in what ledgers an account participated in history_ledger_assets`

This will be a very similar structure to indexes we've been working with @paulbellamy: they determine in which checkpoints a given account was active (also wrt payment/all operations and sucessfull/all txs). So this is less granular to what you propose (ledger vs checkpoint) but from Captive-Core perspective if you want to ingest a specific ledger you need to catchup to the last checkpoint before it and then apply ledgers so it should take the same amount of time to get to the ledger.

So when (or if ever) these indexes will be ready in production (they are ready but not benchmarked/tested/checked for correctness) you won't need extra tables and the process will be simpler.

2opremio · 2022-03-16T16:38:08Z

So this is less granular to what you propose (ledger vs checkpoint) but from Captive-Core perspective if you want to ingest a specific ledger you need to catchup to the last checkpoint before it and then apply ledgers so it should take the same amount of time to get to the ledger.

This is good, in the sense that we can share the solution.

However, I also think this may be a showstopper for the approach I proposed, since it may make backfilling too slow (due to the slowness of Captive core to jump around)

So, we may need to wait for the implementation of a txmeta archive before backfilling can be sufficiently performant.

sreuland added feature request ingest-filtering ingestion filtering labels Mar 9, 2022

sreuland self-assigned this Mar 9, 2022

sreuland mentioned this issue Apr 21, 2022

services/horizon, clients/horizonclient: Allow filtering ingested transactions by account or asset. #4277

Merged

6 tasks

2opremio mentioned this issue May 4, 2022

exp: Add the Zenith (Light Horizon) prototype #4352

Merged

7 tasks

2opremio mentioned this issue May 23, 2022

enable-ingestion-filtering does not reduce database #4395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

services/horizon/ingest/filtering: Backfill low level design #4267

services/horizon/ingest/filtering: Backfill low level design #4267

sreuland commented Mar 9, 2022 •

edited

Loading

2opremio commented Mar 10, 2022

2opremio commented Mar 10, 2022 •

edited

Loading

2opremio commented Mar 10, 2022

sreuland commented Mar 10, 2022 •

edited

Loading

sreuland commented Mar 10, 2022

2opremio commented Mar 10, 2022 •

edited

Loading

2opremio commented Mar 10, 2022

sreuland commented Mar 10, 2022

jcx120 commented Mar 10, 2022 •

edited

Loading

bartekn commented Mar 15, 2022

2opremio commented Mar 16, 2022 •

edited

Loading

services/horizon/ingest/filtering: Backfill low level design #4267

services/horizon/ingest/filtering: Backfill low level design #4267

Comments

sreuland commented Mar 9, 2022 • edited Loading

What problem does your feature solve?

What would you like to see?

Pre-requisites:

2opremio commented Mar 10, 2022

2opremio commented Mar 10, 2022 • edited Loading

2opremio commented Mar 10, 2022

sreuland commented Mar 10, 2022 • edited Loading

sreuland commented Mar 10, 2022

2opremio commented Mar 10, 2022 • edited Loading

2opremio commented Mar 10, 2022

sreuland commented Mar 10, 2022

jcx120 commented Mar 10, 2022 • edited Loading

bartekn commented Mar 15, 2022

2opremio commented Mar 16, 2022 • edited Loading

sreuland commented Mar 9, 2022 •

edited

Loading

2opremio commented Mar 10, 2022 •

edited

Loading

sreuland commented Mar 10, 2022 •

edited

Loading

2opremio commented Mar 10, 2022 •

edited

Loading

jcx120 commented Mar 10, 2022 •

edited

Loading

2opremio commented Mar 16, 2022 •

edited

Loading