Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon/ingest/filtering: Backfill low level design #4267

Open
sreuland opened this issue Mar 9, 2022 · 11 comments
Open

services/horizon/ingest/filtering: Backfill low level design #4267

sreuland opened this issue Mar 9, 2022 · 11 comments
Assignees
Labels

Comments

@sreuland
Copy link
Contributor

sreuland commented Mar 9, 2022

What problem does your feature solve?

Design proposal for backfill behavior of filters.

What would you like to see?

A design that defines user experience with filter rule changes and backfill behaviors in horizon.
Low level design writeup for filtered data range behavior.

Acceptance Criteria:

  • Low level design that provides dev path for supporting:
    • As a Horizon user, I want to configure the range of ingested historical data that my filter rules will be effective within aka 'backfill' timeframe by using existing --history-retention-count
    • As a Horizon user, I want any transactions related to entities that are recognized as allowed by filter rules to be included in 'backfill' timeframe.

Pre-requisites:

@sreuland sreuland self-assigned this Mar 9, 2022
@2opremio
Copy link
Contributor

As a Horizon user, I want to configure the range of ingested historical data that my filter rules will be effective within aka 'backfill' timeframe.

Is this really necessary?

Can't we piggyback on the retention count cmdline flag?

@2opremio
Copy link
Contributor

2opremio commented Mar 10, 2022

Here's an idea of how we can implement backfilling:

  1. We keep create indices for the filtered out data to backfill:

    history_ledger_participants (similar to the already existing history_operation_participants), indicating in what ledgers an account participated in history_ledger_assets`

    Hopefully these tables won't occupy a lot (less than what history_operation_participants) since they are done at the ledger level. But we should check experimentally. If they do, we could refine this further to only include information about the entities (accounts/assets) that we filtered out.

  2. We create a table to keep backfilling state (say backfill_queue) to indicate what accounts/ledgers need to be backfilled based on changes on the filter configurations. We need this to make sure backfilling works across restarts.

  3. We modify the ingestion system to walk over the backfill_queue and the history_ledger_* to perform the backfilling.

(1) could be replaced by a txmeta archive + index once we have it.

Note that (1) needs to happen even when filtering isn't enabled, since, otherwise, backfilling won't work if filtering is enabled later on.

@2opremio
Copy link
Contributor

@bartekn thoughts?

@sreuland
Copy link
Contributor Author

sreuland commented Mar 10, 2022

Can't we piggyback on the retention count cmdline flag?

yes, definitely, updated description to be clear about that usage.

@sreuland
Copy link
Contributor Author

@2opremio , @bartekn , wondering if could remove the backfill_queue aspect just to explore simplicity, the new backfill process invokes the ingestion engine regularly on scheduled intervals, triggering the latest filter rules to be executed by processors with range based on --history-retention-count. The end result would be idempotent, as ingestion processors just upsert to history, correct?

the backfill_queue increases in complexity with the upcoming transitive rules, like 'all trustlines from account x' or 'sponsored accounts of account x', where the filters would need to compute effective list dynamically and then convey that into the queue, whereas if we just let the ingest processors run the filters, then by default effective scope of the filters is exercised by default through the existing ingest->processor->filter.FilterTransaction() path.

@2opremio
Copy link
Contributor

2opremio commented Mar 10, 2022

The end result would be idempotent, as ingestion processors just upsert to history, correct?

Yes it would but I am not sure we can afford that (performance-wise). I also think that a stateless design is cleaner.

BTW, we may also need to think about garbage collection (what if a filter is updated to not include certain accounts anymore, should we care about the data left behind?)

In this case you I think do need state, otherwise you won't know what to gargage-collect if Horizon is restarted.

@2opremio
Copy link
Contributor

Note that (1) needs to happen even when filtering isn't enabled, since, otherwise, backfilling won't work if filtering is enabled later on.

Actually, it doesn't need to be enabled when filtering is disabled, because then everything is reingested and thus, there is nothing to backfill there.

@sreuland
Copy link
Contributor Author

BTW, we may also need to think about garbage collection (what if a filter is updated to not include certain accounts anymore, should we care about the data left behind?)

@jcx120 , I think we discussed this 'stale' or 'garbage' state as part of filter lifecycle earlier, but can find reference, can you confirm product expectation for when a filter rule scope is changed which reduces it's scope and therefore any already ingested history data that no longer fits within the filter rule scope becomes 'stale' but still left on history db, is that acceptable or does the 'stale' data need to be purged?

@jcx120
Copy link

jcx120 commented Mar 10, 2022

@sreuland Purging of 'garbage' data (that are left behind due to a change in filter rules) can be, for this current phase, treated as a low priority / nice to have (and something we can come back to after all top priority features are implemented). The assumption is that for majority of users who apply filters, the dataset will be sufficiently small such that doing a clean re-ingest from scratch would be fast and cheap.

@bartekn
Copy link
Contributor

bartekn commented Mar 15, 2022

history_ledger_participants (similar to the already existing history_operation_participants), indicating in what ledgers an account participated in history_ledger_assets`

This will be a very similar structure to indexes we've been working with @paulbellamy: they determine in which checkpoints a given account was active (also wrt payment/all operations and sucessfull/all txs). So this is less granular to what you propose (ledger vs checkpoint) but from Captive-Core perspective if you want to ingest a specific ledger you need to catchup to the last checkpoint before it and then apply ledgers so it should take the same amount of time to get to the ledger.

So when (or if ever) these indexes will be ready in production (they are ready but not benchmarked/tested/checked for correctness) you won't need extra tables and the process will be simpler.

@2opremio
Copy link
Contributor

2opremio commented Mar 16, 2022

So this is less granular to what you propose (ledger vs checkpoint) but from Captive-Core perspective if you want to ingest a specific ledger you need to catchup to the last checkpoint before it and then apply ledgers so it should take the same amount of time to get to the ledger.

This is good, in the sense that we can share the solution.

However, I also think this may be a showstopper for the approach I proposed, since it may make backfilling too slow (due to the slowness of Captive core to jump around)

So, we may need to wait for the implementation of a txmeta archive before backfilling can be sufficiently performant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants