Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove index counter functionality #2101

Merged
merged 2 commits into from
Jul 13, 2023
Merged

Remove index counter functionality #2101

merged 2 commits into from
Jul 13, 2023

Conversation

gammazero
Copy link
Collaborator

This counter is very expensive to maintain and ends up offering little value.

  • Remove index logic
  • Cleanup datastore to remove index count records

This counter is very expensive to maintain and ends up offering little value.

- Remove index logic
- Cleanup datastore to remove index count records
@codecov-commenter
Copy link

codecov-commenter commented Jul 8, 2023

Codecov Report

Patch coverage: 32.14% and project coverage change: -0.57 ⚠️

Comparison is base (fd9d5c0) 47.95% compared to head (4e57259) 47.38%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2101      +/-   ##
==========================================
- Coverage   47.95%   47.38%   -0.57%     
==========================================
  Files          94       92       -2     
  Lines       10306    10003     -303     
==========================================
- Hits         4942     4740     -202     
+ Misses       4790     4719      -71     
+ Partials      574      544      -30     
Impacted Files Coverage Δ
command/daemon.go 0.00% <0.00%> (ø)
command/update_datastore.go 0.00% <0.00%> (ø)
config/indexer.go 47.50% <ø> (ø)
internal/ingest/linksystem.go 67.43% <ø> (-0.44%) ⬇️
internal/metrics/server.go 68.42% <ø> (-0.81%) ⬇️
server/find/options.go 23.68% <ø> (-11.20%) ⬇️
internal/ingest/ingest.go 72.45% <100.00%> (+1.02%) ⬆️
internal/registry/apiconv.go 84.09% <100.00%> (-0.12%) ⬇️
server/find/handler/handler.go 90.44% <100.00%> (+3.10%) ⬆️
server/find/server.go 62.36% <100.00%> (ø)
... and 1 more

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@willscott
Copy link
Member

Does this remove the ability to know overall size of the index?
If so we should consider if that remains a thing product want to have a clear sense of and if there's a cheaper way to get a decent approximation of index size before removing it.

@gammazero
Copy link
Collaborator Author

Does this remove the ability to know overall size of the index?

It removes the ability to know the number of mappings multihash=>(providerID+contextID) that have been stored by particular indexer. So it does not give an indication of the overall size of the index, unless all indexers are taken into consideration and there is no overlap in what any indexer has stored.

It seems like this index count is a useful metric, but as far as I can tell it has only been a source of questions about what it means and has never provided an answer to anything that anyone wanted to know. It seems that what we really want is the number of unique multihashes indexed, and how many of those are retrievable. ... which this is not.

@gammazero gammazero requested a review from masih July 10, 2023 08:05
@masih
Copy link
Member

masih commented Jul 10, 2023

I am curious to hear @TorfinnOlsen 's take on this.

Here is how it occurs for me:
I think this is an important metric that we want to provide going forward. Based on previous conversations, it is perhaps not as important as the unique CID count but it is useful nevertheless. Where the unique CID count tells us about the "number of destinations", the index count tells us "how many ways there are to get to them". From product perspective, the number of ways to get to CID would help reason about content distribution across providers. So overall I think we want to retain the ability to measure it, but not at the cost of hindering ingest rate.

The work is already done to disable the current measurement mechanism by default. In terms of code removal, we probably wan to retain the hooks that are already in place to increment counters, since whatever the implementation the hooks in code will be needed.

In terms of improving efficiency, the counter should ideally limit the impact per provider and avoid global locks. One can think of other more elaborate designs, e.g. roll up sums, to further reduce the footprint of synchronization when incrementing counters. But right now this does not seem to be a priority.

@TorfinnOlsen
Copy link
Collaborator

TorfinnOlsen commented Jul 10, 2023

I think the index count is valuable to know, but I think the problem with our present index count is that it counts everything and we don't have an easy way to isolate non duplicated or inactive CIDs.

The way I see it the 'count' is currently made up of a strata of value pairs:

  1. CIDs that active storage providers who are still providing, have advertised to us, and are retrievable at the CID:ProviderID advertised. Can be retrieved, from the location implied by this key value pair.

  2. CIDs active storage providers are no longer providing. ProviderID has not changed in anyway but the advertised CID:ProviderID pair is no longer relevant. CID:ProviderID remains advertised via IPNI. Cannot be retrieved.

  3. CIDs advertised to IPNI by inactive ProviderIDs. ProviderID meets our inactivity criteria(2 weeks?). Nothing can be retrieved from this ProviderID all CID's associated with it are dead value pairs.

  4. Duplications of the above incurred by kicking off ingestion(on cid.contact).

I think the count function would be much more valuable if we could project this stratification in the count. Presently we have a bar chart which simply states the count every week. There's no nuance however, into what the makeup of that ingestion is so when we have big spikes in ingestion we know it's because we are building up a new index, but to folks outside the team they perceive that as growth of the index which isn't really accurate. The count has grown but the index itself has not.

Right now the index count has surpassed 2 trillion key value pairs counted, but we know that the number of unique CID's is in the ~200 billion range I believe. At least having a consistent unique count to contrast the ingestion count would help people understand that the count is not really representative of network growth. I believe that's how it's widely perceived now in spite of our explanations that this is the case people naturally bias towards understanding our count in this way.

Summary: The count is valuable, but is to vague in it's current implementation. The count is not high enough resolution and creates a potential for misunderstanding outside of our team. Unique CID counts are also a valuable representation of how many CID's are on the network, but what people really want to know is "how big the index is" which means to me the number of unique CID:ProviderID's which are associated with active providers.

CID's which no longer exist, duplicates, and CID's associated with inactive providers are basically historical records of what has happened. As far as I can tell the only utility they serve is possibly reconstructing prior states of the network. While this may be an important function, including it in the counts we share creates a false impression of how big the "index" is.

@gammazero
Copy link
Collaborator Author

gammazero commented Jul 10, 2023

We should at least remove this counter from the information returned from the /providers API, and instead make it either a metric or make it separately queryable.

Even though the new provider cache eliminates performance issues around accessing these counters, I want this out of the provider information because this counter is confusing and has little value without other measurements that give it context. These counters should be presented in context with other index metrics, so the /providers information is not the place for these. @TorfinnOlsen @masih WDYT?

@TorfinnOlsen
Copy link
Collaborator

@gammazero I can't think of any reason why that would be a loss to us. I assume our grafana panel here: https://protocollabs.grafana.net/d/pPROWOD7z/providers?orgId=1&refresh=5s&from=1688446116224&to=1689050916224&viewPanel=23

will be broken but we can change up our reporting for the time being.

I think if we manage to sort out the distance of providers, and sometime soon thereafter the unique cid count this puts us in a position of higher visibility than where we are at present.

I'm curious this is probably a somewhat ignorant question, but considering the index is a key value store can't we just query it for unique CID's? I know this is greatly simplifying the problem but someone asked me this question and I found myself without a good answer.

@masih
Copy link
Member

masih commented Jul 11, 2023

can't we just query it for unique CID's

No. The scale at which the key value store operates can't simply return the number of unique keys.

@gammazero gammazero merged commit e46f6e6 into main Jul 13, 2023
@gammazero gammazero deleted the remove-index-counters branch July 13, 2023 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants