Remove index counter functionality #2101

gammazero · 2023-07-08T17:41:46Z

This counter is very expensive to maintain and ends up offering little value.

Remove index logic
Cleanup datastore to remove index count records

This counter is very expensive to maintain and ends up offering little value. - Remove index logic - Cleanup datastore to remove index count records

codecov-commenter · 2023-07-08T17:45:25Z

Codecov Report

Patch coverage: 32.14% and project coverage change: -0.57 ⚠️

Comparison is base (fd9d5c0) 47.95% compared to head (4e57259) 47.38%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2101      +/-   ##
==========================================
- Coverage   47.95%   47.38%   -0.57%     
==========================================
  Files          94       92       -2     
  Lines       10306    10003     -303     
==========================================
- Hits         4942     4740     -202     
+ Misses       4790     4719      -71     
+ Partials      574      544      -30

Impacted Files	Coverage Δ
command/daemon.go	`0.00% <0.00%> (ø)`
command/update_datastore.go	`0.00% <0.00%> (ø)`
config/indexer.go	`47.50% <ø> (ø)`
internal/ingest/linksystem.go	`67.43% <ø> (-0.44%)`	⬇️
internal/metrics/server.go	`68.42% <ø> (-0.81%)`	⬇️
server/find/options.go	`23.68% <ø> (-11.20%)`	⬇️
internal/ingest/ingest.go	`72.45% <100.00%> (+1.02%)`	⬆️
internal/registry/apiconv.go	`84.09% <100.00%> (-0.12%)`	⬇️
server/find/handler/handler.go	`90.44% <100.00%> (+3.10%)`	⬆️
server/find/server.go	`62.36% <100.00%> (ø)`
... and 1 more

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

willscott · 2023-07-08T18:12:57Z

Does this remove the ability to know overall size of the index?
If so we should consider if that remains a thing product want to have a clear sense of and if there's a cheaper way to get a decent approximation of index size before removing it.

gammazero · 2023-07-09T19:11:29Z

Does this remove the ability to know overall size of the index?

It removes the ability to know the number of mappings multihash=>(providerID+contextID) that have been stored by particular indexer. So it does not give an indication of the overall size of the index, unless all indexers are taken into consideration and there is no overlap in what any indexer has stored.

It seems like this index count is a useful metric, but as far as I can tell it has only been a source of questions about what it means and has never provided an answer to anything that anyone wanted to know. It seems that what we really want is the number of unique multihashes indexed, and how many of those are retrievable. ... which this is not.

masih · 2023-07-10T08:44:19Z

I am curious to hear @TorfinnOlsen 's take on this.

Here is how it occurs for me:
I think this is an important metric that we want to provide going forward. Based on previous conversations, it is perhaps not as important as the unique CID count but it is useful nevertheless. Where the unique CID count tells us about the "number of destinations", the index count tells us "how many ways there are to get to them". From product perspective, the number of ways to get to CID would help reason about content distribution across providers. So overall I think we want to retain the ability to measure it, but not at the cost of hindering ingest rate.

The work is already done to disable the current measurement mechanism by default. In terms of code removal, we probably wan to retain the hooks that are already in place to increment counters, since whatever the implementation the hooks in code will be needed.

In terms of improving efficiency, the counter should ideally limit the impact per provider and avoid global locks. One can think of other more elaborate designs, e.g. roll up sums, to further reduce the footprint of synchronization when incrementing counters. But right now this does not seem to be a priority.

TorfinnOlsen · 2023-07-10T15:45:58Z

I think the index count is valuable to know, but I think the problem with our present index count is that it counts everything and we don't have an easy way to isolate non duplicated or inactive CIDs.

The way I see it the 'count' is currently made up of a strata of value pairs:

CIDs that active storage providers who are still providing, have advertised to us, and are retrievable at the CID:ProviderID advertised. Can be retrieved, from the location implied by this key value pair.
CIDs active storage providers are no longer providing. ProviderID has not changed in anyway but the advertised CID:ProviderID pair is no longer relevant. CID:ProviderID remains advertised via IPNI. Cannot be retrieved.
CIDs advertised to IPNI by inactive ProviderIDs. ProviderID meets our inactivity criteria(2 weeks?). Nothing can be retrieved from this ProviderID all CID's associated with it are dead value pairs.
Duplications of the above incurred by kicking off ingestion(on cid.contact).

I think the count function would be much more valuable if we could project this stratification in the count. Presently we have a bar chart which simply states the count every week. There's no nuance however, into what the makeup of that ingestion is so when we have big spikes in ingestion we know it's because we are building up a new index, but to folks outside the team they perceive that as growth of the index which isn't really accurate. The count has grown but the index itself has not.

Right now the index count has surpassed 2 trillion key value pairs counted, but we know that the number of unique CID's is in the ~200 billion range I believe. At least having a consistent unique count to contrast the ingestion count would help people understand that the count is not really representative of network growth. I believe that's how it's widely perceived now in spite of our explanations that this is the case people naturally bias towards understanding our count in this way.

Summary: The count is valuable, but is to vague in it's current implementation. The count is not high enough resolution and creates a potential for misunderstanding outside of our team. Unique CID counts are also a valuable representation of how many CID's are on the network, but what people really want to know is "how big the index is" which means to me the number of unique CID:ProviderID's which are associated with active providers.

CID's which no longer exist, duplicates, and CID's associated with inactive providers are basically historical records of what has happened. As far as I can tell the only utility they serve is possibly reconstructing prior states of the network. While this may be an important function, including it in the counts we share creates a false impression of how big the "index" is.

gammazero · 2023-07-10T17:45:12Z

We should at least remove this counter from the information returned from the /providers API, and instead make it either a metric or make it separately queryable.

Even though the new provider cache eliminates performance issues around accessing these counters, I want this out of the provider information because this counter is confusing and has little value without other measurements that give it context. These counters should be presented in context with other index metrics, so the /providers information is not the place for these. @TorfinnOlsen @masih WDYT?

TorfinnOlsen · 2023-07-11T04:51:03Z

@gammazero I can't think of any reason why that would be a loss to us. I assume our grafana panel here: https://protocollabs.grafana.net/d/pPROWOD7z/providers?orgId=1&refresh=5s&from=1688446116224&to=1689050916224&viewPanel=23

will be broken but we can change up our reporting for the time being.

I think if we manage to sort out the distance of providers, and sometime soon thereafter the unique cid count this puts us in a position of higher visibility than where we are at present.

I'm curious this is probably a somewhat ignorant question, but considering the index is a key value store can't we just query it for unique CID's? I know this is greatly simplifying the problem but someone asked me this question and I found myself without a good answer.

masih · 2023-07-11T09:49:00Z

can't we just query it for unique CID's

No. The scale at which the key value store operates can't simply return the number of unique keys.

Remove index counter functionality

4dc73ba

This counter is very expensive to maintain and ends up offering little value. - Remove index logic - Cleanup datastore to remove index count records

rm unused function

4e57259

gammazero requested a review from masih July 10, 2023 08:05

masih approved these changes Jul 11, 2023

View reviewed changes

gammazero merged commit e46f6e6 into main Jul 13, 2023

gammazero deleted the remove-index-counters branch July 13, 2023 17:50

gammazero mentioned this pull request Jul 13, 2023

High CPU usage to the extent where the node becomes unresponsive #2062

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove index counter functionality #2101

Remove index counter functionality #2101

gammazero commented Jul 8, 2023

codecov-commenter commented Jul 8, 2023 •

edited

Loading

willscott commented Jul 8, 2023

gammazero commented Jul 9, 2023

masih commented Jul 10, 2023

TorfinnOlsen commented Jul 10, 2023 •

edited

Loading

gammazero commented Jul 10, 2023 •

edited

Loading

TorfinnOlsen commented Jul 11, 2023

masih commented Jul 11, 2023 •

edited

Loading

Remove index counter functionality #2101

Remove index counter functionality #2101

Conversation

gammazero commented Jul 8, 2023

codecov-commenter commented Jul 8, 2023 • edited Loading

Codecov Report

willscott commented Jul 8, 2023

gammazero commented Jul 9, 2023

masih commented Jul 10, 2023

TorfinnOlsen commented Jul 10, 2023 • edited Loading

gammazero commented Jul 10, 2023 • edited Loading

TorfinnOlsen commented Jul 11, 2023

masih commented Jul 11, 2023 • edited Loading

codecov-commenter commented Jul 8, 2023 •

edited

Loading

TorfinnOlsen commented Jul 10, 2023 •

edited

Loading

gammazero commented Jul 10, 2023 •

edited

Loading

masih commented Jul 11, 2023 •

edited

Loading