-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove index counter functionality #2101
Conversation
This counter is very expensive to maintain and ends up offering little value. - Remove index logic - Cleanup datastore to remove index count records
Codecov ReportPatch coverage:
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. Additional details and impacted files@@ Coverage Diff @@
## main #2101 +/- ##
==========================================
- Coverage 47.95% 47.38% -0.57%
==========================================
Files 94 92 -2
Lines 10306 10003 -303
==========================================
- Hits 4942 4740 -202
+ Misses 4790 4719 -71
+ Partials 574 544 -30
☔ View full report in Codecov by Sentry. |
Does this remove the ability to know overall size of the index? |
It removes the ability to know the number of mappings multihash=>(providerID+contextID) that have been stored by particular indexer. So it does not give an indication of the overall size of the index, unless all indexers are taken into consideration and there is no overlap in what any indexer has stored. It seems like this index count is a useful metric, but as far as I can tell it has only been a source of questions about what it means and has never provided an answer to anything that anyone wanted to know. It seems that what we really want is the number of unique multihashes indexed, and how many of those are retrievable. ... which this is not. |
I am curious to hear @TorfinnOlsen 's take on this. Here is how it occurs for me: The work is already done to disable the current measurement mechanism by default. In terms of code removal, we probably wan to retain the hooks that are already in place to increment counters, since whatever the implementation the hooks in code will be needed. In terms of improving efficiency, the counter should ideally limit the impact per provider and avoid global locks. One can think of other more elaborate designs, e.g. roll up sums, to further reduce the footprint of synchronization when incrementing counters. But right now this does not seem to be a priority. |
I think the index count is valuable to know, but I think the problem with our present index count is that it counts everything and we don't have an easy way to isolate non duplicated or inactive CIDs. The way I see it the 'count' is currently made up of a strata of value pairs:
I think the count function would be much more valuable if we could project this stratification in the count. Presently we have a bar chart which simply states the count every week. There's no nuance however, into what the makeup of that ingestion is so when we have big spikes in ingestion we know it's because we are building up a new index, but to folks outside the team they perceive that as growth of the index which isn't really accurate. The count has grown but the index itself has not. Right now the index count has surpassed 2 trillion key value pairs counted, but we know that the number of unique CID's is in the ~200 billion range I believe. At least having a consistent unique count to contrast the ingestion count would help people understand that the count is not really representative of network growth. I believe that's how it's widely perceived now in spite of our explanations that this is the case people naturally bias towards understanding our count in this way. Summary: The count is valuable, but is to vague in it's current implementation. The count is not high enough resolution and creates a potential for misunderstanding outside of our team. Unique CID counts are also a valuable representation of how many CID's are on the network, but what people really want to know is "how big the index is" which means to me the number of unique CID:ProviderID's which are associated with active providers. CID's which no longer exist, duplicates, and CID's associated with inactive providers are basically historical records of what has happened. As far as I can tell the only utility they serve is possibly reconstructing prior states of the network. While this may be an important function, including it in the counts we share creates a false impression of how big the "index" is. |
We should at least remove this counter from the information returned from the Even though the new provider cache eliminates performance issues around accessing these counters, I want this out of the provider information because this counter is confusing and has little value without other measurements that give it context. These counters should be presented in context with other index metrics, so the |
@gammazero I can't think of any reason why that would be a loss to us. I assume our grafana panel here: https://protocollabs.grafana.net/d/pPROWOD7z/providers?orgId=1&refresh=5s&from=1688446116224&to=1689050916224&viewPanel=23 will be broken but we can change up our reporting for the time being. I think if we manage to sort out the distance of providers, and sometime soon thereafter the unique cid count this puts us in a position of higher visibility than where we are at present. I'm curious this is probably a somewhat ignorant question, but considering the index is a key value store can't we just query it for unique CID's? I know this is greatly simplifying the problem but someone asked me this question and I found myself without a good answer. |
No. The scale at which the key value store operates can't simply return the number of unique keys. |
This counter is very expensive to maintain and ends up offering little value.