Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the lifecycle of indexes with the lifecycle of a deal #1002

Closed
jacobheun opened this issue Nov 23, 2022 · 30 comments · Fixed by #1191
Closed

Clarify the lifecycle of indexes with the lifecycle of a deal #1002

jacobheun opened this issue Nov 23, 2022 · 30 comments · Fixed by #1191
Assignees
Labels
area/indexing area/retrieval Area: Retrieval hint/needs-decision Hint: Needs Decision hint/needs-team-input Hint: Needs Team Input

Comments

@jacobheun
Copy link
Contributor

jacobheun commented Nov 23, 2022

The purpose of this issue is to clarify and discuss the various potential state changes for indexing over the lifecycle of a deal that has been marked for indexing. This is not intended to cover the existing lifecycle of indexing, but the desired lifecycle. Once solidified, we can create accompanying issues to resolve discrepancies in the current implementation to match the desired state.

Legend

  • 🚧 - Needs discussion & decision
  • 🍏 - Alignment reached

The Lifecycle of Indexing

🚧 Indexing a new Deal

Note: Not covering how deals are identified for indexing in this section as there is a separate effort to solidify requirements around that. See #689 and filecoin-project/notary-governance#666 for more details.

For discussion purposes, let’s assume that once the above issues are complete, there will be some way to identify if a specific deal should be indexed or not, and that there will be a mechanism to account for this for existing deals.

When a new deal has been successfully published, if an unsealed copy exists and the deal is marked for indexing, it should be immediately registered with the index provider/marked for indexing.

🚧 Deletion of an unsealed copy

When an unsealed copy is deleted today, indexes are not removed. There is currently support in the Network Indexers to include metadata on whether or not the data is unsealed, but it’s not being leveraged correctly today (all announced indexes are being marked as unsealed).

Screenshot of cid.contact displaying unsealed status

We need a mechanism to detect the removal of unsealed copies (as they can be rm’d manually). The section on Repairing Indexes below, speaks to how this might be accomplished. Upon detection of deletion we can perform one of the following actions (need to decide between the options):

Option 1 - Remove the indexes(recommended): When an unsealed copy is removed, as the unsealing process is a non trivial operation, we should assume the copy will not become available in a short time frame. As such, the local indexes should be removed and we should announce the deletions to the network indexers. This frees up space both locally and on indexers. If unsealed copies were expected to be created/deleted often, then this option might be less reasonable, but this is not the case today.

Option 2 - Update Index Metadata: When an unsealed copy is removed we can update the metadata of the indexes for that deal to specify that no unsealed copy exists. This would still allow discovery of the SP who has the content, but retrieval would not function without an unseal. The advantage of this option is that a client could pay the Unseal price to get the data, knowing who has it. However, it’s worth noting that retrieval flows requiring unsealing are not particularly clear and would need further work to likely become viable.

If this option is chosen, we may want to change the indexing logic of sector expiration and will definitely need to change how removed sectors are handled.

🚧 A sector is unsealed

When we detect a sector has been unsealed, and that sector is eligible for indexing, it should be registered with the index provider for reindexing (assuming unsealed deletion - option1 is selected).

🚧 Expiration of a sector

As long as the unsealed copy of the sector exists, the indexes should also exist. No changes should occur until the unsealed copy is removed.

🚧 Removal of a sector

Same as sector expiration, this should be a no-op, as index changes would be triggered from changes to the unsealed copy only.

If unsealed deletion option 2 is selected, removal of a sector when there is no unsealed copy, will require deletion of indexes and announcement to the network indexers.

🚧 Repairing Indexes

One of the issues facing retrieval reliability is index metadata getting out of sync with unsealed copies (or the lack thereof). There are several reasons this may be occurring, but it often requires manual intervention by SP’s to repair and the visibility into when this needs to happen is not clear. A proposal that has been discussed recently is to have an automatic repair job for indexing, to automatically ensure unsealed copies, eligible for indexing receive an integrity check and are repaired if there is an issue. This would NOT include automatic unsealing of data as this is a resource intensive process.

An extension of this proposal, given that unsealed copies may be deleted or created manually by SP’s, is to have new index creation, and repair all belong to this “repair” service. This service could be a background process that is continually repairing/registering/removing indexes with limited resource consumption. This could remove some operational overhead for common errors reported with retrievals. Specifics of how this could/should work can be flushed out in a followup issue.

Related Issues & Discussions

@jacobheun jacobheun added this to Boost Nov 23, 2022
@jacobheun jacobheun added hint/needs-team-input Hint: Needs Team Input hint/needs-decision Hint: Needs Decision labels Nov 23, 2022
@LaurenSpiegel
Copy link
Collaborator

For option 1, why is the local index removed? Is there a local index for the sealed copy that is separate from the unsealed copy and this operation would just remove the local index for the unsealed copy?

@davidd8
Copy link
Collaborator

davidd8 commented Nov 23, 2022

One general observation is that the data lifecycle is bifurcating into sealed and unsealed data, which adds complexity. The more we can maintain the same state across sealed and unsealed data, the simpler it will be to reason about these systems. In that vein, I lean towards option (1) with the assumption that SPs only remove unsealed copies when also removing sealed copies.

This may imply expiration and removal of sectors should also require the removal of unsealed copies, and the time before the unsealed copy gets deleted is insignificant in the grand scheme of things. Why would an SP hold on to unsealed copies of data that they are no longer storing on-chain?

I'm not clear how the repair functionality works - is it to re-index unsealed copies, or find unsealed copies that are indexed? If this is possible, another idea is to trigger a repair from a failed retrieval attempt. But again, I'm not sure how the SP would know whether or not they have the unsealed copy of the requested CID, if they can't find it in their local index.

@dirkmc
Copy link
Contributor

dirkmc commented Nov 24, 2022

For option 1, why is the local index removed?

When a client queries the SP to ask for the data, the SP

  1. Gets the index for the data
  2. Gets a reader over the data
  3. Serves the data using the index + reader

If the unsealed copy has been removed, it's not possible to serve the deal's data. So we should return a "not found" response to queries at step 1.

@dirkmc
Copy link
Contributor

dirkmc commented Nov 24, 2022

This may imply expiration and removal of sectors should also require the removal of unsealed copies

I think the answer depends on our expectations of unsealing. The system was designed with the expectation that unsealing would soon become cheap and fast, however that hasn't happened in practice. Today it takes multiple hours to unseal data. If we assume that's not going to change soon, then in practical terms retrievability is unrelated to sealed data. So for the purposes of retrieval, what matters is the unsealed copy, and keeping the index in sync with the unsealed copy.

Why would an SP hold on to unsealed copies of data that they are no longer storing on-chain?

They may have separate contractual obligations for retrieval vs storage, and they may earn money from retrieval (eg serving popular data). I agree that this is more complicated conceptually so we need to decide if it's worth the additional complexity to allow those use cases.

I'm not clear how the repair functionality works - is it to re-index unsealed copies, or find unsealed copies that are indexed?

It is to re-index unsealed copies. There may not be an index for a unsealed copy of data if

  • the SP never created an index (when the dagstore was introduced, SPs had to run a manual process to add indexes)
  • something went wrong with indexing
  • the unsealed copy was deleted, and later restored from a sealed copy

Note also that indexes may become corrupted (we've seen this happen in practice with dagstore indexes) so we want a way to detect and repair them.

@jacobheun
Copy link
Contributor Author

jacobheun commented Nov 24, 2022

So for the purposes of retrieval, what matters is the unsealed copy, and keeping the index in sync with the unsealed copy.

Exactly this. Option 2 is the thing that muddies the water here, and why I advocate for just sticking with option 1 until we have a very clear story with good UX around handling retrieval for sealed data. Options 1 makes the logic pretty straightforward:

"For all unsealed data keep the local indexes up to date, and for all unsealed copies marked for index announcement, keep network indexers up to date.".

We don't need to care about sealed data at all for indexing purposes. It doesn't matter if the sealed sector is expired/removed/etc, we only ever care about the unsealed data, and once it's gone we cleanup the PieceDirectory(soon replacing the dagstore).

Is there a local index for the sealed copy that is separate from the unsealed copy and this operation would just remove the local index for the unsealed copy?

To help clarify this:

  • No indexes ever exist for sealed data.
    • If we've generated indexes from the unsealed data we can associate the sealed sector for any indexed cid by the associated Piece cid, but as long as there is no unsealed copy retrieval will fail until it's unsealed.

Note also that indexes may become corrupted (we've seen this happen in practice with dagstore indexes) so we want a way to detect and repair them.

+1. We need to assume corruption is going to happen, we're dealing with a lot of data. Automating a "scan and repair" approach could help us mitigate failures and reduce some operational overhead for SPs.

@LaurenSpiegel
Copy link
Collaborator

LaurenSpiegel commented Nov 28, 2022

If the sealed data is not tracked in the Network Indexer and not in the local index:

  1. How does the client get the CID if they know which SP holds it (what does the flow look like)?
  2. How do they get the CID if they do not know which SP holds it (what does the flow look like)?

@jacobheun
Copy link
Contributor Author

  1. How does the client get the CID if they know which SP holds it (what does the flow look like)?
  2. How do they get the CID if they do not know which SP holds it (what does the flow look like)?

I'm going to assume that by CID you mean the actual content of the cid for these (correct me if that's wrong).

For both of these, I'll start by addressing the sealed data problem. If there is sealed data that a client wants back there are 2 options for getting it.

  • Contacting the SP and telling them to unseal it. - This could be a reasonable option for scenarios like disaster recovery, or archival, where the intent is that hopefully you won't actually need to access the data, but you can contact the SP to regain access to it. This could keep storage costs low, while users might pay a premium for the unsealing and/or retrieval. AFAIK, this is likely the go to option SP's are doing, but I could be wrong.
  • Retrieval triggers an unseal - This logic exists today where retrieval attempts will automatically trigger an unseal, but the UX for either party is not all that clear. As unsealing is resource intensive, SP's often set very high unseal prices to prevent it from being triggered. We could and should eventually improve this flow, but my sense is it's a low priority as the expectation is that data that should be readily accessible be unsealed. The idea is that as a client I can pay the SP to unseal the data, and come back later to retrieve the data once unsealing has been happened. If we want to dig into this option, we should create another issue/discussion as there's a lot to unpack here.
  1. How does the client get the CID if they know which SP holds it (what does the flow look like)?
  • The unseal is triggered by one of the above mentioned mechanisms. (taking ~3h+)
  • The unsealed copy is indexed (At this point the client could retrieve directly from the SP)
    • This could be regenerating indexes (option 1) or "repairing" them (option 2)
  • Indexing updates are announced to the network indexers
    • At this point any client could query the indexers for the data
  1. How do they get the CID if they do not know which SP holds it (what does the flow look like)?

The client should ideally know the PieceCID containing their content. This can be looked up on chain to find the SP, and then the flow from the previous scenario can occur. If for some reason the client does not know the PieceCID there content is in, there's no reasonable way to find the data.

The other option, as described in option 2 in the issue description would be:

  • Indexes are generated for the original unsealed copy
  • The unsealed copy is deleted
  • The indexers are notified that the indexes for this deal are now sealed
  • The client queries the indexers and discovers that SP Alice has the content, and that the indexes are sealed
  • The client triggers unsealing via one of the above mentioned options
  • The unsealed copy is recreated
  • Local indexes are automatically "repaired" as needed.
    • The client already knows the SP, so they could retrieve at this point
  • Indexers are notified that the indexes for the deal are now unsealed (along with any potential index updates)
    • The client can query the indexers to find the unsealed index location if they havent already retrieved

@LaurenSpiegel
Copy link
Collaborator

If for some reason the client does not know the PieceCID there content is in, there's no reasonable way to find the data.

This strikes me as a problem we should be thinking through. For example, if I know a content CID is stored with an SP for archival or backup purposes, I might not know the Piece CID but would still want to find it, have it unsealed and retrieve it.

Rather than deleting references to sealed data, shouldn't we consider keeping CID to Piece mappings and noting whether there is a sealed copy and unsealed copy?

@jacobheun
Copy link
Contributor Author

Rather than deleting references to sealed data, shouldn't we consider keeping CID to Piece mappings and noting whether there is a sealed copy and unsealed copy?

This is option 2 mentioned in the description. It has better UX properties for clients, but definitely adds some complexities. Let's walk through the lifecycle of what that would look like in more depth.

Quickly though to clarify one thing (which we should ensure #689 covers):

  • Not all clients will need their data indexed (announced or not) and SP's may wish to charge different rates for indexed or non indexed data. This may be especially true in the future when making smaller deals becomes more feasible.
    • As such, a client should have the ability to specify whether or not their data should be indexed locally when making a deal. This is separate from whether or not is should be announced.

The Lifecycle of Indexing w/ retained indexes for sealed data

🚧 Indexing a new Deal

This is mostly the same as option 1 in the description but there are some additional things we need to think about.

  • If a deal is marked for indexing, we need to ensure indexes are generated before unsealed copies are/can be deleted. We'll need to look at the storage pipeline to get better clarity around this - If we generate the indexes too early we may need to delete them if the deal fails later, if we attempt to generate them too late the unsealed copy may get deleted.
    • As soon as the unsealed copy is gone, we lose our ability to index until we unseal.
    • It might be worth establishing a clear signal in the deal whether it is "cold storage" or not. If it is cold storage and the client wants it indexed, we could automatically handle clean up of the unsealed sector. I'm not sure if the "fast retrieval" flag would be sufficient for this.

🚧 Deletion of an unsealed copy

  • Check if a sealed copy of the data exists:
    • If there is no sealed copy, delete indexes and announce the deletion.
    • If there is still a sealed copy, update indexers that the indexes are sealed.

🚧 A sector is unsealed

  • When an unseal happens we should register for a "repair" of the indexes as an integrity check.
  • Update indexers that the indexes are unsealed.

🚧 Expiration of a sector

  • No action is needed. We shouldn't need to care until a sector is actually removed.

🚧 Removal of a sector

If the sector is removed but an unsealed copy is still available we need to determine the expected behavior here. As @dirkmc mentioned above, storage and retrieval could be negotiated separately, in which case we should keep indexes around as long as there is still an unsealed copy. So the logic would be:

  • Check if an unsealed copy still exists:
    • If there is an unsealed copy, do nothing.
    • If there is not an unsealed copy, delete indexes and announce to the network.

🚧 Repairing Indexes

Same as option 1 above, however, integrity of indexes for sealed only copies would not be able to be checked without unsealing the data. Assuming appropriate replication/backup of the PieceDirectory once it's released, this should be a minimal risk.

@rvagg
Copy link
Member

rvagg commented Dec 2, 2022

fwiw I'm fine with either option from a retrieval perspective as long as I can count on either either of these things being true (which map to the 2 options): (1) having some basic assurance (not rock solid of course) that the SPs the indexer tells me have a CID have an unsealed copy, or (2) that I can count on that FastRetrieval flag to mean something (it's also graphsync specific metadata btw, maybe it shouldn't be?) because then I can trivially filter it out. I'm currently in that code for lassie & autoretrieve right now and I'd love to be able to plumb it through as a filter.

I suspect though, if we go with option 1, that it's going to be harder to put sealed data back in to the indexers at a future date once we have sealed->unsealed request/retrieval UX sorted out because by that time the assumption will be baked in to the indexer data that it's sealed so we have a compounded UX problem of ensuring that requests to the indexer API or responses from it account for sealed status. However, if we go with option 2 today, then FastRetrieval means something and we can bake that into our retrieval code ignore sealed data entirely, the indexers have a ton more data from the network (and we get juicy metrics).

Maybe we should have a cursory discussion about unsealed request/retrieval UX to cover that base? Maybe it ends up being a simple matter that we can put on a roadmap and know it's not going to be a massive job. (e.g. a new endpoint that take a retrieval-like proposal that I can ping with my unsealing request and it returns a status code and a % of unsealed progress [so I can ping it repeatedly to see how it's going]).

@jacobheun
Copy link
Contributor Author

jacobheun commented Dec 2, 2022

Based on discussions this week I think option 2 is likely to win out, as it does have the better usability properties. There is a good chunk of integration work we need to do here between Lotus and Boost to get better visibility into Sealed and Unsealed state changes to manage the indexes well. We need it regardless for the unsealed data, but will also need to account for sealing state changes for option 2.

if we go with option 1, that it's going to be harder to put sealed data back in to the indexers at a future date once we have sealed->unsealed request/retrieval UX sorted out because by that time the assumption will be baked in to the indexer data that it's sealed so we have a compounded UX problem of ensuring that requests to the indexer API or responses from it account for sealed status.

This shouldn't be the case. For option 1, the indexer only as the notion of unsealed data. If the unsealed copy goes away, we just remove that whole deal set from the indexers. For option 1 you don't need to filter on retrieval because all indexes are unsealed. You do need to filter in option 2, because they may be sealed or unsealed.

The only bit that should be "harder", is that all the indexes have to be re-ingested for that deal instead of just being able to update the metadata for the deal.

Maybe we should have a cursory discussion about unsealed request/retrieval UX to cover that base?

I think it's a good call while it's top of mind, I'll open a discussion for it.

@jacobheun
Copy link
Contributor Author

Started an open discussion in #1027. Not a priority, but wanted to get some thoughts written down while it's fresh in my mind.

@LaurenSpiegel
Copy link
Collaborator

Some data points to consider.

  1. I checked with a large data SP as to what metadata they store separately from Boost--

we definitely store all the piece CIDs on the CAR files involved. On a more granular level, since a CAR file can contain files (chunks) of multiple “client’s filenames”, we also store the CID of each these chunks to allow for partial retrievals.

  1. Singularity creates and stores on IPFS or in mongo an index of the data. See https://github.com/tech-greedy/singularity/wiki/Design

So, Boost does not have to be the source of truth on where to find individual sealed CIDs. The software being built on top of Boost is not expecting that and maps the user file to a Piece.

@rvagg
Copy link
Member

rvagg commented Dec 5, 2022

@jacobheun

This shouldn't be the case. For option 1, the indexer only as the notion of unsealed data

My point was that if we choose option 1 now and then later fix up the UX issues and decide we need sealed-only data put back into the indexers then it's going to be harder to expand the scope of indexer data later (and therefore change query patterns for users) than it is to keep it expanded today but carefully flagged for user consumption (option 2).

Maybe not though, maybe by the time we get to that the number of users of the indexers is still quite small and it's a trivial change. We could also make queries more explicit, like having the end point identifier say "unsealed" today so it's clear that's all you're getting so any future addition has to introduce a new identifier if you want all the data.

@brendalee
Copy link
Collaborator

Some thoughts on my side:

For clients that are storing directly with SPs (aka not going through some onboarding tooling like Estuary, web3.storage, etc.) it appears they implicitly trust in the SPs to help them track where all their data is / which pieces they are in. It's unclear to me how sophisticated the tracking tools are for SPs as the variance in SP sophistication can be large. For unannounced data sets, they likely have their own systems in place already.

When looking at this from a broader lens, I think it makes sense to have option 2 to discover unsealed data as well, but I struggle a bit to figure out a real use case for this. The cases I'm thinking about are super broad, such as fvm/ compute over data/ other developments longer term in which additional use cases come up, and clients interacting with the network want to find existence of data on the network (whether sealed or unsealed) or have some governance around paying/maintaining sealed/unsealed data.

For the existing public (announced) data retrieval / data access that we are seeing, I am thinking if there are any cold storage type use cases. I know there are SPs who have cold storage type offerings (such as PiKNiK), but they are likely for client data and not for public data. In this sense, to enable current retrieval flows smoothly, option 1 makes more sense.

The callout I would like to say (which has been mentioned earlier as well) is that once we establish/push a pattern in the ecosystem, it will require a lot of work to change, which is why I'm leaning a bit more towards option 2 though it is the more complex option.

@brendalee
Copy link
Collaborator

It would be helpful to get more information on usage of the network indexers etc., though not sure if we can have visibility into where the requests are coming from / a meaningful breakdown (such as IPFS gateways, SPs, data on ramps, etc.)

@LaurenSpiegel
Copy link
Collaborator

LaurenSpiegel commented Dec 13, 2022

I propose an Option 3:

  1. We only index unsealed data but we note that it is unsealed (to handle @rvagg's concern and allow flexibility in the future).
  2. If unsealed data is deleted before the expiration or removal of a sector, nothing is done. But boost returns an informative error message when asked to retrieve if it does not have access to unsealed data. This would allow SP's to share unsealed copies as some are doing now without boost caring if the data is stored locally. This would also prevent lots of churn from happening in the indexer and minimize complexity in tracking.
  3. After the expiration or removal of a sector, if the unsealed data is deleted, we remove it from the indexer.

Interested in @willscott 's and @masih 's views.

@willscott
Copy link
Collaborator

  • Boost doesn't always know when unsealed data is deleted - the deletion is files being deleted from the hard drive, and we typically realize they're gone the next time we get a request and try to access them.

I think the states you propose make sense, @LaurenSpiegel

@brendalee
Copy link
Collaborator

@LaurenSpiegel I think that's a good proposal / next step given the current state of the network, and provides enough optionality for the future :)

@masih
Copy link
Member

masih commented Dec 13, 2022

@LaurenSpiegel, on potion 3 proposal:

  1. Makes sense.

  2. Question: what would the rate of indexing churn be if we were to keep metadata in advertisements up-to-date? I would love to understand:

    • how often and how fast the unsealed data is deleted.
    • how often would it be resurrected by unsealing?

    Depending on the churn rate, Option 2 may make more sense.
    See train of thoughts below.

  3. Makes sense.


Thoughts on how it occurs for me:

  • Most of the problems are caused by the discrepancy between the actual state of data compared to what's advertised to the indexers.
  • There is already indexing metadata in place and in use that differentiates sealed vs unsealed data.
  • Retrieval Client seems like the right place to bake in logic that adjusts the course of data retrieval depending on whether the data is sealed or unsealed.
  • On Storage Provider side I would lean towards a direction that facilitates client side decisions by exposing data points - instead of making an implicit decision for the client.
  • Even though it is hard to reason about usecases for sealed/unsealed differenciator today, my vote is to indeed continue to advertise it, and keep it up-to-date with the sate of the data. Because:
    • Even though that has not been the case so far, we (the wider organisation) should work towards making sealing/unsealing cheaper and faster in the future.
    • Retaining information about the sealed vs. unsealed differentiator would then help us deliver impact quickly on retrievability without significant re-engineering or expensive migrations.
    • The cost of keeping the metadata on sealed/unsealed up-to-date vs. publishing a removal advertisement is negligible.
      • From SP side they still need to perform some action in response to unsealed data removal.
      • From indexers perspective the net effect on backing store is similar if not cheaper than removal advertisements.

@LaurenSpiegel
Copy link
Collaborator

@masih , I believe some SP's delete the unsealed copies immediately. Others likely do it when they need more disk space. Right now, there is very little to no resurrection by unsealing. So churn rate likely low.

However, we did recently learn that some SP's are sharing unsealed copies so even if they deleted their local copy, it does not mean that they won't honor a retrieval request (and cache locally again once they served that retrieval?). We are getting further info on how they are doing this but it seems like something we should keep in mind and not remove a reference to the SP having access to the unsealed data just because boost/lotus doesn't see it on disk at any point in time.

@masih
Copy link
Member

masih commented Dec 14, 2022

some SP's delete the unsealed copies immediately.

Ouch. Not great for retrieval. Though, understandable as far as the (current) game theory goes?

we should keep in mind and not remove a reference to the SP having access to the unsealed data.

Agreed.

Continuing the Option 3 discussion, based on low churn level while keeping in mind the need for reducing complexity on Boost side, would it make sense to change Option 3.2 to:

  • If unsealed data is deleted before the expiration or removal of a sector, Boost publishes a Metadata update advertisement for it as soon as it learns about its removal:
    • I am not entirely clear if there is an explicit event of sorts that signals removal of unsealed data in Lotus <-> Boost interaction. If there is then that signal should trigger the metadata update ad publication.
    • If there is no such signal, then Boost would passively publish those metadata updates upon failure to respond to a retrieval requests.
    • If it is a mix then we should do both.
    • TLDR; we should optimistically keep metadata up-to-date to avoid expensive "Not Found" errors on retrieval side since they have a sizeable impact on TTFB.
  • Additionally, all retrieval requests to Boost that are not handleable because of missing unsealed copies should return an explicit response that conveys the absence of unsealed data, say "Unsealed Data no longer present", instead of a generic "Not Found" error.
    • The rationale here is to create opportunity for specialisation/optimisation for SPs that share unsealed data.
      • Recommendation to such SPs, is then to not update metadata, and instead respond with "Come back in a bit" assuming it takes time to share unsealed copies among SPs where that time is longer than the retrieval timeout.
  • Once we understand this unsealed data sharing among SPs better, we can adopt it in Boost, respond with "Come back in a bit" instead of "Unsealed Data no longer present" and remove the need for metadata update publication.

@LaurenSpiegel
Copy link
Collaborator

On changing 3.2 to keep the indexer up to date on the state of unsealed data, would only want to do this if there is a real impact on TTFB. Is this really the case especially with bitswap where multiple addresses can be checked and it's noisy anyway?

It's possible some reputation schemes might want to take into account -- does the SP serve data it indexes? If we clean up the records we remove the ability to do that. Though I agree if real impact on TTFB it's worth the sacrifice.

@masih
Copy link
Member

masih commented Dec 15, 2022

Is this really the case especially with bitswap where multiple addresses can be checked and it's noisy anyway?

I don't know if there are autoretrieve metrics that would answer that. Maybe @hannahhoward or @rvagg know? I am curious how we would know this and can truly measure if we are returning a generic "Not Found" in response to unsealed copy being deleted. So maybe the first iteration here is to change error codes such that we can observe the effect of this on TTFB.

If we clean up the records we remove the ability to do that.

I am not sure I follow; how would the records reflecting the true state removes the ability to measure reputation?

@masih masih closed this as completed Dec 15, 2022
@masih masih moved this to Done in Boost Dec 15, 2022
@masih masih reopened this Dec 15, 2022
@masih
Copy link
Member

masih commented Dec 15, 2022

I am sorry I must have clicked close issue by mistake!

@LaurenSpiegel
Copy link
Collaborator

LaurenSpiegel commented Dec 16, 2022

I don't know if there are autoretrieve metrics that would answer that. Maybe @hannahhoward or @rvagg know? I am curious how we would know this and can truly measure if we are returning a generic "Not Found" in response to unsealed copy being deleted. So maybe the first iteration here is to change error codes such that we can observe the effect of this on TTFB.

Totally agree we should have clearer error messaging if possible. Autoretrieve will not help answer performance on bitswap though.

I am not sure I follow; how would the records reflecting the true state removes the ability to measure reputation?

The idea is that SP's would potentially stop announcing CID's so that they don't get hit for them for retrievals and then don't get dinged on reputation for not serving them. Counter is that a thorough reputation system needs to have a better source of what should be indexed, check that and check retrieval (but not all reputation checkers might have that better source and having more signals is better than fewer).

I am sorry I must have clicked close issue by mistake!

Could have been a freudian click. 😄 We should discuss this in real-time in the new year to get to some resolution.

@jacobheun
Copy link
Contributor Author

We have a sync discussion to talk about this later today, so to quickly summarize we're effectively deciding on option 2 or the newly proposed option 3.

As far as I see it, option 3 is a partial implementation of option 2. We still care about announced indexes for sealed data, except option 3 isn't notifying the indexers when unsealed copy removals are detected.

As I see these as progressive implementations, I'm not overly opposed to starting with option 3, but I do think the UX here is bad for retrieval clients:

  • I query IPNI and discover that SP X has the cid and that it's unsealed
  • I attempt a retrieval from SP X only to have Boost tell me that there is no unsealed copy and can't be retrieved

It's not a payment or access problem, it's an eventual consistency problem that will never become consistent. From a retrieval client standpoint, I am now inclined to not trust what's in IPNI if that data never gets updated. It's not IPNI's fault, but will likely get some blame as it's the first step in the retrieval process.

If Boost can't detect an unsealed copy, that data isn't retrievable and someone has to manually intervene to make it so.

What is option 3 optimizing for?

"This would allow SP's to share unsealed copies as some are doing now without boost caring if the data is stored locally. This would also prevent lots of churn from happening in the indexer and minimize complexity in tracking."

I don't understand the first part of this, how are they sharing unsealed copies that allow retrieval to work without manual intervention?

@LaurenSpiegel
Copy link
Collaborator

Based on live discussion today, we propose to:

  1. Build interface between sealer and boost that actively informs boost if retrieval data not available. [unclear on scope of work. need to sync with lotus team]
  2. Interface must be flexible enough to allow for “shared” unsealed copies by SP's.
  3. Boost to keep IPNI updated of “facts”. IPNI to reflect whether unsealed copy is available or not (IPNI does not delete entry just because no unsealed copy). Client apps can make decisions based on what is reflected in IPNI.

@brendalee to set up sync with lotus team

@jacobheun
Copy link
Contributor Author

Note for implementation: Ensure SP's can manage pricing/access for retrievals where a sector has expired. An SP may wish to change retrieval requirements until a sector is renewed. This should be doable with deal filters, but we should ensure that flow works well and that it's clearly documented.

Example: SP Sally is serving free retrievals of Piece bafy2 in Sector 1. When Sector 1 expires, Sally would like to set a non zero retrieval price until bafy2 or sector 1 is extended. This ensures Sally can still be rewarded via retrieval for holding the content since they are no longer being paid for storage. Once the sector is extended, Sally wants to resume free retrievals to meet the storage requirements of the client.

@jacobheun
Copy link
Contributor Author

As we've reached consensus here and as I should have originally created this as a discussion thread, I'm going to move this to the Discussion board and create an issue to track the engineering effort here.

@filecoin-project filecoin-project locked and limited conversation to collaborators Feb 17, 2023
@jacobheun jacobheun converted this issue into discussion #1198 Feb 17, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in Boost Feb 17, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
area/indexing area/retrieval Area: Retrieval hint/needs-decision Hint: Needs Decision hint/needs-team-input Hint: Needs Team Input
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

9 participants