Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to retrieve when root payload CID is an identity CID #715

Merged
merged 4 commits into from
Jan 26, 2023

Conversation

hannahhoward
Copy link
Collaborator

@hannahhoward hannahhoward commented Aug 20, 2022

What

Any CAR file with an identity CID as the root payload CID is unretrievable if written with CarV2.

When a file is imported via UnixFS, the default builders have a condition where any block less than 126 bytes becomes an identity CID. This is true for the most common CAR generation utils used by Lotus & boostx. So this is likely applicable to real world pieces.

Car files written with CarV1 with identity CIDs may or may not be retrievable but as they are written back out with CarV2 writers they likely won't match commP if trying to do whole payload / piece retrieval over GraphSync.

How

This PR is actually an issue with a test to demonstrate. This issue affacts the default integration tests, and you can demonstrate simply by adding a retrieval to the boost deal test. This produces an error of:

Retrieval query: getPieceInfoFromCid: getting pieces for cid bafyaa4asfyfcmalqudsaeiclsn2n56lmy7rolvh3hcmbb7lailbxpowb4quv6ggvqayqo7pj7mjaagejsbbrelqkeyaxbiheaiqp7e4gsn3ajlb533z2wp377int6fxiluov6il34itoaemeb5w5qeasaamk35b4bihaqaqyqcexuieaqbacbaejhi: getting pieces containing block bafyaa4asfyfcmalqudsaeiclsn2n56lmy7rolvh3hcmbb7lailbxpowb4quv6ggvqayqo7pj7mjaagejsbbrelqkeyaxbiheaiqp7e4gsn3ajlb533z2wp377int6fxiluov6il34itoaemeb5w5qeasaamk35b4bihaqaqyqcexuieaqbacbaejhi: failed to lookup index for mh 0070122e0a260170a0e402204b9374def96cc7e2e5d4fb389810fd6042c377bac1e4295f18d58031077de9fb120018899043122e0a260170a0e40220ff9386937604ac3ddef3ab3f7ffa1b3f16e85d1d5f217be226e011840f6dd810120018adf43c0a0e08021880897a208080402080893a, err: datastore: key not found

(probably will produce a different CID if you run it yourself)

You can then fix the test (or at least get it past the above issue) by removing the InlineBuilder in the test car file generator: https://github.com/filecoin-project/boost/blob/main/testutil/car.go#L142

Looking at the error, I believe the issue is that the identity CID is simply not present in the car index generated by the DAGStore, as it's not in the underlying CARv2. So when you ask "what information do you have on this CID?", it responds saying "I don't have any"

@rvagg rvagg self-assigned this Aug 22, 2022
@rvagg
Copy link
Member

rvagg commented Aug 23, 2022

filecoin-project/lotus#2487

Even lotus import is still apparently producing CARs like this, and has always been.

@rvagg
Copy link
Member

rvagg commented Aug 23, 2022

Status of investigation so far:

@rvagg
Copy link
Member

rvagg commented Aug 23, 2022

btw, the 1 hack only works because the CARv2 blockstore interface already properly handles identity CIDs by returning the block data without looking it up, so it doesn't matter what the offset is, it would be returned successfully .. but having an offset that's within the CARv1 header seems preferable to having one somewhere else in the file - and 0 won't work because the indexer has issues with 0 (see linked go-car PR for commentary on this).

@jacobheun
Copy link
Contributor

I'm not super familiar with the dag store and deals DB, but I'm wondering if we can do a fallback lookup by payload cid if the multihash lookup fails as a workaround.

rvagg added a commit to filecoin-project/go-fil-markets that referenced this pull request Aug 24, 2022
There are valid cases where a CAR may have an identity CID as its root that is
not represented as a 'block' within the CAR body and therefore isn't indexed
by the dagstore. In this case, we inspect the identity CID content and treat
the query as a query for the intersection of all of the links within the block.

Ref: filecoin-project/boost#715
rvagg added a commit to filecoin-project/go-fil-markets that referenced this pull request Aug 24, 2022
There are valid cases where a CAR may have an identity CID as its root that is
not represented as a 'block' within the CAR body and therefore isn't indexed
by the dagstore. In this case, we inspect the identity CID content and treat
the query as a query for the intersection of all of the links within the block.

Ref: filecoin-project/boost#715
@rvagg
Copy link
Member

rvagg commented Aug 24, 2022

Another alternative approach to consider, suggested by @dirkmc: when the lookup fails, decode the identity CID and query for all of the links it contains and return the result for that. Implemented in filecoin-project/go-fil-markets#747 and working if this branch links against that branch of go-fil-markets.

Things to consider about this approach:

  • There's zero guarantee that the PayloadCID we're considering is original root CID of a CAR (i.e. the original PayloadCID), you could be crafting a custom identity CID with all of the links you want to retrieve and making it find you a piece that has them all—this would be a new feature I guess, is that OK?
  • There's no limit on the number here other than practical constraints of querying with a very large identity CID, it could contain a large number of links and we're doing lookups for them all. The way I've written it, it'll bail when it stops finding common pieces that contain them all, but in theory you could craft a PayloadCID that references all of the CIDs in a particular piece and it'd go back and forth to the dagstore to check that they're all there before giving you a 👍. Maybe limits are reasonable here? I'm not sure what though.

@rvagg
Copy link
Member

rvagg commented Aug 25, 2022

One more piece of learning from my investigations today: online deals handle identity CIDs differently to the default CAR creation process that we've been stuck with elsewhere:

I'm not sure but I believe that prior to the go-car/v2 blockstore interface, we would have been writing these out with identity CIDs in them because that's what go-car always did.

I discovered this by extending lotus/itest/deals_anycid_test.go and inserting some identity CIDs into the source data. It transfers OK and I get a CAR out the SP end and it has an identity CID that I put in the root but it doesn't have the intermediate ones as blocks in the CAR so fails the CommP comparison.

Depending on how users are generating CARs, this may or may not be a problem in the wild but it could explain some CommP mismatch errors. Lotus itself sets up an inline CID builder for UnixFS creation (which I believe is used for import, maybe more?): https://github.com/filecoin-project/lotus/blob/f1f31d10671c5076943d3c47e92a514dce4697de/lib/unixfs/filestore.go#L37-L40 - so if you make a DAG and it has an identity CID not at the root, then your data transfer is going to fail CommP checking.

@jacobheun
Copy link
Contributor

so if you make a DAG and it has an identity CID not at the root, then your data transfer is going to fail CommP checking.

This sounds like filecoin-project/lotus#8663

@rvagg
Copy link
Member

rvagg commented Aug 29, 2022

Summary of options

  • Faux-index the root identity CIDs as if they were contained in the data block of the CAR
  • Make the Dagstore register root identity CIDs in its inverted index when initialising a new shard
    • feat: register identity hash roots as existing in shards dagstore#138
    • Similar to the car/v2 "hack" but less of a hack, it doesn't require simulating an index offset
    • Requires a re-index of existing pieces to get access to old data (could get away with a scan of just headers to add this data in, but we'd also need a signal that we haven't performed such a "migration")
  • Make the Dagstore expose an API that lets you ask for shards that have specific roots
    • (not implemented)
    • A major API change to the Dagstore and a non-trivial internal change since it'll need a place to store root information
    • Requires a scan of existing pieces to get access to old data - not a full re-index since we just need to load CAR headers if we don't have the data
    • Enables a new API option for Lotus that establishes a direct link between a CID and a piece if we want it - "find a piece with exactly this root"
  • Special-case identity CIDs for retrieval requests when we fail to find a Dagstore shard with that CID by looking for shards that contain all of the CIDs within the identity CID
    • feat: handle retrieval queries for unindexed identity payload CIDs go-fil-markets#747
    • Gives us access to old data without needing a migration/re-index
    • Doesn't establish a clear link between the CID and the Piece - you can craft a custom identity CID that'll take advantage of this - but does that matter? We can add limits if we want (size, number of links, codec). If we go with this option, we establish a new retrieval mode ("find a piece that contains all of these CIDs").

A combination of these may be appropriate.

Other ideas?

@aschmahmann
Copy link
Collaborator

aschmahmann commented Aug 29, 2022

@rvagg forgive any ignorance here, but IIUC there's a fourth option that looks a little bit like your third one "looking for shards that contain all of the CIDs within the identity CID", but is a little less hard-coded to identity CIDs.

  • Allow for retrieving data via GraphSync without caring which Piece it's in as long as the SP has the data
    • When encountering a block during a traversal use the DAGStore's top level index to find the CARs the block is in and fetch from one of them. Prefetch more data from that CAR if a) the traversal sees lots of requests that target the same CAR b) there's a hint in go-data-transfer about the DealID (and therefore CAR) to fetch.
    • IIUC the Bitswap support work ([Epic] Add support for the Bitswap transport #709) will end up doing this kind of thing anyway since it won't care about which Piece a given block is in because the same block can be in multiple Pieces and there's no go-data-transfer related signal about the Piece.
  • As a bonus this also happens to make it very easy to enable cross-sector retrieval from the same SP (e.g. a 33GB file stored across two sectors)
  • Might end up either more complicated than "looking for shards that contain all the CIDs", or with a performance hit depending on how important the prefetching optimization is and what the implementation looks like.

@rvagg
Copy link
Member

rvagg commented Aug 30, 2022

@aschmahmann yeah, this is the bit that slightly confuses me and I hope @hannahhoward can provide additional context - I think that the main source of complexity is that retrieval deals are tied to pieces, which may or may not be sealed. So when setting up a retrieval deal (the "query"), you're giving a payload CID and it's using that to figure out where it's going to get it from and sends you back the deal parameters for it.

So these difficulties all hinge from that point, where we're tying a Payload CID to a Piece CID to set up a retrieval deal that's acceptable to both parties.

Maybe we can make this go away, or mostly go away with query-ask v2? #671

@hannahhoward
Copy link
Collaborator Author

@aschmahmann @rvagg

I was thinking about this, and it's possible we could solve this this as part of moving graphsync retrievals out of the main process once we have a load balancer. Then GraphSync could use the same remote blockstore used by bitswap and employ the same checks -- i.e. don't ask for CIDs that are identity CIDs across the process boundary.

@rvagg
Copy link
Member

rvagg commented Aug 30, 2022

FYI I'm working on tidying up filecoin-project/go-fil-markets#747 to propose as a workaround in lieu of a clearer path forward. Possibly we just need to patch this over to make current style deals work and then move on to focusing on less piece-centric retrievals with bitswap and independent graphsync?

rvagg added a commit to filecoin-project/go-fil-markets that referenced this pull request Aug 31, 2022
There are valid cases where a CAR may have an identity CID as its root that is
not represented as a 'block' within the CAR body and therefore isn't indexed
by the dagstore. In this case, we inspect the identity CID content and treat
the query as a query for the intersection of all of the links within the block.

Ref: filecoin-project/boost#715
rvagg added a commit to filecoin-project/go-fil-markets that referenced this pull request Sep 1, 2022
There are valid cases where a CAR may have an identity CID as its root that is
not represented as a 'block' within the CAR body and therefore isn't indexed
by the dagstore. In this case, we inspect the identity CID content and treat
the query as a query for the intersection of all of the links within the block.

Ref: filecoin-project/boost#715
@rvagg rvagg force-pushed the test/demonstrate-identity-cid-car-issue branch 3 times, most recently from 17b9b4f to 648fd75 Compare September 1, 2022 07:08
@rvagg
Copy link
Member

rvagg commented Sep 1, 2022

This is passing now with the only "fix" applied being to use go-fil-markets @ filecoin-project/go-fil-markets#747 so you can now do retrievals with a root identity CID and it'll work it out for you on the SP end.

@rvagg
Copy link
Member

rvagg commented Sep 1, 2022

I think we should merge this in to get the test changes included, although it might need to wait till filecoin-project/go-fil-markets#747 is merged

rvagg added a commit to filecoin-project/go-fil-markets that referenced this pull request Sep 2, 2022
There are valid cases where a CAR may have an identity CID as its root that is
not represented as a 'block' within the CAR body and therefore isn't indexed
by the dagstore. In this case, we inspect the identity CID content and treat
the query as a query for the intersection of all of the links within the block.

Ref: filecoin-project/boost#715
@jacobheun
Copy link
Contributor

Let's merge this into the release/lotus1.17.2 branch. That's the target release for the breaking libp2p updates, so the plan is to include all our updates dependent on that release there.

@rvagg
Copy link
Member

rvagg commented Sep 5, 2022

👍 filecoin-project/go-fil-markets#747 needs someone with merge access though. It's got a 👍 although there were a few changes based on our conversations since that 👍 but it should be fine to merge and update here. Note that that's going into go-fil-markets master which has those libp2p upgrades in it too so that'll need to come back here.

dirkmc pushed a commit to filecoin-project/go-fil-markets that referenced this pull request Sep 13, 2022
)

* feat: handle retrieval queries for unindexed identity payload CIDs

There are valid cases where a CAR may have an identity CID as its root that is
not represented as a 'block' within the CAR body and therefore isn't indexed
by the dagstore. In this case, we inspect the identity CID content and treat
the query as a query for the intersection of all of the links within the block.

Ref: filecoin-project/boost#715

* fix: refactor out multiple calls to dagStore.GetPiecesContainingBlock

1. to support identity PayloadCID without having to duplicate decode & lookup
   logic
2. because it's not cheap, especially for identity PayloadCIDs with lots of
   links

The tradeoff is that in some cases we end up calling the PieceStore more than
we otherwise would.

* feat: impose limits on identity PayloadCIDs

* Byte limit (2048)
* Link limit (32)

* feat: handle retrievals for nested identity CIDs

* chore: expand testing to cover dag-pb identity CIDs
@rvagg
Copy link
Member

rvagg commented Sep 14, 2022

filecoin-project/go-fil-markets#747 is merged, but it still needs to bubble up through the stack, which might take a while given the lotus dependency issues

@jacobheun jacobheun added the area/retrieval Area: Retrieval label Nov 24, 2022
@jacobheun jacobheun force-pushed the test/demonstrate-identity-cid-car-issue branch from 9fcb9be to 0412500 Compare November 24, 2022 16:25
@jacobheun
Copy link
Contributor

I ran across this checking the retrieval backlog so I went ahead and rebased with minor updates since the dependency chain has been resolved now.

Tests are all passing so this should be good for a final review.

@jacobheun jacobheun requested a review from nonsense November 24, 2022 16:53
// Disable this part of the test for now because there is a bug in lotus
// introduced in this PR that causes the test to fail:
// https://github.com/filecoin-project/lotus/pull/9174/files#top
// *********************************************************************
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nonsense This was previously commented out in a separate update. I added it back in to see if things are passing now which seems to be the case.

@jacobheun jacobheun merged commit 188081a into main Jan 26, 2023
@jacobheun jacobheun deleted the test/demonstrate-identity-cid-car-issue branch January 26, 2023 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retrieval Area: Retrieval
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants