-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiformat code for CARs #239
Comments
Seems like a reasonable format / type of CID. agree we should have carv1 and carv2 codecs for when we talk about the multihash of a car file. |
I don't understand the value add here. How is this different from someone asking for a code for MP4, AVI, jpeg, BMP, zip, ... file formats? There are magic bytes at the beginning of the CAR file that tell you what it is (as with various other formats). If you have a database only storing the hashes you can just augment with a signal, either way when the data is fetched the user can know what it is. Generally, this proposal sounds like #4 unless I'm missing something. |
^ Agreed, this is basically the same as the mime types proposals - which we've been mostly favourable toward adding, we just haven't actually pulled the trigger on it (#159 is close). For all of these general file type types, we can consider them to be "codecs", in the same way that a filename extension tells you what's in the file and how to open (decode) it. It's a bit uncomfortable if we try and conceive of that strictly within the bounds of what we think of as "IPFS" (where large files as UnixFS would get chunked and the root of a chunked DAG is not going to be the same multihash as the whole thing, nor would it be able to use this multicodec code). But the table is meant to be useful well beyond those bounds, and we're even pushing the definition of CIDs beyond comfortable boundaries (like with the special Filecoin CIDs, or even identity CIDs). I'm personally comfortable with the notion that just because you have a valid CID doesn't mean you can do anything useful with it inside an IPFS, or even arbitrary IPLD, context. It's just a Content IDentifier that's useful for something within your specific system, separate from the actual contents (which I could inspect for magic bytes if I had it in this, and many other cases). Specifically on CAR mime types:
Maybe we need to push that forward as well as mime types in the multicodec table? |
There seems to be an assumption that these will correspond to IPLD codecs, and they most likely won’t ;) Multiformats is bigger than just IPLD codecs. We have a thing, that we hash, and we’d like to reference that thing we’re hashing with an identifier. It’s a pretty obvious case for a multiformat :) |
This is not necessarily a bad assumption when you're using it for a "codec code" in a CID, the primary purpose of that field is to look up the IPLD codec. You're just suggesting abusing that field for other purposes, which is where the disagreement hinges I think - do we care about how users abuse these things for their own purposes or do we try and retain some amount purity? 🤷 I'm pretty relaxed about it, but I'm also fine with making our own tooling hostile to abuses of the standards like this if you show up with your own funky use and expect everything to work. |
How did you get here? I was pretty directly comparing these to MIME types and that we should treat them similarly.
Why do you want to use a CID here instead of
IIUC the main point of putting all the codes in a single table instead of have a multiaddr, multihash, namespace, ipld-codec, mimetype, ... set of tables was so that figuring out what a code meant became easier since there are fewer collisions that can only be resolved through context. So when adding a new code it seems fair game to ask "what is it, what's it for and what category does it belong to". So how does this fit in, it seems an awful lot like a mimetype or just a random file format to me and should get handled similarly. Another note on having purposes for codes: Perhaps I'm an outlier here, but even the dual purposing of |
I find this a very good point. For me one of the central points of a CID is, that the multicodec code before the multihash is an IPLD Codec (I know that the spec isn't clear about that, I hope it will be some day). If it is a not a IPLD Codec, it's not a CID anymore. Though of course you can have a thing that could be parsed like a CID. But instead you could just use |
You’re right. I’m sold. |
FYI #258 adds a |
Quoting @lidel here from the PR
And my response here
|
@lidel @rvagg just to clarify, our current intention is to do what @aschmahmann suggested and what @mikeal agreed
That said, I personally think there is a value in turning them into CIDs and packaging CARs as IPLD codecs. That way they can have 1st class representation in IPLD. However that would mean we can have CIDs for things that are greater in size than current block size limit. Maybe that is ok, because you'd know it from the CID. I'd love to hear your opinions in that regard though. |
My guess is that objections to CID come either from disagreements that this is an "IPLD format", or that this is a CID for a thing larger than a happy libp2p block size (although I've only seen this one from you @Gozala, so maybe it's not a live objection?). I'm on the side of agreeing that CAR can fit the definition of an "IPLD format"—partly because this isn't a definition that we've ratified, but mainly because it's a binary encode format that maps nicely to the data model and also has links. It's even more of an IPLD format than JSON or CBOR, which we've tagged "serialization" in the table, because of the links thing. It's a restricted format, like dag-pb, in that it has a fixed schema, but it yields a map ( But anyway, that's all just to say that I'd not have a problem with CIDs being used to define CARs, we have the CID spec so we can append codec codes to multihashes, so why not use it? CID is a just a versioned There may be a valid concern of people treating these things in the wrong way over libp2p. But what are they going to do? Request them from the DHT and not get an answer? Even if you made one small enough, go-ipfs and js-ipfs won't know what to do with this codec code anyway (although it's interesting to consider the possibility of adding support at that layer ...). |
I don't have a particular objection to this being an IPLD codec since basically anything can be an IPLD codec if it can turn bytes into the IPLD Data model (ideally in a way more sophisticated than just an array of bytes 😄). A couple questions I have here, which have been raised about
These questions, and the stated purpose of the request for a code, made me think this looks basically like a MIME type request and should've gotten treated as such. If folks want to use IPLD codecs for CAR files though then no objections from me as long as we have some answers to the above. Side comment about block size limits and IPLD - IMO they have nothing to do with libp2p and they are instead related to the feasibility of incrementally verifiable downloads of large blocks of data. I've collected some thoughts on block limits here, which I'd love feedback on. |
That is one of the reasons why I felt classifying it as "serialization" made more sense right now. I suspect we'll get better idea of what we'd like IPLD data model to look like sometime in the future. I thought once we have a spec that is when it would make most sense to propose changing classification to "IPLD Codec" I am also suspecting that we may find that instead of general CAR codec we may instead define codec for more constrained CAR variant in which case we may be more interested in defining IPLD codec for that as opposed to CAR codec.
My argument had been that we could have version agnostic CODE along with versioned ones. That way software can choose to support arbitrary car and decode with appropriate version based on CAR header or can choose to support specific version. So far our use cases mostly had been around "block sets" where CAR version seemed to not matter as long as we can ingest the blocks. That is also why I'm biased towards version agnostic code. If I'm missing something crucial please call out.
This seems ok to me. I think it's reasonable to have an overlap in some cases and not in others. I'm not sure about CAR specific case, but if we end up with canonical IPLD model for CAR I think it would be reasonable. I think it is also reasonable to have alternative IPLD models for the CAR with different codes.
I hope my answers above provided some clarity no this. More concretely I think we want to evaluate CAR as IPLD codec, but do not want to claim IPLD Codec code until we have more confidence, spec and prototype of what that might look like. It seemed reasonable to recognize CAR as serialization format and evaluate other ideas from there. If you really fell like it should be designated as "ipld" as opposed to "serialization", I'm happy to send another PR to change that. |
I see we have multiple codes now(?)
Can this be closed? |
yeah, should have been closed with #258 |
In nft/web3.storage we accept a lot of CAR files, and I’m starting to worry that we aren’t doing enough to prepare for future version upgrades of the CAR format.
Since we routinely produce multiple CAR files for partial DAGs we end up with a lot of CAR files with the same CID. As a result, we’ve taken to writing the multihash of the CAR file into the database where we track each upload along with the root CID.
It would be much nicer if we could replace the CAR mulithash with a CID that included the CAR version (we’d still store the root CID separately). I can see in the multiformats table that we have entries for some of the extensions we’ve done within the CAR format but not for each complete version of the CAR format. Any objections to adding them?
Often we assume that, if it’s a CID, there’s a reasonable max size for the payload. This would obviously violate that assumption, but the codec is a usable signal to adjust behavior and people do need to hash CAR files so this seems like something we should do.
The text was updated successfully, but these errors were encountered: