Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A general question about the terminology of a content-addressable storage #137

Open
CMCDragonkai opened this issue Mar 12, 2019 · 5 comments
Labels

Comments

@CMCDragonkai
Copy link

CMCDragonkai commented Mar 12, 2019

The content store used here and in Nix, is a store where the data is addressed by some hash. But this hash is not calculated by the content of the data, but by some reproducible expression that produces the data.

Is it still correct to call it a content-addressable storage, or is there some better name for this or is there some qualifier that qualifies this content-addressable storage as something different? The wikipedia article for CAS only mentions that the hash is built from the content that is being stored, not some input expression that produces the content.

Some other thoughts that I'm hoping others can help put in a more formal way:

Looking at this in a more general way. There is some notion of equivalence between the expression that produces a value, and the value itself. However the mapping is surjective, as many kinds of input expressions may map to the same value (which is why canonicalisation/normalisation of the input expression is useful). We assume that hashing the expression and looking it up is faster than running the expression and hashing the resulting value and looking it up.

@mpickering
Copy link
Contributor

The hash in funflow is precisely calculated from the output data. The hash in nix is calculated from just the input expression. This is sometimes known as the difference between an "intensional" and "extensional" store (see section 6 of the thesis) and is one of the things that differentiates funflow's store from nix's store.

@CMCDragonkai
Copy link
Author

@mpickering Oh in that case, how come it says this on your README:

Funflow's caching is based around the content store. This stores all artifacts according to the hash of the inputs and computations which produced them. Funflow uses this to know when exactly a computation must be rerun and when previous results can be re-used.

@aherrmann
Copy link
Member

@CMCDragonkai funflow uses two kinds of hashes. It hashes the inputs to a task and links them to the resulting output so that we can cache computations. It also uses content hashes to deduplicate the store.

The next paragraph in the README points to the CAS aspect

The content store also acts as a CAS system. This means that, if multiple inputs produce the same output, that output will be stored only once on disk (and subsequent computations will not be rerun).

@CMCDragonkai
Copy link
Author

I've reread the nix phd thesis on intensional store, I understand what you mean now. @aherrmann Can you point out in the code where is the table/datastructure that links the extensional hash with the intensional hash?

@aherrmann
Copy link
Member

@CMCDragonkai Sure, the links are stored on the file system as symbolic links. These are created here. On lookup we check for such a symbolic link and if it exists resolve it to determine the CAS item here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants