A general question about the terminology of a content-addressable storage #137

CMCDragonkai · 2019-03-12T04:01:44Z

The content store used here and in Nix, is a store where the data is addressed by some hash. But this hash is not calculated by the content of the data, but by some reproducible expression that produces the data.

Is it still correct to call it a content-addressable storage, or is there some better name for this or is there some qualifier that qualifies this content-addressable storage as something different? The wikipedia article for CAS only mentions that the hash is built from the content that is being stored, not some input expression that produces the content.

Some other thoughts that I'm hoping others can help put in a more formal way:

Looking at this in a more general way. There is some notion of equivalence between the expression that produces a value, and the value itself. However the mapping is surjective, as many kinds of input expressions may map to the same value (which is why canonicalisation/normalisation of the input expression is useful). We assume that hashing the expression and looking it up is faster than running the expression and hashing the resulting value and looking it up.

mpickering · 2019-03-12T08:19:08Z

The hash in funflow is precisely calculated from the output data. The hash in nix is calculated from just the input expression. This is sometimes known as the difference between an "intensional" and "extensional" store (see section 6 of the thesis) and is one of the things that differentiates funflow's store from nix's store.

CMCDragonkai · 2019-03-12T09:54:59Z

@mpickering Oh in that case, how come it says this on your README:

Funflow's caching is based around the content store. This stores all artifacts according to the hash of the inputs and computations which produced them. Funflow uses this to know when exactly a computation must be rerun and when previous results can be re-used.

aherrmann · 2019-03-12T10:01:01Z

@CMCDragonkai funflow uses two kinds of hashes. It hashes the inputs to a task and links them to the resulting output so that we can cache computations. It also uses content hashes to deduplicate the store.

The next paragraph in the README points to the CAS aspect

The content store also acts as a CAS system. This means that, if multiple inputs produce the same output, that output will be stored only once on disk (and subsequent computations will not be rerun).

CMCDragonkai · 2019-03-18T04:00:33Z

I've reread the nix phd thesis on intensional store, I understand what you mean now. @aherrmann Can you point out in the code where is the table/datastructure that links the extensional hash with the intensional hash?

aherrmann · 2019-03-18T09:01:23Z

@CMCDragonkai Sure, the links are stored on the file system as symbolic links. These are created here. On lookup we check for such a symbolic link and if it exists resolve it to determine the CAS item here.

CMCDragonkai mentioned this issue Mar 12, 2019

Hashing the expression that produces the value dgilland/hashfs#13

Closed

dorranh added the question label Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A general question about the terminology of a content-addressable storage #137

A general question about the terminology of a content-addressable storage #137

CMCDragonkai commented Mar 12, 2019 •

edited

Loading

mpickering commented Mar 12, 2019

CMCDragonkai commented Mar 12, 2019

aherrmann commented Mar 12, 2019

CMCDragonkai commented Mar 18, 2019

aherrmann commented Mar 18, 2019

A general question about the terminology of a content-addressable storage #137

A general question about the terminology of a content-addressable storage #137

Comments

CMCDragonkai commented Mar 12, 2019 • edited Loading

mpickering commented Mar 12, 2019

CMCDragonkai commented Mar 12, 2019

aherrmann commented Mar 12, 2019

CMCDragonkai commented Mar 18, 2019

aherrmann commented Mar 18, 2019

CMCDragonkai commented Mar 12, 2019 •

edited

Loading