Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ContentHashable for Hash{Map, Set} is not stable #215

Open
guibou opened this issue Sep 1, 2023 · 4 comments
Open

ContentHashable for Hash{Map, Set} is not stable #215

guibou opened this issue Sep 1, 2023 · 4 comments

Comments

@guibou
Copy link
Contributor

guibou commented Sep 1, 2023

Describe the bug

  • use ContentHashable to hash the content of an HashMap (directly or indirectly)
  • update my haskell packages. A new version of hashable is released, which change the hash function. The order of HashMap is hence changed, and my nicely content hashed object suddently change their hash.

See the instance for HashMap in the code, it uses toList, the order is unspecified.

https://github.com/tweag/funflow/blob/master/cas/hashable/src/Data/CAS/ContentHashable.hs#L361-L369

To Reproduce

main = do
   print =<< contentHash $ HashMap.fromList [("hello", 10), ("goodbye", 20)]

Run this code with different version of hashable (You may have to try different keys in the map, or a bigger map, in order to trigger the effect).

Expected behavior

It is impossible to guarantee that the hash won't change, but either:

  • Do not expose an instance for a type when it is known that the hash will change systematically on each release of a library as fundamental as hashable
  • (or) sort the values, at the cost of an Ord, Eq constraints on the keys of the HashMap.
@guibou guibou changed the title ContentHashable for HashMap is not stable ContentHashable for Hash{Map, Set} is not stable Sep 1, 2023
guibou added a commit to guibou/funflow that referenced this issue Sep 1, 2023
Close tweag#215

`unordered-containers` structures, such as `HashMap` and `HashSet` uses
`Hashable` `hash` function to maintain their internal structure. Hence
the order of the items (when converting using `toList`, or `elems`, or
just `traverse` or `fold`) depends on the version of `hashable` used.

This results on a change of hash of theses structure when `hashable`
library is updated.

It is not possible to ensure that hash will never change for any type,
because it relies on other instances, `Ord/Eq` for `containers` `Map`
and `Set`, or even the implementation of `ContentHashable` itself.
However `hashable` is known for its frequent changes when on the other
hand, `Ord/Eq` instances are most of the time generated by `deriving`
and are hence perfectly reproducible.

This commit removes the dependency on the implicit order from `hashable`
at the cost of an introduce `Ord` constraint on the keys of the
`Hash{Map,Set}`. This is a tradeoff:

- The user can still use `Hash{Map,Set}` based structure in their type
  for whatever reason they want (usually performance).
- They will need to provide an `Ord` instance for `ContentHashable`
- If an `Ord` instance already exists, this change will just bring
  stability. If the instance does not exist, user will now be aware of
  the problem.

Note: this invalidate previous hash computed on `Hash{Map, Set}`.
guibou added a commit to guibou/funflow that referenced this issue Sep 1, 2023
Close tweag#215

`unordered-containers` structures, such as `HashMap` and `HashSet` uses
`Hashable` `hash` function to maintain their internal structure. Hence
the order of the items (when converting using `toList`, or `elems`, or
just `traverse` or `fold`) depends on the version of `hashable` used.

This results on a change of hash of theses structure when `hashable`
library is updated.

It is not possible to ensure that hash will never change for any type,
because it relies on other instances, `Ord/Eq` for `containers` `Map`
and `Set`, or even the implementation of `ContentHashable` itself.
However `hashable` is known for its frequent changes when on the other
hand, `Ord/Eq` instances are most of the time generated by `deriving`
and are hence perfectly reproducible.

This commit removes the dependency on the implicit order from `hashable`
at the cost of an introduce `Ord` constraint on the keys of the
`Hash{Map,Set}`. This is a tradeoff:

- The user can still use `Hash{Map,Set}` based structure in their type
  for whatever reason they want (usually performance).
- They will need to provide an `Ord` instance for `ContentHashable`
- If an `Ord` instance already exists, this change will just bring
  stability. If the instance does not exist, user will now be aware of
  the problem.

Note: this invalidate previous hash computed on `Hash{Map, Set}`.
guibou added a commit to guibou/funflow that referenced this issue Sep 1, 2023
Close tweag#215

`unordered-containers` structures, such as `HashMap` and `HashSet` uses
`Hashable` `hash` function to maintain their internal structure. Hence
the order of the items (when converting using `toList`, or `elems`, or
just `traverse` or `fold`) depends on the version of `hashable` used.

This results on a change of hash of theses structure when `hashable`
library is updated.

It is not possible to ensure that hash will never change for any type,
because it relies on other instances, `Ord/Eq` for `containers` `Map`
and `Set`, or even the implementation of `ContentHashable` itself.
However `hashable` is known for its frequent changes when on the other
hand, `Ord/Eq` instances are most of the time generated by `deriving`
and are hence perfectly reproducible.

This commit removes the dependency on the implicit order from `hashable`
at the cost of an introduce `Ord` constraint on the keys of the
`Hash{Map,Set}`. This is a tradeoff:

- The user can still use `Hash{Map,Set}` based structure in their type
  for whatever reason they want (usually performance).
- They will need to provide an `Ord` instance for `ContentHashable`
- If an `Ord` instance already exists, this change will just bring
  stability. If the instance does not exist, user will now be aware of
  the problem.

Note: this invalidate previous hash computed on `Hash{Map, Set}`.
@guibou
Copy link
Contributor Author

guibou commented Sep 1, 2023

The branch https://github.com/guibou/funflow/tree/stable_unordered-containers-hash-cas-hashable contains a commit which fix for this issue: it pre-sort the HashMap before hashing.

It adds an Ord constraint, so the design is discutable. I can open an MR if you agree with the idea.

@guibou
Copy link
Contributor Author

guibou commented Sep 1, 2023

However, removing cas-hashable from our codebase took me less time than fixing this (I'm now using a one line function which compute the sha256 sum of the output of cereal serialization, which is enough for my needs and should be more robust and easier to maintain), so I won't continue my efforts here unless you ask / continue the discussion.

Note that I wrote a custom instance for HashMap / HashSet for Serialize which sort the data first, as described here.

@dorranh
Copy link
Contributor

dorranh commented Sep 5, 2023

Hi @guibou, thanks for reporting this issue - I imagine that was a tricky one to track down 😬. We will take a closer look and get back to you regarding possible fix you linked. I would also be interested in hearing your use case for cas-hashable and funflow if you don't mind sharing.

@guibou
Copy link
Contributor Author

guibou commented Sep 5, 2023

Hi @dorranh.

We are solving virtual population of patient and the effect of treatments on them. So basically, we have a list of tasks (i.e. simulating a patient), some tasks depends on other task (i.e. comparing treatment effect versus control arm need to wait for the result of both task on the same patient. Or generating complete trial statistics needs to wait on all tasks) and we want to run theses tasks on a cluster of machines on AWS and we want to have an efficient and robust cache.

Caching is important for us because imagine the same simulation, we can run it with 1 million patient on two protocol arms (i.e. placebo versus treatment), that's 2 millions patients.. At the end of this run, we realize that we want to test another treatment (for example, double dose). With caching, we can avoid recomputing 2 millions patients (placebo / first treatment) and only recompute the new treatment.

We use content hash of the patient information / model / treatment.

Initially all of that was done with funflow and hence cas-hashable. It was completely rewritten with porcupine (https://github.com/YPares/porcupine). cas-hashable was still used to compute the hash of our resources until recently.

edit: that's why robust hash is important for us (see also #202) because when hash change, caching is invalidated and it may lead to additional computation cost, storage cost and user frustration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants