CRDT for counting dislikes #707

k98kurz · 2022-07-30T19:32:22Z

k98kurz
Jul 30, 2022

Has anyone considered using a CRDT for counting dislikes? In particular, the ORSet and PNCounter might be useful here.

CRDT Overview

A Conflict-free Replicated Data Type (also called Convergent and Commutative Replicated Data Type) is a collaborative data structure that guarantees a form of correctness called "strong eventual consistency" which essentially guarantees eventual consistency under all network behaviors and structures; i.e. it guarantees availability and eventual consistency in the presence of temporary partitions, a pragmatic solution for the CAP theorem problem. Importantly, unlike systems such as git, merging is always automatic and guarantees no conflicts; however, like git, the result is not always semantically correct, but semantics are not part of the problems that CRDTs can solve for this extension.

A CRDT is formed by finding data types that have an operation that is associative, commutative, and idempotent and applying only those operation for synchronizing updates; only one operation can be applied to an underlying data type, but multiple data types and operations can be used to generate a CRDT object.

Associative: (a ○ b) ○ c = a ○ (b ○ c); or f(f(a, b), c) = f(a, f(b, c))
Commutative: a ○ b = b ○ a; or f(a, b) = f(b, a)
Idempotent: (a ○ b) ○ b = a ○ b; or f(f(a, b), b) = f(a, b)

An example of this is the (unordered) set union operator.
Associative: ({a} U {b}) U {c} = {a} U ({b} U {c}) = {a, b, c}
Commutative: {a} U {b} = {b} U {a} = {a, b}
Idempotent: ({a} U {b}) U {b} = {a} U {b} = {a, b}

Another example of this is the int max operator:
Associative: max(max(a, b), c) = max(a, max(b, c))
Commutative: max(a, b) = max(b, a)
Idempotent: max(max(a, b), b) = max(a, b)

A useful feature for advanced uses is composability: given any number of proven-correct CRDTs, composing them together yields another CRDT. ("Proven correctness" in this case refers to a mathematical proof. See A comprehensive study of Convergent and Commutative Replicated Data Types for a comprehensive exploration of this proven correctness.) The CRDTs I propose are simple compositions: the ORSet, a composition of two sets, and the PNCounter, a composition of two integers.

ORSet

The Observed-Remove Set is a simple CRDT for creating a Set with dynamic membership. It is composed of two sets each using the union operator, with the final Set being the Observed set minus the Remove set. Lamport timestamp metadata is used to ensure that an individual item can be added, removed, re-added, re-removed, etc: given an item that is a member of both sets, the version with the higher Lamport timestamp takes precedence; if the timestamps are identical (i.e. concurrent updates), we apply a tie breaker rule (either add-biased or remove-biased); when an item is superseded, its old version is simply dropped from the data structure, and any later applications of that old item will also be dropped. Any number of collaborators can issue updates and guarantee that all participants will eventually have the same output set. (Lamport timestamp requires that each node synchronizes to the observed Lamport timestamp of other nodes, i.e. local_time = max(local_time, received_time) + 1; each write or read event causes the local timestamp to increment to preserve causality/ordering.)

This could be used to track the extension users who have disliked or undisliked a video, but the data storage requirements would grow very quickly, so I do not believe this should be used for this particular issue. Also, counting the number of dislikes would require counting the size of the final Set, and updating would require iterating over both underlying sets, so it would have a large computational cost. However, this could be used on a per-user basis to store the dislike/un-dislike state for each video using that video's ID, and all synchronization/merge logic could be done client side with the servers providing backups/message passing if user account synchronization is a desired feature. (IndexedDB could be used to maintain both underlying sets client-side.) Synchronization logic could also be implemented server-side without much difficulty due to the simplicity of this CRDT design. Effectively, the server-side data structure would be a map between user_id and ORSet, where only the user with the given user_id could update their ORSet, and all updates would be synchronized across their extension installations.

PNCounter

The Positive-Negative Counter is used by some large scale companies (e.g. SoundCloud) for solving the problem of counting likes with high availability and without the high latency of using a central global coordinator to guarantee consensus ("the speed of light is too slow"). This CRDT is composed of two counters using the int max operator: a positive counter and a negative counter. The positive counter increases every time the like button is pressed, and the negative counter is increased every time the unlike button is pressed; the final value is the positive counter minus the negative counter. Technically, this is implemented with vectors of ints, where each replica node has a unique index in each vector for issuing updates, then the sum of each vector becomes the int (either positive or negative) used to calculate the final Int value. Instead of flat arrays with int indices, this would in practice use maps/registers where the replica node ID is the key and the counter int is the value.

The updates would be tracked on a per-API server basis rather than on a per-user basis, allowing the API server infrastructure to scale (i.e. scale the number of replicas) without growing the vector sizes linearly with the extension user base. With a serverless infrastructure, some server would need to be active as a coordinator, and some form of instance identifier would be necessary: either instances would need to have reusable IDs allocated to ensure no concurrent use of an ID, which could cause a bottleneck, or periodic pruning would be necessary by the coordinator to avoid unbounded growth in the int vectors. Periodic pruning could be accomplished by a task that has its own replica node id within the CRDT, deletes old replica node ID:counter pairs, and issues updates for its own counters summing these with its prior state: PNCounter[pruned][p] += sum(pruned_p).

I believe the PNCounter can be used for tracking dislikes: each time a video is disliked, the positive counter increments; each time a video is un-disliked, the negative counter increments; the result is an eventually consistent and accurate count of the number of actual extension dislikes, regardless of the number of replicas used for scaling. For this to work, each client would need to know its state of whether it has previously disliked or un-disliked an individual video. This state could be implemented client-side or server-side, with obvious scalability implications for each. I propose should using the ORSet to track this state for each user to enable account synchronization.

Optimization

CRDTs have high metadata storage requirements in their unoptimized forms. However, optimization is possible. See Optimizing state-based CRDTs (part 1) and CRDT optimizations for some discussions on this topic.

k98kurz · 2022-07-31T20:09:50Z

k98kurz
Jul 31, 2022
Author

@Anarios @sy-b what is the current system in place for counting extension user dislikes? I do not believe the server source code is public, so I cannot determine this myself. I'm willing to help implement this if the decision is made to use this technique.

2 replies

Anarios Jul 31, 2022
Maintainer

@k98kurz

I like the idea, but since the system is stable and doesn't experience any issues - I wouldn't rewrite it right now.

Current implementation is a simple postgres DB.

k98kurz Jul 31, 2022
Author

Sounds reasonable. Just keep this in mind if you experience any scaling, availability, or consistency issues in the future. There are plenty of nodejs packages that already implement CRDT logic, and these particular CRDTs would be easy to port into other languages from the code I've seen. (I think there are some in rust as well, but I don't use rust, so I can't analyze their quality.) If you ever want a proof-of-concept in python, php, or js, just let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRDT for counting dislikes #707

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

CRDT for counting dislikes #707

k98kurz Jul 30, 2022

CRDT Overview

ORSet

PNCounter

Optimization

Replies: 1 comment · 2 replies

k98kurz Jul 31, 2022 Author

Anarios Jul 31, 2022 Maintainer

k98kurz Jul 31, 2022 Author

k98kurz
Jul 30, 2022

Replies: 1 comment 2 replies

k98kurz
Jul 31, 2022
Author

Anarios Jul 31, 2022
Maintainer

k98kurz Jul 31, 2022
Author