Sync issues with missing Yjs steps/structs #4600

mejo- · 2023-07-28T21:39:58Z

Describe the bug
In a test session with four participants we were able to create a situation were changes of two participants were no longer applied to the document of the two others.

Some first insights after analysing the HAR file of three participants

Synchronisation of client (a) ended at exactly the same Yjs client clock count for two other clients (c) and (d). Synchronisation of client (a) kept working on client (b). We thus assume that changes of client (a) were propagated to the server.
Both clients (c) and (d) received Yjs steps with a higher clock count by client (a). They were not applied to the document state due to two missing steps.
Apparently client (a) reconnected and changed its Yjs client ID just in the moment when synchronisation stopped working on clients (c) and (d).
Client (a) sent a Yjs query request (starting with AAA...) that's filtered out by the server in the moment the synchronisation stopped for clients (c) and (d).

The text was updated successfully, but these errors were encountered:

max-nextcloud · 2023-10-24T13:49:44Z

I think we are missing a mechanism for resyncing clients once they get out of sync. yjs-websocket provides this in https://github.com/yjs/y-websocket/blob/master/src/y-websocket.js#L300-L311 - but that only works with a server that understands the y-js protocol and can answer the sync type 1 messages from the client.

All a shared resync with multiple clients would take would be:

One client (A) sends its full current yjs state incl. history and a matching sync type 1 message.
These are distributed to all clients.
All clients reply to the sync type 1 if they have additional steps
These replies are again distributed to all clients.

1. and 2. ensure all other clients are at least in sync with (A).
3. and 4. ensures all steps the other clients have in addition are also synced.

At the same time this sync scales fairly well:

grows linear with the compressed document size (some 10k)
grows linear with the number of clients (n) and the compressed document size ( some 10k per client)
will be small as it's only the diff to 1. - so at most the compressed steps since the last resync
grows with n^2 - times the small diff.

For this to scale nicely it's important that only one client performs 1. Otherwise everything will be multiplied by n.

Since we already have an autosave request that is send by one client every 30 seconds (?) I propose the following:
a) in addition to the autosave content the saving client sends the content of 1.
b) the server adds these two messages to the distributed steps even though one of them is a sync type 1 which we normally drop.

This should be enough to provide the resync mechanism. We could rely on it and simplify other parts of the sync:

c) When autosaving we don't need to upload the doc state anymore as that's encoded in the first message in 1.
d) When connecting we don't need to distribute the doc state anymore.
e) During new connections / when receiving a sync type 1 message from a client the server can (re)send all messages since the last autosave including the last resync.

mejo- · 2023-10-25T12:33:09Z

@max-nextcloud: how would 1. be triggered? In other words: when exactly would client (A) send the full current yjs state? When it thinks it's out of sync? Or when it's requested by some other party to do so?

As far as I understand, you would make the client do 1. along with the autosave every 30 seconds. How would you ensure that only one client does so? And how do we decide which client is responsible to do it? What if this client becomes unresponsive?

Apart from these questions, your approach sounds sensible to me.

mejo- · 2023-10-25T12:51:31Z

This morning, @juliushaertl and me went through the code and identified two more potential causes for problems:

There's a possibility for a race condition in DocumentService->addStep(). First we get $stepsVersion via StepMapper->getLatestVersion(), then we do some more stuff, then we write back the steps via StepMapper->insert(), calculating $newVersion from the formerly retrieved $stepsVersion. If another client writes steps in the same moment, the new calculated steps versions might conflict:

text/lib/Service/DocumentService.php

Lines 265 to 274 in aad168b

    
           $stepsVersion = $this->stepMapper->getLatestVersion($document->getId()); 
        
           $newVersion = $stepsVersion + count($steps); 
        
           $this->logger->debug("Adding steps to " . $document->getId() . ": bumping version from $stepsVersion to $newVersion"); 
        
           $this->cache->set('document-version-' . $document->getId(), $newVersion); 
        
           $step = new Step(); 
        
           $step->setData($stepsJson); 
        
           $step->setSessionId($session->getId()); 
        
           $step->setDocumentId($document->getId()); 
        
           $step->setVersion($newVersion); 
        
           $this->stepMapper->insert($step);

We should change this to be atomic, by calculating the new version within the same query that writes them. E.g. something like:

INSERT INTO text_steps (version, [...]) SUM(SELECT MAX(version) FROM text_steps + $newStepCount), ...

In DocumentService->addStep() we fetch all steps if there's queries in pushes of type YJS_SYNC_MESSAGE_STEP1:

text/lib/Service/DocumentService.php

Lines 232 to 234 in aad168b

    
           // If there were any queries in the steps send the entire history 
        
           $getStepsSinceVersion = count($querySteps) > 0 ? 0 : $version; 
        
           $allSteps = $this->getSteps($documentId, $getStepsSinceVersion);

But this only fetches the first 1000 steps due to setMaxResults() in StepMapper->find():

text/lib/Db/StepMapper.php

Line 50 in aad168b

->setMaxResults(1000)

This should be changed to return all steps.

We're not sure whether any of these two causes the problems people experience, but they should be fixed nevertheless. @max-nextcloud and @juliushaertl if you agree, I could prepare patches for these two problems.

juliusknorr · 2023-10-25T12:58:57Z

Regarding the y.js based resyncing mechanism, we were wondering earlier when this would actually be triggered. So far I think that y.js would only send the SyncStep1 on:

initial connect
reconnect of the websocket
the resync interval is disabled by default https://github.com/yjs/y-websocket/blob/master/src/y-websocket.js#L300-L311

Now my main question to consider implementing the above resync handling would be if we can find a resoning on why the existing methods to get steps to the clients could miss steps:

The text app sync polling should always catch up with the next steps
The reconnect of y.js would load 1000 steps (as pointed above), but still the sync should continue where the last step version was left before
One further idea: Since we return steps on push, maybe there is an issue with aggregating the own added steps which could leave out inbetween ones, that would also explain why a reload of the page brings back proper sync state which means that the db state seems consistent enough to continue when fully re-initializing the client states

max-nextcloud · 2023-10-25T18:19:03Z

In DocumentService->addStep() we fetch all steps if there's queries in pushes of type YJS_SYNC_MESSAGE_STEP1:

If i remember correctly we never used them on the cient side. So while i agree that we should fetch all steps for now this should not matter. Plus the whole approach of resending the entire history seems inefficient.

max-nextcloud · 2023-10-25T18:20:53Z

I could prepare patches for these two problems.

A patch for 1. seems like a good idea to eliminate race conditions. With 2. as i said i'd check if this is actually being used before investing time in something that I hope will be replaced anyway.

juliusknorr · 2023-10-25T19:21:52Z

I could prepare patches for these two problems.

A patch for 1. seems like a good idea to eliminate race conditions. With 2. as i said i'd check if this is actually being used before investing time in something that I hope will be replaced anyway.

Maybe we can have a quick logging patch to at least see an indication if someone runs into the issue

max-nextcloud · 2023-10-25T21:01:08Z

As far as I understand, you would make the client do 1. along with the autosave every 30 seconds. How would you ensure that only one client does so? And how do we decide which client is responsible to do it? What if this client becomes unresponsive?

I had another look how autosave actually works. The autosave function inside SyncService is debounced so it's executed at most every 30 seconds. It's triggered from Editor.vue whenever there are local changes (inside on onStateChange).

My understanding now is that every client will trigger the autosave and the server will just ignore them for a while to avoid saving all of the time. But I have not looked into the server side code here yet. This approach seems doable for the resync as well. All clients trigger a resync at most x seconds and the server only distributes one of them every x seconds. There may be better times to trigger a resync than the onStateChange handler though. Ideally it would happen if the client notices y.js updates it cannot apply due to missing steps - however that would require some upstream changes as far as I understand ( yjs/yjs#550 ). Another option might be to trigger the resync whenever a client reconnects. That would cover the laptop reopening pretty well.

max-nextcloud · 2023-10-27T10:19:21Z

Attempts to reproduce this - ideally in a reliable way

Failed

### Oct 27th - two clients against local dev server

I tried to reproduce this problem on a local instance. So far i failed - but also tried with the server and both clients running on the same machine:

turned the network off on one client - everything recovered fine.
stopped one client with a js breakpoint - everything recovered fine.
closed the laptop lid - send it to sleep - everything recovered fine.

The laptop sleep obviously affected both clients and the server. Will try with different machines next.

### Oct 28th - three clients against local dev and remote server, 12h sleep

Tried some more things to reproduce this. This time with two authenticated users and one guest against the local dev server and cloud.nextcloud.com. Still failing. Either I just did not run into the problem or disconnecting and reconnecting network is not enough and neither is sending the laptop to sleep.

I had the laptop sleep over night but the editing session was still working and in sync afterwards.

Other bugs found / reproduced

Not possible to set a guest name in public share #4252
reconnect button will take you to a loading screen that does not go away when network comes back.

Prevent a possible race condition when two clients add steps at the same time. See #4600. Rely on the autoincrementing id in order to provide a canonical order that steps can be retrieved in. When two clients push steps at the same time the entries receive destinct ids that increment. So if another client fetches steps in between it will see the smaller id as the version of the fetched step and fetch the other step later on. Transition: In the future we can drop the version column entirely but currently there are still steps stored in the database that make use of the old column. So we need to transition away from that. In order to find entries that are newer than version x we select those that have both a version and an id larger than x. Entries of the new format are newer than any entry of the old format. So we set their version to the largest possible value. This way they will always fulfill the version condition and the condition on the id is more strict and therefore effective. For the old format the version will be smaller than the id as it's incremented per document while the id is unique accross documents. Therefore the version condition is the more strict one and effective. The only scenario where the version might be larger than the id would be if there's very few documents in the database and they have had a lot of steps stored in single database entries. Signed-off-by: Max <max@nextcloud.com>

Prevent a possible race condition when two clients add steps at the same time. See #4600. Rely on the autoincrementing id in order to provide a canonical order that steps can be retrieved in. When two clients push steps at the same time the entries receive destinct ids that increment. So if another client fetches steps in between it will see the smaller id as the version of the fetched step and fetch the other step later on. Transition: In the future we can drop the version column entirely but currently there are still steps stored in the database that make use of the old column. So we need to transition away from that. In order to find entries that are newer than version x we select those that have both a version and an id larger than x. Entries of the new format are newer than any entry of the old format. So we set their version to the largest possible value. This way they will always fulfill the version condition and the condition on the id is more strict and therefore effective. For the old format the version will be smaller than the id as it's incremented per document while the id is unique accross documents. Therefore the version condition is the more strict one and effective. The only scenario where the version might be larger than the id would be if there's very few documents in the database and they have had a lot of steps stored in single database entries. Signed-off-by: Max <max@nextcloud.com> Signed-off-by: Jonas <jonas@freesources.org>

MrRinkana · 2023-11-09T22:32:13Z

I can confirm that there seems to be a way for people connecting to get out of sync. I had one document being edited by 5 people and a 6th only got a old version of the document, even if he refreshed the page (text 3.8.0).

Unfortunately I have not found a way to reproduce the issue consistently.

max-nextcloud · 2023-12-21T07:25:12Z

We have a fix for the one scenario that we could reproduce reliably. It will ship with the next releases - that is 28.0.2, 27.1.6 and 26.0.11.

I'm curious to see if it also fixes the harder to reproduce cases. I'll keep this issue open until we got some feedback.

MrRinkana · 2024-02-23T12:08:14Z

I tested this a little bit now by creating a .md file and sharing it publicly. I then opened 8 different connections to the file trough the use of Firefox containers (same computer, different sets of cookes etc) and one from my phone (different ip, latencies) of which one was logged in (and owned the file).

I'm on 28.0.2

In summary it seems to work much better!

I did however manage to get the file out of sync, namely I managed to create a table (managed to make it happen twice) that did not appear on all clients, while other changes (after the creation of the table) did. Reloading fixes the sync.

First time only my phone didn't see the table, nor changes to text in it, but could both edit and se other edits. On the clients that saw the table, the edits from the phone appeared normally above or under the table, so the file managed it well! Reloading introduced the table.

Second time I made a new table on one tab (client), added text into one square from another tab, waited a bit then opened the file again from one tab that had closed the file (file owner). The file owner did not see the tab. Three more tabs did not see the table, but the two that edited it and one more did. Reloading the page fixed the issue, and it seemed like the other tabs fixed themselves without needing to reload after that. See picture:

I did not manage to find a consistent trigger. I just played around in the file for about 30 mins, triggering it twice. I suspect a test with multiple users with different latencies/ips would be a good idea, as in my case only the phone had different ip and latency. I unfortunately will not be able to test it in such a way for a while.

No errors where logged (with log level 2) in the Nextcloud log.

mejo- · 2024-03-13T20:48:13Z

Thanks everyone for your input here and @MrRinkana for trying to further hunt this down. Given the feedback of users who suffered from this issue we expect this to be fixed with most recent Nextcloud 27 and 28 releases.

@MrRinkana could you please try whether you're able to create an out of sync state by adding tables and open a new issue about it if so?

mejo- added bug Something isn't working 0. Needs triage feature: sync high labels Jul 28, 2023

mejo- assigned mejo- and max-nextcloud Jul 28, 2023

mejo- added this to 📝 Office team Jul 28, 2023

github-project-automation bot moved this to 🧭 Planning evaluation (don't pick) in 📝 Office team Jul 28, 2023

mejo- moved this from 🧭 Planning evaluation (don't pick) to 📄 To do (~10 entries) in 📝 Office team Jul 28, 2023

juliusknorr moved this from 📄 To do (~10 entries) to 🏗️ In progress in 📝 Office team Oct 30, 2023

max-nextcloud mentioned this issue Oct 30, 2023

fix(sync): prevent race condition by relying on autoincrement #4938

Merged

4 tasks

mejo- closed this as completed Mar 13, 2024

github-project-automation bot moved this from 🏗️ In progress to ☑️ Done in 📝 Office team Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync issues with missing Yjs steps/structs #4600

Sync issues with missing Yjs steps/structs #4600

mejo- commented Jul 28, 2023

max-nextcloud commented Oct 24, 2023

mejo- commented Oct 25, 2023

mejo- commented Oct 25, 2023 •

edited

Loading

juliusknorr commented Oct 25, 2023

max-nextcloud commented Oct 25, 2023

max-nextcloud commented Oct 25, 2023

juliusknorr commented Oct 25, 2023

max-nextcloud commented Oct 25, 2023

max-nextcloud commented Oct 27, 2023 •

edited

Loading

MrRinkana commented Nov 9, 2023

max-nextcloud commented Dec 21, 2023 •

edited by juliusknorr

Loading

MrRinkana commented Feb 23, 2024 •

edited

Loading

mejo- commented Mar 13, 2024

Sync issues with missing Yjs steps/structs #4600

Sync issues with missing Yjs steps/structs #4600

Comments

mejo- commented Jul 28, 2023

max-nextcloud commented Oct 24, 2023

mejo- commented Oct 25, 2023

mejo- commented Oct 25, 2023 • edited Loading

juliusknorr commented Oct 25, 2023

max-nextcloud commented Oct 25, 2023

max-nextcloud commented Oct 25, 2023

juliusknorr commented Oct 25, 2023

max-nextcloud commented Oct 25, 2023

max-nextcloud commented Oct 27, 2023 • edited Loading

Attempts to reproduce this - ideally in a reliable way

Failed

Other bugs found / reproduced

MrRinkana commented Nov 9, 2023

max-nextcloud commented Dec 21, 2023 • edited by juliusknorr Loading

MrRinkana commented Feb 23, 2024 • edited Loading

mejo- commented Mar 13, 2024

mejo- commented Oct 25, 2023 •

edited

Loading

max-nextcloud commented Oct 27, 2023 •

edited

Loading

max-nextcloud commented Dec 21, 2023 •

edited by juliusknorr

Loading

MrRinkana commented Feb 23, 2024 •

edited

Loading