Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary MongoDB retries when observer writes anchor CIDs during recovery flow #1086

Open
DRK3 opened this issue Feb 2, 2022 · 0 comments

Comments

@DRK3
Copy link
Collaborator

DRK3 commented Feb 2, 2022

When an Observer writes anchor CID data to the store (this line of code), we explicitly identify which suffix anchors are brand new and should not already exist in the store. We do this since when using MongoDB, we can make the bulk storage operation much faster if we can "tell it" in advance that the data is new (and so the underlying storage code does an insert rather than an upsert).

If an Orb server is is shut down in the middle of writing anchor CID data to the store then after restarting, it may attempt to try that operation again. In this case, it's possible for some of those anchor CIDs to have already been written, but the Orb server will mark them as "new" when telling the MongoDB storage code to execute the bulk write. This cases a failure. If we detect this failure in Orb (as reported by the underlying MongoDB storage code), then we retry the batch operation by simply labelling each operation as an upsert, which is guaranteed to succeed but be much slower. This works as expected, but in the logs you'll see this warning message several times:

 [didanchor-store] 2022/02/02 12:55:47 UTC - didanchor.(*Store).PutBulk -> WARNING Failed to add cid[hl:uEiCojP47SeohcFG03jCPcfddX7Cy3cLyM9ID3MGBup4OxQ:uoQ-BeFxodHRwczovL29yYi0zLnZtZS1vcmItcGVyZi50cnVzdGJsb2MuZGV2L2Nhcy91RWlDb2pQNDdTZW9oY0ZHMDNqQ1BjZmRkWDdDeTNjTHlNOUlEM01HQnVwNE94UQ] to suffixes[EiDk_C06jNl_FSkBVOs-Tz4O2lVPbs8k9xfVtIYjGtMWVQ EiDre_Rovj0KnCWaiJvEFtWPuTSalKQpWuKZxVvpExVmBA EiDKiw2716jgK2moKNb-hSqQ_wVRbspbvKE3JQUTCcNMTQ ...] using the batch speed optimization. This can happen if this Orb server is in a recovery flow. Will retry without the optimization now (will be slower). Underlying error message: failed to perform batch operations after 4 attempts: duplicate key. Either the IsNewKey optimization flag has been set to true for a key that already exists in the database, or, if using MongoDB 4.0.0, then this may be a transient error due to another call storing data under the same key at the same time. Underlying error message: bulk write exception: write errors: [E11000 duplicate key error collection: c index: _id_]

Note the after 4 attempts part. We should actually immediately try the "slow" route after this fails a single time. The problem is that in the underlying storage code, it can't tell whether this error is due to data being marked as "new" when it isn't actually, or whether it's happening due to a different transient error (in which case, retrying is the right thing to do). This "transient error" only seems to happen in MongoDB 4.0.0 (or DocumentDB 4.0.0), but from the underlying storage code we also don't know which version of MongoDB is being used.

If we can guarantee that the transient error scenario won't happen, then there could be a way to just fail immediately without doing the retries. There are a few possible solutions, but we'll have to think carefully about how to do it.

Some ideas:
Passing a flag to the underlying storage to limit retries
Somehow detect the MongoDB version so that the code can rule out the "transient error" scenario if it doesn't apply
Detect if a batch operation is doing only inserts (new data), in which case we know for sure the transient error doesn't apply.

(Note that multiple of the above solutions may apply. They all involve changing the underlying storage code in Aries-Framework-Go-Ext)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: TODO
Development

No branches or pull requests

1 participant