Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluid runtime summary error caused by large change #22395

Open
nmsimons opened this issue Sep 5, 2024 · 2 comments
Open

Fluid runtime summary error caused by large change #22395

nmsimons opened this issue Sep 5, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@nmsimons
Copy link
Contributor

nmsimons commented Sep 5, 2024

Describe the bug

If we try sending large set of data (Array of size 1 million nodes) via shared tree (insert/remove op), then the socket communication breaks and web just hangs there
The other client which is connected to the session, do not receive the data
After some time interval, we starts getting Summarize_cancel, ‘Summerize_failed’ errors

To Reproduce

Steps to reproduce the behavior:

Using any fluid app, insert a very large string or create and array with around 1 million nodes. nsimons repro'd with the array in using the FluidExamples code.

Expected behavior

No error.

@nmsimons nmsimons added the bug Something isn't working label Sep 5, 2024
@kian-thompson kian-thompson self-assigned this Sep 5, 2024
@kian-thompson
Copy link
Contributor

This issue is due to AFR having a limit of 28 MB on summary uploads: Azure Fluid Relay limits - Azure Fluid Relay | Microsoft Learn

I would recommend doing summary compression at the driver layer to resolve this issue. To do this, simply wrap the document service factory using the applyStorageCompression(...) function from the @fluidframework/driver-utils package.
https://github.com/microsoft/FluidFramework/blob/main/packages/loader/driver-utils/src/adapters/predefinedAdapters.ts#L23

@CraigMacomber
Copy link
Contributor

Issues which I think need to be addressed related to this:

  1. The above mentioned limits are for AFR. The two repros I'm aware of for this were with different services (a custom service implementation, and with azure-local-service). Details on how if/how these limits apply to other services and how to configure those limits would be useful. Ideally that page should link a more general document about how fluid size limits work and can be configured.
  2. The linked method applyStorageCompression is an internal API: this issue is impacting customers so if that needs to be used we have to make changes on our end to use it, or to publish it.
  3. Compression can effectively make the limits higher in some cases, but it does not actually allow documents stale to the sizes some customers require. It is a useful mitigation to know about but is not a solution to the general problem of large documents breaking: we should clarify our status and plans for large document support somewhere.
  4. The linked document Azure Fluid Relay limits - Azure Fluid Relay | Microsoft Learn has me concerned about data loss. If incremental summaries are working, it seems like you can make a document large enough that if the user reloads the page they will be unable to access the document content. Basically, if you manage to get 100mb of data into a document, you win the prize of 100 mb data loss from the user's perspective. This seems bad and we need some way to prevent it from occurring. Ideally large documents would degrade in performance, not become inaccessible. If this is not feasible, we should have a way to give error which prevent the document from becoming so large that it is inaccessible, or at least provide an easy way to roll back to a version that is accessible. Any time we have public facing documentation describing something like "If the size of the document grows above 95 MB, subsequent client load or join requests will fail" we really should pair it with a feature or a roadmap for us delivering a feature to prevent such a document breaking failure mode so that potential customer have some confidence fluid won't just eat their user's data.

Some good news:

My original repro for this issue (inserting 400000 nodes containing "x" into an array in the tree at https://github.com/microsoft/FluidExamples/blob/main/item-counter/src/schema.ts#L25 using this.insertAtStart(TreeArrayNode.spread(Array.from("x".repeat(400000)))); is no longer failing.

I now have to insert much more data to fail, and the error I get now indicates that the summary was too large:

fluid:telemetry:fluid:telemetry:Summarizer:Running Summarize_cancel {"category":"error","fullTree":false,"finalAttempt":false,"latestSummaryRefSeqNum":224,"timeSinceLastAttempt":13862,"timeSinceLastSummary":13862,"ackWaitDuration":1200,"ackNackSequenceNumber":1340,"summarySequenceNumber":1335,"referenceSequenceNumber":817,"minimumSequenceNumber":780,"opsSinceLastAttempt":593,"opsSinceLastSummary":593,"stage":"submit","dataStoreCount":1,"summarizedDataStoreCount":1,"gcStateUpdatedDataStoreCount":0,"gcBlobNodeCount":0,"gcTotalBlobsSize":0,"summaryNumber":4,"treeNodeCount":10,"blobNodeCount":9,"handleNodeCount":3,"totalBlobSize":131040326,"unreferencedBlobSize":0,"generateDuration":616,"handle":"e43cc6bc13f59c36cfaf9ece06542fa4d65fe096","uploadDuration":13994,"clientSequenceNumber":12,"hasMissingOpData":true,"opsSizesSinceLastSummary":96391413,"nonRuntimeOpsSinceLastSummary":14,"runtimeOpsSinceLastSummary":103,"reason":"summaryNack: Server rejected summary via summaryNack op","duration":15811,"error":"summaryNack: Server rejected summary via summaryNack op","errorMessage":"A non-fatal error happened when trying to write client summary. Error: {"message":"request entity too large","canRetry":false,"isFatal":false,"source":"[post] request to [http://localhost:7070/repos/local] failed with [413] status code"}","message":"summaryNack: Server rejected summary via summaryNack op","errorInstanceId":"1f7767f0-7028-42da-8e5d-5a9b9f53e97a","clientType":"noninteractive/summarizer","containerId":"4cb4b2da-6496-4f07-aa1c-14527ae3924e","docId":"b4e21778-ff37-4a34-88e6-0bf42f285104","containerAttachState":"Attached","containerLifecycleState":"loaded","containerConnectionState":"Connected","serializedContainer":false,"runtimeVersion":"2.3.1","summarizeCount":1,"summarizerSuccessfulAttempts":0,"summarizeReason":"maxOps","summaryAttempts":1,"dmInitialSeqNumber":224,"dmLastProcessedSeqNumber":1342,"dmLastKnownSeqNumber":1342,"containerLoadedFromVersionId":"2cda505ebef7e8c33823a5e925535f35276019b7","containerLoadedFromVersionDate":"2024-10-10T19:53:06.000Z","dmLastMsqSeqNumber":1342,"dmLastMsqSeqTimestamp":1728590021463,"dmLastMsqSeqClientId":"3f233329-bbf3-494d-8019-9db1cac97eac","dmLastMsgClientSeq":18,"connectionStateDuration":29673,"loaderId":"f3f65460-c352-4531-a2ef-51fce36187d8","loaderVersion":"2.3.1"} tick=35404 Error
RetriableSummaryError@http://localhost:8080/bundle.js:72084:9
summarizeCore@http://localhost:8080/bundle.js:72299:31
asyncsummarize@http://localhost:8080/bundle.js:72109:14
attemptSummarize@http://localhost:8080/bundle.js:69648:52
trySummarizeWithRetries@http://localhost:8080/bundle.js:69674:51
./node_modules/@fluidframework/container-runtime/lib/summary/runningSummarizer.js/trySummarize/<@http://localhost:8080/bundle.js:69614:25
lockedSummaryAction@http://localhost:8080/bundle.js:69555:16
trySummarize@http://localhost:8080/bundle.js:69611:14
./node_modules/@fluidframework/container-runtime/lib/summary/runningSummarizer.js/RunningSummarizer/this.heuristicRunner<@http://localhost:8080/bundle.js:69336:171
./node_modules/@fluidframework/container-runtime/lib/summary/summarizerHeuristics.js/SummarizeHeuristicRunner/this.runSummarize@http://localhost:8080/bundle.js:70346:29
run@http://localhost:8080/bundle.js:70371:29
./node_modules/@fluidframework/container-runtime/lib/summary/runningSummarizer.js/handleOp/<@http://localhost:8080/bundle.js:69461:42
promise callback
handleOp@http://localhost:8080/bundle.js:69460:18
./node_modules/@fluidframework/container-runtime/lib/summary/runningSummarizer.js/RunningSummarizer/this.runtimeListener@http://localhost:8080/bundle.js:69376:18
emit@http://localhost:8080/bundle.js:6402:17
validateAndProcessRuntimeMessage@http://localhost:8080/bundle.js:60697:14
./node_modules/@fluidframework/container-runtime/lib/containerRuntime.js/processInboundMessages/</<@http://localhost:8080/bundle.js:60577:30
ensureNoDataModelChanges@http://localhost:8080/bundle.js:59465:20
./node_modules/@fluidframework/container-runtime/lib/containerRuntime.js/processInboundMessages/<@http://localhost:8080/bundle.js:60575:22
processInboundMessages@http://localhost:8080/bundle.js:60574:22
process@http://localhost:8080/bundle.js:60545:18
processRemoteMessage@http://localhost:8080/bundle.js:52581:22
process@http://localhost:8080/bundle.js:52450:40
processInboundMessage@http://localhost:8080/bundle.js:54064:22
./node_modules/@fluidframework/container-loader/lib/deltaManager.js/DeltaManager/this._inbound<@http://localhost:8080/bundle.js:53561:18
processDeltas@http://localhost:8080/bundle.js:54306:18
./node_modules/@fluidframework/container-loader/lib/deltaQueue.js/ensureProcessing/this.processingPromise<@http://localhost:8080/bundle.js:54271:37
promise callback*ensureProcessing@http://localhost:8080/bundle.js:54269:18
push@http://localhost:8080/bundle.js:54243:18
enqueueMessages@http://localhost:8080/bundle.js:53977:31
incomingOpHandler@http://localhost:8080/bundle.js:53536:26
./node_modules/@fluidframework/container-loader/lib/connectionManager.js/ConnectionManager/this.opHandler@http://localhost:8080/bundle.js:49872:24
emit@http://localhost:8080/bundle.js:6402:17
emit@http://localhost:8080/bundle.js:101259:26
./node_modules/@fluidframework/driver-base/lib/documentDeltaConnection.js/DocumentDeltaConnection/</<@http://localhost:8080/bundle.js:75744:30
./node_modules/@socket.io/component-emitter/lib/esm/index.js/Emitter.prototype.emit@http://localhost:8080/bundle.js:147759:20
emitEvent@http://localhost:8080/bundle.js:156614:20
onevent@http://localhost:8080/bundle.js:156601:18
onpacket@http://localhost:8080/bundle.js:156571:22
+0ms localhost:8080:13851:25

I'm also no longer seeing an error reported from the azure-local-service console output.

I produced this error by inserting 100+ MB of uncompressible data into the tree. The error indicates "errorMessage":"A non-fatal error happened when trying to write client summary. Error: {\"message\":\"request entity too large\"

That was done using:

	public insertNew = () => {
		const data = [];
		for (let index = 0; index < 100; index++) {
			const inner = [];
			for (let i = 0; i < 10000; i++) {
				inner.push(Math.random().toString(36));
			}
			data.push(inner.join("x"));
		}
		this.insertAtStart(TreeArrayNode.spread(data));
		console.log(
			`${[...this].map((a) => a.length).reduce((a, b) => a + b, 0) / 1024 / 1024} MB`,
		);
	};

and inserting data with the button hooked up to that method many times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants