WIP: Tensor zero point sparsity #1304

kylesayrs · 2024-06-21T07:18:02Z

kylesayrs · 2024-06-21T07:21:12Z

source/onnx.js

@@ -1301,6 +1310,17 @@ onnx.Context.Graph = class {
 node = new onnx.Node(this, node, inputs, outputs);
 this._nodes.push(node);

+ switch (op_type) {


In order for a tensor to know its zero, it must get the sibling initializer to it on a quantized node

The onnx format doesn't allow for easy lookup from initializer to node or from initializer to initializer, so it's best to setZero after onnx.Tensor has been constructed.

An initializer is a parameter to a node. A node should not modify an initializer.

This code happens within onnx.Context.Graph, so it is the onnx context graph which is modifying initializers. Setting the zero point from onnx.Context.Graph seems like the best option?

Attempting to force the zero point to be set during initializer construction seems like a bad idea. Initializers must know the context of their parent node in order to know their zero point.

Looking up the parent node of each initializer is O(N*M) if N is the number of nodes and M is the number of tensors, which seems like unacceptable runtime.

Attempting to create a parent node lookup table requires that table to be filled outside of onnx.Tensor (ie, in some place like onnx.Context.Graph), which is similar to the current implementation. It also requires all nodes to be ingested into the lookup table before tensors are constructed, which also seems like an unacceptable solution.

The initializer has no zero point. Only the context of the operation or result of the operation does.

x = a + z0 y = a + z1

@lutzroeder The zero point is needed by the initializer/tensor to correctly calculate the sparsity metric. Without the zero point, the sparsity metric for quantized weights is meaningless. It would be nice if the sparsity metric for quantized tensors reflected the true number of zeros that would be skipped by a sparsity-aware inference engine.

@lutzroeder Sorry, let me see if I can summarize what you're suggesting

You're saying that calculating sparsity and flops is equivalent to built in runtime execution or partial execution?

While it's true that these calculations are somewhat similar to runtime execution, I don't consider them to be the same since calculating sparsity and flops can be done much faster and can be done lazily/piece-wise behind expanders as previously suggested.

I agree that work will need to be done to support different formats. Metrics calculations may go unsupported for some formats. I'm not sure if I understand the alternatives though.

You're suggesting an alternative could be to have metrics handled by the runtime itself and results written to a "metrics" file.

When you refer to runtime itself, I assume you mean ORT or other 3rd party runtimes? Would the intention be to have the user run the model outside of Netron? The problem would obviously be a non-standardization/absence of metrics output files. Even if there was a sparsity-aware runtime for each format which outputted metrics files, these metrics files would be in different formats and we would have the same problem as 1 where we now have to do work to support all runtime output formats.

You're suggesting a third alternative of attaching to a runtime.

This part I do not understand. This seems similar to the job of a profiler specific to the runtime? How would Netron know the internals of how a runtime performed on a network? What would it look like for the user to "attach a runtime"?

I truly apologize if I am missing something obvious

Exactly. These would be the harder questions that need better answers. It isn't clear what the right path is.

sparsity and flops can be done much faster

Are these a few exceptions or the beginning of discovering that many useful metrics require deeper runtime integration or crossing over into profiling.

The problem would obviously be a non-standardization/absence of metrics output files.

Which metrics output file formats and tools do exist. How are these problems solved today?

Metrics calculations may go unsupported for some formats. I'm not sure if I understand the alternatives though.

UX approach isn't clear. How can users intuitively understand what's happening. The sparsity is not the sparsity of this tensor, it's the result of an operation. Some formats tweaking source tensors and others not is confusing at best. Users wouldn't be able to tell without reading the source.

@lutzroeder I have implemented sparsity and ops calculations in the past for ONNX here: neuralmagic/sparsezoo#173. The only metric which was somewhat runtime specific was the flops calculations for conv.

If we choose not to calculate skipped padding for conv ops, then we don't need to iterate through tensor values at all and can calculate ops from the raw number of zeros. This makes all common onnx operations one-pass linear time to calculate metrics for.

In terms of performance, I suspect that calculating most individual tensor metrics will be negligible, especially if they're behind expanders. Calculating metrics for all tensors/nodes so that they can be searched will likely take some time no matter which approach is taken. I believe users would be willing to endure some wait time for these metrics if Netron

Provides a loading bar in large tensor/ metrics search cases

Lazily load metrics behind expanders

Refuses to calculate tensors with size above a certain threshold

Computes metrics in a separate thread to avoid UI slowdown and enable the option to cancel the search

Giving users the option to use these metrics in exchange for some wait time seems like a valuable feature.

I think 2 and 3 are viable, but having users use an external engine or attach a runtime will likely be slower than if Netron calculates the metrics directly, not to mention more inconvenient.

Just because Netron supports 1 doesn't mean it can't support 2 and/or 3 down the line. Indeed I think Netron should support 2 in some capacity. All of these options require format-specific implementations, at least with 1, more of the logic is shared.

Yes, trying to summarize:

This pull request should not merge as modifying the source tensors is running the computation and storing the result in the source tensor which will lead to conflicts. It would not be transparent to users that the source tensor was modified and confusing as this happens in some cases but not others. A better way to model this might be sparsity of the operation (e.g. can Node have a sparsity).

This leads to next level of questions how the computation is performed and tensor access is needed to perform these computations. Not sure if these issues need to be solved immediately towards making metrics a minimum viable feature or if this can be more exploratory. There seem to be more basic changes like lazy expansion for tensor properties and basic metrics like min, max, stddev. Also, now that some effort went into tensor properties, would tensor visualizations for example be more valuable and broadly applicable than handling zero point in special operators in a single framework.

On the explorations side, is it possible to generalize metrics that need tensor data and computation to avoid implementations for each framework or is that too far fetched? If the later, will this lead to more of an execution layer that needs to be implemented for each framework (similar question for shape inference) and should this code be always loaded. Starting down this route will create the expectation that this gets implemented and maintained more broadly. e.g. how to guarantee these "partial execution implementations" are more than prototypes adding maintenance scope. Other routes worth exploring would be driving metric implementations via external tools or the runtime and serializing to a metrics file (which might be also desirable for performance metrics) or attaching to Python runtimes to avoid duplicate execution implementations and enabling live execution.

lutzroeder

This would tie a specific tensor to another tensor via a relationship that is only established during execution. For example, the same source tensor could be used multiple times in a model.

…or-zero-point-sparsity

lutzroeder and others added 6 commits June 1, 2024 16:12

Update onednn.js

b12d72f

Merge branch 'main' of https://github.com/kylesayrs/netron into main

99c0381

Merge branch 'main' of https://github.com/lutzroeder/netron into main

1b2cf28

Merge branch 'main' of https://github.com/lutzroeder/netron into main

bf425ca

Merge branch 'main' of https://github.com/lutzroeder/netron into main

8b46d9a

WIP: zero point

932b101

kylesayrs commented Jun 21, 2024

View reviewed changes

implement the rest of the op types

2a5eb19

lutzroeder requested changes Jun 22, 2024

View reviewed changes

lutzroeder force-pushed the main branch 3 times, most recently from 947fe42 to 09bd794 Compare July 4, 2024 17:26

lutzroeder force-pushed the main branch 8 times, most recently from 742dcbd to fc6346e Compare July 6, 2024 01:52

lutzroeder and others added 3 commits July 7, 2024 20:33

Update TorchScript test files (lutzroeder#1067)

effbf09

Merge branch 'main' of https://github.com/lutzroeder/netron into tens…

dda6828

…or-zero-point-sparsity

fix lints

fa41bde

lutzroeder force-pushed the main branch 2 times, most recently from e02cf71 to a805f35 Compare July 8, 2024 04:28

lutzroeder closed this Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Tensor zero point sparsity #1304

WIP: Tensor zero point sparsity #1304

kylesayrs commented Jun 21, 2024 •

edited

Loading

kylesayrs Jun 21, 2024 •

edited

Loading

lutzroeder Jun 22, 2024

kylesayrs Jun 22, 2024 •

edited

Loading

lutzroeder Jun 22, 2024

kylesayrs Jun 22, 2024

kylesayrs Jun 22, 2024 •

edited

Loading

lutzroeder Jun 22, 2024

kylesayrs Jun 25, 2024

kylesayrs Jun 25, 2024

lutzroeder Jul 13, 2024 •

edited

Loading

lutzroeder left a comment

WIP: Tensor zero point sparsity #1304

WIP: Tensor zero point sparsity #1304

Conversation

kylesayrs commented Jun 21, 2024 • edited Loading

kylesayrs Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

lutzroeder Jun 22, 2024

Choose a reason for hiding this comment

kylesayrs Jun 22, 2024 • edited Loading

Choose a reason for hiding this comment

lutzroeder Jun 22, 2024

Choose a reason for hiding this comment

kylesayrs Jun 22, 2024

Choose a reason for hiding this comment

kylesayrs Jun 22, 2024 • edited Loading

Choose a reason for hiding this comment

lutzroeder Jun 22, 2024

Choose a reason for hiding this comment

kylesayrs Jun 25, 2024

Choose a reason for hiding this comment

kylesayrs Jun 25, 2024

Choose a reason for hiding this comment

lutzroeder Jul 13, 2024 • edited Loading

Choose a reason for hiding this comment

lutzroeder left a comment

Choose a reason for hiding this comment

kylesayrs commented Jun 21, 2024 •

edited

Loading

kylesayrs Jun 21, 2024 •

edited

Loading

kylesayrs Jun 22, 2024 •

edited

Loading

kylesayrs Jun 22, 2024 •

edited

Loading

lutzroeder Jul 13, 2024 •

edited

Loading