Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Tensor zero point sparsity #1304

Closed
wants to merge 10 commits into from

Conversation

kylesayrs
Copy link

@kylesayrs kylesayrs commented Jun 21, 2024

@@ -1301,6 +1310,17 @@ onnx.Context.Graph = class {
node = new onnx.Node(this, node, inputs, outputs);
this._nodes.push(node);

switch (op_type) {
Copy link
Author

@kylesayrs kylesayrs Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for a tensor to know its zero, it must get the sibling initializer to it on a quantized node

The onnx format doesn't allow for easy lookup from initializer to node or from initializer to initializer, so it's best to setZero after onnx.Tensor has been constructed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An initializer is a parameter to a node. A node should not modify an initializer.

Copy link
Author

@kylesayrs kylesayrs Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code happens within onnx.Context.Graph, so it is the onnx context graph which is modifying initializers. Setting the zero point from onnx.Context.Graph seems like the best option?

Attempting to force the zero point to be set during initializer construction seems like a bad idea. Initializers must know the context of their parent node in order to know their zero point.

  1. Looking up the parent node of each initializer is O(N*M) if N is the number of nodes and M is the number of tensors, which seems like unacceptable runtime.
  2. Attempting to create a parent node lookup table requires that table to be filled outside of onnx.Tensor (ie, in some place like onnx.Context.Graph), which is similar to the current implementation. It also requires all nodes to be ingested into the lookup table before tensors are constructed, which also seems like an unacceptable solution.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initializer has no zero point. Only the context of the operation or result of the operation does.

x = a + z0
y = a + z1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lutzroeder The zero point is needed by the initializer/tensor to correctly calculate the sparsity metric. Without the zero point, the sparsity metric for quantized weights is meaningless. It would be nice if the sparsity metric for quantized tensors reflected the true number of zeros that would be skipped by a sparsity-aware inference engine.

Copy link
Author

@kylesayrs kylesayrs Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lutzroeder Sorry, let me see if I can summarize what you're suggesting

  1. You're saying that calculating sparsity and flops is equivalent to built in runtime execution or partial execution?

While it's true that these calculations are somewhat similar to runtime execution, I don't consider them to be the same since calculating sparsity and flops can be done much faster and can be done lazily/piece-wise behind expanders as previously suggested.

I agree that work will need to be done to support different formats. Metrics calculations may go unsupported for some formats. I'm not sure if I understand the alternatives though.

  1. You're suggesting an alternative could be to have metrics handled by the runtime itself and results written to a "metrics" file.

When you refer to runtime itself, I assume you mean ORT or other 3rd party runtimes? Would the intention be to have the user run the model outside of Netron? The problem would obviously be a non-standardization/absence of metrics output files. Even if there was a sparsity-aware runtime for each format which outputted metrics files, these metrics files would be in different formats and we would have the same problem as 1 where we now have to do work to support all runtime output formats.

  1. You're suggesting a third alternative of attaching to a runtime.

This part I do not understand. This seems similar to the job of a profiler specific to the runtime? How would Netron know the internals of how a runtime performed on a network? What would it look like for the user to "attach a runtime"?

I truly apologize if I am missing something obvious

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. These would be the harder questions that need better answers. It isn't clear what the right path is.

sparsity and flops can be done much faster

Are these a few exceptions or the beginning of discovering that many useful metrics require deeper runtime integration or crossing over into profiling.

The problem would obviously be a non-standardization/absence of metrics output files.

Which metrics output file formats and tools do exist. How are these problems solved today?

Metrics calculations may go unsupported for some formats. I'm not sure if I understand the alternatives though.

UX approach isn't clear. How can users intuitively understand what's happening. The sparsity is not the sparsity of this tensor, it's the result of an operation. Some formats tweaking source tensors and others not is confusing at best. Users wouldn't be able to tell without reading the source.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lutzroeder I have implemented sparsity and ops calculations in the past for ONNX here: neuralmagic/sparsezoo#173. The only metric which was somewhat runtime specific was the flops calculations for conv.

If we choose not to calculate skipped padding for conv ops, then we don't need to iterate through tensor values at all and can calculate ops from the raw number of zeros. This makes all common onnx operations one-pass linear time to calculate metrics for.

In terms of performance, I suspect that calculating most individual tensor metrics will be negligible, especially if they're behind expanders. Calculating metrics for all tensors/nodes so that they can be searched will likely take some time no matter which approach is taken. I believe users would be willing to endure some wait time for these metrics if Netron

  • Provides a loading bar in large tensor/ metrics search cases
  • Lazily load metrics behind expanders
  • Refuses to calculate tensors with size above a certain threshold
  • Computes metrics in a separate thread to avoid UI slowdown and enable the option to cancel the search

Giving users the option to use these metrics in exchange for some wait time seems like a valuable feature.

I think 2 and 3 are viable, but having users use an external engine or attach a runtime will likely be slower than if Netron calculates the metrics directly, not to mention more inconvenient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because Netron supports 1 doesn't mean it can't support 2 and/or 3 down the line. Indeed I think Netron should support 2 in some capacity. All of these options require format-specific implementations, at least with 1, more of the logic is shared.

Copy link
Owner

@lutzroeder lutzroeder Jul 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, trying to summarize:

This pull request should not merge as modifying the source tensors is running the computation and storing the result in the source tensor which will lead to conflicts. It would not be transparent to users that the source tensor was modified and confusing as this happens in some cases but not others. A better way to model this might be sparsity of the operation (e.g. can Node have a sparsity).

This leads to next level of questions how the computation is performed and tensor access is needed to perform these computations. Not sure if these issues need to be solved immediately towards making metrics a minimum viable feature or if this can be more exploratory. There seem to be more basic changes like lazy expansion for tensor properties and basic metrics like min, max, stddev. Also, now that some effort went into tensor properties, would tensor visualizations for example be more valuable and broadly applicable than handling zero point in special operators in a single framework.

On the explorations side, is it possible to generalize metrics that need tensor data and computation to avoid implementations for each framework or is that too far fetched? If the later, will this lead to more of an execution layer that needs to be implemented for each framework (similar question for shape inference) and should this code be always loaded. Starting down this route will create the expectation that this gets implemented and maintained more broadly. e.g. how to guarantee these "partial execution implementations" are more than prototypes adding maintenance scope. Other routes worth exploring would be driving metric implementations via external tools or the runtime and serializing to a metrics file (which might be also desirable for performance metrics) or attaching to Python runtimes to avoid duplicate execution implementations and enabling live execution.

Copy link
Owner

@lutzroeder lutzroeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would tie a specific tensor to another tensor via a relationship that is only established during execution. For example, the same source tensor could be used multiple times in a model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants