forked from christophM/interpretable-ml-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
07.4-unused.txt
369 lines (253 loc) · 19.5 KB
/
07.4-unused.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
### Sanity Checks for Saliency Maps
Model parameter randomization test: compare output of saliency method of model with the same output when the network gets random weights (but same architecture).
Expectation: Quite different results.
If results are similar, it means that the method is not connected to the model (bad).
Data randomization test: Compare saliency maps of two trained networks.
First network trained on training data, second network on training data but with shuffled data.
Again, we would expect that the saliency maps differ quite a lot.
If not, the method is broken.
Theses are very basic checks.
If a method passes these checks, it could still be a wrong attribution method.
But if it fails these checks, the method is definitely broken and should not be used.
Problems with guided backpropagaion and similar methods:
Images closely resemble the output of edge detectors.
Edge detectors are independent of the model and the data, so this is really bad for guided backpropagation.
Gradients and GradCAM passed the sanity checks (model and data randomization).
Guided BackProp & Guided GradCAM failed.
Other cases (gradient \* input, integrated gradients, SmoothGrad) where not clear cut.
Also very dependent on the activation function that was used (ReLU, Tanh, softmax, ...)
Unfortunately, many of those methods have issues.
In an experiment, where the labels were mixed and the model was retrained, the explanations were stil very similar.
Only GradCAM was okay.
### List of problems
Before we start with all the approaches, let's look at problems.
I know, a bit unusual, but it helps to weed out some of the approaches.
Some are also more "axiomatic" in the sense that these are problems to things that are desirable.
- Saturated Gradient (explained in [^deeplift]).
Caused by the activation function.
The flat part of an activation function (e.g. ReLU) does not change at some value.
So looking at the gradient of a pixel, it might be zero, even though it contributed a lot, but the gradient is already saturated.
- Disconinuities in the gradients, also called the Thresholding artifact. Stems from jump in ReLU.
Exists in Gradient and Gradient x Input
- Insensitivity to Model
- Insensitivity to Training Data
- Failure to highlight negative contributions (due to backpropagation through ReLUs)
- Input sensitivity failure: If input and baseline differ in one feature and have different predictions, the relevance of that feature has to be non-zero. Some methods fail this [^integrated-gradient].
Lack of input sensitivity leads to focus on irrelevant features.
Gradient saturation causes input sensitivity failure for DeConvNets and Guided Backpropagation.
- Failure of Implementation Invariance: An attribution should be identical when two networks give exactly the same predictions for any inpt.
This means no matter what the network does internally, as long as the output remains the same, so should the attributions to the pixels.
If an attribution method suffers from failure of implementation invariance, it may attribute to irrelevant inputs.
This happens because DeepLift (and also LRP) replace gradients (which are implementation invariant) with discrete gradients (i.e. finite sums) plus a modified form of backpropagation.-
### Some more methods:
VISUALIZING DEEP NEURAL NETWORK DECISIONS: PREDICTION DIFFERENCE ANALYSIS:
- Based on Shapley Value for explaining individual predictions
- Relevance of feature is estimated by measuring how the prediction changes if feature is unknown, by simulating that the feature is unknown.
- what this paper does differently: instead of simulating
- Implementation: https://github.com/lmzintgraf/DeepVis-PredDiff
- DeepSHAP (gradient-based)
- version called DeepExplainer. There is a connection between SHAP and DeepLift
- version called GradientExplainer. Connection between SHAP and Gradient Input algorithm.
- Occlusion (perturbation based) https://arxiv.org/abs/1311.2901
- (epsilon) Layer-Wise Relevance Propagation (gradient-based)
- Gradient * Input (gradient-based, surprise!) [^integrated-gradients] https://arxiv.org/abs/1605.01713
- Shapley Value Sampling (perturbation based)
- LIME (perturbation based)
- Grad-CAM class activation maps (gradient-based) [^grad-cam]
- Guided Backpropagagion
- Smoothgrad [^smoothgrad] https://arxiv.org/abs/1706.03825
- Deep Inside: Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep inside convolutional networks: Visualising image classificationmodels and saliency maps.arXiv preprint arXiv:1312.60342013.
- All-CNN: Springenberg, J.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for Simplicity: The All Convolutional Net.ICLR (workshop track), 2015.
- Deep Visualization: Yosinski, J.; Clune, J.; Nguyen, A.; Fuchs, T.; Lipson, H. Understanding neural networks through deepvisualization.arXiv preprint arXiv:1506.065792015
- VBP: Bojarski, M.; Choromanska, A.; Choromanski, K.; Firner, B.; Jackel, L.; Muller, U.; Zieba, K. Visualbackprop:visualizing cnns for autonomous driving.arXiv preprint arXiv:1611.054182016.
- Meaningful: Fong, R.C.; Vedaldi, A. Interpretable explanations of black boxes by meaningful perturbation. Proceedings ofthe IEEE International Conference on Computer Vision, 2017.
- code for above: https://github.com/jacobgil/pytorch-explain-black-box
- Grad-CAM++: Chattopadhyay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V. Grad-CAM++: Generalized Gradient-basedVisual Explanations for Deep Convolutional Networks2017.
### Further Papers analyzing saliency methods
- https://arxiv.org/pdf/1901.09392.pdf
- https://arxiv.org/pdf/1912.01451.pdf this paper says that the methods to analyze saliency maps is itself not very reliable
### DeconvNet, Deconvolution 2013-11-12
Paper: Visualizing and Understanding Convolutional Networks
By: Matthew D Zeiler, Rob Fergus
Date: 2013-11-12
The authors propose a Deconvolutional Network (DeconvNet).
A DeconvNet kind of uses the same components as a Convolutional Neural Network, but reverses the operations such as filtering, pooling and activation.
These nets have already been suggested two years earlier, but not for the purpose of visualization.
We start with an already trained neural network.
For a given image, we compute the all the feature maps.
The approach constructs a second neural network that covers the convolutional layers, but in reverse.
We start with the last convolutional layer (the feature maps) and go back until a reconstruction of the input image.
Pooling layers are reversed by unpooling:
Although this unpooling is not reversible (since max-pooling loses information), it is possible to obtain an approximation.
This works by recording the location of the maxima within each pooling region.
They call those "switch" variables.
In the DeconvNet unpooling, these switches are used to map the reconstructions from the layer above to the maximally activated location below.
The activation layer -- here Rectifying Linear Unit (ReLU) -- again uses ReLUs to backpropagate the reconstructions.
The filtering units are simply transponsed, but applied to the rectified maps, not the output layer beneath.
[On this github README](https://github.com/vdumoulin/conv_arithmetic) you can see some animations how this looks like.
We do not cover too deeply here, because it has some problems and the following, more simple approaches work better.
[^perplexing-behavior] showed that Guided Backpropagation and DeConvNet do partial image reconstruction, which is undesirable.
They also have a nice image explaining the difference between backpropagation of activation between saliency, deconvnet and guided backpropagation.
DeConvNet uses a "Backward ReLU" operation.
DeConvNet suffers from
- Input sensitivity failure: Reason is that methods that back-propagated through ReLU only if ReLU is "turned on" for this input. I guess same issue as gradient saturation.
DeConvNet can also be used for [Feature Visualization](#feature-visualization).
DeConvNet is like vanilla gradient, except of the behavior at the ReLU units.
### DeepLIFT
Paper: Learning Important Features Through Propagating Activation Differences
Authors: Avanti Shrikumar, Peyton Greenside, Anshul Kundaje
Date: 2019-10-12 (arxiv, v2)
Link: Learning important features through propagating activation diffe rences
[^deeplift]
Also backpropagation based approach.
The name DeepLIFT stands for Deep Learning Important FeaTures.
DeepLIFT attributes the difference in prediction from some reference point prediction among all features.
DeepLIFT explains a prediction as the difference from the prediction of some reference point.
This reference is a neutral or default input.
The choice depends on the task.
For an image this can be all black pixels.
Notation:
- $c$ is class of interest
- $x$ is vector of input pixels
- $x_0$ is reference image pixels
- $f$ is the neural network prediction or score just before classification layer (before softmax)
- $f(x)$ is predictions
- $f(x_0)$ is prediction of reference image
- $\Delta{}f=f(x)-f(x_0)$ is difference in prediction to reference image
- $C_j$ is attribution assigned to the $j$-th pixel $x_j$
DeepLIFT assigns scores to each input feature.
Explained is the difference between some prediction $\Delta{}f=f(x)-f(x_0)$.
$$\sum_{j=1}^p{}C_j$$
where $C$ is the contribution of a feature.
DeepLift works by a set of rules that are applied to calculate the individual contributions.
$$\Delta{}y=\Delta{}y^++\Delta{}y^-$$
For dense and convolutional layers, the postive parts and negative parts of an activation are described with:
$$\Delta{}y^+=\sum_{i}\mathbb{1}\{w_i\Delta{}x_i>0\}w_i\Delta{}x_i=\sum_{i}\mathbb{1}\{w_i\Delta{}x_i>0\}w_i(\Delta{}\x_i^{+}+\Delta{}x_i^{-})$$
$$\Delta{}y^-=\sum_{i}\mathbb{1}\{w_i\Delta{}x_i<0\}w_i\Delta{}x_i=\sum_{i}\mathbb{1}\{w_i\Delta{}x_i<0\}w_i(\Delta{}\x_i^{+}+\Delta{}x_i^{-})$$
Contributions are calculated based on these differences (called "Linear Rule")
$$C_{\Delta{}x_i^+\Delta{}y_i^+}=\mathbb{1}\{w_i\Delta{}x_i>0\}w_i\Delta{}x_i^+$$
$$C_{\Delta{}x_i^-\Delta{}y_i^+}=\mathbb{1}\{w_i\Delta{}x_i>0\}w_i\Delta{}x_i^-$$
$$C_{\Delta{}x_i^+\Delta{}y_i^-}=\mathbb{1}\{w_i\Delta{}x_i<0\}w_i\Delta{}x_i^+$$
$$C_{\Delta{}x_i^-\Delta{}y_i^-}=\mathbb{1}\{w_i\Delta{}x_i<0\}w_i\Delta{}x_i^-$$
where $y$ is the activation before the activation layer.
Nonlinear activation functions such as ReLU, tanh and sigmoid are processed with the following rule:
$$m_{\Delta{}x\Delta{}y}=\frac{\Delta{}y}{\Delta{}x}$$
This rescale rule only addresses saturation and thresholding problems, but not the min/AND rule.
So, alternatively can use the RevealCancel Rule, which addresses all three.
But this rule can be sensitive to noise, so the ReScale rule can sometimes be preferable (see https://www.youtube.com/watch?v=f_iAM0NPwnM)
No distinction made between positive and negative deltas.
There is a connection between DeepLIFT and Shapley VAlues.
DeepLIFT is an approximation of Shapley Values but with reference point.
How to choose reference.
Either there is a natural reference, such as all zeros for MNIST (is the background color).
Or one can use multiple references and average results.
DeepLift is Shapley Value approximation.
Shapley where reference point is "not participating team"
DeepLift breaks implementation invariance [^integrated-gradients].
This happens because DeepLift (and also LRP) replace gradients (which are implementation invariant) with discrete gradients (i.e. finite sums) plus a modified form of backpropagation.
For normal backpropagation, the implementation invariance comes from $\delta{}f/\delta{}x$ could be computed directly, without the chain-rule, in theory, and mathematically has to yield the same result with when using the chain-rule.
The chain rule uses implementation details (i.e. intermediate layers), but will come to the same result in the end, which is the definition of implementation invariance.
The chain rule does not hold for discrete gradients and thus implementation invariance breaks.
Deep Lift and $\epsilon$-LRP can both be re-formulated as computing backpropagation for modified gradient function
Ancona et. al 2018.
Some tips and tricks (for LRP): Methods for Interpreting and Understanding Deep Neural Networks
LRP: Should work better on ReLU
These gradient based methods are all different for different activation functions, since when the chain rule for derivation is applied, they replace the non-linear activations with a function $g(x)$ is different in different methods.
### Deep Taylor
Deep Taylor is a reference method, i.e. attributions are relative to some reference data point.
First, the attribution of an input neuron j is initialized to be equal to the output of taht neuron.
iAttribution of other is set to zero. $s_j^{output}=y$, $s^{output}_{k\neqj}=0$
The attribution then is backpropagated to inputs with the following rule:
$$s^{l-1}_j=\frac{w\cdot(x-x_0)}{w^Tx}s_j^l$$
where $x_0$ is the pixel vector of the reference image.
When constant is added to image, does not work any more, depending on choice of reference point [^unreliable-saliency]
Deep Taylor is equivalent to LRP, when an all zero reference point is chosen.
When all zero is chosen, they method is sensitive to constant shift in image -> bad
### Integrated Gradients 2017-06-13
Citation: [^integrated-gradients]
Paper: Axiomatic Attribution for Deep Networks
Link: https://arxiv.org/pdf/1703.01365.pdf
Integrated gradients, informally, can be seen as combininig gradients with the path dependent approaches such as LRP and DeepLIFT.
The basic idea: Choose some baseline input (e.g. black image), consider a straight line between baseline and actual image (in the pixel space), and integrated the gradients when moving the pixel colors from one image to the other.
It's the path integral.
If we had a two pixel image [0.1, 0.2] and a baseline image [0,0], intermediate steps would be [0.7, 1.4] and [0.01, 0.02] , when moving from image to baseline.
They defined some desiderata, which, of course, Integrated Gradients all fulfil:
- Sensitivity: If input and baseline differ in one feature and have different predictions, the relevance of that feature has to be non-zero.
I think that is a fair requirement.
Gradient alone does not fulfill this axiom, because gradient can be zero, but feature different.
Also deconvnet and guided backpropagation break sensitivity.
- Implementation Invariance: For two networks that have exactly the same predictions, no matter how the input looks like, the attribution should be the same.
Even if the networks work differently inside.
I think this is also a fair assumption.
LRP and DeepLift don't satisfy implementation invariance.
If methods don't have implementation invariance, they are senstive to unimportant workings of the network.
- Completeness: The attributions / relevance score add up to the difference between input x and the chosen baseline.
Integrated Gradients, DeepList and LRRP do so.
- Dummy: if a network does not depend on a feature at all, its relevance should be zero.
- Linearity: If we linearly combine two networks (e.g. weighted sum of the prediction of both), then the attribution for a feature should also be a weighted sum (with same weights as in linear combination).
- Symmetry: swapping two features should yield same attribution (see shapley value)
It's important to do the linear interpolation to achieve symmetry.
There are infinite path one could take, e.g. first fully change one pixel, than the next and so on.
Each path would lead to a different attribution.
Integrated Gradients chose the linear path.
(most axioms are like their shapley equivalents)
Integrated gradients correspond to a cost-sharing method called Aumann-Shapley.
$$E_{IG}=(x-x_0)\cdot\int_0^1\frac{\delta{}f(x_0+\alpha(x-x_0)}{\delta{}x}$$
Here $x_0$ is a reference image, that stands for absence of features.
Usually a black or otherwise one-colored image.
IG is a reference point method, i.e. they compare to a reference.
IG is sensitive to constant shift [^unreliable-saliency] which is bad, but only when an all zero reference point is chosen.
When a black image is chosen as reference, it is fine (set all pixels to the mininimum in the data).
Numerical integral is, of course, some computation overhead compared to other methods.
They are computed using summation.
We cut the path into $m$ parts replace integration with the sum (which is called the Riemman approximation of integrals):
$$E_{IG}=(x-x_0)\cdot\sum_{k=1}^m\frac{\delta{}f(x_0+\frac{k}{m}\cdot(x-x_0)}{\delta{}x}\cdot\frac{1}{m}$$
Integrated Gradients address the saturation and thresholding problems.
Has a reference point.
Can still give misguiding results, according to DeepLift paper.
https://distill.pub/2020/attribution-baselines/
Integrated Gradients are motivated by Aumann-Shapley value from cooperative game theory.
You can also average contributions over many different reference point, a reference distribution.
### Guided Backpropagation 2015-04-13
Paper: Striving for Simplicity: The all Convolutional Net
Date: 2015-04-13
Link: https://arxiv.org/pdf/1412.6806.pdf
The paper proposes CNN architecture solely on convoluation layers without maxpooling layers.
But also they introduce a new variant of DeConvNet.
Combines Vanilla Gradient and Backpropagation?
Guided Backpropagation computes grradients, but any gradient that becomes negative during the backward pass is discarded at the ReLUs (i.e. set to zero).
Guided Backpropagation break Input Sensitivity (and Gradient Saturation?). [^integrated-gradients].
Builds on Deconvolution.
Negative gradients are set to zero.
Gradient is backpropagated through backward ReLU, then forward ReLU.
[^perplexing-behavior] showed that Guided Backpropagation and DeConvNet do partial image reconstruction, which is undesirable.
By combining VAnilla Gradient and DeConvNet we end up with more zero attributions.
Like Deconvnet, Guided Backpropagation can also be used for [Feature Visualization](#feature-visualization).
![](images/backpass.png)
### Layerwise Relevance Propagation LRP 2016
Paper: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation
Authors: Alexander Binder, Sebastian Bach, Gregoire Montavon, Klaus-Robert Müller, and Wojciech Samek
Date: 2015
Link: http://iphome.hhi.de/samek/pdf/BinICISA16.pdf
[^lrp]
First, you do a forward pass of the image to get the prediction.
Then, you do a backward pass and at each layer, redistribute the outcome among the layer below:
Idea: $f(x)\approx\sum_{i=1}^pR_i$
where each contribution $R_i$ belongs to an input pixel.
Contributions can be positive or negative.
$$R_i^{(l)}=\sum_{j}\frac{z_{ij}}{\sum_{i'}z_{i'j}}R_j^{(l+1)}$$
with $z_{ij}=x_i^{(l)}w_{ij}^{(l,l+1)}$.
where $i$ is the index of a neuron at layer $l$.
The $j$ index runs over all upper-layer neurons, to which neuron $i$ contributes.
Using this satisfies the efficiency property: $\sum_pR_p^{(1)}=f(x)$.
In their paper, they call it conservation property.
They introduce two more variants, LRP-$\beta$ and LRP-$\epsilon$, which relax the efficiency property to get some better numerical properties.
LRP for ReLU networks is equivalent -- up to some factor -- to Gradient \* Input.
And Gradient x Input does not address the saturation problem.
LRP breaks implementation invariance [^integrated-gradients].
### Gradient \* Input, DeepLift, $\epsilon$-LRP
It is the same as the gradient method, but multiplied with the value of the feature / pixel.
Helps with the problem of gradient saturation and reduces visual diffusion, as can be seen in the examples later.
$$E_{GI}=x\cdot\frac{\delta{}f}{\delta{}x}|x=x_0$$
All three methods are equivalent, as shown by M. Anaconda (in ReLU networks, no bias, zero baseline).
When constant is added to image, does not work any more. [^unreliable-saliency]