Concerns about Applying `stride=2` in `CausalConv` and Padding Strategy for ASR #9883

raman-r-4978 · 2024-07-25T11:31:57Z

raman-r-4978
Jul 25, 2024

Hello,

I have been exploring the implementation of CausalConv for Streaming ASR tasks and have encountered a couple of points that I would like to discuss:

Stride Application in CausalConv2D:
Is it sensible to apply stride=2 for CausalConv in the context of ASR? Considering the streaming signals, using a stride of 2 could potentially skip over and see the future information. Can someone provide insights into the practical implications of using stride=2 in Streaming ASR case?

Reference:

>>> asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_en_fastconformer_hybrid_large_streaming_multi")
>>> print(asr_model)
...
(encoder): ConformerEncoder(
    (pre_encode): ConvSubsampling(
      (out): Linear(in_features=2816, out_features=512, bias=True)
      (conv): Sequential(
        (0): CausalConv2D(1, 256, kernel_size=(3, 3), stride=(2, 2))
        (1): ReLU(inplace=True)
        (2): CausalConv2D(256, 256, kernel_size=(3, 3), stride=(2, 2), groups=256)
        (3): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
        (4): ReLU(inplace=True)
        (5): CausalConv2D(256, 256, kernel_size=(3, 3), stride=(2, 2), groups=256)
        (6): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
        (7): ReLU(inplace=True)
      )
    )
...

Padding Strategy:
I noticed that the current implementation involves adding both left and top padding to the time axis and dim axis. Wouldn't it be sufficient to add padding only to the time axis? What is the rationale behind applying both left and top padding, and how does it impact the performance and accuracy of the ASR model?

Ref: https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/submodules/causal_convs.py#L67

def forward(self, x,):
    x = F.pad(x, pad=(self._left_padding, self._right_padding, self._left_padding, self._right_padding))
    x = super().forward(x)
    return x

Looking forward to understanding the idea behind this.
Thank you!

@VahidooX @Slyne @titu1994

Answered by scarecrow1123

Aug 8, 2024

It just moves the kernel by 2 instead of 1.

the Test 2 scenario isn't causal because it allows access to future information.

@VahidooX 's point still holds good with Test 2 @raman-r-4978 . One way to explain your example is that, for inference, every alternate input frame has to wait till the next frame is arrived. Hence 2 is held till 3 arrives for further processing. So in your example of Second window [1, 2, 3] , 3 is the current frame and not future frame. With causal convolutions, the current input is always at the right end. When 2 was the current frame in the previous step, there was no real output.

View full answer

VahidooX · 2024-08-02T05:31:26Z

VahidooX
Aug 2, 2024
Collaborator

Considering the streaming signals, using a stride of 2 could potentially skip over and see the future information.

Not sure why strided convolutions may cause the model to see the future. It just moves the kernel by 2 instead of 1. Can you provide a simple example of one dimension vector with strided of 2 where a timestep is enabled to see the future? The following script simulates the streaming and it has shown that the outputs of the model in streaming mode (no future exists) is exactly the same as when you pass the whole audio at once. So it is very unlikely that such an issue has happened.

https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py

Wouldn't it be sufficient to add padding only to the time axis? What is the rationale behind applying both left and top padding, and how does it impact the performance and accuracy of the ASR model?

The padding on the left (time axis) is needed to make the convolution causal. The padding on the top is added to make sure all channels are seen by the convolution. When you have an even number of channels (like 80 in most of our models) and you use strided convolution of 2 (the convolution does the striding on all dimensions, not just the time), then the last channel may get skipped and not seen by the convolution, By adding one padding on the top, we make the channels odd and make sure all channels are being seen. Adding extra paddings does not affect the results as values are always zero and model can easily understand it.

4 replies

raman-r-4978 Aug 5, 2024
Author

Hi @VahidooX , Thanks for the reply

The padding on the left (time axis) is needed to make the convolution causal. The padding on the top is added to make sure all channels are seen by the convolution. When you have an even number of channels (like 80 in most of our models) and you use strided convolution of 2 (the convolution does the striding on all dimensions, not just the time), then the last channel may get skipped and not seen by the convolution, By adding one padding on the top, we make the channels odd and make sure all channels are being seen. Adding extra paddings does not affect the results as values are always zero and model can easily understand it.

I can understand this part. Thanks

Can you provide a simple example of one dimension vector with strided of 2 where a timestep is enabled to see the future?

Note:
The below CausalConv1D expects inputs as [batch, mel_feat_dim, seq_length].
Example: For a 30-second audio, the input would be [1, 80, 3000].

import torch
import torch.nn as nn
import torch.nn.functional as F

class CausalConv1d(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride, dilation=1):
        super(CausalConv1d, self).__init__()
        self.conv = nn.Conv1d(
            in_channels,
            out_channels,
            kernel_size,
            stride=stride,
            dilation=dilation,
        )

    def forward(self, x):
        # Input `x` is already padded for this experiment
        return self.conv(x)

batch_size = 1
seq_length = 10

in_channels = 1
out_channels = 1
kernel_size = 3
dilation = 1

# Test 1: `stride=1`
stride = 1

input_tensor = torch.arange(1, seq_length + 1, dtype=torch.float32).view(batch_size, in_channels, seq_length)

padding = (kernel_size - 1) * dilation
input_tensor = F.pad(input_tensor, (padding, 0))

causal_conv = CausalConv1d(in_channels, out_channels, kernel_size, stride, dilation)
output = causal_conv(input_tensor)

print("Input shape:", input_tensor.shape)
print("Input:", input_tensor.squeeze().tolist())
print("Output shape:", output.shape)
print("Output:", output.squeeze())

######### Output  #########
# Input shape: torch.Size([1, 1, 12])
# Input: [0.0, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
# Output shape: torch.Size([1, 1, 10])
# Output: tensor([-0.1508, -0.6553, -1.6221, -2.5889, -3.5556, -4.5224, -5.4892, -6.4560, -7.4227, -8.3895], grad_fn=<SqueezeBackward0>)

# Step by step output
x = input_tensor.squeeze().squeeze()
for i in range(0, output.shape[2]):
    start_idx = i * stride
    new_input = x[start_idx:start_idx+kernel_size]
    print(f"\nTime step {i}:")
    print(f"  Input used: {new_input.tolist()}")

    new_output = causal_conv(new_input.view(batch_size, in_channels, kernel_size))
    print(f"  New output: {new_output.item():.4f}")
    print(f"  Original output: {output[0][0][i].item():.4f}")
    print(f"  Outputs match: {torch.isclose(new_output, output[0][0][i]).item()}")


######### Output 1 #########
# Time step 0:
#   Input used: [0.0, 0.0, 1.0]
#   New output: -0.1508
#   Original output: -0.1508
#   Outputs match: True

# Time step 1:
#   Input used: [0.0, 1.0, 2.0]
#   New output: -0.6553
#   Original output: -0.6553
#   Outputs match: True

# Time step 2:
#   Input used: [1.0, 2.0, 3.0]
#   New output: -1.6221
#   Original output: -1.6221
#   Outputs match: True

# Time step 3:
#   Input used: [2.0, 3.0, 4.0]
#   New output: -2.5889
#   Original output: -2.5889
#   Outputs match: True

# Time step 4:
#   Input used: [3.0, 4.0, 5.0]
#   New output: -3.5556
#   Original output: -3.5556
#   Outputs match: True

# Time step 5:
#   Input used: [4.0, 5.0, 6.0]
#   New output: -4.5224
#   Original output: -4.5224
#   Outputs match: True

# Time step 6:
#   Input used: [5.0, 6.0, 7.0]
#   New output: -5.4892
#   Original output: -5.4892
#   Outputs match: True

# Time step 7:
#   Input used: [6.0, 7.0, 8.0]
#   New output: -6.4560
#   Original output: -6.4560
#   Outputs match: True

# Time step 8:
#   Input used: [7.0, 8.0, 9.0]
#   New output: -7.4227
#   Original output: -7.4227
#   Outputs match: True

# Time step 9:
#   Input used: [8.0, 9.0, 10.0]
#   New output: -8.3895
#   Original output: -8.3895
#   Outputs match: True


# Test 2: `stride=2`

stride = 2

causal_conv = CausalConv1d(in_channels, out_channels, kernel_size, stride, dilation)

input_tensor = torch.arange(1, seq_length + 1, dtype=torch.float32).view(batch_size, in_channels, seq_length)

padding = (kernel_size - 1) * dilation
input_tensor = F.pad(input_tensor, (padding, 0))

output = causal_conv(input_tensor)

print("Input shape:", input_tensor.shape)
print("Input:", input_tensor.squeeze().tolist())
print("Output shape:", output.shape)
print("Output:", output.squeeze())

######### Output  #########
# Input shape: torch.Size([1, 1, 12])
# Input: [0.0, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
# Output shape: torch.Size([1, 1, 5])
# Output: tensor([-0.4276, -0.9126, -1.4725, -2.0324, -2.5923], grad_fn=<SqueezeBackward0>)

x = input_tensor.squeeze().squeeze()
for i in range(0, output.shape[2], 1):
    start_idx = i * stride
    new_input = x[start_idx:start_idx+kernel_size]
    print(f"\nTime step {i}:")
    print(f"  Input used: {new_input.tolist()}")

    new_output = causal_conv(new_input.view(batch_size, in_channels, kernel_size))
    print(f"  New output: {new_output.item():.4f}")
    print(f"  Original output: {output[0][0][i].item():.4f}")
    print(f"  Outputs match: {torch.isclose(new_output, output[0][0][i]).item()}")

######### Output 2 #########
# Time step 0:
#   Input used: [0.0, 0.0, 1.0]
#   New output: -0.4276
#   Original output: -0.4276
#   Outputs match: True

# Time step 1:
#   Input used: [1.0, 2.0, 3.0]
#   New output: -0.9126
#   Original output: -0.9126
#   Outputs match: True

# Time step 2:
#   Input used: [3.0, 4.0, 5.0]
#   New output: -1.4725
#   Original output: -1.4725
#   Outputs match: True

# Time step 3:
#   Input used: [5.0, 6.0, 7.0]
#   New output: -2.0324
#   Original output: -2.0324
#   Outputs match: True

# Time step 4:
#   Input used: [7.0, 8.0, 9.0]
#   New output: -2.5923
#   Original output: -2.5923
#   Outputs match: True

I have no issues with Test 1, as it functions as intended.
But, as you can see, the Test 2 scenario isn't causal because it allows access to future information.

Here's an illustration

Input vector with pad: [0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Convolutional operation (kernel size = 3, stride = 2):

At timestep 1, First window: [0, 0, 1] -> Output 0
At timestep 2, Second window: [1, 2, 3] -> Output 1 (sees the future frame 3)
At timestep 3,Third window: [3, 4, 5] -> Output 2 (sees the future frame 4, 5)
.. so on

scarecrow1123 Aug 8, 2024

It just moves the kernel by 2 instead of 1.

the Test 2 scenario isn't causal because it allows access to future information.

@VahidooX 's point still holds good with Test 2 @raman-r-4978 . One way to explain your example is that, for inference, every alternate input frame has to wait till the next frame is arrived. Hence 2 is held till 3 arrives for further processing. So in your example of Second window [1, 2, 3] , 3 is the current frame and not future frame. With causal convolutions, the current input is always at the right end. When 2 was the current frame in the previous step, there was no real output.

Answer selected by VahidooX

VahidooX Aug 8, 2024
Collaborator

@scarecrow1123 gave a good explanation. Current step is the right frame in causal convolutions.

As I mentioned, as the results have been tested in streaming mode and compared with offline mode, it is very unlikely that any leaking happens from future.

raman-r-4978 Aug 8, 2024
Author

Understood. Thanks to both of you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concerns about Applying `stride=2` in `CausalConv` and Padding Strategy for ASR #9883

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Concerns about Applying stride=2 in CausalConv and Padding Strategy for ASR #9883

raman-r-4978 Jul 25, 2024

Replies: 1 comment · 4 replies

VahidooX Aug 2, 2024 Collaborator

raman-r-4978 Aug 5, 2024 Author

scarecrow1123 Aug 8, 2024

VahidooX Aug 8, 2024 Collaborator

raman-r-4978 Aug 8, 2024 Author

Concerns about Applying `stride=2` in `CausalConv` and Padding Strategy for ASR #9883

raman-r-4978
Jul 25, 2024

Replies: 1 comment 4 replies

VahidooX
Aug 2, 2024
Collaborator

raman-r-4978 Aug 5, 2024
Author

VahidooX Aug 8, 2024
Collaborator

raman-r-4978 Aug 8, 2024
Author