Optimization using cuFFT Callbacks

Algorithmic Steps

We'll consider 3 of the kernels within the inner loop of the PyCBC analysis code.

correlate

multiples a complex template and data segment together to create the 'correlation' vector
ifft

Using cuFFT, we take the correlation vector and inverse FFT it to produce an SNR time series. This is a C2C out of place transform (2^20 points unless otherwise noted)
threshold

Peaks in the SNR timeseries are located by windows maximums within a window size.

Callback Variations

No callback used
Fuse the correlate and iFFT together
(2) plus simply return zero for the second half of the correlation vector.

Future Variations that should be safe

Store template as real phase, calculate both phases within input callback
Store SNR timeseries as fp16
Move portion of thresholding code into output callback.
Use exact non-zero input boundaries (How do we pass in the boundaries?)
Use exact non-corrupted output boundaries (How do we pass in the boundaries?)

Future Variations that need accuracy checking

Load template and/or data as fp16

Callback wishlist

Callbacks for input/output of intermediate steps in the FFT

Notes on Benchmarking

CUDA 6.5
Vector sizes are 2^20 points complex single precision unless otherwise noted
FFT batching is not used.
in-situ analysis on Gaussian data
host performance and GPU dead time are ignored.
Results obtained by running under COMPUTE_PROFILE=1 and parsing the resultant log files.

Tesla K10 Results

No Callbacks

Operation	Kernel	Average gputime (µs)
FFT	_ZN12spRadix0256B10k	208.00 x 2
FFT	_ZN12spRadix0016B10k	180.25
Correlate	correlate	175.18
Threshold	threshold_and_cluste	75.41
Threshold	threshold_and_cluste	5.14

Fuse Correlate FFT

Operation	Kernel	Average gputime (µs)
FFT	_ZN12spRadix0256C18k (callback)	319.51
FFT	_ZN12spRadix0256B10k	198.09
FFT	_ZN12spRadix0016B10k	181.27
Threshold	threshold_and_cluste	75.48
Threshold	threshold_and_cluste	5.23

__device__ cufftComplex in_call(void* input, size_t offset,
                            void* caller_info, void* shared) {
    cufftComplex r;

    cufftComplex s = ((cufftComplex*) input)[offset];
    cufftComplex h = ((cufftComplex*) %s)[offset];

    r.x = h.x * s.x + h.y * s.y;
    r.y = h.x * s.y - h.y * s.x;

    return r;
}

Fuse Correlate FFT, and return zeros for second half of vector

Operation	Kernel	Average gputime (µs)
FFT	_ZN12spRadix0256C18k (callback)	254.62
FFT	_ZN12spRadix0256B10k	198.03
FFT	_ZN12spRadix0016B10k	181.27
Threshold	threshold_and_cluste	75.58
Threshold	threshold_and_cluste	5.14

__device__ cufftComplex in_call(void* input, size_t offset,
                            void* caller_info, void* shared) {
    if (offset > %s)
        return (cufftComplex){0, 0};
    else{
        cufftComplex r;

        cufftComplex s = ((cufftComplex*) input)[offset];
        cufftComplex h = ((cufftComplex*) %s)[offset];

        r.x = h.x * s.x + h.y * s.y;
        r.y = h.x * s.y - h.y * s.x;

        return r;
    }

}

Curious Results using Store Callback

Adding a callback for the store operation has very strange results at the moment. Using the (3) load callback and adding a simple store callback that just writes out the results with no changes made, results in a very large performance regression in that last FFT kernel. One might expect there to be some overhead using the store callback, and so would be slightly slower than the (3) case alone.

Store callback that simply sets the output

Operation	Kernel	Average gputime (µs)
FFT	_ZN12spRadix0256C18k (load callback)	254.61
FFT	_ZN12spRadix0256B10k	198.12
FFT	_ZN12spRadix0016A18k (store callback)	558.46
Threshold	threshold_and_cluste	75.65
Threshold	threshold_and_cluste	5.14

__device__ void out_call(void *out, size_t offset, cufftComplex element,
                         void *caller_info, void *shared){
       ((cufftComplex*) out)[offset] = element;
}

Store callback that does nothing (no output written at all!)

Even if the store callback is a NOOP so that nothing is written to the output memory, it still seems to be anomalously slow. Even with some overhead, I would have expected this to be faster than the non-callback case as you save on the memory writes.

Operation	Kernel	Average gputime (µs)
FFT	_ZN12spRadix0256C18k (load callback)	254.97
FFT	_ZN12spRadix0256B10k	197.76
FFT	_ZN12spRadix0016A18k (store callback)	478.47
Threshold	threshold_and_cluste	76.08
Threshold	threshold_and_cluste	4.64

__device__ void out_call(void *out, size_t offset, cufftComplex element,
                         void *caller_info, void *shared){
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly