The framework is a crude approximation of the CMS data processing software, CMSSW. CMSSW is a rather generic framework to process independent chunks of data. In CMS these chunks of data correspond to triggered proton-proton collisions, and are called events. The events are processed by modules, that in this project are all "producers" that can read objects from the event, and insert new objects to the event (in CMSSW there are, in addition, analyzers, that can only read objects from the event, and filters, that can decide to stop processing of the event).
The modules form a DAG based on their data dependencies. The modules are implemented in C++ (C++17 in general, CUDA code is in C++14). A CMSSW job is configured in Python (this project does not provide configuration mechanism).
The CMSSW framework is multi-threaded using Threading Building Blocks (TBB). An integral part of the multi-threading is a concept of "concurrent event processor" that we call "a stream" (to disambiguate from CUDA streams, these streams are called "EDM streams" from now on). An EDM stream processes one event at a time ("processing" meaning that each module in the DAG is run on the event in some order respecting the data dependencies). A job may have multiple EDM streams, in which case the EDM streams process their events concurrently. Furthermore, modules processing the same event that are independent in the DAG are also run concurrently. All this potential concurrency is exposed as tasks to be run by the TBB task scheduler. We do not make any assumptions on how TBB runs these tasks in threads (e.g. number of EDM streams and number of threads may be different). For more information on CMSSW framework see e.g.
- CMS TWiki pages
- SWGuideFrameWork
- MultithreadedFrameworkDesignDiscussions
- The CMS Offline WorkBook WorkBook
- The CMS Offline SW Guide SWGuide
- Papers
The processing time and memory requirements can vary a lot across events. In addition, the filtering capability may affect which modules in the DAG can be run.
The main approximations of the framework in this project with respect to CMSSW are
- producer modules only, and only stream and stream-ExternalWork producers
- modules and the event products have no labels, and therefore the event can hold only at most one product of each C++ type
- no run time configuration mechanism
- input data are fully read in memory at the beginning of the job
- EventSetup system has a single (implicit) record, one IOV, all products are read from binary dumps at the beginning of the job.
Our overall aims are to avoid blocking synchronization as much as
possible, and keep all processing units (CPU cores, GPUs) as busy as
we can doing useful work. It follows that we try to have all
operations (memory transfers, kernel calls) asynchronous with the use
of CUDA streams, and that we use callback functions
(cudaStreamAddCallback()
) to notify the CMSSW framework when the
asynchronous work has finished.
We use a "caching allocator" (based on the one from
CUB library) for both device and
pinned host memory allocations. This approach allows us to amortize
the cost of the cudaMalloc()
/cudaFree()
etc, while being able to
easily re-use device/pinned host memory regions for temporary
workspaces, to avoid "conservative" overallocation of memory, and to
avoid constraining the scheduling of modules to multiple devices.
We use one CUDA stream for each EDM stream ("concurrent event") and each linear chain of GPU modules that pass data from one to the other in the device memory. In case of branches in the DAG of modules, additional CUDA streams are used since there is sub-event concurrency in the DAG that we want to expose to CUDA runtime.
For more information see cms-sw/cmssw:HeterogeneousCore/CUDACore/README.md
.
The pixel tracking GPU prototype consists of five modules that are run according to their data dependencies (roughly in the following order)
BeamSpotToCUDA
(A)SiPixelRawToClusterCUDA
(B)SiPixelRecHitCUDA
(C)CAHitNtupletCUDA
(D)PixelVertexProducerCUDA
(E)
The data dependencies of the modules form the following DAG
In addition, there are two modules to transfer the tracks and vertices back to CPU (enabled only with --transfer
parameter)
With the transfers the module DAG becomes
The application reads uncompressed raw data for just the pixel detector (about 250 kB/event). This configuration is somewhat artificial, e.g. almost nothing is transferred back to CPU, at the moment there are no modules that would consume the data in the format produced by this workflow, and in offline data processing the input data is compressed.
For more information on the application see
- arXiv:2008.13461 A Bocci et al.: Heterogeneous reconstruction of tracks and primary vertices with the CMS pixel tracker
- https://patatrack.web.cern.ch/patatrack/wiki/
- CERN EP Software Seminar 2020-07-01: D. Vom Bruch and F. Pantaleo: Real-time heterogeneous event reconstruction with GPUs at CMS and LHCb during LHC Run-3
- CHEP 2019: A. Bocci: Heterogeneous online reconstruction at CMS
- ACAT 2019: A. Bocci: Towards a heterogeneous High Level Trigger farm for CMS
- Connecting The Dots 2019: F. Pantaleo: Patatrack: accelerated Pixel Track reconstruction in CMS
- F. Pantaleo: New Track Seeding Techniques for the CMS Experiment (PhD thesis)
BeamSpot transfer (BeamSpotToCUDA
)
This module transfers information about the average beam collision region ("beam spot") from the host to the device for each event.
Operation | Description |
---|---|
memcpy host-to-device 44 B | Transfer BeamSpotCUDA::Data for position and other information about the beam spot |
These essentially only transfer data from CPU to GPU.
Raw-to-cluster (SiPixelRawToClusterCUDA
)
This module that unpacks and reformats the raw to into something usable to downstream, applies some calibration, and forms clusters of pixels on each pixel detector module.
The following memory transfers are done once per job on the first event from SiPixelRawToClusterCUDA.cc
Operation | Description |
---|---|
memcpy host-to-device 1.44 MB | Transfer SiPixelFedCablingMapGPU for pixel detector cabling map |
memcpy host-to-device 57.6 kB | Transfer an array of module indices to be unpacked. This is to support regional unpacking in CMSSW, even though this functionality is not used in this project. |
memcpy host-to-device 3.09 MB | Transfer gain calibration data |
memcpy host-to-device 24.05 kB | Transfer SiPixelGainForHLTonGPU for gain calibration |
memcpy host-to-device 8 B | Set the gain calibration data pointer in SiPixelGainForHLTonGPU struct |
The following CUDA operations are issued for each event from SiPixelRawToClusterGPUKernel.cu
Operation | Description |
---|---|
memcpy host-to-device 40 B | Transfer SiPixelDigisCUDA::DeviceConstView for SoA of pixel digis (= unpacked raw data) |
memset 3.6 MB | Zero unpacking error array |
memcpy host-to-device 16 B | Transfer GPU::SimpleVector<PixelErrorcompact> to provide std::vector -like interface for the unpacking error array |
memcpy host-to-device 32 B | Transfer SiPixelClustersCUDA::DeviceConstView for SoA of pixel clusters |
memcpy host-to-device X B | Transfer raw data words |
memcpy host-to-device X B | Transfer IDs for FEDs that provided data |
pixelgpudetails::RawToDigi_kernel() |
Kernel to unpack the raw data into data structure usable for subsequent kernels |
memcpy device-to-host 16 B | Transfer GPU::SimpleVector<PixelErrorcompact> , i.e. essentially the number of errors, to host |
gpuCalibPixel::calibDigis() |
Calibrate pixel digis (ADC counts) |
gpuClustering::countModules() |
Fills starting index into the ADC (etc) arrays for each active module |
memcpy device-to-host 4 B | Transfer number of active modules to host |
gpuClustering::findClus() |
Cluster digis on each pixel module |
gpuClustering::clusterChargeCut() |
Select clusters whose aggregated electric charge is above a given threshold |
pixelgpudetails::fillHitsModuleStart() |
find the "location" of the first hit of each module |
memcpy device-to-host 4 B | Transfer number of pixel clusters to host |
Times below report minimum, maximum, mean, and standard deviation in microseconds/event. They are measured with one concurrent event (with two threads in practice) on an otherwise empty node, over the 1000 events.
Kernel | V100 |
---|---|
pixelgpudetails::RawToDigi_kernel() |
4.8-8.9 (mean 6.1 stdev 0.5) us |
gpuCalibPixel::calibDigis() |
3.1-4.6 (mean 3.7 stdev 0.2) us |
gpuClustering::countModules() |
2.9-4.1 (mean 3.3 stdev 0.2) us |
gpuClustering::findClus() |
43-490 (mean 82 stdev 34) us |
gpuClustering::clusterChargeCut() |
12-15 (mean 12.4 stdev 0.6 ) us |
pixelgpudetails::fillHitsModuleStart() |
5.4-6.8 (mean 5.8 stdev 0.3) us |
RecHits SiPixelRecHitCUDA
This module computes the 3D position estimate for each cluster.
The following memory transfers are done once per job on the first event from SiPixelRecHitCUDA
Operation | Description |
---|---|
memcpy host-to-device 32 B | Transfer pixelCPEforGPU::ParamsOnGPU main struct of detector module parameters |
memcpy host-to-device 16 B | Transfer pixelCPEforGPU::ParamsOnGPU parameters common to all modules (thickness ,pitch) |
memcpy host-to-device 3.56 kB | Transfer phase1PixelTopology::AverageGeometry parameters of module ladders in barrel, and z positions of first forward disks |
memcpy host-to-device 160 B | Transfer pixelCPEforGPU::LayerGeometry layer indices |
memcpy host-to-device 208 kB | Transfer pixelCPEforGPU::DetParams detector module parameters |
The following CUDA operations are issued for each event from PixelRecHits.cu
Operation | Description |
---|---|
memcpy host-to-device 152 B | Transfer TrackingRecHit2DSOAView for SoA of pixel hits |
gpuPixelRecHits::getHits() |
Calculates 3D position for each cluster |
setHitsLayerStart() |
Set index of the first hit for each pixel detector layer |
cudautils::countFromVector() |
First kernel of four to fill a phi-binned histogram of the hits. This kernel counts the number of elements for each bin |
cub::DeviceScanInitKernel() |
Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize() |
cub::DeviceScanKernel() |
Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize() |
cudautils::fillFromVector() |
Last kernel of four to fill a phi-binned histogram of the hits. This kernel fills each bin with the hits. |
Times below report minimum, maximum, mean, and standard deviation in microseconds/event. They are measured with one concurrent event (with two threads in practice) on an otherwise empty node, over the 1000 events.
Kernel | V100 |
---|---|
gpuPixelRecHits::getHits() |
35-39 (mean 39 stdev 2) us |
setHitsLayerStart() |
1.7-3.8 (mean 1.8 stdev 0.1) us |
cudautils::countFromVector() |
2.4-4.2 (mean 3.1 stdev 0.3) us |
cub::DeviceScanInitKernel() |
1.1-1.5 (mean 1.18 stdev 0.06) us |
cub::DeviceScanKernel() |
3.1-4.1 (mean 3.5 stdev 0.2) us |
cudautils::fillFromVector() |
2.8-4.5 (mean .3. stdev 0.3) us |
Pattern recognition (with Cellular Automaton) (CAHitNtupletCUDA
)
This module performs the "pattern recognition":
- create pairs of pixel hits on adjacent layers
- connect the pairs to form ntuplets (triplets or quadruplets)
- fit the ntuplets with a helix
The following CUDA operations are issued for each event from CAHitNtupletGeneratorKernels.cu
and BrokenLineFitOnGPU.cu
Operation | Description |
---|---|
memset 4 B | Zero the number of doublets (CA cells) |
memset 36 B | Zero counter of ntuplets |
memset 197 kB | Zero association structure from hits to ntuplets |
gpuPixelDoublets::initDoublets() |
Initialize pair/doublet finding data structure |
gpuPixelDoublets::getDoubletsFromHisto() |
Create the hit pairs/doublets |
memset 131 kB | Zero ntuplet finding data structure |
kernel_connect() |
Connect compatible pairs/doublets |
gpuPixelDoublets::fishbone() |
Identify duplicate ntuplets (a single particle can induce multiple hits per layer because of redundancies in the detector) |
kernel_find_ntuplets |
Find ntuplets from the doublet connection graph (the actual Cellular Automaton) |
cudautils::finalizeBulk() |
finalize data structures (Histogrammer/OnetoManyAssoc) filled in the previous kernel |
kernel_earlyDuplicateRemover() |
Clean duplicate ntuplets |
kernel_countMultiplicity() |
Count the number of ntuplets with different numbers of hits |
cub::DeviceScanInitKernel() |
Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize() |
cub::DeviceScanKernel() |
Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize() |
kernel_fillMultiplicity() |
Fills a nhit-binned histogram of the ntuplets. |
kernel_fillHitDetIndices |
fill for each ht-tuple a tuple with just the det-indices. |
kernelBLFastFit<3>() |
First step of fitting triplets |
kernelBLFit<3>() |
Second step of fitting triplets |
kernelBLFastFit<4>() |
First step of fitting quadruplets |
kernelBLFit<4>() |
Second step of fitting quadruplets |
kernelBLFastFit<4>() |
First step of fitting pentuplets (only first 4 hits) |
kernelBLFit<4>() |
Second step of fitting pentuplets (only first 4 hits) |
kernel_classifyTracks |
Classify fitted tracks according to certain quality criteria |
kernel_fastDuplicateRemover |
Identify duplicate tracks |
kernel_countHitInTracks |
Count the number of times each hit is used in any ntuplet |
cub::DeviceScanInitKernel() |
Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize() |
cub::DeviceScanKernel() |
Part of cub::DeviceScan::InclusiveSum() called from cms::cuda::launchFinalize() |
kernel_fillHitInTracks |
Fill a histogram per hit with ntuplets using that hit |
kernel_tripletCleaner |
Clean triplets |
Times below report minimum, maximum, mean, and standard deviation in microseconds/event. They are measured with one concurrent event (with two threads in practice) on an otherwise empty node, over the 1000 events.
Kernel | V100 |
---|---|
gpuPixelDoublets::initDoublets() |
1.6-8.2 (mean 4 stdev 1) us |
gpuPixelDoublets::getDoubletsFromHisto() |
28.2-195 (mean 85 stdev 27) us |
kernel_connect() |
34.0-248 (mean 83 stdev 31) us |
gpuPixelDoublets::fishbone() |
12.1-101 (mean 47 stdev 14) us |
kernel_find_ntuplets |
53.6-366 (mean 177 stdev 51) us |
cudautils::finalizeBulk() |
1.8-2.4 (mean 2.0 stdev 0.1) us |
kernel_earlyDuplicateRemover() |
6.1-27 (mean 15 stdev 3) us |
kernel_countMultiplicity() |
2-10 (mean 4.5 stdev 1.4) |
cub::DeviceScanInitKernel() |
0.9-1.7 (mean 1.07 stdev 0.09) us |
cub::DeviceScanKernel() |
1.6-3.9 (mean 1.7 stdev 0.1) us |
kernel_fillMultiplicity() |
2.3-10 (mean 4.5 stdev 1.2) us |
kernel_fillHitDetIndices |
3.0-3.7 (mean 3.2 stdev 0.2) us |
kernelBLFastFit<3>() |
6.1-12 (mean 7.1 stdev 0.5) us |
kernelBLFit<3>() |
47-66 (mean 50 stdev 3) us |
kernelBLFastFit<4>() |
6.4-10 (mean 7.7 stdev 0.7) us |
kernelBLFit<4>() |
60-76 (mean 62 stdev 2) us |
kernelBLFastFit<4>() |
5.0-8.7 (mean 5.8 stdev 0.4) us |
kernelBLFit<4>() |
21-27 (mean 23 stdev 1) us |
kernel_classifyTracks |
3.2-6.3 (mean 3.9 stdev 0.4) us |
kernel_fastDuplicateRemover |
8.8-33 (mean 21 stdev 4) us |
kernel_countHitInTracks |
2.2-4.2 (mean 3.5 stdev 0.2) us |
cub::DeviceScanInitKernel() |
1.1-1.7 (mean 1.2 stdev 0.1) us |
cub::DeviceScanKernel() |
3.8-5.0 (mean 4.2 stdev 0.2) us |
kernel_fillHitInTracks |
3.2-4.6 (mean 2.6 sedev 0.2) us |
kernel_tripletCleaner |
3.7-11 (mean 6.4 stdev 1.1 us) |
Vertexing (PixelVertexProducerCUDA
)
This module reconstruct vertices, i.e. finds clusters of track "origin points" that likely correspond to proton-proton collision points.
The following CUDA operations are issued for each event from gpuVertexFinderImpl.h
Operation | Description |
---|---|
gpuVertexFinder::init() |
Zero data structures |
gpuVertexFinder::loadTracks() |
Fill vertexing data structures with information from tracks |
gpuVertexFinder::vertexFinderOneKernel() |
Vertex reconstruction split into cluster tracks to vertices, Fit vertex parameters (first pass), split vertices that are likely to be merged from multiple proton-proton interactions, fit vertex parameters (second pass, after splitting), sort vertices by the sum of the square of the transverse momentum of the tracks contained by a vertex |
Times below report minimum, maximum, mean, and standard deviation in microseconds/event. They are measured with one concurrent event (with two threads in practice) on an otherwise empty node, over the 1000 events.
Kernel | V100 |
---|---|
gpuVertexFinder::init() |
1.1-1.7 (mean 1.13 stdev 0.07) us |
gpuVertexFinder::loadTracks() |
42.7-4.1 (mean 3.4 stdev 0.2) us |
gpuVertexFinder::vertexFinderOneKernel() |
63.9-237 (mean 100 stdev 21) us |
Transfer tracks to host (PixelTrackSoAFromCUDA
)
This module transfers the pixel tracks from the device to the host.
Operation | Description |
---|---|
memcpy device-to-host 3.97 MB | Transfer pixeltrack::TrackSoA |
Transfer vertices to host (PixelVertexSoAFromCUDA
)
This module transfers the pixel vertices from the device to the host.
Operation | Description |
---|---|
memcpy device-to-host 117 kB | Transfer ZVertexSoA |