Julian PREIN
Jorge KORGUT Junior
The project contains 2 majors optimizations that can be found in the following branches:
📌 final-unchanged : This branch, contains the first version without any modification. This was the starting point of our project.
📌 final-custombitset : This was the first optimization implementation. The idea was to encode the DNA in a bitset, to gain performance by fastly comparing the wanted sequences.
📌 master : After gaining performance with the bitset, the main goal of this step was to convert all the particularities from bitset to the GPU. The propretary language CUDA was used.
For being able to launch our project, you will need CUDA nvcc compiler installed and the boost library installed inside boost folder.
Cuda : CUDA installation guide
Boost : BOOST Getting started
Extract and install the files from boost following the following folder hierarchic example. The boost folder can be found in the root of the project.
Ex : boost/boost_1_83_0
error: parameter packs not expanded with ‘...’
The error comes probably because you have installed cuda-toolkit from the unix package manager. The solution I found to this error is to uninstall every package related to cuda and then reinstall it following the exact instructions from CUDA installation guide
Could NOT find CUDA (missing: CUDA_NVCC_EXECUTABLE CUDA_CUDART_LIBRARY)
The reason for this error is the missing CUDA path directory for nvcc and the CUDA library. To fix it just add this lines into your ~/.bashrc
export PATH=/usr/local/cuda-12.3/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
After cloning the repository execute bootstrap.sh
to install and setup our
boost fork dependency.
As mentioned before, every branch is tagged with a prefix final- before the optimization type used. Therefore, for launching the project with a specific optimization, change branches to the wished one.
git checkout final-#
Then create a new folder inside projects.
mkdir project/build
Launch cmake inside the folder you have created.
cd project/build
cmake .. -DUSE_CUDA=on
Build the project
make -j
To measure the application execution time, you can execute the following unix application.
time project/build/micro_aevol_#
The objective of this project is to optimize the run of the MICRO-AEVOL application. This wide scope was purpose chosen to allow us to explore different solutions either with hardware and software optimization. In what concerns this project, 2 main axes will be focused on : algorithm run time and memory space.
By analyzing the initial code and running it on profiler tools (INTEL VTUNE), we remark that a lot of the runtime was spent traversing the DNA sequence. Therefore this had become our first optimization target.
Micro-aevol run profiled with Intel VTune showing the application hot paths
Diving into the existing code we discover one of the reasons why the application spends a significant amount of time on the DNA sequence. In summary multiples algorithms used in our application iterates over the DNA to find patterns. For example, in one of them, we need to extract the promoter in a determined position. For doing so, we cover all the DNA iterating position after position and comparing the starting sequence to a reference and constant pattern, saving its distance for later on.
For that reason, using the classical character vector c++ implementation forces the iteration over the array every time we want to reach our objectives. This approach waste a lot of memory and causes unwanted cache misses to fetch data from further and slower memories.
Therefore, the first optimization idea that pops up is to replace this structure by a bitset. But what is a bitset ?
A bitset is a fundamental data structure in computer science that is used to represent a fixed-size sequence of binary digits.
However, for our purpose, the number of digits is defined on run-time. Thus we are not able to use the simple implementation to build our data in, we are constraint to implement a dynamic version of this data structure.
With a little bit of research, we step into a dynamic bitset library called BOOST. This allows us to save up some development time and discover the meticulous details of its implementation. BOOST implements the dynamic bitset as a dynamic array of blocks that are user defined. These blocks are values and represent one unit of storage supporting the same amount of elements as in its bit length. For example : a block of type uint64_t has 64 bits of length and supports 64 elements inside of it.
The next step is to adapt the original MICRO-AEVOL code, replacing the character vector by our dynamic bitset. However, this data structure limit our use case to only 2 different genetics codes. If we want to implement the 4 different ones, the dynamic bitset would still be possible, but an adaptation to 2 instead of 1 bits is required.
Micro-aevol run time execution showing lost of performance with boost dynamic bitset. This graph was constructed running 20 times each version and then plotting the result.
The loss of time performance was expected since we simply replaced the structure but we did not use the full potential of the bitset. Even if we did not win the performance we wished for, we saved up a good amount of space in memory. Indeed, before we used the bit length of a character times the length of the sequence to represent 1 DNA and now we use the ceiling of the length of the sequence divided by the bit length of a block in the dynamic bitset.
To illustrate, if we use a block of uint_64, we economize 8 * n / (n / 64) = 512
. That is 512 times less memory to represent the same DNA in the previous version.
Knowing the full potential of bitset, the focus is on modifying the algorithm's logic to work with our new approach. To be precise, we changed the loops of 5 main functions :
- promoter_at
- terminator_at
- shine_dal_start
- protein_stop
- codon_at
The common thing we remark on all these functions is an inner loop to check if a specific pattern is found. Knowing that these patterns are constant and that they are smaller than our max available block bit size, we could heavily optimize these functions.
To benefit from the optimizations that BOOST's engineers had already think of, we added a new function to their library :
This function returns a block with the starting bit defined by pos and the last one defined by pos + len - 1. The rest of its bits are 0. In short, function uses binary operators to quickly retrieve, construct and return our target sequence.
With this new helper function, we could adapt our previous functions and surprisingly we had a really nice interface to work with.
This optimization led our program to execute faster than the previous version and the unchanged one.
Run time comparison of the starting version of Micro-Aevol and the one with custom dynamic bitset. This graph was constructed running 20 times each version and then plotting the result.
Micro-aevol custom dynamic bitset version run profiled with Intel VTune showing the application hot paths
It is interesting to notice the loss of the relative time spent on the DNA sequence and the total speedup obtained using this optimization method.
After optimization the CPU executable we switched over to the GPU. We started by
a very naive replacement of all the char
, *int8_t
and uint
arrays we could
find to a bitset in the form of block *
(i.e. uint64_t
). This obviously
broke the functionality of the program.
From here on we started to also slowly port the functionality of every function
using those arrays. This consisted mostly of adjustments in uses of indices and
lengths in loops, but also of more complex parts like introducing specific
atomic operations since different threads might need to access the same bytes
because of the denser memory layout. In this step we also partially reverted
back to uint
arrays for the arrays holding indices (e.g. terminator
,
prot_start
). At the end of this we had working code again.
When comparing the performance we first benchmarked the state of the program when we got it. For this we just turned on cuda support and increased the C++ version to 20.
$ git diff origin/dev-master-cuda 80ed9301ddd0 -- CMakeLists.txt
diff --git a/CMakeLists.txt b/CMakeLists.txt
index d55343c24c74..8efb394019db 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -1,13 +1,11 @@
cmake_minimum_required(VERSION 3.13)
project(pdc_mini_aevol)
-set(CMAKE_CXX_STANDARD 20)
-set(CMAKE_BUILD_TYPE Release)
-set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -march=native")
+set(CMAKE_CXX_STANDARD 14)
set(DO_TRACES OFF CACHE BOOL "Whether to enable the Traces library")
set(USE_OMP OFF CACHE BOOL "Whether to enable OpenMP parallelization")
-set(USE_CUDA ON CACHE BOOL "Whether to enable CUDA parallelization")
+set(USE_CUDA OFF CACHE BOOL "Whether to enable CUDA parallelization")
if ( DO_TRACES )
add_definitions(-DTRACES)
The benchmarked commit can be found on the dev-master-cuda branch. This commit would run in our tests for 28 seconds and our bitset version for 17.2 seconds.
After this we profiled our code with nvprof and got following results:
==10520== NVPROF is profiling process 10520, command: ./micro_aevol_gpu
==10520== Profiling application: ./micro_aevol_gpu
==10520== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 51.09% 8.72178s 1001 8.7131ms 8.5950ms 13.767ms sparse_meta(unsigned int, cuIndividual*)
13.63% 2.32774s 1 2.32774s 2.32774s 2.32774s clean_population_metadata(unsigned int, cuIndividual*)
12.71% 2.17036s 1001 2.1682ms 2.1056ms 3.4977ms gather_genes(unsigned int, cuIndividual*)
8.07% 1.37777s 1001 1.3764ms 1.3479ms 2.3350ms search_patterns(unsigned int, cuIndividual*)
6.86% 1.17070s 1001 1.1695ms 28.512us 2.0323ms clean_metadata(unsigned int, cuIndividual*)
3.46% 590.94ms 1001 590.35us 571.59us 877.67us find_gene_per_RNA(unsigned int, cuIndividual*)
1.73% 294.78ms 1001 294.49us 290.37us 482.63us compute_fitness(unsigned int, cuIndividual*, double const *)
1.26% 214.40ms 1001 214.18us 208.61us 356.20us compute_phenotype(unsigned int, cuIndividual*)
0.47% 80.508ms 2 40.254ms 31.915ms 48.593ms check_result(unsigned int, cuIndividual*)
0.31% 52.315ms 1001 52.263us 50.464us 82.561us translation(unsigned int, cuIndividual*)
0.18% 30.995ms 1001 30.964us 29.600us 41.632us transcription(unsigned int, cuIndividual*)
0.08% 13.131ms 1000 13.130us 12.576us 19.968us selection(unsigned int, unsigned int, cuIndividual const *, RandService*, int*)
0.06% 11.020ms 1000 11.020us 10.464us 16.160us reproduction(unsigned int, cuIndividual*, int const *, unsigned long const *)
0.06% 9.5282ms 1000 9.5280us 8.2880us 17.088us do_mutation(unsigned int, cuIndividual*, double, RandService*)
0.01% 2.4967ms 1 2.4967ms 2.4967ms 2.4967ms init_device_population(int, unsigned int, unsigned int, int, cuIndividual*, unsigned long*, unsigned char*, unsigned long*, unsigned int*, unsigned long*, unsigned int*, cuRNA*)
0.01% 2.4171ms 1001 2.4140us 2.2400us 12.192us swap_parent_child_genome(unsigned int, cuIndividual*, unsigned long*)
0.01% 1.5842ms 1028 1.5410us 1.2800us 3.2640us [CUDA memcpy DtoH]
0.00% 425.78us 1027 414ns 320ns 2.4000us [CUDA memcpy HtoD]
API calls: 85.72% 14.7989s 13013 1.1372ms 5.2830us 48.925ms cudaDeviceSynchronize
13.58% 2.34485s 2055 1.1410ms 3.4150us 2.32776s cudaMemcpy
0.31% 53.503ms 1 53.503ms 53.503ms 53.503ms cudaDeviceSetLimit
0.23% 39.463ms 13014 3.0320us 2.0730us 7.5902ms cudaLaunchKernel
0.11% 19.057ms 1 19.057ms 19.057ms 19.057ms cudaDeviceReset
0.03% 4.9337ms 13 379.51us 1.6610us 3.2099ms cudaFree
0.01% 1.7501ms 13013 134ns 85ns 300.55us cudaGetLastError
0.00% 668.90us 13 51.454us 1.7030us 160.60us cudaMalloc
0.00% 203.28us 114 1.7830us 207ns 79.210us cuDeviceGetAttribute
0.00% 17.983us 1 17.983us 17.983us 17.983us cuDeviceGetName
0.00% 11.497us 1 11.497us 11.497us 11.497us cuDeviceGetPCIBusId
0.00% 2.1560us 3 718ns 329ns 1.4750us cuDeviceGetCount
0.00% 1.1050us 2 552ns 224ns 881ns cuDeviceGet
0.00% 532ns 1 532ns 532ns 532ns cuDeviceTotalMem
0.00% 411ns 1 411ns 411ns 411ns cuModuleGetLoadingMode
0.00% 360ns 1 360ns 360ns 360ns cuDeviceGetUuid
As a next step we parallelized clean_population_metadata
with a very simple
patch to use a thread per individual instead of just one thread doing
everything:
diff --git a/project/cuda/cuExpManager.cu b/project/cuda/cuExpManager.cu
index 110ea097d75f..95f4fcffb0e5 100644
--- a/project/cuda/cuExpManager.cu
+++ b/project/cuda/cuExpManager.cu
@@ -344,7 +344,7 @@ void cuExpManager::transfer_to_host() const {
}
void cuExpManager::device_data_destructor() {
- clean_population_metadata<<<1, 1>>>(nb_indivs_, device_individuals_);
+ clean_population_metadata<<<ceil((float)nb_indivs_ / 32.0), 32>>>(nb_indivs_, device_individuals_);
RandService tmp_rand;
checkCuda(cudaMemcpy(&tmp_rand, rand_service_, sizeof(RandService), cudaMemcpyDeviceToHost));
checkCuda(cudaFree(tmp_rand.rng_counters));
@@ -568,10 +568,10 @@ __global__ void compute_fitness(uint size, cuIndividual* individuals, const doub
// Interface Host | Device
__global__ void clean_population_metadata(uint nb_indivs, cuIndividual* individuals) {
- if (threadIdx.x + blockIdx.x == 0) {
- for (int i = 0; i < nb_indivs; ++i) {
- individuals[i].clean_metadata();
- }
+ // On thread per individual
+ auto indiv_idx = threadIdx.x + blockIdx.x * blockDim.x;
+ if (indiv_idx < nb_indivs) {
+ individuals[indiv_idx].clean_metadata();
}
}
With this we gained over 2 seconds with a final runtime of 14.9 seconds. Thus we achieved a speedup of almost 1.9x.
When looking at the output of nvprof one can clearly see that the majority of
the runtime is spent in sparse_meta
:
==10694== NVPROF is profiling process 10694, command: ./micro_aevol_gpu
==10694== Profiling application: ./micro_aevol_gpu
==10694== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 59.15% 8.70406s 1001 8.6954ms 8.5703ms 13.858ms sparse_meta(unsigned int, cuIndividual*)
14.72% 2.16590s 1001 2.1637ms 2.0989ms 3.4946ms gather_genes(unsigned int, cuIndividual*)
9.32% 1.37108s 1001 1.3697ms 1.3398ms 2.3108ms search_patterns(unsigned int, cuIndividual*)
7.96% 1.17136s 1001 1.1702ms 28.513us 1.9597ms clean_metadata(unsigned int, cuIndividual*)
4.01% 589.85ms 1001 589.26us 572.04us 886.05us find_gene_per_RNA(unsigned int, cuIndividual*)
2.00% 294.39ms 1001 294.10us 289.73us 484.13us compute_fitness(unsigned int, cuIndividual*, double const *)
1.45% 213.97ms 1001 213.76us 208.13us 357.86us compute_phenotype(unsigned int, cuIndividual*)
0.55% 80.416ms 2 40.208ms 31.854ms 48.562ms check_result(unsigned int, cuIndividual*)
0.35% 52.030ms 1001 51.978us 50.113us 82.817us translation(unsigned int, cuIndividual*)
0.21% 30.956ms 1001 30.925us 29.664us 41.728us transcription(unsigned int, cuIndividual*)
0.09% 12.962ms 1000 12.961us 12.352us 20.191us selection(unsigned int, unsigned int, cuIndividual const *, RandService*, int*)
0.08% 11.050ms 1000 11.049us 10.528us 16.352us reproduction(unsigned int, cuIndividual*, int const *, unsigned long const *)
0.06% 9.3883ms 1000 9.3880us 8.2240us 16.512us do_mutation(unsigned int, cuIndividual*, double, RandService*)
0.02% 2.5412ms 1001 2.5380us 2.3670us 12.608us swap_parent_child_genome(unsigned int, cuIndividual*, unsigned long*)
0.02% 2.5118ms 1 2.5118ms 2.5118ms 2.5118ms init_device_population(int, unsigned int, unsigned int, int, cuIndividual*, unsigned long*, unsigned char*, unsigned long*, unsigned int*, unsigned long*, unsigned int*, cuRNA*)
0.01% 1.6073ms 1028 1.5630us 1.2800us 9.5040us [CUDA memcpy DtoH]
0.01% 983.56us 1 983.56us 983.56us 983.56us clean_population_metadata(unsigned int, cuIndividual*)
0.00% 396.70us 1027 386ns 288ns 2.4320us [CUDA memcpy HtoD]
API calls: 99.00% 14.7695s 13013 1.1350ms 5.2760us 48.960ms cudaDeviceSynchronize
0.44% 65.061ms 1 65.061ms 65.061ms 65.061ms cudaDeviceSetLimit
0.27% 39.591ms 13014 3.0420us 2.1670us 6.7657ms cudaLaunchKernel
0.13% 18.682ms 1 18.682ms 18.682ms 18.682ms cudaDeviceReset
0.12% 18.269ms 2055 8.8900us 4.0870us 996.16us cudaMemcpy
0.03% 4.9751ms 13 382.70us 1.9580us 3.2105ms cudaFree
0.01% 1.5044ms 13013 115ns 81ns 298.24us cudaGetLastError
0.00% 674.46us 13 51.881us 1.7180us 160.61us cudaMalloc
0.00% 192.25us 114 1.6860us 169ns 79.173us cuDeviceGetAttribute
0.00% 16.932us 1 16.932us 16.932us 16.932us cuDeviceGetName
0.00% 10.792us 1 10.792us 10.792us 10.792us cuDeviceGetPCIBusId
0.00% 2.0680us 3 689ns 295ns 1.4280us cuDeviceGetCount
0.00% 934ns 2 467ns 196ns 738ns cuDeviceGet
0.00% 464ns 1 464ns 464ns 464ns cuModuleGetLoadingMode
0.00% 444ns 1 444ns 444ns 444ns cuDeviceTotalMem
0.00% 296ns 1 296ns 296ns 296ns cuDeviceGetUuid
This is the case since sparse_meta
only uses 3 threads instead of all of the
available ones. Our plan was to parallelize sparse_meta by combining most of its
work with search_patterns
. The problem here was that the order of the
resulting lists terminator_idxs
and prot_start_idxs
would be different (i.e.
non-deterministic) which changed the end-result. Additionally we couldn't gain
any runtime through our approach. Our implementation can be found on the
further branch. We believe that with more time one could bring
those two functions together and increase the GPU performance even further.