This project uses kernel methods to classify the ethnic groups given the father's and mother's allele sequences. For specific details of the implementation, refer to report.pdf.
The code for the data preprocessing can be found in the preprocessing directory.
The code for building the kernel matrices from the processed data can be found in the kernel_matrices directory.
The code for training and predicting the data can be found in the kernel_methods directory.
Some data exploration and visualizations can be found in the jupyer notebooks and R files in the notebooks directory.
In the data directory, you can find the processed data, used for modeling and commented in the report.
To reproduce the steps to obtain the results explained in the report, the files that should be on data/raw_data directory must be downloaded, as its total size is aroung 5GB. The four files are csvs:
- Afircan American Population Data
- Estonian Population Data
- Korean Population Data
- Palestinian Population Data
Once downloaded, they must be copied into the directory in order for the data preprocessing scripts in the preprocessing folder work properly, and generate the processed_data.csv file in the data directory.