- Brief description of the scripts used to prepare the RDS from UCSC
- Brief description of the DEGs scripts
- Workflow
- UCSC_metadata_parsing.R - R script to collect metadata about the UCSC potential datasets to analyze
- Eze_Nowa_rds.R - R script to create the SeuratObjects from the expression matrices for the Nowakowski and Eze projects
- Eze_Nowa_integration.R - R script to integrate Eze_2021 and Nowakowski_2017 datasets
- Velmeshev_split.R - script to prepares files for split_big_ExprMtx.sh
- split_big_ExprMtx.sh - bash script to split a heavy TSV.GZ file into multiple smaller TSV.GZ files according to indexes provided in .TXT files
- Velmeshev_furu_rds.R - R script to create the RDS SeuratObjects for the larger Velmeshev subsets (2nd trimester, 10-20 years)
- Velmeshev_rds.R - R script to create the RDS SeuratObjects for the smaller Velmeshev subsets
- prep_DEG_files.R - script containing functions for preparing files for the DEG analysis, run in Velmeshev_furu_rds.R and Velmeshev_rds.R
- UCSC_integration_2nd_trimester_all.R - R script to integrate all datasets with data from the 2nd trimester (Eze, Nowakowski, Velmeshev) - just to verify that we do not have separation based on dataset alone
- UCSC_Velmeshev_all_ages_furu.R - R script to integrate all previous RDS SeuratObjects from Velmeshev
The RDS from UCSC were filtered according to the following steps:
- excluded projects that did not contain data from both sexes (at least 2-3/samples per sex) -> Velmeshev 4-10 years excluded, and Eze-Nowakowski integrated
- for each dataset, kept only the cell types that had samples from both sexes
- lastly, cell types with less than 100 cells per project and sex were not analyzed
Before we ran this bash script for the Velmeshev dataset, we first removed all rows which contained only 0s, which represent genes with no expression across all cells. We did it by running the following command:
gunzip -c exprMatrix_Velmeshev_2022.tsv.gz | awk '{s=0; for (i=2;i<=NF;i++) s+=$i; if (s!=0)print}' | gzip > exprMtx_filt_Velmeshev_2022.tsv.gz
The following script does not mantain the header; however, we can easily retrieve it from the metadata in R, since the order of the samples in the column in the expression matrix and in the rows in the metadata cell column is the same.
Arguments of the script
The script takes the following arguments:
- -i - input expression matrix to be split in multiple subsets (extract different columns)
- -b - input directory where the indexes of the columns to be extracted for each subset are saved (separate .TXT files, indexes listed in one line with commas as separator -> see the object split_indexes in Velmeshev_rds.R
- -o - the output directory where to save the split expression matrices
Example
An example of how to run the script can be found below:
./split_big_ExprMtx.sh -i exprMtx_filt_Velmeshev_2022.tsv.gz -b Velmeshev_split -o Velmeshev_outs
Notes
- Please note that if no argument is given, the script may raise error or overwrite files
- Please note that the script may raise errors if the indexes are not sorted in ascending order
- In our case, we had to split the indexes of three subsets in two since the list of columns was too long
- Please note that for 3 subsets (0_1_years_1, 2nd_trimester_2, 3rd_trimester_2), the expression matrices were obtained in R instead of bash, since the script was still raising errors - we ackowledge this is a limitation of the bash script and if this step becomes a routine step in the workflow, improvements in the script will be tested
From the UCSC Cell Browser, we used the following dataset for the second trimester integration:
- Nowakowski et al. 2017 - Spatiotemporal Gene Expression Trajectories Reveal Developmental Hierarchies of the Human Cortex (paper, UCSC dataset)
- Eze et al. 2021 - Heterogeneity of Human Neuroepithelial Cells and Early Radial Glia (paper, UCSC dataset)
These two datasets all contained fetal samples, with both female and male samples. However, since the studies were likely not designed with a sex comparison in mind, we could find little number of age-matching samples between females and males within the same dataset. Therefore, we decided to integrate the two datasets and group the samples according to the gestational trimester, instead of by gestational week. This strategy also allowed for better comparison with the results from the Velmeshev analysis. The scripts for the RDS and integration are found above.
- 01B_plot_num_genes_func.R - functions to extract the common DEGs among F and M DEGs per each subtype, and plot the results
- 01C_num_chr_func.R - functions to map the DEGs obtained from 01B to the genome and plot the fraction of DEGs belonging to X, Y or autosomial chromosome; it also plots a heatmap of the X- and Y-genes expression across the different subtypes
- 01D_Xpar1,2_func.R - script to calculate and plot the number of genes belonging to Xpar1 and Xpar2
- 02A_HyperGeom_func.R - functions to calculate the enrichment of sex chromosomes compared to autosomial genes - used in 01C to add significance to plot
- 02B_ARE_ERE_func.R - functions to calculate and plot the ARE and ERE sites percentages in the cell types, separated F and M, but in the individual projects instead of on the intersected DEGs
- 02C_Conservation_func.R - functions to plot the conserved fraction of DEGs across mammals and primates
- all_scripts_Velmeshev_2nd_trim.R - script containing the previous scripts, 01A-02C, to be run for the Velmeshev 2nd trimester
- all_scripts_Velmeshev_3rd_trim.R - script containing the previous scripts, 01A-02C, to be run for the Velmeshev 3rd trimester
- all_scripts_Velmeshev_0_1_years.R - script containing the previous scripts, 01A-02C, to be run for the Velmeshev 0-1 years
- all_scripts_Velmeshev_1_2_years.R - script containing the previous scripts, 01A-02C, to be run for the Velmeshev 1-2 years
- all_scripts_Velmeshev_2_4_years.R - script containing the previous scripts, 01A-02C, to be run for the Velmeshev 2-4 years
- all_scripts_Velmeshev_10_20_years.R - script containing the previous scripts, 01A-02C, to be run for the Velmeshev 10-20 years
- all_scripts_Velmeshev_Adult.R - script containing the previous scripts, 01A-02C, to be run for the Velmeshev Adults
- DEGs_Eze_Nowa.R - script to generate the SG-biased DEGs used in the DEGs_2nd_trimester.R script
The scripts should be run in the following order:
- UCSC_metadata_parsing.R
- Eze_Nowa_rds.R BEFORE Eze_Nowa_integration.R
- Velmeshev_split.R
- split_big_ExprMtx.sh
- Velmeshev_furu_rds.R, Velmeshev_rds.R
- UCSC_integration_2nd_trimester_all.R
- UCSC_Velmeshev_all_ages_furu.R
- all_scripts_Velmeshev_2nd_trim.R, all_scripts_Velmeshev_3rd_trim.R, all_scripts_Velmeshev_0_1_years.R, all_scripts_Velmeshev_1_2_years.R, all_scripts_Velmeshev_2_4_years.R, all_scripts_Velmeshev_10_20_years.R, all_scripts_Velmeshev_Adult.R, DEGs_Eze_Nowa.R - in no particular order