diff --git a/_data/tool_and_resource_list.yml b/_data/tool_and_resource_list.yml index 77a6ff07..4e0fe0ba 100644 --- a/_data/tool_and_resource_list.yml +++ b/_data/tool_and_resource_list.yml @@ -1051,3 +1051,13 @@ url: https://biit.cs.ut.ee/gprofiler/gost regsitry: biotools: gprofiler +- description: EuroHPC Joint Undertaking is a joint initiative between the EU, European countries and private partners to develop a World Class Supercomputing Ecosystem in Europe. + id: eurohpc + name: EuroHPC + url: https://eurohpc-ju.europa.eu/ + regsitry: +- description: BEAUti is a graphical user-interface (GUI) application for generating BEAST XML files. + id: beauti + name: BEAUti + url: https://beast.community/beauti.html + regsitry: diff --git a/data-analysis/pathogen-characterisation.md b/data-analysis/pathogen-characterisation.md index bed9d797..63173a0a 100644 --- a/data-analysis/pathogen-characterisation.md +++ b/data-analysis/pathogen-characterisation.md @@ -1,7 +1,7 @@ --- title: Pathogen characterisation description: Generic workflows for different data types. -contributors: [Francesco Messina, Rafael Andrade Buono] +contributors: [Eva Garcia Alvarez, Francesco Messina, Fotis Psomopoulos, Rafael Andrade Buono] page_id: pc_data_analysis redirect_from: /pathogen-characterisation/data-analysis related_pages: @@ -17,7 +17,7 @@ training: registry: Other url: https://gxy.io/GTN:T00437 rdmkit: - - name: “Your tasks: Data Analysis” + - name: Data Analysis url: https://rdmkit.elixir-europe.org/data_analysis faircookbook: - name: @@ -65,6 +65,83 @@ When analysing pathogen data involved in a health emergency or epidemic outbreak - Accompanied by documentation that lists all parameters and other relevant information to reproduce the findings +### Existing approaches +- **Container and environments**: Consider using containers and environments to collect and isolate dependencies for tools and pipelines. Environment management systems, such as Conda, help with reproducibility but are not inherently portable across platforms. Containers provide a higher level of portability, being able to encapsulate both the software and its dependencies. +- **Web-based code collaboration platform**: Consider using a centralised location for software developers to store, manage, collaborate, and share their code. For instance, {% tool "github" %}, {% tool "gitlab" %}, or {% tool "bitbucket" %}. +- **Workflow management systems**: Allow you to formalise your workflows in a standardised format and execute them locally or on a remote computer infrastructure. Popular systems are {% tool "nextflow" %} and {% tool "snakemake" %}. +- **Workflow platforms**: Allow users to manage data, run formalised workflows, and review their results. Platforms, such as {% tool "galaxy" %}, may offer multiple interfaces, e.g. web, GUI, and APIs. +- **Reference databases**: Collect the suitable reference data about pathogens to be investigated. {% tool "european-nucleotide-archive" %} and {% tool "gisaid" %} are examples of genomic databases to which researchers share their data. In this context, the European Pathogens Portal aggregates databases relating to pathogens, as well as hosts and their vectors. Other countries host their own instance of the {% tool "pathogens-portal" %}, e.g. see the {% tool "swedish-pathogens-portal" %}Swedish Pathogens Portal [showcase](https://www.infectious-diseases-toolkit.org/showcase/swedish-pathogens-portal). +- **Workflow registries**: Register workflows in platforms, such as {% tool "workflowhub" %}, that facilitate sharing, versioning, and authorship attribution of the pipelines. + + +For more general information and solutions on data analysis, you may have a look at the content available on the [RDMkit data analysis page] +(https://rdmkit.elixir-europe.org/data_analysis#what-are-the-best-practices-for-data-analysis). +While the examples on this page focus on the genomic characterisation of pathogens, similar principles apply to other data types. + +## Preprocessing + +Data preprocessing is an initial step in data analysis involving the preparation of raw data for the main analysis. It is an important factor in quality control, and involves steps for the cleaning of the data, with the identification of inconsistencies, errors, and missing values. Preprocessing may also include data conversion and transformation steps to get the data in a format compatible with the expected inputs of the chosen analysis pipelines. + +### Considerations + +Some typical considerations involved in this step: +- **Data cleaning**: Finds and corrects errors in the data. For example, eliminating duplicates, removing too short genomic reads, and trimming not useful information such as contaminating host data. +- **Quality control checks**: Should be conducted at each step to ensure that the data is suitable for the intended analysis. +- **Exclusion of low-quality samples**: Samples with low-quality scores should be marked and removed. In genomics studies, samples with missing values, low sequencing depth, and contaminations might be removed. + +### Existing approaches + +Preprocessing steps may depend on the technology used and the pathogen being studied and thus should be adjusted accordingly. Some common approaches in genomics studies include: + +- Raw sequences quality check: {% tool "fastqc" %} +- Trimming out adapters and low-quality sequences: {% tool "trimmomatic" %} +- Quality checks: further information can be found on the [Quality control - Pathogen characterisation](/quality-control/pathogen-characterisation) page. + +## Analysis + +The analysis of data to characterise a pathogen of interest can involve methodologies from different fields. While genomics approaches are of common interest, analysis of other data types, such as proteomics and metabolomics, and their combination can be of special importance. + +### Considerations + +- **The computational resources**: Verify that the appropriate computational resources are available. Depending on the volume and complexity of the data, you might need to make use of large computing clusters or cloud computing resources. +- **The location of your data**: Ensure that the chosen computing infrastructure and platforms have access to the data. It is important to consider the distance between the data storage and computing, as it can significantly impact transfer times and costs. +- **Document the steps**: Report every step of the data analysis process. Including software versions employed, parameters utilised, the computing environment employed, reference genome used, as well as any “manual” data curation steps. More information on recording provenance can be found on the [Provenance pages](/provenance/) +- **Collaborative analysis**: it is important that partners have access to the data, tools, and workflows. It is crucial that systems are in place to track changes to the tools and workflows used, and that the history of modifications is accessible to all collaborators. + +### Existing approaches + +There are several types of analysis that can be performed on pathogen-related data, depending on the specific research question and type of data being analysed. Here are some solutions: +- Consider using the available computational infrastructure to scale up your analysis capabilities. This may include applying for access to large computing cluster resources with e.g. {% tool ëurohpc"%} or making use of public Galaxy servers such as {% tool "galaxy-europe" %}. +- **Genomic analysis**: Including whole genome sequencing (WGS), this analysis allows the interpretation of genetic information encoded along the genome (DNA or RNA). Genomic analysis can be used for a wide range of applications to characterise many aspects of pathogen variability, such as Variants of Concern (VOC) and antimicrobial resistance profiles in bacteria (AMR). Examples of tools that allow us to take into account the genomic characteristics of pathogens (e.g. genomic structure and size, gene annotations, mobile genetic elements) are: + - Sequence Alignment + - {% tool "bowtie2" %} + - {% tool "bwa" %} + - {% tool "samtools" %} + - Genome Assembly + - {% tool "canu"%} + - {% tool "velvet" %} + - {% tool "spades" %} + - Phylogenetic Analysis + - {% tool "clustalw" %} + - {% tool "muscle" %} + - {% tool "mafft" %} + - {% tool "raxml" %} + - {% tool "iqtree" %} + - Molecular Clock + - {% tool "mrbayes" %} + - {% tool "beast" %} + - {% tool "beauti" %} + - Variant calling + - {% tool "dragen-gatk" %} + - {% tool "freebayes" %} + - {% tool "varscan" %} + - Annotation + - {% tool "annovar" %} + - {% tool "snpeff" %} + - {% tool "vep" %} + - {% tool "dbnsfp" %} + - All-in-one Bioinformatic Tools + - {% tool "snippy" %} ## Concrete topic 1 Short explanation of what this topic is about and why it is important, with an emphasis on infectious diseases and the category that you selected e.g. pathogen characterisation.