Skip to content

Commit

Permalink
Merge branch 'pathogen-data-analysis' of github.com:elixir-europe/inf…
Browse files Browse the repository at this point in the history
…ectious-diseases-toolkit into pathogen-data-analysis
  • Loading branch information
bedroesb committed Sep 18, 2024
2 parents e981834 + 8f78aa7 commit 003700f
Show file tree
Hide file tree
Showing 2 changed files with 89 additions and 2 deletions.
10 changes: 10 additions & 0 deletions _data/tool_and_resource_list.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1051,3 +1051,13 @@
url: https://biit.cs.ut.ee/gprofiler/gost
regsitry:
biotools: gprofiler
- description: EuroHPC Joint Undertaking is a joint initiative between the EU, European countries and private partners to develop a World Class Supercomputing Ecosystem in Europe.
id: eurohpc
name: EuroHPC
url: https://eurohpc-ju.europa.eu/
regsitry:
- description: BEAUti is a graphical user-interface (GUI) application for generating BEAST XML files.
id: beauti
name: BEAUti
url: https://beast.community/beauti.html
regsitry:
81 changes: 79 additions & 2 deletions data-analysis/pathogen-characterisation.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Pathogen characterisation
description: Generic workflows for different data types.
contributors: [Francesco Messina, Rafael Andrade Buono]
contributors: [Eva Garcia Alvarez, Francesco Messina, Fotis Psomopoulos, Rafael Andrade Buono]
page_id: pc_data_analysis
redirect_from: /pathogen-characterisation/data-analysis
related_pages:
Expand All @@ -17,7 +17,7 @@ training:
registry: Other
url: https://gxy.io/GTN:T00437
rdmkit:
- name: “Your tasks: Data Analysis
- name: Data Analysis
url: https://rdmkit.elixir-europe.org/data_analysis
faircookbook:
- name: <!---the title of the FAIR Cookbook recipe--->
Expand Down Expand Up @@ -65,6 +65,83 @@ When analysing pathogen data involved in a health emergency or epidemic outbreak
- Accompanied by documentation that lists all parameters and other relevant information to reproduce the findings


### Existing approaches
- **Container and environments**: Consider using containers and environments to collect and isolate dependencies for tools and pipelines. Environment management systems, such as Conda, help with reproducibility but are not inherently portable across platforms. Containers provide a higher level of portability, being able to encapsulate both the software and its dependencies.
- **Web-based code collaboration platform**: Consider using a centralised location for software developers to store, manage, collaborate, and share their code. For instance, {% tool "github" %}, {% tool "gitlab" %}, or {% tool "bitbucket" %}.
- **Workflow management systems**: Allow you to formalise your workflows in a standardised format and execute them locally or on a remote computer infrastructure. Popular systems are {% tool "nextflow" %} and {% tool "snakemake" %}.
- **Workflow platforms**: Allow users to manage data, run formalised workflows, and review their results. Platforms, such as {% tool "galaxy" %}, may offer multiple interfaces, e.g. web, GUI, and APIs.
- **Reference databases**: Collect the suitable reference data about pathogens to be investigated. {% tool "european-nucleotide-archive" %} and {% tool "gisaid" %} are examples of genomic databases to which researchers share their data. In this context, the European Pathogens Portal aggregates databases relating to pathogens, as well as hosts and their vectors. Other countries host their own instance of the {% tool "pathogens-portal" %}, e.g. see the {% tool "swedish-pathogens-portal" %}Swedish Pathogens Portal [showcase](https://www.infectious-diseases-toolkit.org/showcase/swedish-pathogens-portal).
- **Workflow registries**: Register workflows in platforms, such as {% tool "workflowhub" %}, that facilitate sharing, versioning, and authorship attribution of the pipelines.


For more general information and solutions on data analysis, you may have a look at the content available on the [RDMkit data analysis page]
(https://rdmkit.elixir-europe.org/data_analysis#what-are-the-best-practices-for-data-analysis).
While the examples on this page focus on the genomic characterisation of pathogens, similar principles apply to other data types.

## Preprocessing

Data preprocessing is an initial step in data analysis involving the preparation of raw data for the main analysis. It is an important factor in quality control, and involves steps for the cleaning of the data, with the identification of inconsistencies, errors, and missing values. Preprocessing may also include data conversion and transformation steps to get the data in a format compatible with the expected inputs of the chosen analysis pipelines.

### Considerations

Some typical considerations involved in this step:
- **Data cleaning**: Finds and corrects errors in the data. For example, eliminating duplicates, removing too short genomic reads, and trimming not useful information such as contaminating host data.
- **Quality control checks**: Should be conducted at each step to ensure that the data is suitable for the intended analysis.
- **Exclusion of low-quality samples**: Samples with low-quality scores should be marked and removed. In genomics studies, samples with missing values, low sequencing depth, and contaminations might be removed.

### Existing approaches

Preprocessing steps may depend on the technology used and the pathogen being studied and thus should be adjusted accordingly. Some common approaches in genomics studies include:

- Raw sequences quality check: {% tool "fastqc" %}
- Trimming out adapters and low-quality sequences: {% tool "trimmomatic" %}
- Quality checks: further information can be found on the [Quality control - Pathogen characterisation](/quality-control/pathogen-characterisation) page.

## Analysis

The analysis of data to characterise a pathogen of interest can involve methodologies from different fields. While genomics approaches are of common interest, analysis of other data types, such as proteomics and metabolomics, and their combination can be of special importance.

### Considerations

- **The computational resources**: Verify that the appropriate computational resources are available. Depending on the volume and complexity of the data, you might need to make use of large computing clusters or cloud computing resources.
- **The location of your data**: Ensure that the chosen computing infrastructure and platforms have access to the data. It is important to consider the distance between the data storage and computing, as it can significantly impact transfer times and costs.
- **Document the steps**: Report every step of the data analysis process. Including software versions employed, parameters utilised, the computing environment employed, reference genome used, as well as any “manual” data curation steps. More information on recording provenance can be found on the [Provenance pages](/provenance/)
- **Collaborative analysis**: it is important that partners have access to the data, tools, and workflows. It is crucial that systems are in place to track changes to the tools and workflows used, and that the history of modifications is accessible to all collaborators.

### Existing approaches

There are several types of analysis that can be performed on pathogen-related data, depending on the specific research question and type of data being analysed. Here are some solutions:
- Consider using the available computational infrastructure to scale up your analysis capabilities. This may include applying for access to large computing cluster resources with e.g. {% tool ëurohpc"%} or making use of public Galaxy servers such as {% tool "galaxy-europe" %}.
- **Genomic analysis**: Including whole genome sequencing (WGS), this analysis allows the interpretation of genetic information encoded along the genome (DNA or RNA). Genomic analysis can be used for a wide range of applications to characterise many aspects of pathogen variability, such as Variants of Concern (VOC) and antimicrobial resistance profiles in bacteria (AMR). Examples of tools that allow us to take into account the genomic characteristics of pathogens (e.g. genomic structure and size, gene annotations, mobile genetic elements) are:
- Sequence Alignment
- {% tool "bowtie2" %}
- {% tool "bwa" %}
- {% tool "samtools" %}
- Genome Assembly
- {% tool "canu"%}
- {% tool "velvet" %}
- {% tool "spades" %}
- Phylogenetic Analysis
- {% tool "clustalw" %}
- {% tool "muscle" %}
- {% tool "mafft" %}
- {% tool "raxml" %}
- {% tool "iqtree" %}
- Molecular Clock
- {% tool "mrbayes" %}
- {% tool "beast" %}
- {% tool "beauti" %}
- Variant calling
- {% tool "dragen-gatk" %}
- {% tool "freebayes" %}
- {% tool "varscan" %}
- Annotation
- {% tool "annovar" %}
- {% tool "snpeff" %}
- {% tool "vep" %}
- {% tool "dbnsfp" %}
- All-in-one Bioinformatic Tools
- {% tool "snippy" %}
## Concrete topic 1 <!---REPLACE THIS with the name of the topic. Example: Metadata harmonisation--->

Short explanation of what this topic is about and why it is important, with an emphasis on infectious diseases and the category that you selected e.g. pathogen characterisation.
Expand Down

0 comments on commit 003700f

Please sign in to comment.