Merge branch 'development'

exomiser · Feb 29, 2024 · bc26fd3 · bc26fd3
2 parents 48875d2 + d2771f7
commit bc26fd3
Show file tree

Hide file tree

Showing 391 changed files with 110,807 additions and 25,677 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -8,7 +8,7 @@ jobs:
     docker:
       # was circleci/openjdk:8-jdk but something changed and it tests failed on fork
       #- image: circleci/openjdk@sha256:3640c4f42886e796e805c23af48b0d7348dc1d3fa8dae9a365e1f023f913c795
-      - image: circleci/openjdk:11.0.4-jdk-stretch
+      - image: cimg/openjdk:17.0.7
     steps:
       - checkout
       - run: chmod +x mvnw

diff --git a/.github/workflows/maven.yml b/.github/workflows/maven.yml
@@ -10,9 +10,9 @@ name: Java CI with Maven
 
 on:
   push:
-    branches: [ "master", "develop" ]
+    branches: [ "master", "development" ]
   pull_request:
-    branches: [ "master", "develop" ]
+    branches: [ "master", "development" ]
 
 jobs:
   build:
@@ -21,10 +21,10 @@ jobs:
 
     steps:
     - uses: actions/checkout@v3
-    - name: Set up JDK 11
+    - name: Set up JDK 17
       uses: actions/setup-java@v3
       with:
-        java-version: '11'
+        java-version: '17'
         distribution: 'temurin'
         cache: maven
     - name: Build with Maven

diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -0,0 +1,35 @@
+# Read the Docs configuration file for Sphinx projects
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+# Required
+version: 2
+
+# Set the OS, Python version and other tools you might need
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.12"
+    # You can also specify other tool versions:
+    # nodejs: "20"
+    # rust: "1.70"
+    # golang: "1.20"
+
+# Build documentation in the "docs/" directory with Sphinx
+sphinx:
+  configuration: docs/conf.py
+  # You can configure Sphinx to use a different builder, for instance use the dirhtml builder for simpler URLs
+  # builder: "dirhtml"
+  # Fail on all warnings to avoid broken references
+  # fail_on_warning: true
+
+# Optionally build your docs in additional formats such as PDF and ePub
+formats:
+ - pdf
+ - epub
+
+# Optional but recommended, declare the Python requirements required
+# to build your documentation
+# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
+python:
+ install:
+   - requirements: docs/requirements.txt
diff --git a/README.md b/README.md
@@ -56,7 +56,7 @@ For further instructions on installing and running please refer to the [README.m
 
 #### Running it
 
-Please refer to the [manual](http://exomiser.github.io/Exomiser/) for details on how to configure and run the Exomiser.
+Please refer to the [manual](https://exomiser.readthedocs.io/en/latest/) for details on how to configure and run the Exomiser.
 
 #### Demo site
 

diff --git a/docs/acmg_assignment.rst b/docs/acmg_assignment.rst
@@ -30,13 +30,24 @@ Computational and Predictive Data
 PVS1
 ----
 Variants must have a predicted loss of function effect, be in a gene with known disease associations and have a gene
-constraint LOF O/E < 0.7635 (gnomAD 2.1.1) to suggest that a gene is LoF intolerant. Variants not predicted to lead to
+constraint LOF O/E < 0.7635 (gnomAD 4.0) to suggest that a gene is LoF intolerant. Variants not predicted to lead to
 NMD (those located in the last exon) will have the modifier downgraded to Strong.
 
+PS1
+---
+Variants with the same amino acid change as previously reported P/LP missense or in-frame indel ClinVar variants will be
+assigned `PS1` with a strength of `Strong` for variants >= 2 stars, `Moderate` for variants with 1 star or `Supporting`
+for those without a ClinVar start rating.
+
 PM4
 ---
 Stop-loss and in-frame insertions or deletions, not previously assigned a `PVS1` criterion are assigned `PM4`.
 
+PM5
+---
+Variants having a novel missense change to an amino acid where a previously reported ClinVar P/LP variant has been seen
+will be assigned `PM5` with a strength of `Moderate` for those with >=2 stars or `Supporting` otherwise.
+
 PP3 / BP4
 ---------
 If REVEL is chosen as a pathogenicity predictor for missense variants, `PP3` and `BP4` are assigned using the modifiers
@@ -46,6 +57,16 @@ Note that this suggests the use of modifiers up to Strong in the case of pathoge
 Otherwise, an ensemble-based approach will be used for other pathogenicity predictors as per the original 215 guidelines.
 It should be noted we found better performance using the REVEL-based approach when testing against the 100K genomes data.
 
+Functional Data
+===============
+PM1
+---
+Missense and inframe indels are assigned `PM1` if the surrounding region of 25 nucleotides either side of the variant
+contain at least 4 reported P/LP variants in ClinVar and no B/LB variants. If the number of P/LP variants is greater
+than the number of VUS in the region the strength will be assigned `Moderate` but regions containing P/LP <= VUS
+(and no B/BL) will have the strength downgraded to `Supporting`.
+
+
 Segregation Data
 ================
 BS4
@@ -158,16 +179,16 @@ conjunction with a disorder, the assigned criteria with any modifiers and the fi
               ],
               "frequencyData": {
                 "rsId": "rs121918506",
-                "score": 1
+                "frequencyScore": 1
               },
               "pathogenicityData": {
                 "clinVarData": {
                   "alleleId": "28333",
                   "primaryInterpretation": "LIKELY_PATHOGENIC",
                   "reviewStatus": "criteria provided, single submitter"
                 },
-                "score": 0.965,
-                "predictedPathogenicityScores": [
+                "pathogenicitycore": 0.965,
+                "pathogenicityScores": [
                   {
                     "source": "REVEL",
                     "score": 0.965

diff --git a/docs/advanced_analysis.rst b/docs/advanced_analysis.rst
@@ -107,12 +107,7 @@ requires anything different, it is possible to manually define the data sources
         TOPMED,
         UK10K,
 
-        ESP_AFRICAN_AMERICAN, ESP_EUROPEAN_AMERICAN, ESP_ALL,
-
-        EXAC_AFRICAN_INC_AFRICAN_AMERICAN, EXAC_AMERICAN,
-        EXAC_SOUTH_ASIAN, EXAC_EAST_ASIAN,
-        EXAC_FINNISH, EXAC_NON_FINNISH_EUROPEAN,
-        EXAC_OTHER,
+        ESP_AA, ESP_EA, ESP_ALL,
 
         GNOMAD_E_AFR,
         GNOMAD_E_AMR,
@@ -208,25 +203,18 @@ Here you can specify which variant frequency databases you want to use. You can
 array format as the HPO IDs.
 
 The data sources used are from `1000 genomes <http://www.1000genomes.org>`_ (via DBSNP), `DBSNP <https://www.ncbi.nlm.nih.gov/projects/SNP/>`_,
-`ESP <https://evs.gs.washington.edu/EVS/>`_, `ExAC, gnomAD exomes and gnomAD genomes <https://gnomad.broadinstitute.org/about>`_,
-`UK10K <https://www.uk10k.org/>`_ (via DBSNP), `TOPMed <https://topmed.nhlbi.nih.gov/>`_ (via DBSNP).
+`ESP <https://evs.gs.washington.edu/EVS/>`_, `UK10K <https://www.uk10k.org/>`_ (via DBSNP), `TOPMed <https://topmed.nhlbi.nih.gov/>`_ (via DBSNP).
+
+As of the 2402 data release `ExAC, gnomAD exomes and gnomAD genomes <https://gnomad.broadinstitute.org/about>`_ source
+has been removed as this is part of the gnomAD 2.1+ data.
 
 DBSNP:
     ``THOUSAND_GENOMES``,
     ``UK10K``,
     ``TOPMED``
 
 ESP:
-    ``ESP_AFRICAN_AMERICAN``, ``ESP_EUROPEAN_AMERICAN``, ``ESP_ALL``
-
-ExAC:
-    ``EXAC_AFRICAN_INC_AFRICAN_AMERICAN``,
-    ``EXAC_AMERICAN``,
-    ``EXAC_SOUTH_ASIAN``,
-    ``EXAC_EAST_ASIAN``,
-    ``EXAC_FINNISH``,
-    ``EXAC_NON_FINNISH_EUROPEAN``,
-    ``EXAC_OTHER``
+    ``ESP_AA``, ``ESP_EA``, ``ESP_ALL``
 
 gnomAD exomes:
     ``GNOMAD_E_AFR``,
@@ -235,21 +223,26 @@ gnomAD exomes:
     ``GNOMAD_E_EAS``,
     ``GNOMAD_E_FIN``,
     ``GNOMAD_E_NFE``,
+    ``GNOMAD_E_MID``,
     ``GNOMAD_E_OTH``,
     ``GNOMAD_E_SAS``,
 
 gnomAD genomes:
     ``GNOMAD_G_AFR``,
     ``GNOMAD_G_AMR``,
+    ``GNOMAD_G_AMI``,
     ``GNOMAD_G_ASJ``,
     ``GNOMAD_G_EAS``,
     ``GNOMAD_G_FIN``,
     ``GNOMAD_G_NFE``,
+    ``GNOMAD_G_MID``,
     ``GNOMAD_G_OTH``,
     ``GNOMAD_G_SAS``
 
-We recommend using all databases if the proband population background is unknown, although removing the ``GNOMAD_E_ASJ``
-and ``GNOMAD_G_ASJ``, unless your proband is known to come from an Ashkenazi population e.g.
+We recommend using all databases if the proband population background is unknown, although removing the ``ASJ``, ``AMI``,
+``FIN``, ``MID`` and ``OTH`` populations is recommended as these are small/founder populations which are likely to have
+artificially high allele frequencies for some relevant variants. These populations will not be included when assessing
+the population frequency for the ACMG assignments, even if used in the filtering.
 
 .. code-block:: yaml
 
@@ -258,29 +251,24 @@ and ``GNOMAD_G_ASJ``, unless your proband is known to come from an Ashkenazi pop
       TOPMED,
       UK10K,
 
-      ESP_AFRICAN_AMERICAN, ESP_EUROPEAN_AMERICAN, ESP_ALL,
-
-      EXAC_AFRICAN_INC_AFRICAN_AMERICAN, EXAC_AMERICAN,
-      EXAC_SOUTH_ASIAN, EXAC_EAST_ASIAN,
-      EXAC_FINNISH, EXAC_NON_FINNISH_EUROPEAN,
-      EXAC_OTHER,
+      ESP_AA, ESP_EA, ESP_ALL,
 
       GNOMAD_E_AFR,
       GNOMAD_E_AMR,
-      #        GNOMAD_E_ASJ,
+      # GNOMAD_E_ASJ,
       GNOMAD_E_EAS,
-      GNOMAD_E_FIN,
+      # GNOMAD_E_FIN,
       GNOMAD_E_NFE,
-      GNOMAD_E_OTH,
+      # GNOMAD_E_OTH,
       GNOMAD_E_SAS,
 
       GNOMAD_G_AFR,
       GNOMAD_G_AMR,
-      #        GNOMAD_G_ASJ,
+      # GNOMAD_G_ASJ,
       GNOMAD_G_EAS,
-      GNOMAD_G_FIN,
+      # GNOMAD_G_FIN,
       GNOMAD_G_NFE,
-      GNOMAD_G_OTH,
+      # GNOMAD_G_OTH,
       GNOMAD_G_SAS
     ]
 
@@ -289,14 +277,27 @@ and ``GNOMAD_G_ASJ``, unless your proband is known to come from an Ashkenazi pop
 
 pathogenicitySources:
 ---------------------
-Possible pathogenicitySources: ``POLYPHEN``, ``MUTATION_TASTER``, ``SIFT``, ``REVEL``, ``MVP``, ``CADD``, ``REMM``. ``REMM`` is trained on
+Possible pathogenicitySources: ``POLYPHEN``, ``MUTATION_TASTER``, ``SIFT``, ``REVEL``, ``MVP``, ``ALPHA_MISSENSE``,
+``SPLICE_AI`` (derived from gnomAD 4.0, so only available for hg38),  ``CADD``, ``REMM``. ``REMM`` is trained on
 non-coding regulatory regions. **WARNING** if you enable ``CADD``, ensure that you have downloaded and installed the CADD
 tabix files and updated their location in the ``application.properties`` (see :ref:`cadd-install`). Exomiser will not run
 without this.
 
 We recommend using either  ``[REVEL, MVP]`` **OR** ``[POLYPHEN, MUTATION_TASTER, SIFT]`` as REVEL and MVP are newer
 predictors which have been shown to have better performance and are more nuanced. Mixing them with the Polyphen2,
-MutationTaster or SIFT will give worse performance.
+MutationTaster or SIFT will give worse performance. Testing on GEL solved cases with AlphaMissense slightly increased
+performance when combined with MVP. We advise testing on local cohorts for assessing local performance.
+
+`REVEL scores are freely available for non-commercial use. For other uses, please contact Weiva Sieh.`
+
+`AlphaMissense Database Copyright (2023) DeepMind Technologies Limited. All predictions are provided for non-commercial
+research use only under CC BY-NC-SA license. Researchers interested in predictions not yet provided, and for
+non-commercial use, can send an expression of interest to alphamissense@google.com.`
+
+`SpliceAI source code is provided under the GPLv3 license. SpliceAI includes several third party packages provided under
+other open source licenses, please see NOTICE for additional details. The trained models used by SpliceAI (located in
+this package at spliceai/models) are provided under the CC BY NC 4.0 license for academic and non-commercial use; other
+use requires a commercial license from Illumina, Inc.`
 
 .. code-block:: yaml
 
@@ -319,7 +320,7 @@ Analysis steps are defined in terms of :ref:`variant filters<variantfilters>`, :
 operate on genes but also require the variants to have already been filtered. The optimiser will ensure that these are
 run at the correct time if they have been incorrectly placed.
 
-Using these it is possible to create artificial exomes, define gene panels or only examine specific regions, for example.
+Using these it is possible, for example to create artificial exomes, define gene panels or only examine specific regions.
 
 .. _variantfilters:
 
@@ -413,14 +414,20 @@ frequencyFilter:
 Frequency cutoff of a variant **in percent**. Frequencies are derived from the databases defined in the :ref:`frequencySources<frequencysources>`
 section. We recommend a value below 5.0 % depending on the disease. Variants will be removed/failed if they have a
 frequency higher than the stated percentage in any database defined in the :ref:`frequencySources<frequencysources>` section.
-_n.b_ Not defining this filter will result in all variants having no frequency data, even if the :ref:`frequencySources<frequencysources>`
-contains values.
 
 .. code-block:: yaml
 
     frequencyFilter: {maxFrequency: 1.0}
 
 
+.. important::
+
+    Not defining this filter will result in all variants having no frequency data, even if the :ref:`frequencySources<frequencysources>`
+    are defined. Failing to include this will result in Exomiser assuming variants are all very rare and subsequently
+    assigning an artificially inflated score, especially for very common variants. If you want to score all variants and
+    write failed ones to the output, it is recommended to use `analysisMode: FULL`.
+
+
 .. _pathogenicityfilter:
 
 pathogenicityFilter:
@@ -435,6 +442,14 @@ This filter is meant to be quite permissive and we recommend it be set to true.
     pathogenicityFilter: {keepNonPathogenic: true}
 
 
+.. important::
+
+    Not defining this filter will result in all variants having no pathogenicity data or ClinVar annotations, even if the
+    :ref:`pathogenicitySources<pathogenicitysources>` are defined. Failing to include this will result in Exomiser
+    using default scores based on the assigned variant effect. If you want to score all variants and write failed ones
+    to the output, it is recommended to use `analysisMode: FULL`.
+
+
 .. _genefilters:
 
 Gene filters

diff --git a/docs/conf.py b/docs/conf.py
@@ -55,11 +55,11 @@
 # -- Project information -----------------------------------------------------
 
 project = u'exomiser'
-copyright = u'2021, Jules Jacobsen, Damian Smedley, Peter Robinson'
+copyright = u'2024, Jules Jacobsen, Damian Smedley, Peter Robinson'
 author = u'Jules Jacobsen, Damian Smedley, Peter Robinson'
 
 # The short X.Y version
-version = u'13.1.0'
+version = u'14.0.0'
 # The full version, including alpha/beta/rc tags
 release = version
 
@@ -94,7 +94,7 @@
 #
 # This is also used if you do content translation via gettext catalogs.
 # Usually you set "language" from the command line for these cases.
-language = None
+language = 'en'
 
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.

diff --git a/docs/input_files_and_options.rst b/docs/input_files_and_options.rst
@@ -78,6 +78,15 @@ genome assembly and is enabled in the ``application.properties`` see the :ref:`r
 Analysis
 ========
 
+.. important::
+
+    The exome and genome analyses found in the `test-analysis-exome.yml` and `test-analysis-genome.yml` files are
+    recommended for use in most situations, and removing steps from the analysis is likely to negatively impact
+    performance. It is *strongly* recommended to test any changes against the standard setup on the example samples and
+    your own solved cases to check the impact of any changes you might want to make. If you want to score all variants
+    and write failed ones to the output, it is recommended to use `analysisMode: FULL`.
+
+
 Analysis files contain all possible options for running an analysis including the ability to specify variant frequency
 and pathogenicity data sources and the ability to tweak the order that analysis steps are performed.