Merge pull request #65 from apriltuesday/EVA-3668

EVA-3668: Clarify support for compressed input files
EBIvariation · Oct 10, 2024 · 3d9939e · 3d9939e
2 parents 6dc0e3c + c1d9cb6
commit 3d9939e
Show file tree

Hide file tree

Showing 6 changed files with 82 additions and 35 deletions.
diff --git a/README.md b/README.md
@@ -48,33 +48,41 @@ Install each of these and ensure they are included in your PATH. Then install th
 
 The ["Getting Started" guide](Getting_Started_with_eva_sub_cli.md) serves as an introduction for users of the eva-sub-cli tool. It includes instructions on how to prepare your data and metadata, ensuring that users are equipped with the necessary information to successfully submit variant data. This guide is essential for new users, offering practical advice and tips for a smooth onboarding experience with the eva-sub-cli tool.
 
-## eva-sub-cli tool: Options and parameters guide
+## Options and parameters guide
 
-The eva-sub-cli tool provides several options/parameters that you can use to tailor its functionality to your needs. Understanding these parameters is crucial for configuring the tool correctly. Below is an overview of the key parameters and options:
+The eva-sub-cli tool provides several options and parameters that you can use to tailor its functionality to your needs.
+You can view all the available parameters with the command `eva-sub-cli.py -h` and view detailed explanations for the
+input file requirements in the ["Getting Started" guide](Getting_Started_with_eva_sub_cli.md).
+Below is an overview of the key parameters.
 
-| OPTIONS/PARAMETERS         | DESCRIPTION                                                                                                                                                                                                        |
-|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| --version                  | Shows version number of the program and exit                                                                                                                                                                       |
-| --metadata_xlsx            | Excel spreadsheet that describe the project, analysis, samples and files                                                                                                                                           |
-| --metadata_json            | Json file that describe the project, analysis, samples and files                                                                                                                                                   |
-| --vcf_files                | One or several vcf files to validate.This allows you to provide multiple VCF files to validate and a single associated reference genome file. The VCF files and the associated reference genome file must use the same chromosome naming convention                                                                                                                                                                              |
-| --reference_fasta          | The fasta file containing the reference genome from which the variants were derived                                                                                                                                |
-| --submission_dir           | Path to the directory where all processing will be done and submission data is/will be stored                                                                                                                      |
-| --tasks {validate,submit}  | Selecting VALIDATE will run the validation regardless of the outcome of previous runs. Selecting SUBMIT will run validate only if the validation was not performed successfully before and then run the submission |
-| --executor {docker,native} | Select an execution type for running validation (default native)                                                                                                                                                   |
-| --shallow                  | Set the validation to be performed on the first 10000 records of the VCF. Only applies if the number of record exceed 10000                                                                                        |
-| --username                 | Username used for connecting to the ENA webin account                                                                                                                                                              |
-| --password                 | Password used for connecting to the ENA webin account                                                                                                                                                              |
+### Submission directory
 
+This is the directory where all processing will take place, and where configuration and reports will be saved.
+Crucially, the eva-sub-cli tool requires that there be **only one submission per directory** and that the submission directory not be reused.
+Running multiple submissions from a single directory can result in data loss during validation and submission.
+
+### Metadata file
+
+Metadata can be provided in one of two files.
 
 #### The metadata spreadsheet
 
-The metadata template can be found within the [etc folder](eva_sub_cli/etc/EVA_Submission_template.xlsx). It should be populated following the instruction provided within the template.
+The metadata template can be found within the [etc folder](eva_sub_cli/etc/EVA_Submission_template.xlsx). It should be populated following the instructions provided within the template.
+This is passed using the option `--metadata_xlsx`.
 
 #### The metadata JSON
 
 The metadata can also be provided via a JSON file, which should conform to the schema located [here](eva_sub_cli/etc/eva_schema.json).
+This is passed using the option `--metadata_json`.
+
+### VCF files and Reference FASTA
+
+These can be provided either in the metadata file directly, or on the command line using the `--vcf_files` and `--reference_fata` options.
+Note that if you are using more than one reference FASTA, you **cannot** use the command line options; you must specify which VCF files use which FASTA files in the metadata.
 
+VCF files can be either uncompressed or compressed using bgzip.
+Other types of compression are not allowed and will result in errors during validation.
+FASTA files must be uncompressed.
 
 ## Execution
 

diff --git a/eva_sub_cli/exceptions/invalid_file_type_exception.py b/eva_sub_cli/exceptions/invalid_file_type_exception.py
@@ -0,0 +1,4 @@
+class InvalidFileTypeError(Exception):
+    def __init__(self, message):
+        self.message = message
+        super().__init__(self.message)
diff --git a/eva_sub_cli/executables/xlsx2json.py b/eva_sub_cli/executables/xlsx2json.py
@@ -47,6 +47,7 @@ def __init__(self, xlsx_filename, conf_filename):
         :param conf_filename: configuration file path
         :type conf_filename: basestring
         """
+        self.errors = []
         with open(conf_filename, 'r') as conf_file:
             self.xlsx_conf = yaml.safe_load(conf_file)
         try:
@@ -60,7 +61,6 @@ def __init__(self, xlsx_filename, conf_filename):
         self.row_offset = {}
         self.headers = {}
         self.file_loaded = True
-        self.errors = []
         self.valid_worksheets()
 
     @property

diff --git a/eva_sub_cli/orchestrator.py b/eva_sub_cli/orchestrator.py
@@ -10,6 +10,7 @@
 from openpyxl.reader.excel import load_workbook
 
 from eva_sub_cli import SUB_CLI_CONFIG_FILE, __version__
+from eva_sub_cli.exceptions.invalid_file_type_exception import InvalidFileTypeError
 from eva_sub_cli.exceptions.submission_not_found_exception import SubmissionNotFoundException
 from eva_sub_cli.exceptions.submission_status_exception import SubmissionStatusException
 from eva_sub_cli.submission_ws import SubmissionWSClient
@@ -36,7 +37,19 @@ def get_vcf_files(mapping_file):
     return vcf_files
 
 
-def get_project_title_and_create_vcf_files_mapping(submission_dir, vcf_files, reference_fasta, metadata_json, metadata_xlsx):
+def get_project_title_and_create_vcf_files_mapping(submission_dir, vcf_files, reference_fasta,
+                                                   metadata_json, metadata_xlsx):
+    """
+    Get project title and mapping between VCF files and reference FASTA files, from three sources: command line
+        arguments, metadata JSON file, or metadata XLSX file.
+
+    :param submission_dir: Directory where mapping file will be saved
+    :param vcf_files: VCF files from command line, if present
+    :param reference_fasta: Reference FASTA from command line, if present
+    :param metadata_json: Metadata JSON from command line, if present
+    :param metadata_xlsx: Metadata XLSX from command line, if present
+    :return: Project title and path to the mapping file
+    """
     mapping_file = os.path.join(submission_dir, 'vcf_mapping_file.csv')
     with open(mapping_file, 'w') as open_file:
         writer = csv.writer(open_file, delimiter=',')
@@ -45,7 +58,7 @@ def get_project_title_and_create_vcf_files_mapping(submission_dir, vcf_files, re
         vcf_files_mapping = []
         if vcf_files and reference_fasta:
             for vcf_file in vcf_files:
-                vcf_files_mapping.append([os.path.abspath(vcf_file), os.path.abspath(reference_fasta)])
+                vcf_files_mapping.append([os.path.abspath(vcf_file), os.path.abspath(reference_fasta), ''])
             if metadata_json:
                 project_title, _ = get_project_and_vcf_fasta_mapping_from_metadata_json(metadata_json, False)
             elif metadata_xlsx:
@@ -55,12 +68,32 @@ def get_project_title_and_create_vcf_files_mapping(submission_dir, vcf_files, re
         elif metadata_xlsx:
             project_title, vcf_files_mapping = get_project_and_vcf_fasta_mapping_from_metadata_xlsx(metadata_xlsx, True)
 
+        validate_vcf_mapping(vcf_files_mapping)
         for mapping in vcf_files_mapping:
             writer.writerow(mapping)
 
     return project_title, mapping_file
 
 
+def validate_vcf_mapping(vcf_mapping):
+    """
+    Validate that VCF files and FASTA files in the mapping are present and FASTA files are not compressed.
+
+    :param vcf_mapping: iterable of triples (VCF file path, reference FASTA path, optional assembly report path)
+    :return:
+    """
+    for vcf_file, fasta_file, report_file in vcf_mapping:
+        if not (vcf_file and os.path.isfile(vcf_file)):
+            raise FileNotFoundError(f'The variant file {vcf_file} does not exist, please check the file path.')
+        if not (fasta_file and os.path.isfile(fasta_file)):
+            raise FileNotFoundError(f'The reference fasta {fasta_file} does not exist, please check the file path.')
+        if fasta_file.lower().endswith('gz'):
+            raise InvalidFileTypeError(f'The reference fasta {fasta_file} is compressed, please uncompress the file.')
+        if report_file and not os.path.isfile(report_file):
+            raise FileNotFoundError(f'The assembly report file {report_file} does not exist, please check the file '
+                                    f'path.')
+
+
 def get_project_and_vcf_fasta_mapping_from_metadata_json(metadata_json, mapping_req=False):
     with open(metadata_json) as file:
         json_metadata = json.load(file)
@@ -118,10 +151,6 @@ def get_project_and_vcf_fasta_mapping_from_metadata_xlsx(metadata_xlsx, mapping_
             file_name = os.path.abspath(row[files_headers['File Name']])
             analysis_alias = row[files_headers['Analysis Alias']]
             reference_fasta = os.path.abspath(analysis_alias_dict[analysis_alias])
-            if not (file_name and os.path.isfile(file_name)):
-                raise FileNotFoundError(f'The variant file {file_name} provided in spreadsheet {metadata_xlsx} does not exist')
-            if not (reference_fasta and os.path.isfile(reference_fasta)):
-                raise FileNotFoundError(f'The reference fasta {reference_fasta} in spreadsheet {metadata_xlsx} does not exist')
             vcf_fasta_report_mapping.append([os.path.abspath(file_name), os.path.abspath(reference_fasta), ''])
 
     return project_title, vcf_fasta_report_mapping

diff --git a/eva_sub_cli/validators/docker_validator.py b/eva_sub_cli/validators/docker_validator.py
@@ -12,7 +12,7 @@
 logger = logging_config.get_logger(__name__)
 
 container_image = 'ebivariation/eva-sub-cli'
-container_tag = 'v0.0.1'
+container_tag = 'v0.0.2.dev0'
 container_validation_dir = '/opt/vcf_validation'
 container_validation_output_dir = 'vcf_validation_output'
 

diff --git a/tests/test_orchestrator.py b/tests/test_orchestrator.py
@@ -8,6 +8,7 @@
 from requests import HTTPError
 
 from eva_sub_cli import SUB_CLI_CONFIG_FILE
+from eva_sub_cli.exceptions.invalid_file_type_exception import InvalidFileTypeError
 from eva_sub_cli.exceptions.submission_not_found_exception import SubmissionNotFoundException
 from eva_sub_cli.exceptions.submission_status_exception import SubmissionStatusException
 from eva_sub_cli.orchestrator import orchestrate_process, VALIDATE, SUBMIT, DOCKER, check_validation_required
@@ -149,16 +150,17 @@ def test_orchestrate_submit_no_validate(self):
 
     def test_orchestrate_with_vcf_files(self):
         with patch('eva_sub_cli.orchestrator.WritableConfig') as m_config, \
-                patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator:
-            orchestrate_process( self.test_sub_dir, self.vcf_files, self.reference_fasta, self.metadata_json,
-                                 self.metadata_xlsx, tasks=[VALIDATE], executor=DOCKER)
+                patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator, \
+                patch('eva_sub_cli.orchestrator.os.path.isfile'):
+            orchestrate_process(self.test_sub_dir, self.vcf_files, self.reference_fasta, self.metadata_json,
+                                self.metadata_xlsx, tasks=[VALIDATE], executor=DOCKER)
             # Mapping file was created from the vcf and assembly files
             assert os.path.exists(self.mapping_file)
             with open(self.mapping_file) as open_file:
                 reader = csv.DictReader(open_file, delimiter=',')
                 for row in reader:
                     assert row['vcf'].__contains__('vcf_file')
-                    assert row['report'] == None
+                    assert row['report'] == ''
             m_docker_validator.assert_any_call(
                 self.mapping_file, self.test_sub_dir, self.project_title, self.metadata_json, self.metadata_xlsx,
                 submission_config=m_config.return_value, shallow_validation=False
@@ -187,7 +189,8 @@ def test_orchestrate_with_metadata_json_with_asm_report(self):
         shutil.copy(os.path.join(self.resource_dir, 'EVA_Submission_test_with_asm_report.json'), self.metadata_json)
 
         with patch('eva_sub_cli.orchestrator.WritableConfig') as m_config, \
-                patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator:
+                patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator, \
+                patch('eva_sub_cli.orchestrator.os.path.isfile'):
             orchestrate_process(self.test_sub_dir, None, None, self.metadata_json, None,
                 tasks=[VALIDATE], executor=DOCKER)
             # Mapping file was created from the metadata_json
@@ -207,7 +210,8 @@ def test_orchestrate_vcf_files_takes_precedence_over_metadata(self):
         shutil.copy(os.path.join(self.resource_dir, 'EVA_Submission_test_with_asm_report.json'), self.metadata_json)
 
         with patch('eva_sub_cli.orchestrator.WritableConfig') as m_config, \
-                patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator:
+                patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator, \
+                patch('eva_sub_cli.orchestrator.os.path.isfile'):
             orchestrate_process(self.test_sub_dir, self.vcf_files, self.reference_fasta, self.metadata_json,
                                 None, tasks=[VALIDATE], executor=DOCKER, resume=False)
             # Mapping file was created from the metadata_json
@@ -216,15 +220,13 @@ def test_orchestrate_vcf_files_takes_precedence_over_metadata(self):
                 reader = csv.DictReader(open_file, delimiter=',')
                 for row in reader:
                     assert row['vcf'].__contains__('vcf_file')
-                    assert row['report'] == None
+                    assert row['report'] == ''
             m_docker_validator.assert_any_call(
                 self.mapping_file, self.test_sub_dir, self.project_title, self.metadata_json, None,
                 submission_config=m_config.return_value, shallow_validation=False
             )
             m_docker_validator().validate_and_report.assert_called_once_with()
 
-
-
     def test_orchestrate_with_metadata_xlsx(self):
         with patch('eva_sub_cli.orchestrator.WritableConfig') as m_config, \
                 patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator:
@@ -244,12 +246,16 @@ def test_orchestrate_with_metadata_xlsx(self):
             m_docker_validator().validate_and_report.assert_called_once_with()
 
     def test_metadata_file_does_not_exist_error(self):
-        with self.assertRaises(Exception) as context:
+        with self.assertRaises(FileNotFoundError) as context:
             orchestrate_process(self.test_sub_dir, None, None, None, 'Non_existing_metadata.xlsx',
                                 tasks=[VALIDATE], executor=DOCKER)
         self.assertRegex(
             str(context.exception),
             r"The provided metadata file .*/resources/test_sub_dir/Non_existing_metadata.xlsx does not exist"
         )
 
-
+    def test_fasta_file_compressed(self):
+        with patch('eva_sub_cli.orchestrator.os.path.isfile'):
+            with self.assertRaises(InvalidFileTypeError):
+                orchestrate_process(self.test_sub_dir, self.vcf_files, self.reference_fasta + '.gz', self.metadata_json,
+                                    self.metadata_xlsx, tasks=[VALIDATE], executor=DOCKER)