Skip to content

Commit

Permalink
Merge pull request #65 from apriltuesday/EVA-3668
Browse files Browse the repository at this point in the history
EVA-3668: Clarify support for compressed input files
  • Loading branch information
apriltuesday authored Oct 10, 2024
2 parents 6dc0e3c + c1d9cb6 commit 3d9939e
Show file tree
Hide file tree
Showing 6 changed files with 82 additions and 35 deletions.
40 changes: 24 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,33 +48,41 @@ Install each of these and ensure they are included in your PATH. Then install th

The ["Getting Started" guide](Getting_Started_with_eva_sub_cli.md) serves as an introduction for users of the eva-sub-cli tool. It includes instructions on how to prepare your data and metadata, ensuring that users are equipped with the necessary information to successfully submit variant data. This guide is essential for new users, offering practical advice and tips for a smooth onboarding experience with the eva-sub-cli tool.

## eva-sub-cli tool: Options and parameters guide
## Options and parameters guide

The eva-sub-cli tool provides several options/parameters that you can use to tailor its functionality to your needs. Understanding these parameters is crucial for configuring the tool correctly. Below is an overview of the key parameters and options:
The eva-sub-cli tool provides several options and parameters that you can use to tailor its functionality to your needs.
You can view all the available parameters with the command `eva-sub-cli.py -h` and view detailed explanations for the
input file requirements in the ["Getting Started" guide](Getting_Started_with_eva_sub_cli.md).
Below is an overview of the key parameters.

| OPTIONS/PARAMETERS | DESCRIPTION |
|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| --version | Shows version number of the program and exit |
| --metadata_xlsx | Excel spreadsheet that describe the project, analysis, samples and files |
| --metadata_json | Json file that describe the project, analysis, samples and files |
| --vcf_files | One or several vcf files to validate.This allows you to provide multiple VCF files to validate and a single associated reference genome file. The VCF files and the associated reference genome file must use the same chromosome naming convention |
| --reference_fasta | The fasta file containing the reference genome from which the variants were derived |
| --submission_dir | Path to the directory where all processing will be done and submission data is/will be stored |
| --tasks {validate,submit} | Selecting VALIDATE will run the validation regardless of the outcome of previous runs. Selecting SUBMIT will run validate only if the validation was not performed successfully before and then run the submission |
| --executor {docker,native} | Select an execution type for running validation (default native) |
| --shallow | Set the validation to be performed on the first 10000 records of the VCF. Only applies if the number of record exceed 10000 |
| --username | Username used for connecting to the ENA webin account |
| --password | Password used for connecting to the ENA webin account |
### Submission directory

This is the directory where all processing will take place, and where configuration and reports will be saved.
Crucially, the eva-sub-cli tool requires that there be **only one submission per directory** and that the submission directory not be reused.
Running multiple submissions from a single directory can result in data loss during validation and submission.

### Metadata file

Metadata can be provided in one of two files.

#### The metadata spreadsheet

The metadata template can be found within the [etc folder](eva_sub_cli/etc/EVA_Submission_template.xlsx). It should be populated following the instruction provided within the template.
The metadata template can be found within the [etc folder](eva_sub_cli/etc/EVA_Submission_template.xlsx). It should be populated following the instructions provided within the template.
This is passed using the option `--metadata_xlsx`.

#### The metadata JSON

The metadata can also be provided via a JSON file, which should conform to the schema located [here](eva_sub_cli/etc/eva_schema.json).
This is passed using the option `--metadata_json`.

### VCF files and Reference FASTA

These can be provided either in the metadata file directly, or on the command line using the `--vcf_files` and `--reference_fata` options.
Note that if you are using more than one reference FASTA, you **cannot** use the command line options; you must specify which VCF files use which FASTA files in the metadata.

VCF files can be either uncompressed or compressed using bgzip.
Other types of compression are not allowed and will result in errors during validation.
FASTA files must be uncompressed.

## Execution

Expand Down
4 changes: 4 additions & 0 deletions eva_sub_cli/exceptions/invalid_file_type_exception.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
class InvalidFileTypeError(Exception):
def __init__(self, message):
self.message = message
super().__init__(self.message)
2 changes: 1 addition & 1 deletion eva_sub_cli/executables/xlsx2json.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ def __init__(self, xlsx_filename, conf_filename):
:param conf_filename: configuration file path
:type conf_filename: basestring
"""
self.errors = []
with open(conf_filename, 'r') as conf_file:
self.xlsx_conf = yaml.safe_load(conf_file)
try:
Expand All @@ -60,7 +61,6 @@ def __init__(self, xlsx_filename, conf_filename):
self.row_offset = {}
self.headers = {}
self.file_loaded = True
self.errors = []
self.valid_worksheets()

@property
Expand Down
41 changes: 35 additions & 6 deletions eva_sub_cli/orchestrator.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from openpyxl.reader.excel import load_workbook

from eva_sub_cli import SUB_CLI_CONFIG_FILE, __version__
from eva_sub_cli.exceptions.invalid_file_type_exception import InvalidFileTypeError
from eva_sub_cli.exceptions.submission_not_found_exception import SubmissionNotFoundException
from eva_sub_cli.exceptions.submission_status_exception import SubmissionStatusException
from eva_sub_cli.submission_ws import SubmissionWSClient
Expand All @@ -36,7 +37,19 @@ def get_vcf_files(mapping_file):
return vcf_files


def get_project_title_and_create_vcf_files_mapping(submission_dir, vcf_files, reference_fasta, metadata_json, metadata_xlsx):
def get_project_title_and_create_vcf_files_mapping(submission_dir, vcf_files, reference_fasta,
metadata_json, metadata_xlsx):
"""
Get project title and mapping between VCF files and reference FASTA files, from three sources: command line
arguments, metadata JSON file, or metadata XLSX file.
:param submission_dir: Directory where mapping file will be saved
:param vcf_files: VCF files from command line, if present
:param reference_fasta: Reference FASTA from command line, if present
:param metadata_json: Metadata JSON from command line, if present
:param metadata_xlsx: Metadata XLSX from command line, if present
:return: Project title and path to the mapping file
"""
mapping_file = os.path.join(submission_dir, 'vcf_mapping_file.csv')
with open(mapping_file, 'w') as open_file:
writer = csv.writer(open_file, delimiter=',')
Expand All @@ -45,7 +58,7 @@ def get_project_title_and_create_vcf_files_mapping(submission_dir, vcf_files, re
vcf_files_mapping = []
if vcf_files and reference_fasta:
for vcf_file in vcf_files:
vcf_files_mapping.append([os.path.abspath(vcf_file), os.path.abspath(reference_fasta)])
vcf_files_mapping.append([os.path.abspath(vcf_file), os.path.abspath(reference_fasta), ''])
if metadata_json:
project_title, _ = get_project_and_vcf_fasta_mapping_from_metadata_json(metadata_json, False)
elif metadata_xlsx:
Expand All @@ -55,12 +68,32 @@ def get_project_title_and_create_vcf_files_mapping(submission_dir, vcf_files, re
elif metadata_xlsx:
project_title, vcf_files_mapping = get_project_and_vcf_fasta_mapping_from_metadata_xlsx(metadata_xlsx, True)

validate_vcf_mapping(vcf_files_mapping)
for mapping in vcf_files_mapping:
writer.writerow(mapping)

return project_title, mapping_file


def validate_vcf_mapping(vcf_mapping):
"""
Validate that VCF files and FASTA files in the mapping are present and FASTA files are not compressed.
:param vcf_mapping: iterable of triples (VCF file path, reference FASTA path, optional assembly report path)
:return:
"""
for vcf_file, fasta_file, report_file in vcf_mapping:
if not (vcf_file and os.path.isfile(vcf_file)):
raise FileNotFoundError(f'The variant file {vcf_file} does not exist, please check the file path.')
if not (fasta_file and os.path.isfile(fasta_file)):
raise FileNotFoundError(f'The reference fasta {fasta_file} does not exist, please check the file path.')
if fasta_file.lower().endswith('gz'):
raise InvalidFileTypeError(f'The reference fasta {fasta_file} is compressed, please uncompress the file.')
if report_file and not os.path.isfile(report_file):
raise FileNotFoundError(f'The assembly report file {report_file} does not exist, please check the file '
f'path.')


def get_project_and_vcf_fasta_mapping_from_metadata_json(metadata_json, mapping_req=False):
with open(metadata_json) as file:
json_metadata = json.load(file)
Expand Down Expand Up @@ -118,10 +151,6 @@ def get_project_and_vcf_fasta_mapping_from_metadata_xlsx(metadata_xlsx, mapping_
file_name = os.path.abspath(row[files_headers['File Name']])
analysis_alias = row[files_headers['Analysis Alias']]
reference_fasta = os.path.abspath(analysis_alias_dict[analysis_alias])
if not (file_name and os.path.isfile(file_name)):
raise FileNotFoundError(f'The variant file {file_name} provided in spreadsheet {metadata_xlsx} does not exist')
if not (reference_fasta and os.path.isfile(reference_fasta)):
raise FileNotFoundError(f'The reference fasta {reference_fasta} in spreadsheet {metadata_xlsx} does not exist')
vcf_fasta_report_mapping.append([os.path.abspath(file_name), os.path.abspath(reference_fasta), ''])

return project_title, vcf_fasta_report_mapping
Expand Down
2 changes: 1 addition & 1 deletion eva_sub_cli/validators/docker_validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
logger = logging_config.get_logger(__name__)

container_image = 'ebivariation/eva-sub-cli'
container_tag = 'v0.0.1'
container_tag = 'v0.0.2.dev0'
container_validation_dir = '/opt/vcf_validation'
container_validation_output_dir = 'vcf_validation_output'

Expand Down
28 changes: 17 additions & 11 deletions tests/test_orchestrator.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from requests import HTTPError

from eva_sub_cli import SUB_CLI_CONFIG_FILE
from eva_sub_cli.exceptions.invalid_file_type_exception import InvalidFileTypeError
from eva_sub_cli.exceptions.submission_not_found_exception import SubmissionNotFoundException
from eva_sub_cli.exceptions.submission_status_exception import SubmissionStatusException
from eva_sub_cli.orchestrator import orchestrate_process, VALIDATE, SUBMIT, DOCKER, check_validation_required
Expand Down Expand Up @@ -149,16 +150,17 @@ def test_orchestrate_submit_no_validate(self):

def test_orchestrate_with_vcf_files(self):
with patch('eva_sub_cli.orchestrator.WritableConfig') as m_config, \
patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator:
orchestrate_process( self.test_sub_dir, self.vcf_files, self.reference_fasta, self.metadata_json,
self.metadata_xlsx, tasks=[VALIDATE], executor=DOCKER)
patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator, \
patch('eva_sub_cli.orchestrator.os.path.isfile'):
orchestrate_process(self.test_sub_dir, self.vcf_files, self.reference_fasta, self.metadata_json,
self.metadata_xlsx, tasks=[VALIDATE], executor=DOCKER)
# Mapping file was created from the vcf and assembly files
assert os.path.exists(self.mapping_file)
with open(self.mapping_file) as open_file:
reader = csv.DictReader(open_file, delimiter=',')
for row in reader:
assert row['vcf'].__contains__('vcf_file')
assert row['report'] == None
assert row['report'] == ''
m_docker_validator.assert_any_call(
self.mapping_file, self.test_sub_dir, self.project_title, self.metadata_json, self.metadata_xlsx,
submission_config=m_config.return_value, shallow_validation=False
Expand Down Expand Up @@ -187,7 +189,8 @@ def test_orchestrate_with_metadata_json_with_asm_report(self):
shutil.copy(os.path.join(self.resource_dir, 'EVA_Submission_test_with_asm_report.json'), self.metadata_json)

with patch('eva_sub_cli.orchestrator.WritableConfig') as m_config, \
patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator:
patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator, \
patch('eva_sub_cli.orchestrator.os.path.isfile'):
orchestrate_process(self.test_sub_dir, None, None, self.metadata_json, None,
tasks=[VALIDATE], executor=DOCKER)
# Mapping file was created from the metadata_json
Expand All @@ -207,7 +210,8 @@ def test_orchestrate_vcf_files_takes_precedence_over_metadata(self):
shutil.copy(os.path.join(self.resource_dir, 'EVA_Submission_test_with_asm_report.json'), self.metadata_json)

with patch('eva_sub_cli.orchestrator.WritableConfig') as m_config, \
patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator:
patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator, \
patch('eva_sub_cli.orchestrator.os.path.isfile'):
orchestrate_process(self.test_sub_dir, self.vcf_files, self.reference_fasta, self.metadata_json,
None, tasks=[VALIDATE], executor=DOCKER, resume=False)
# Mapping file was created from the metadata_json
Expand All @@ -216,15 +220,13 @@ def test_orchestrate_vcf_files_takes_precedence_over_metadata(self):
reader = csv.DictReader(open_file, delimiter=',')
for row in reader:
assert row['vcf'].__contains__('vcf_file')
assert row['report'] == None
assert row['report'] == ''
m_docker_validator.assert_any_call(
self.mapping_file, self.test_sub_dir, self.project_title, self.metadata_json, None,
submission_config=m_config.return_value, shallow_validation=False
)
m_docker_validator().validate_and_report.assert_called_once_with()



def test_orchestrate_with_metadata_xlsx(self):
with patch('eva_sub_cli.orchestrator.WritableConfig') as m_config, \
patch('eva_sub_cli.orchestrator.DockerValidator') as m_docker_validator:
Expand All @@ -244,12 +246,16 @@ def test_orchestrate_with_metadata_xlsx(self):
m_docker_validator().validate_and_report.assert_called_once_with()

def test_metadata_file_does_not_exist_error(self):
with self.assertRaises(Exception) as context:
with self.assertRaises(FileNotFoundError) as context:
orchestrate_process(self.test_sub_dir, None, None, None, 'Non_existing_metadata.xlsx',
tasks=[VALIDATE], executor=DOCKER)
self.assertRegex(
str(context.exception),
r"The provided metadata file .*/resources/test_sub_dir/Non_existing_metadata.xlsx does not exist"
)


def test_fasta_file_compressed(self):
with patch('eva_sub_cli.orchestrator.os.path.isfile'):
with self.assertRaises(InvalidFileTypeError):
orchestrate_process(self.test_sub_dir, self.vcf_files, self.reference_fasta + '.gz', self.metadata_json,
self.metadata_xlsx, tasks=[VALIDATE], executor=DOCKER)

0 comments on commit 3d9939e

Please sign in to comment.