Releases: EBIvariation/eva-pipeline
Better progress tracking, dropping studies and multiple versions of Ensembl annotation supported
Version 2.0 of the EVA pipeline has changed the workflow manager technology. Instead of tracking progress of steps as a whole (done / not done), it now split the work in chunks of configurable size. This way, if a step has processed millions of variants before failing, it will be resumed from that point instead of completely restarted.
This version replicates the functionality present in the previous version and adds the following:
- Better job parameters management, report of user errors, etc.
- New job for dropping studies
- Multiple versions of functional annotation generated using Ensembl Variant Effect Predictor
- Better detection of when the Ensembl VEP process hangs or fails
- Displaying estimated progress percentage for several steps
Please note the database schema has changed to support multiple versions of the annotation and the migration tool can be found in a module of the repository eva-tools.
Support for multiple versions of VEP annotation
The database schema has been heavily modified to support multiple versions of the Variant Effect Predictor (VEP) annotation. During the load of a VCF file, it is possible to annotate those variants using an existing or new version. The job that allowed to annotate already loaded studies or a whole database has been also modified accordingly.
The introduced database changes are:
- The bulk of the 'annot' subdocument in the variants collection has been extracted to the 'annotations' collection
- The variants collection only stores those fields used as indexes for efficient filtering
- New 'annotationMetadata' collection introduced, listing the versions of VEP used to annotate the variants in the database
If you had been using a previous version of this software, please check out the eva-tools repository in order to obtain the database migration scripts.
Annotation job and more batch friendly VEP execution
A new job used to regenerate annotation only is included in this release. The execution of steps associated with VEP has also being improved, detecting when the external process hangs, and communicating progress and errors better.
New job for dropping studies, improved progress tracking
This release includes a new job that allows to drop a study. Users only need to provide the study ID and the job will take care of the variants and the files where they were reported.
Progress tracking has been improved in the step that loads variants into the system. The completion percentage is now displayed, along with the number of variants read from the VCF, written into the database, and skipped due to any kind of issue. This tracking will be added to other steps in future releases.
Usability and software architecture improvements
The initial goal of this beta release was to improve unit/integration tests and the architecture as a whole, to ensure it could be extended more easily in the future. As a side effect, we also managed to improve usability!
These are the outcomes of the last months of work:
- Tests are now completely independent from each other, using random test folders and Mongo databases
- Job parameters are fully validated using the Spring Batch API
- Job parameters can be conveniently read from a properties file, in addition to CLI arguments
Technology migration for better restartability
Version 2.0 of the EVA pipeline will move from Luigi to Spring Batch. Instead of tracking progress of steps as a whole (done / not done), Spring Batch splits the work in chunks of configurable size. This way, if a step has processed millions of variants before failing, it will be resumed from that point instead of completely restarted.
The functionality implemented for this first beta includes:
- Normalization of variants reported in a VCF file
- Storage of variants in MongoDB
- Calculation of allele frequencies and other statistics for all the samples in a VCF file
- Annotation using Ensembl Variant Effect Predictor
Future beta releases will include support for population statistics via a PED file and improved usability.
Multi-step pipeline
Improved restartability by running the pipeline in multiple steps.
Added support for population statistics when combining a VCF + PED files, and also loading annotations generated by VEP.
EVA automated pipeline using Luigi
First production version of the European Variation Archive pipeline, implemented using Luigi by Spotify.