-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List of files to be stored in database for webserver (front-/backend) consumption #85
Comments
@thomashopf : @b-schubert and I were wondering which files are necessary for the frontend. This is the list @b-schubert came up with. Anything to add or remove? The idea is to store these files somewhere after the pipeline is run (e.g. in a DB, see #84 ) and then have the backend fetch the data from there instead of fetching it from disk. @kpgbrock : @b-schubert was unsure about the files to store from the fold stage. Can you help us out there? Thanks :) |
I think the best way to list this is in terms of
There will also be entries that are not files but just values that should be available for querying. General job info (already available at submit time)
Sequence alignment (for alignment view)
The following are in statistics.csv file but I would store them anyways because relevant for job overview:
ECs (for EC contact map view)If no comparison available:
If comparison available:
And:
I would be hesitant to put the model_file into the database since we are often talking hundreds of megabytes per file (if we do, this would however allow the user to quickly predict arbitrary mutations by supplying a list of mutations after the pipeline has run which would be very nice) Mutation effects (for mutation matrix view - epistatic and independent models)
Experimental PDB structure information
Structure prediction
Archive download
Probably no need to store pml files unless these can be easily fed into NGL (or the like) or transformed to be reused |
Agree ( amateur speaking here )
… On Nov 9, 2017, at 8:36 PM, thomashopf ***@***.***> wrote:
I think the best way to list this is in terms of
items that will compose what is on the website rather than stages, and
keys in the output file
There will also be entries that are not files but just values that should be available for querying.
General job info (already available at submit time)
sequence_id
sequence_file
target_sequence_file
Sequence alignment (for alignment view)
alignment_file (for alignment viewer)
annotation_file (for displaying additional information for each sequence)
statistics_file (for job stats, and displaying alignment coverage)
frequencies_file (for sequence logo)
The following are in statistics.csv file but I would store them anyways because relevant for job overview:
num_sequences
segments
effective_sequences
num_sites
effective_sequences
ECs (for EC contact map view)
If no comparison available:
ec_file
ec_longrange_file
If comparison available:
ec_compared_file
ec_compared_longrange_file
And:
enrichment_file (for displaying enrichment table / visualizing enrichment on 3D structure)
evzoom_file (for EVzoom view)
I would be hesitant to put the model_file into the database since we are often talking hundreds of megabytes per file (if we do, this would however allow the user to quickly predict arbitrary mutations by supplying a list of mutations after the pipeline has run which would be very nice)
Mutation effects (for mutation matrix view - epistatic and independent models)
mutation_matrix_file
Experimental PDB structure information
pdb_structure_hits_file (for showing which structures were found)
monomer_contacts_file (for displaying structure contacts on contact map without loading full distance map)
multimer_contacts_file (for displaying structure contacts on contact map without loading full distance map)
remapped_pdb_files (for showing ECs/enrichment/mutation effects on experimental structures)
Structure prediction
sec_struct_file (for showing predicted secondary structure on contact maps / mutation matrices)
folded_structure_files (for showing predicted structures using NGL, visualizing mutation effects, enrichment, ECs)
folding_ranking_file (for selecting blindly in which order to display structures)
folding_comparison_file (for showing how good models are, if there is experimental 3D structure)
Archive download
archive_file ... actually this needs serious discussion because we want to allow the user to download the full output archive, but just plonking this into the database feels wrong, but so does storing all the result files in the database and generating the archive on the fly
Probably no need to store pml files unless these can be easily fed into NGL (or the like) or transformed to be reused
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Since this is inherently related to the REST API endpoints, I'll see if get around to drafting a reasonable set of endpoints this weekend |
@thomashopf thanks for the list. Just to re-iterate the idea of doing this in the first place: @b-schubert said storing the computations takes up too much space + you would like to have some easy way of chaning provider (where the data fed to the frontend is stored). I proposed storing only relevant files (for the frontend) in a database (possibly mongo, I totally agree with .4 in #84 (comment) ) for runs issued via the flask backend. Additionally, I would delete the big result file (the compressed archive) from the FS if the job is run from the web. This in turn means that the archive that the user running the job via the web will be able to download is a much lighter archive than the one generated by running the computation locally. So, in this sense:
I think the model file (from what I have understood) is quite important. Can this really be removed completely from a possible
Again, here it boils down to design. As far as I understood (as you mentioned a couple months ago) the webpage is a striped down version of the complete pipeline. The advanced user will use the pipeline directly. It makes sense therefore to give access to only that information that the webpage really offers to the user downloading the
I don't know how you thought about implementing this, but I would store the entire file. Reason: if we start chunking files, we create new "parsers" and logic around what gets computed and then stored in the db and what not. Simply "copy-paste" results sounds like the way to go to me.
I will research about this. Here ( #85 (comment) ) you talk about endpoints in https://github.com/debbiemarkslab/EVcouplings-server-backend , right? |
Storage space at least on Orchestra is not much of an issue, if one doesn't keep the results forever and actually enforces deletion (the current server setup doesn't and no one bothered...) after e.g. 2 weeks with a cronjob. The pipeline also already has a setting delete in the management section that allows it to clean up after itself to minimize space usage, which is just turned off by default to allow reusing results during reruns. If one deleted raw_alignment_file and model_file, the biggest space offenders are gone already. Database space on Orchestra might be more of a limitation - the current server relational database had about 50GB a while ago and RITG was asking to bring down its size because the databases are on more expensive storage that is heavily backed up. (Side note: Last time I checked there was no MongoDB or the like, at least not advertised openly, so I would check with RITG early on - created a ticket in backend repo for this) That being said, I think in terms of architecture it is much nicer to have a broker like a database to have full freedom in decoupling server backend and computational pipeline, which justifies this alone. Regarding the results archive:
So the trade-off is simply between options
If one implements the list of items to store in the database as a configurable list like management.archive, this decision is entirely moved into the server backend which would be nice. If one chooses 2) the choice of whether to include the model file could be made dynamically when the users selects to download the archive.
Yes it is a core piece computationally, but not a relevant download for 95% of server users which would then get hasseled with hundreds of MB of uncompressable model (also requires programming knowledge to use it at which point people should be able to run the pipeline). The main application to put in the database would be to allow users to dynamically predict mutation effects of their choice, if one fires off a Celery worker in the server backend for that computation.
Simplified in what it takes to run the computation, and intuitive visualization on the webserver. As I wrote above, the default selection of files in the archive is a relevant selection of outputs, on the whole, this shouldn't be dumbed down (but also not blown up any further). If an intermediate users wants more detail in the output, the archive is where to look. The two single biggest files will almost always be the alignment and the model. So I think the best strategy is the following (given that the things that end up in the DB will be flexible it is not much of a decision anyways):
The alignment_statistics.csv file is a convenience output largely assembled from result items that are part of the output configuration by default (e.g. like num_sites or num_effective_sequences). So no need to parse anything, these key-value pairs could just go directly from the pipeline into the database without needing to touch the csv file at all, which I find preferable. The file is more interesting for the output archive rather than the webserver (the list was meant to list all available options).
Btw I would also verify very early on if NGL supports all necessary visualization features (I'll create a separate issue for this) and plays nicely with React, or if another viewer might be better suited.
Yes to come soon. |
Joining the discussion .... |
From the discussion with @cccsander :
|
Yes fully agree, including it makes the settings 100% transparent and reproducible. Being able to upload a config file would be a very nice feature too, but then this becomes an absolute input validation nightmare and probably users should be using the pipeline locally. |
API endpoints using these files defined here: debbiemarkslab/EVcouplings-server-backend#4 |
I feel like this is very, very much done, except that it's in #166 |
subfoldbitscore/align/*_alignment_statistics.csv
*_job_statistics_summary.pdf
--> Might be missingAlign
align\*_alignment_statistics.csv
)align\*.fa
)align\*.a2m
)align\*_frequencies.csv
)Couplings
couplings\*.model
)couplings\*_enrichment_sausage
)couplings\*_enrichment_sphere
)couplings\*_evzoom.json
)if no compare:
couplings\*_CouplingScores.csv
)else:
compare\*_CouplingScoresCompared_all.csv
)Compare
compare\*_CouplingScoresCompared_longrange.csv
)compare\*_structure_hits.csv
)Mutate
mutate\*_mutate_matrix.csv
)Fold
fold\*_secondary_structure.csv
)fold\*_ranking.csv
)if structures available:
fold\*_comparison.csv
) (has more cols)The text was updated successfully, but these errors were encountered: