List of files to be stored in database for webserver (front-/backend) consumption #85

sacdallago · 2017-11-09T15:56:48Z

sacdallago · 2017-11-09T15:59:35Z

@thomashopf : @b-schubert and I were wondering which files are necessary for the frontend. This is the list @b-schubert came up with. Anything to add or remove? The idea is to store these files somewhere after the pipeline is run (e.g. in a DB, see #84 ) and then have the backend fetch the data from there instead of fetching it from disk.

@kpgbrock : @b-schubert was unsure about the files to store from the fold stage. Can you help us out there?

Thanks :)

thomashopf · 2017-11-09T19:36:42Z

I think the best way to list this is in terms of

items that will compose what is on the website rather than stages, and
keys in the output file

There will also be entries that are not files but just values that should be available for querying.

General job info (already available at submit time)

sequence_id
sequence_file
target_sequence_file

Sequence alignment (for alignment view)

alignment_file (for alignment viewer)
annotation_file (for displaying additional information for each sequence)
statistics_file (for job stats, and displaying alignment coverage)
frequencies_file (for sequence logo)

The following are in statistics.csv file but I would store them anyways because relevant for job overview:

num_sequences
segments
effective_sequences
num_sites
effective_sequences

ECs (for EC contact map view)

If no comparison available:

ec_file
ec_longrange_file

If comparison available:

ec_compared_file
ec_compared_longrange_file

And:

enrichment_file (for displaying enrichment table / visualizing enrichment on 3D structure)
evzoom_file (for EVzoom view)

I would be hesitant to put the model_file into the database since we are often talking hundreds of megabytes per file (if we do, this would however allow the user to quickly predict arbitrary mutations by supplying a list of mutations after the pipeline has run which would be very nice)

Mutation effects (for mutation matrix view - epistatic and independent models)

mutation_matrix_file

Experimental PDB structure information

pdb_structure_hits_file (for showing which structures were found)
monomer_contacts_file (for displaying structure contacts on contact map without loading full distance map)
multimer_contacts_file (for displaying structure contacts on contact map without loading full distance map)
remapped_pdb_files (for showing ECs/enrichment/mutation effects on experimental structures)

Structure prediction

sec_struct_file (for showing predicted secondary structure on contact maps / mutation matrices)
folded_structure_files (for showing predicted structures using NGL, visualizing mutation effects, enrichment, ECs)
folding_ranking_file (for selecting blindly in which order to display structures)
folding_comparison_file (for showing how good models are, if there is experimental 3D structure)

Archive download

archive_file ... actually this needs serious discussion because we want to allow the user to download the full output archive, but just plonking this into the database feels wrong, but so does storing all the result files in the database and generating the archive on the fly

Probably no need to store pml files unless these can be easily fed into NGL (or the like) or transformed to be reused

deboramarks · 2017-11-09T19:40:41Z

Agree ( amateur speaking here )

…

On Nov 9, 2017, at 8:36 PM, thomashopf ***@***.***> wrote: I think the best way to list this is in terms of items that will compose what is on the website rather than stages, and keys in the output file There will also be entries that are not files but just values that should be available for querying. General job info (already available at submit time) sequence_id sequence_file target_sequence_file Sequence alignment (for alignment view) alignment_file (for alignment viewer) annotation_file (for displaying additional information for each sequence) statistics_file (for job stats, and displaying alignment coverage) frequencies_file (for sequence logo) The following are in statistics.csv file but I would store them anyways because relevant for job overview: num_sequences segments effective_sequences num_sites effective_sequences ECs (for EC contact map view) If no comparison available: ec_file ec_longrange_file If comparison available: ec_compared_file ec_compared_longrange_file And: enrichment_file (for displaying enrichment table / visualizing enrichment on 3D structure) evzoom_file (for EVzoom view) I would be hesitant to put the model_file into the database since we are often talking hundreds of megabytes per file (if we do, this would however allow the user to quickly predict arbitrary mutations by supplying a list of mutations after the pipeline has run which would be very nice) Mutation effects (for mutation matrix view - epistatic and independent models) mutation_matrix_file Experimental PDB structure information pdb_structure_hits_file (for showing which structures were found) monomer_contacts_file (for displaying structure contacts on contact map without loading full distance map) multimer_contacts_file (for displaying structure contacts on contact map without loading full distance map) remapped_pdb_files (for showing ECs/enrichment/mutation effects on experimental structures) Structure prediction sec_struct_file (for showing predicted secondary structure on contact maps / mutation matrices) folded_structure_files (for showing predicted structures using NGL, visualizing mutation effects, enrichment, ECs) folding_ranking_file (for selecting blindly in which order to display structures) folding_comparison_file (for showing how good models are, if there is experimental 3D structure) Archive download archive_file ... actually this needs serious discussion because we want to allow the user to download the full output archive, but just plonking this into the database feels wrong, but so does storing all the result files in the database and generating the archive on the fly Probably no need to store pml files unless these can be easily fed into NGL (or the like) or transformed to be reused — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

thomashopf · 2017-11-09T19:45:16Z

Since this is inherently related to the REST API endpoints, I'll see if get around to drafting a reasonable set of endpoints this weekend

sacdallago · 2017-11-09T21:04:17Z

@thomashopf thanks for the list.

Just to re-iterate the idea of doing this in the first place: @b-schubert said storing the computations takes up too much space + you would like to have some easy way of chaning provider (where the data fed to the frontend is stored). I proposed storing only relevant files (for the frontend) in a database (possibly mongo, I totally agree with .4 in #84 (comment) ) for runs issued via the flask backend. Additionally, I would delete the big result file (the compressed archive) from the FS if the job is run from the web. This in turn means that the archive that the user running the job via the web will be able to download is a much lighter archive than the one generated by running the computation locally.

So, in this sense:

I would be hesitant to put the model_file into the database since we are often talking hundreds of megabytes per file (if we do, this would however allow the user to quickly predict arbitrary mutations by supplying a list of mutations after the pipeline has run which would be very nice)

I think the model file (from what I have understood) is quite important. Can this really be removed completely from a possible .zip download of a user running a job via the web?

archive_file ... actually this needs serious discussion because we want to allow the user to download the full output archive, but just plonking this into the database feels wrong, but so does storing all the result files in the database and generating the archive on the fly

Again, here it boils down to design. As far as I understood (as you mentioned a couple months ago) the webpage is a striped down version of the complete pipeline. The advanced user will use the pipeline directly. It makes sense therefore to give access to only that information that the webpage really offers to the user downloading the .zip. This means: less storage requirements for us + maybe more people interested in also testing, using, advancing the complete pipeline. Alternatively, jobs issued via web could be optionally marked as "complete", in which case the entire archive is stored and available for download. Opinions are needed here :) I won't make this call.

The following are in statistics.csv file but I would store them anyways because relevant for job overview

I don't know how you thought about implementing this, but I would store the entire file. Reason: if we start chunking files, we create new "parsers" and logic around what gets computed and then stored in the db and what not. Simply "copy-paste" results sounds like the way to go to me.

Probably no need to store pml files unless these can be easily fed into NGL (or the like) or transformed to be reused

I will research about this.

Here ( #85 (comment) ) you talk about endpoints in https://github.com/debbiemarkslab/EVcouplings-server-backend , right?

thomashopf · 2017-11-10T19:26:03Z

Just to re-iterate the idea of doing this in the first place: @b-schubert said storing the computations takes up too much space + you would like to have some easy way of chaning provider (where the data fed to the frontend is stored). I proposed storing only relevant files (for the frontend) in a database (possibly mongo, I totally agree with .4 in #84 (comment) ) for runs issued via the flask backend. Additionally, I would delete the big result file (the compressed archive) from the FS if the job is run from the web. This in turn means that the archive that the user running the job via the web will be able to download is a much lighter archive than the one generated by running the computation locally.

Storage space at least on Orchestra is not much of an issue, if one doesn't keep the results forever and actually enforces deletion (the current server setup doesn't and no one bothered...) after e.g. 2 weeks with a cronjob. The pipeline also already has a setting delete in the management section that allows it to clean up after itself to minimize space usage, which is just turned off by default to allow reusing results during reruns. If one deleted raw_alignment_file and model_file, the biggest space offenders are gone already.

Database space on Orchestra might be more of a limitation - the current server relational database had about 50GB a while ago and RITG was asking to bring down its size because the databases are on more expensive storage that is heavily backed up. (Side note: Last time I checked there was no MongoDB or the like, at least not advertised openly, so I would check with RITG early on - created a ticket in backend repo for this)

That being said, I think in terms of architecture it is much nicer to have a broker like a database to have full freedom in decoupling server backend and computational pipeline, which justifies this alone.

Regarding the results archive:

What gets included in the result archive is a parameter in the configuration file for this very reason. So the server backend can request what to put into the archive whatever it wants, no need to delete things, or no archive at all.
The default selection is already a relevant selection of results based on years of experience in the lab with the server in mind (not a full dump of the output folder)
The archive will need to contain additional files that are not used by the server front end (e.g. PDF contact map plots)

So the trade-off is simply between options

Just plonking the archive into the database and not worrying about it, wasting some space
Generating the archive on the fly from result files stored in the database as single files, which would make it necessary to store a superset of the files needed for the server frontend (but not that many)

If one implements the list of items to store in the database as a configurable list like management.archive, this decision is entirely moved into the server backend which would be nice. If one chooses 2) the choice of whether to include the model file could be made dynamically when the users selects to download the archive.

I think the model file (from what I have understood) is quite important. Can this really be removed completely from a possible .zip download of a user running a job via the web?

Yes it is a core piece computationally, but not a relevant download for 95% of server users which would then get hasseled with hundreds of MB of uncompressable model (also requires programming knowledge to use it at which point people should be able to run the pipeline). The main application to put in the database would be to allow users to dynamically predict mutation effects of their choice, if one fires off a Celery worker in the server backend for that computation.

Again, here it boils down to design. As far as I understood (as you mentioned a couple months ago) the webpage is a striped down version of the complete pipeline. The advanced user will use the pipeline directly. It makes sense therefore to give access to only that information that the webpage really offers to the user downloading the .zip. This means: less storage requirements for us + maybe more people interested in also testing, using, advancing the complete pipeline. Alternatively, jobs issued via web could be optionally marked as "complete", in which case the entire archive is stored and available for download. Opinions are needed here :) I won't make this call.

Simplified in what it takes to run the computation, and intuitive visualization on the webserver. As I wrote above, the default selection of files in the archive is a relevant selection of outputs, on the whole, this shouldn't be dumbed down (but also not blown up any further). If an intermediate users wants more detail in the output, the archive is where to look.

The two single biggest files will almost always be the alignment and the model.

So I think the best strategy is the following (given that the things that end up in the DB will be flexible it is not much of a decision anyways):

Model file goes into the database. If we see this blows up the database too much, it will have to go (which would kill the option to download and predict mutations dynamically)
Default is that the model file is not in the result archive. If a users wants it, they have to request it explicitly. If we choose option 1) above, this means including model_file in management.archive in the configuration when submitting; if we choose 2), one could offer the choice using a radio box or two links). Based on this consideration, I would lean towards 2 for flexibility and minimizing database size.

I don't know how you thought about implementing this, but I would store the entire file. Reason: if we start chunking files, we create new "parsers" and logic around what gets computed and then stored in the db and what not. Simply "copy-paste" results sounds like the way to go to me.

The alignment_statistics.csv file is a convenience output largely assembled from result items that are part of the output configuration by default (e.g. like num_sites or num_effective_sequences). So no need to parse anything, these key-value pairs could just go directly from the pipeline into the database without needing to touch the csv file at all, which I find preferable. The file is more interesting for the output archive rather than the webserver (the list was meant to list all available options).

I will research about this.
My guess is that again the pml files are only interesting for the result archive, but that the input to NGL et al. is better (and more cleanly) generated on the fly from the actual data.

Btw I would also verify very early on if NGL supports all necessary visualization features (I'll create a separate issue for this) and plays nicely with React, or if another viewer might be better suited.

Here ( #85 (comment) ) you talk about endpoints in https://github.com/debbiemarkslab/EVcouplings-server-backend , right?

Yes to come soon.

cccsander · 2017-11-11T02:57:39Z

Joining the discussion ....

sacdallago · 2017-11-11T04:16:02Z

From the discussion with @cccsander :

Save the configuration file and make it available to download (i.e. the job description, the python-config file which is fed to the pipeline in order to produce the results for that run). This way users can easily re-run the job in the future. [I think this we were storing already]. Maybe even make it downloadable easily somewhere on the results page. [e.g. download zip button + download job description button --> one can download only the job description/config file and re-run]

thomashopf · 2017-11-11T09:18:47Z

Yes fully agree, including it makes the settings 100% transparent and reproducible.

Being able to upload a config file would be a very nice feature too, but then this becomes an absolute input validation nightmare and probably users should be using the pipeline locally.

thomashopf · 2017-11-11T16:21:42Z

API endpoints using these files defined here: debbiemarkslab/EVcouplings-server-backend#4

sacdallago · 2018-11-05T21:06:11Z

I feel like this is very, very much done, except that it's in #166

sacdallago added the help wanted label Nov 9, 2017

sacdallago added the discussion label Nov 13, 2017

sacdallago changed the title ~~List of files to be stored in database for frontend consumption~~ List of files to be stored in database for webserver (front-/backend) consumption Nov 21, 2017

aggreen assigned sacdallago Dec 5, 2017

sacdallago closed this as completed Nov 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of files to be stored in database for webserver (front-/backend) consumption #85

List of files to be stored in database for webserver (front-/backend) consumption #85

sacdallago commented Nov 9, 2017 •

edited

Loading

sacdallago commented Nov 9, 2017

thomashopf commented Nov 9, 2017 •

edited

Loading

deboramarks commented Nov 9, 2017 via email

thomashopf commented Nov 9, 2017

sacdallago commented Nov 9, 2017 •

edited

Loading

thomashopf commented Nov 10, 2017 •

edited

Loading

cccsander commented Nov 11, 2017

sacdallago commented Nov 11, 2017

thomashopf commented Nov 11, 2017

thomashopf commented Nov 11, 2017

sacdallago commented Nov 5, 2018

List of files to be stored in database for webserver (front-/backend) consumption #85

List of files to be stored in database for webserver (front-/backend) consumption #85

Comments

sacdallago commented Nov 9, 2017 • edited Loading

Align

Couplings

Compare

Mutate

Fold

sacdallago commented Nov 9, 2017

thomashopf commented Nov 9, 2017 • edited Loading

General job info (already available at submit time)

Sequence alignment (for alignment view)

ECs (for EC contact map view)

Mutation effects (for mutation matrix view - epistatic and independent models)

Experimental PDB structure information

Structure prediction

Archive download

deboramarks commented Nov 9, 2017 via email

thomashopf commented Nov 9, 2017

sacdallago commented Nov 9, 2017 • edited Loading

thomashopf commented Nov 10, 2017 • edited Loading

cccsander commented Nov 11, 2017

sacdallago commented Nov 11, 2017

thomashopf commented Nov 11, 2017

thomashopf commented Nov 11, 2017

sacdallago commented Nov 5, 2018

sacdallago commented Nov 9, 2017 •

edited

Loading

thomashopf commented Nov 9, 2017 •

edited

Loading

sacdallago commented Nov 9, 2017 •

edited

Loading

thomashopf commented Nov 10, 2017 •

edited

Loading