Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of files to be stored in database for webserver (front-/backend) consumption #85

Closed
19 tasks
sacdallago opened this issue Nov 9, 2017 · 11 comments
Closed
19 tasks

Comments

@sacdallago
Copy link
Member

sacdallago commented Nov 9, 2017

  • Master config file (job parameters)
  • subfoldbitscore/align/*_alignment_statistics.csv
  • *_job_statistics_summary.pdf --> Might be missing

Align

  • Master alignment statistics file (align\*_alignment_statistics.csv)
  • Sequences file (align\*.fa)
  • Alignment file (align\*.a2m)
  • Frequencies file (align\*_frequencies.csv)

Couplings

  • Couplings model file (couplings\*.model)
  • Enrichment file (couplings\*_enrichment_sausage)
  • Enrichment file (couplings\*_enrichment_sphere)
  • EV_Zoom file (couplings\*_evzoom.json)

if no compare:

  • Coupling scores file (couplings\*_CouplingScores.csv)

else:

  • Coupling scores file (compare\*_CouplingScoresCompared_all.csv)

Compare

  • Coupling scores file (compare\*_CouplingScoresCompared_longrange.csv)
  • Structure hits file (compare\*_structure_hits.csv)

Mutate

  • Mutation effect file (mutate\*_mutate_matrix.csv)

Fold

  • Sec structure file (fold\*_secondary_structure.csv)
  • Model ranking file (fold\*_ranking.csv)

if structures available:

  • Model ranking file (fold\*_comparison.csv) (has more cols)
@sacdallago
Copy link
Member Author

@thomashopf : @b-schubert and I were wondering which files are necessary for the frontend. This is the list @b-schubert came up with. Anything to add or remove? The idea is to store these files somewhere after the pipeline is run (e.g. in a DB, see #84 ) and then have the backend fetch the data from there instead of fetching it from disk.

@kpgbrock : @b-schubert was unsure about the files to store from the fold stage. Can you help us out there?

Thanks :)

@thomashopf
Copy link
Contributor

thomashopf commented Nov 9, 2017

I think the best way to list this is in terms of

  1. items that will compose what is on the website rather than stages, and
  2. keys in the output file

There will also be entries that are not files but just values that should be available for querying.

General job info (already available at submit time)

  • sequence_id
  • sequence_file
  • target_sequence_file

Sequence alignment (for alignment view)

  • alignment_file (for alignment viewer)
  • annotation_file (for displaying additional information for each sequence)
  • statistics_file (for job stats, and displaying alignment coverage)
  • frequencies_file (for sequence logo)

The following are in statistics.csv file but I would store them anyways because relevant for job overview:

  • num_sequences
  • segments
  • effective_sequences
  • num_sites
  • effective_sequences

ECs (for EC contact map view)

If no comparison available:

  • ec_file
  • ec_longrange_file

If comparison available:

  • ec_compared_file
  • ec_compared_longrange_file

And:

  • enrichment_file (for displaying enrichment table / visualizing enrichment on 3D structure)
  • evzoom_file (for EVzoom view)

I would be hesitant to put the model_file into the database since we are often talking hundreds of megabytes per file (if we do, this would however allow the user to quickly predict arbitrary mutations by supplying a list of mutations after the pipeline has run which would be very nice)

Mutation effects (for mutation matrix view - epistatic and independent models)

  • mutation_matrix_file

Experimental PDB structure information

  • pdb_structure_hits_file (for showing which structures were found)
  • monomer_contacts_file (for displaying structure contacts on contact map without loading full distance map)
  • multimer_contacts_file (for displaying structure contacts on contact map without loading full distance map)
  • remapped_pdb_files (for showing ECs/enrichment/mutation effects on experimental structures)

Structure prediction

  • sec_struct_file (for showing predicted secondary structure on contact maps / mutation matrices)
  • folded_structure_files (for showing predicted structures using NGL, visualizing mutation effects, enrichment, ECs)
  • folding_ranking_file (for selecting blindly in which order to display structures)
  • folding_comparison_file (for showing how good models are, if there is experimental 3D structure)

Archive download

  • archive_file ... actually this needs serious discussion because we want to allow the user to download the full output archive, but just plonking this into the database feels wrong, but so does storing all the result files in the database and generating the archive on the fly

Probably no need to store pml files unless these can be easily fed into NGL (or the like) or transformed to be reused

@deboramarks
Copy link

deboramarks commented Nov 9, 2017 via email

@thomashopf
Copy link
Contributor

Since this is inherently related to the REST API endpoints, I'll see if get around to drafting a reasonable set of endpoints this weekend

@sacdallago
Copy link
Member Author

sacdallago commented Nov 9, 2017

@thomashopf thanks for the list.

Just to re-iterate the idea of doing this in the first place: @b-schubert said storing the computations takes up too much space + you would like to have some easy way of chaning provider (where the data fed to the frontend is stored). I proposed storing only relevant files (for the frontend) in a database (possibly mongo, I totally agree with .4 in #84 (comment) ) for runs issued via the flask backend. Additionally, I would delete the big result file (the compressed archive) from the FS if the job is run from the web. This in turn means that the archive that the user running the job via the web will be able to download is a much lighter archive than the one generated by running the computation locally.

So, in this sense:

I would be hesitant to put the model_file into the database since we are often talking hundreds of megabytes per file (if we do, this would however allow the user to quickly predict arbitrary mutations by supplying a list of mutations after the pipeline has run which would be very nice)

I think the model file (from what I have understood) is quite important. Can this really be removed completely from a possible .zip download of a user running a job via the web?

archive_file ... actually this needs serious discussion because we want to allow the user to download the full output archive, but just plonking this into the database feels wrong, but so does storing all the result files in the database and generating the archive on the fly

Again, here it boils down to design. As far as I understood (as you mentioned a couple months ago) the webpage is a striped down version of the complete pipeline. The advanced user will use the pipeline directly. It makes sense therefore to give access to only that information that the webpage really offers to the user downloading the .zip. This means: less storage requirements for us + maybe more people interested in also testing, using, advancing the complete pipeline. Alternatively, jobs issued via web could be optionally marked as "complete", in which case the entire archive is stored and available for download. Opinions are needed here :) I won't make this call.


The following are in statistics.csv file but I would store them anyways because relevant for job overview

I don't know how you thought about implementing this, but I would store the entire file. Reason: if we start chunking files, we create new "parsers" and logic around what gets computed and then stored in the db and what not. Simply "copy-paste" results sounds like the way to go to me.

Probably no need to store pml files unless these can be easily fed into NGL (or the like) or transformed to be reused

I will research about this.


Here ( #85 (comment) ) you talk about endpoints in https://github.com/debbiemarkslab/EVcouplings-server-backend , right?

@thomashopf
Copy link
Contributor

thomashopf commented Nov 10, 2017

Just to re-iterate the idea of doing this in the first place: @b-schubert said storing the computations takes up too much space + you would like to have some easy way of chaning provider (where the data fed to the frontend is stored). I proposed storing only relevant files (for the frontend) in a database (possibly mongo, I totally agree with .4 in #84 (comment) ) for runs issued via the flask backend. Additionally, I would delete the big result file (the compressed archive) from the FS if the job is run from the web. This in turn means that the archive that the user running the job via the web will be able to download is a much lighter archive than the one generated by running the computation locally.

Storage space at least on Orchestra is not much of an issue, if one doesn't keep the results forever and actually enforces deletion (the current server setup doesn't and no one bothered...) after e.g. 2 weeks with a cronjob. The pipeline also already has a setting delete in the management section that allows it to clean up after itself to minimize space usage, which is just turned off by default to allow reusing results during reruns. If one deleted raw_alignment_file and model_file, the biggest space offenders are gone already.

Database space on Orchestra might be more of a limitation - the current server relational database had about 50GB a while ago and RITG was asking to bring down its size because the databases are on more expensive storage that is heavily backed up. (Side note: Last time I checked there was no MongoDB or the like, at least not advertised openly, so I would check with RITG early on - created a ticket in backend repo for this)

That being said, I think in terms of architecture it is much nicer to have a broker like a database to have full freedom in decoupling server backend and computational pipeline, which justifies this alone.

Regarding the results archive:

  • What gets included in the result archive is a parameter in the configuration file for this very reason. So the server backend can request what to put into the archive whatever it wants, no need to delete things, or no archive at all.

  • The default selection is already a relevant selection of results based on years of experience in the lab with the server in mind (not a full dump of the output folder)

  • The archive will need to contain additional files that are not used by the server front end (e.g. PDF contact map plots)

So the trade-off is simply between options

  1. Just plonking the archive into the database and not worrying about it, wasting some space
  2. Generating the archive on the fly from result files stored in the database as single files, which would make it necessary to store a superset of the files needed for the server frontend (but not that many)

If one implements the list of items to store in the database as a configurable list like management.archive, this decision is entirely moved into the server backend which would be nice. If one chooses 2) the choice of whether to include the model file could be made dynamically when the users selects to download the archive.

I think the model file (from what I have understood) is quite important. Can this really be removed completely from a possible .zip download of a user running a job via the web?

Yes it is a core piece computationally, but not a relevant download for 95% of server users which would then get hasseled with hundreds of MB of uncompressable model (also requires programming knowledge to use it at which point people should be able to run the pipeline). The main application to put in the database would be to allow users to dynamically predict mutation effects of their choice, if one fires off a Celery worker in the server backend for that computation.

Again, here it boils down to design. As far as I understood (as you mentioned a couple months ago) the webpage is a striped down version of the complete pipeline. The advanced user will use the pipeline directly. It makes sense therefore to give access to only that information that the webpage really offers to the user downloading the .zip. This means: less storage requirements for us + maybe more people interested in also testing, using, advancing the complete pipeline. Alternatively, jobs issued via web could be optionally marked as "complete", in which case the entire archive is stored and available for download. Opinions are needed here :) I won't make this call.

Simplified in what it takes to run the computation, and intuitive visualization on the webserver. As I wrote above, the default selection of files in the archive is a relevant selection of outputs, on the whole, this shouldn't be dumbed down (but also not blown up any further). If an intermediate users wants more detail in the output, the archive is where to look.

The two single biggest files will almost always be the alignment and the model.

So I think the best strategy is the following (given that the things that end up in the DB will be flexible it is not much of a decision anyways):

  • Model file goes into the database. If we see this blows up the database too much, it will have to go (which would kill the option to download and predict mutations dynamically)
  • Default is that the model file is not in the result archive. If a users wants it, they have to request it explicitly. If we choose option 1) above, this means including model_file in management.archive in the configuration when submitting; if we choose 2), one could offer the choice using a radio box or two links). Based on this consideration, I would lean towards 2 for flexibility and minimizing database size.

I don't know how you thought about implementing this, but I would store the entire file. Reason: if we start chunking files, we create new "parsers" and logic around what gets computed and then stored in the db and what not. Simply "copy-paste" results sounds like the way to go to me.

The alignment_statistics.csv file is a convenience output largely assembled from result items that are part of the output configuration by default (e.g. like num_sites or num_effective_sequences). So no need to parse anything, these key-value pairs could just go directly from the pipeline into the database without needing to touch the csv file at all, which I find preferable. The file is more interesting for the output archive rather than the webserver (the list was meant to list all available options).

I will research about this.
My guess is that again the pml files are only interesting for the result archive, but that the input to NGL et al. is better (and more cleanly) generated on the fly from the actual data.

Btw I would also verify very early on if NGL supports all necessary visualization features (I'll create a separate issue for this) and plays nicely with React, or if another viewer might be better suited.

Here ( #85 (comment) ) you talk about endpoints in https://github.com/debbiemarkslab/EVcouplings-server-backend , right?

Yes to come soon.

@cccsander
Copy link

Joining the discussion ....

@sacdallago
Copy link
Member Author

From the discussion with @cccsander :

  • Save the configuration file and make it available to download (i.e. the job description, the python-config file which is fed to the pipeline in order to produce the results for that run). This way users can easily re-run the job in the future. [I think this we were storing already]. Maybe even make it downloadable easily somewhere on the results page. [e.g. download zip button + download job description button --> one can download only the job description/config file and re-run]

@thomashopf
Copy link
Contributor

Yes fully agree, including it makes the settings 100% transparent and reproducible.

Being able to upload a config file would be a very nice feature too, but then this becomes an absolute input validation nightmare and probably users should be using the pipeline locally.

@thomashopf
Copy link
Contributor

API endpoints using these files defined here: debbiemarkslab/EVcouplings-server-backend#4

@sacdallago sacdallago changed the title List of files to be stored in database for frontend consumption List of files to be stored in database for webserver (front-/backend) consumption Nov 21, 2017
@sacdallago
Copy link
Member Author

I feel like this is very, very much done, except that it's in #166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants