Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First steps and decision on the BaMMformat v1 #2

Open
5 tasks
croth1 opened this issue Mar 28, 2017 · 0 comments
Open
5 tasks

First steps and decision on the BaMMformat v1 #2

croth1 opened this issue Mar 28, 2017 · 0 comments

Comments

@croth1
Copy link

croth1 commented Mar 28, 2017

Problems with one-model-one-json

My first suggestion was to encode every BaMM in one json file. This however has several disadvantages that together made me rethink this.

  • binary files have to be stored in ascii encoding. This requires sophisticated parser classes in all languages we use.
  • most json modules read a whole file to memory. Implementing lazy loading is cumbersome
  • large amounts of binary data could make the model too big to be edited by hand. This defeats the purpose of having a human readable format.

New proposal of the BaMM format

One BaMM is one folder, distributed in a zip file. A database of BaMM models is thus a zip folder of multiple BaMM folders:

BaMM_database/
>info.json (command line call, time/date of call)

>Bamm1/

>> general.json
>> metadata.json

>> data/
>>> bg-model.json 
>>> peng.json
>>> fdr_analysis.json
>>> model_visualization.json
>>> ...

>> attachments/
>>> attachment1.file
>>> attachment2.file
>>> ...

>Bamm2/
>> general.json
...

general.json

A file that contains technical meta information of the model directory, e.g. BaMMformat version. Also checksums for all files, so that we can write a validation script for checking whether a given directory is a valid BaMM model.

metadata.json

A file that can be used for storing arbitrary meta data in key-value format. E.g. sequence set (e.g. ENCODE345235, GEOid, transcription factor accession, cell line / species, conditions, sequencing platform, exp. procedures, author name, contact, download link, publication, etc. Users can read and edit this file with an editor of their choice and add information.

data/

A folder that contains a json file for each tool that processed the BaMM model. It stores inputs and outputs and meta information such as version.

attachments/

A folder that stores additional files e.g. binary representation of models, plots, etc.


Steps towards implementing

  • implementing python classes for creating and manipulating BaMM models/databases
  • changing our scripts so that they can read/write BaMM model
  • some c++ classes will have to read models directly (e.g. scanning sequences). We need at least a rudimentary c++ api for reading and writing BaMM databases.
  • write js visualization tool for models (can be directly used in our webserver)
  • write json schema and validation scripts for BaMM models

Motif_occurrence_db/
>info.json (command line call, time/date of call)

> Occurrences_of_Bamm1
>> summary.json (number of occurrences, fraction of occ., score distribution)
>> table_of_occurrences.json (sequence_id, position, score/prob)

> Occurrences_of_Bamm2
>> summary.json (number of occurrences, fraction of occ., score distribution)
>> table_of_occurrences.json (sequence_id, position, score/prob)
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant