First steps and decision on the BaMMformat v1 #2

croth1 · 2017-03-28T15:24:22Z

Problems with one-model-one-json

My first suggestion was to encode every BaMM in one json file. This however has several disadvantages that together made me rethink this.

binary files have to be stored in ascii encoding. This requires sophisticated parser classes in all languages we use.
most json modules read a whole file to memory. Implementing lazy loading is cumbersome
large amounts of binary data could make the model too big to be edited by hand. This defeats the purpose of having a human readable format.

New proposal of the BaMM format

One BaMM is one folder, distributed in a zip file. A database of BaMM models is thus a zip folder of multiple BaMM folders:

BaMM_database/
>info.json (command line call, time/date of call)

>Bamm1/

>> general.json
>> metadata.json

>> data/
>>> bg-model.json 
>>> peng.json
>>> fdr_analysis.json
>>> model_visualization.json
>>> ...

>> attachments/
>>> attachment1.file
>>> attachment2.file
>>> ...

>Bamm2/
>> general.json
...

general.json

A file that contains technical meta information of the model directory, e.g. BaMMformat version. Also checksums for all files, so that we can write a validation script for checking whether a given directory is a valid BaMM model.

metadata.json

A file that can be used for storing arbitrary meta data in key-value format. E.g. sequence set (e.g. ENCODE345235, GEOid, transcription factor accession, cell line / species, conditions, sequencing platform, exp. procedures, author name, contact, download link, publication, etc. Users can read and edit this file with an editor of their choice and add information.

data/

A folder that contains a json file for each tool that processed the BaMM model. It stores inputs and outputs and meta information such as version.

attachments/

A folder that stores additional files e.g. binary representation of models, plots, etc.

Steps towards implementing

implementing python classes for creating and manipulating BaMM models/databases
changing our scripts so that they can read/write BaMM model
some c++ classes will have to read models directly (e.g. scanning sequences). We need at least a rudimentary c++ api for reading and writing BaMM databases.
write js visualization tool for models (can be directly used in our webserver)
write json schema and validation scripts for BaMM models

Motif_occurrence_db/
>info.json (command line call, time/date of call)

> Occurrences_of_Bamm1
>> summary.json (number of occurrences, fraction of occ., score distribution)
>> table_of_occurrences.json (sequence_id, position, score/prob)

> Occurrences_of_Bamm2
>> summary.json (number of occurrences, fraction of occ., score distribution)
>> table_of_occurrences.json (sequence_id, position, score/prob)
...

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First steps and decision on the BaMMformat v1 #2

First steps and decision on the BaMMformat v1 #2

croth1 commented Mar 28, 2017 •

edited

Loading

First steps and decision on the BaMMformat v1 #2

First steps and decision on the BaMMformat v1 #2

Comments

croth1 commented Mar 28, 2017 • edited Loading

Problems with one-model-one-json

New proposal of the BaMM format

general.json

metadata.json

data/

attachments/

Steps towards implementing

croth1 commented Mar 28, 2017 •

edited

Loading