-
Notifications
You must be signed in to change notification settings - Fork 8
Setting Up an Experiment
Each time you run an experiment on DaCapo, you have to tell it where to store the output, where to find your data (DataSplit), how to configure the neural network structure (Architecture), what you want the network to learn (Task), and how you want the network to learn (Training)
DaCapo has the following data storage components:
- Loss stats: The loss per training run iteration is stored along with a couple other statistics such as how long that iteration took to compute. These will be stored in the MongoDB if available.
- Validation scores: For each validation step. These will be stored in the MongoDB if available.
- Validation volumes: The results of validation (images with presumptive organelles labeled) are stored in zarr datasets so you visually inspect the best predictions on your validation out data according to the validation metric of your choice. This data will be stored on disk.
- Checkpoints: Copies of your model are stored at various intervals during training. This lets you retrieve the best performing model according to the validation metric of your choice. This data will be stored on disk.
- Training Snapshots: Every n iterations (where n corresponds to the snapshot_interval defined in the Training configuration) a snapshot that includes the inputs and outputs of the model at that iteration is stored along with some extra results that can be very helpful for debugging. The saved arrays include: Ground Truth, Target (Ground Truth transformed by Task), Raw, Prediction, Gradient, and Weights (for modifying the loss). This data will be stored on disk.
- Configurations: To make runs easily reproducible, the configuration files used to execute experiments are saved. This way other people can use the exact same configuration files or change single parameters and get comparable results. This data will be stored in the MongoDB if available.
To define where this data goes, use a text editor to create a dacapo.yaml configuration file. Here is a template for the file:
mongodbhost: mongodb://dbuser:dbpass@dburl:dbport/ mongodbname: dacapo runs_base_dir: /path/to/my/data/storage
**runs_base_dir **defines where your on-disk data will be stored. The mongodbhost and mongodbname define the mongodb host and database that will store your cloud data. If you want to store everything on disk, replace mongodbhost and mongodbname with a single type: files and everything will be saved to disk.
Configuration files for your experiments are created in python, using a Jupiter notebook. The first thing you need to do is launch the dacapo and logging platforms by typing in the following commands:
import dacapo import logging
Now you can create your configureations.
The DataSplit configuration file indicates where your data is stored, what forma it is in, whether it needs to be normalized, and what data you want to use for training, validation, and testing.
The neural network architecture used is usually one called a UNet. You will need to specify parameters such as how much you want to downsample your image and h how many layers you want in the neural network.
The Task configuration defines what and how you want the network to learn.
The Trainer configuration defines the training loop, what sort of Augmentations to apply during training, what learning rate and optimizer to use, and what batch size to train with.