-
Notifications
You must be signed in to change notification settings - Fork 155
CMD usage instructions
English| 简体中文
All the following commands are preceded by $ (which is also the Linux prompt under normal user), and $ does not need to be entered together when actually entering commands to the console.
This system has the following requirements for external data.
-
marked in VOC format.
-
the paths of all images (collectively called assets or medias in this system) need to be written uniformly to the index.tsv file, while all annotation files need to be located in the same directory.
-
the user running the command line needs read access to index.tsv, all image files and all annotation files.
Let's describe the process of preparing an external dataset using the pascal 2017 test dataset as an example. Download the dataset VOC2012test.tar from the official website and unpack it using the following command.
$ tar -xvf VOC2012test.tar
After unpacking, you can get the following directory structure (assuming VOCdevkit is located in the /data directory)
/data/VOCdevkit
`-- VOC2012
|-- Annotations
|-- ImageSets
|-- Action
|-- Layout
|-- Main
| `-- Segmentation
`-- JPEGImages
Where all annotations are located in the annotations directory and all images are located in the JPEGImages directory. Use the following command to generate the index.tsv file.
$ find /data/VOCdevkit/VOC2012/JPEGImages -type f > index.tsv
You can see the following content in index.tsv.
/data/VOCdevkit/VOC2012/JPEGImages/2009_001200.jpg
/data/VOCdevkit/VOC2012/JPEGImages/2009_004006.jpg
/data/VOCdevkit/VOC2012/JPEGImages/2008_006022.jpg
/data/VOCdevkit/VOC2012/JPEGImages/2008_006931.jpg
/data/VOCdevkit/VOC2012/JPEGImages/2009_003016.jpg
...
This index.tsv can be used for the next data import step.
Also, in the Annotations folder, each annotation has the same master file name as the image. The xxx attribute in it will be extracted as a predefined keyword to be used for data filtering in a later step.
The command line of this system uses a similar approach to git for managing users' resources, where users create their own mir repository and do all the next tasks in this mir repo.
To create their own mir repo, users simply need to.
$ mkdir ~/mir-demo-repo && cd ~/mir-demo-repo # Create the directory and enter
$ mir init # Initialize this directory as a mir repo
$ mkdir ~/ymir-assets ~/ymir-models # Create a resource and model storage directory, all image resources will be stored in this directory and only references to these resources will be kept in the mir repo
Labels in the mir repo are managed in a unified way through the label file. Open the label file ~/mir-demo-repo/.mir/labels.yaml
and you can see something like this
labels:
- create_time: 1646728410.570311
id: 0
update_time: 1646728410.570311
name: frisbee
- create_time: 1646728410.570311
id: 1
update_time: 1646728410.570311
name: car
You can add your own tags, like the following.
labels:
- create_time: 1646728410.570311
id: 0
update_time: 1646728410.570311
name: frisbee
- create_time: 1646728410.570311
id: 1
update_time: 1646728410.570311
name: car
- create_time: 1646728410.570311
id: 2
update_time: 1646728410.570311
name: tv
A category tag can specify one or more aliases, for example, if television and tv_monitor are specified as aliases for tv, the ``labels.yaml` file can be changed to
- create_time: 1646728410.570311
id: 0
update_time: 1646728410.570311
name: frisbee
- create_time: 1646728410.570311
id: 1
update_time: 1646728410.570311
name: car
- create_time: 1646728410.570311
id: 2
update_time: 1646728410.570311
name: tv
aliases:
- television
- tv_monitor
This file can be edited using vi, or other editing tools. Users can add aliases to categories, or add new categories, but it is not recommended to change or delete the primary name and id of existing categories.
The labels.yaml
file can be shared among multiple mir repo's by creating soft links.
The user needs to prepare three datasets in advance.
-
the training set dataset-training, with annotations, for the initial model training.
-
validation set dataset-val, with annotations, for model validation during training.
-
the mining set dataset-mining, which is a relatively large dataset to be mined.
The user imports these three datasets with the following commands.
$ cd ~/mir-demo-repo
$ mir import --index-file /path/to/training-dataset-index.tsv \ # path to dataset index.tsv
--gt-dir /path/to/training-dataset-annotation-dir \ # annotation path
--gen-dir ~/ymir-assets \ \ # Resource storage path
--unknown-types-strategy stop \ # Unknown category handling strategy, can choose from stop, ignore, add
--anno-type det-box \ \ # annotated categories, you can choose from det-box, seg-poly, seg-mask
--dst-rev 'dataset-training@import' # Resulting branch and operation task name
$ mir checkout master
$ mir import --index-file /path/to/val-dataset-index.tsv \
--gt-dir /path/to/val-dataset-annotation-dir \
--gen-dir ~/ymir-assets \
--unknown-types-strategy stop \
--anno-type det-box \
--dst-rev 'dataset-val@import'
$ mir checkout master
$ mir import --index-file /path/to/mining-dataset-index.tsv \
--gt-dir /path/to/mining-dataset-annotation-dir \
--gen-dir ~/ymir-assets \
--unknown-types-strategy stop \
--anno-type det-box \
--dst-rev 'dataset-mining@import'
- Note: By pointing the optional parameter
-gt-dir
to the ground truth directory and the optional parameter-pred-dir
to the prediction directory, the prediction and ground truth results can be imported into the same dataset.
After all tasks have been successfully executed, the following command can be used.
$ git branch
The user should now see that the repo has four branches: master, dataset-training, dataset-val, dataset-mining, and that the current repo is on branch dataset-mining.
The user can also view summary information for any of the branches with the following command.
$ mir show --src-rev dataset-mining
The training model requires a training set and a validation set. Synthesize the dataset-training and dataset-val into one with the following command.
$ mir merge --src-revs tr:dataset-training@import;va:dataset-val@import \ # Branch to be merged
--dst-rev tr-va@merged \ # Resulting branch and operation task names
-s host # Policy: resolve conflicts based on the subject branch
After the merge is complete, you can see that the current repo is under the tr-va branch, and you can check the status of the merged branch with the mir show command.
$ mir show --src-revs HEAD # HEAD refers to the current branch, or you can use the specific branch name tr-va instead
Suppose the dataset-training and dataset-val before the merge have 2000 and 1510 images respectively, and the merged branch has 2000 images as the training set and 1510 images as the validation set. Suppose we only want to train the model for recognizing people and cats, we first filter the resources of appearing people or cats from this large dataset as follows.
$ mir filter --src-revs tr-va@merged \
--dst-rev tr-va@filtered \
-p 'person;cat'
First pull the training image and the mining image from dockerhub.
$ docker pull youdaoyzbx/ymir-executor:ymir1.3.0-yolov5-cu111-tmi
and start the training process with the following command.
$ mir train -w /tmp/ymir/training/train-0 \
--media-location ~/ymir-assets \ # path to store resources when importing
--model-location ~/ymir-models \ # The path to store the model after training is done
--task-config-file ~/training-config.yaml \ \ # Training parameter configuration file, go to the training image to get it
--src-revs tr-va@filtered \
--dst-rev training-0@trained \
--executor youdaoyzbx/ymir-executor:ymir1.3.0-yolov5-cu111-tmi # training image
After the model training is finished, the system will output the model id, and the user can see the packaged files of this trained model in ~/ymir-models.
During the training of the model, the training mirror will generate several intermediate models, the names of these intermediate models will be displayed when the training is completed, and YMIR will eventually record the set of all intermediate models generated during this training.
The above models are based on a small set of data. By mining, it is possible to get the most effective resources from a large data set for training the models in the next step. The user completes the mining process using the following command.
$ mir mining --src-revs dataset-mining@import \ # Imported mining branch
--dst-rev mining-0@mining \ \ # The result branch of the mining
-w /tmp/ymir/mining/mining-0 \ # Temporary working directory for this task
--topk 200 \ # The number of images of the mining results
--model-location ~/ymir-models \
--media-location ~/ymir-assets \
--model-hash <hash>@<inter-model-name> \ # The id of the model trained in the previous step, and the name of the intermediate model you want to use for inference
--asset-cache-dir /tmp/ymir/cache \ # Resource cache
--task-config-file ~/mining-config.yaml \ \ # mining parameter configuration file, go to the mining image to get it
--executor youdaoyzbx/ymir-executor:ymir1.3.0-yolov5-cu111-tmi # mining mirror
- Note: If the
-add-prediction
parameter is added to mir mining, i.e., the inference results of the model can be generated in the result dataset at the same time.
Now that the system has mined the 200 images that are most effective for model training, and these images are saved in branch mining, the next task is to export these resources to the annotators for annotation. The user can complete the export process with the following command.
$ mir export --asset-dir /tmp/ymir/export/export-0/assets \ # Resource export directory
--pred-dir /tmp/ymir/export/export-0/annotations \ \ # Export annotation directory
--media-location ~/ymir-assets \ # Resource storage directory
--src-revs mining-0@mining \
--asset-format raw \ \ # Export raw images
--anno-format none # Do not export annotations
$ find /tmp/ymir/export/export-0/assets > /tmp/ymir/export/export-0/index.tsv
After the export is finished, you can see the exported images at /tmp/ymir/export/export-0/assets location, and users can send these images to annotations, which need to be saved in VOC format, assuming the save path is still /tmp/ymir/export/export-0/annotations. Once the annotation is complete, the user can import the data using an import command similar to the one in [4.2.2] (#422 - Create a local repo and import data) in the following way.
$ mir import --index-file /tmp/ymir/export/export-0/index.tsv
--gt-dir /tmp/ymir/export/export-0/annotations \ # mark path
--gen-dir ~/ymir-assets \ # Resource storage path
--unknown-types-strategy stop \
--anno-type det-box \
--dst-rev 'labeled-0@import' # Resulting branch and operation task name
The branch labeled-0 in the workspace now contains 200 new training images mined, which can be merged with the original training set by the aforementioned merge: ``
$ mir merge --src-revs tr-va@filtered;tr:labeled-0@import \ # branch to be merged
--dst-rev tr-va-1@merged \ \ # Resulting branch and operation task name
-s host # Policy: resolve conflicts based on the subject branch
ðŸ˜'
## 8 Training the second model
Now that the branch tr-va-1 contains the training and validation sets used in the previous training, as well as the new 200 training sets plus manual annotations derived later through data mining, a new model can be trained on this set with the following command.
$ mir train -w /tmp/ymir/training/train-1 \ # Use a different working directory for each different training and mining task
--media-location ~/ymir-assets
--model-location ~/ymir-models
--task-config-file ~/training-config.yaml \ # Training parameter configuration file, go to the training image to get it
--src-revs tr-va-1@merged \ # Use the branch after synthesis
--dst-rev training-1@trained
--executor youdaoyzbx/ymir-executor:ymir1.3.0-yolov5-cu111-tmi