This is my implementation of the experiment in the paper of fixmatch. I only implemented experiements on cifar-10 dataset without CTAugment.
My platform is:
- 2080ti gpu
- ubuntu-16.04
- python3.6.9
- pytorch-1.3.1 installed via conda
- cudatoolkit-10.1.243
- cudnn-7.6.3 in /usr/lib/x86_64-linux-gpu
download cifar-10 dataset:
$ mkdir -p dataset && cd dataset
$ wget -c http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
$ tar -xzvf cifar-10-python.tar.gz
To train the model with 40 labeled samples, you can run the script:
$ python train.py --n-labeled 40
where 40
is the number of labeled sample during training.
After training the model with 40 labeled samples for 5 times with the command:
$ python train.py --n-labeled 40
I observed top-1 accuracy like this:
#No. | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
acc | 91.81 | 91.29 | 89.51 | 91.32 | 79.42 |
Note:
-
There is no need to add interleave, since interleave is used to avoid the bias of bn status. MixMatch uses interleave because because they run forward computation with three data batches for three times, if you combine the three batches together and run only one pass of forward computation with the combined batch, the results should be same. You may refer to my implementation of mixmatch here, which does not use interleave and still achieves similar results.
-
There are two methods to deal with the buffers in the operation of ema: One is directly copying them as the ema buffer states and the other is implementing ema on these buffer states. Generally speaking, there should not be a large gap between these two methods, since in the model of resnet, the buffers are the
running_mean/running_var
of thenn.BatchNorm
layers. During training process, these BN buffers are updated with the moving average method which is same as what theema
operator does. What theema
operator does is to estimate the expectation of the associated parameters by smoothing a series of values. Ema operation can be seen as averaging the most recent values of the series, and by averaging the values, we are computing the less noised parameter value(which can simply be treated as the expectation). To directly copy the buffers, we are using theone-order
smoothing, while to implement ema on them, we are using thetwo-order
smoothing. In general, we can have a good enough expected value with one-order smoothing, though two-order smoothing should be more unbiased. -
The method based on naive random augmentation will cause a relatively large variance. If you set random seed free, and generate the split of labeled training set randomly each time, you may observe that the validation accuracy would fluctuate within a big range. In the paper, the authors used CTAugment which introduced some feedback to the data augmentation strategy, which will reduce the variance.