Implement the code from the Paper, all the architecture and parameters are the same as paper, the only different thing is that I trained the model with only 100 classes, because 1000 classes might take too much time.
The accuracy and loss are shown as below:
The results of the feature visualization are shown as below: