training doesn't proceed past epoch 1 on GPU workstation #755

panichem · 2022-05-18T23:53:48Z

panichem
May 18, 2022

I'm working through the sleap tutorial on a PC with a decent GPU. The initial training step is taking a bit of time though...:

Here's the dump from terminal - can't really see anything weird. Any idea what I need to do differently?

(sleap) C:\Users\moorelab>sleap-label
Saving config: C:\Users\moorelab/.sleap/1.2.3/preferences.yaml
Restoring GUI state...

Software versions:
SLEAP: 1.2.3
TensorFlow: 2.6.3
Numpy: 1.19.5
Python: 3.7.12
OS: Windows-10-10.0.19041-SP0

Happy SLEAPing! :)
Resetting monitor window.
Polling: C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation\models\220516_200359.single_instance.n=26\viz\validation.*.png
Start training single_instance...
['sleap-train', 'C:\\Users\\moorelab\\AppData\\Local\\Temp\\tmpgdzdwt6i\\220516_200359_training_job.json', 'C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation/JoplinLick.v000.slp', '--zmq', '--save_viz']
INFO:sleap.nn.training:Versions:
SLEAP: 1.2.3
TensorFlow: 2.6.3
Numpy: 1.19.5
Python: 3.7.12
OS: Windows-10-10.0.19041-SP0
INFO:sleap.nn.training:Training labels file: C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation/JoplinLick.v000.slp
INFO:sleap.nn.training:Training profile: C:\Users\moorelab\AppData\Local\Temp\tmpgdzdwt6i\220516_200359_training_job.json
INFO:sleap.nn.training:
INFO:sleap.nn.training:Arguments:
INFO:sleap.nn.training:{
    "training_job_path": "C:\\Users\\moorelab\\AppData\\Local\\Temp\\tmpgdzdwt6i\\220516_200359_training_job.json",
    "labels_path": "C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation/JoplinLick.v000.slp",
    "video_paths": [
        ""
    ],
    "val_labels": null,
    "test_labels": null,
    "tensorboard": false,
    "save_viz": true,
    "zmq": true,
    "run_name": "",
    "prefix": "",
    "suffix": "",
    "cpu": false,
    "first_gpu": false,
    "last_gpu": false,
    "gpu": 0
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Training job:
INFO:sleap.nn.training:{
    "data": {
        "labels": {
            "training_labels": null,
            "validation_labels": null,
            "validation_fraction": 0.1,
            "test_labels": null,
            "split_by_inds": false,
            "training_inds": null,
            "validation_inds": null,
            "test_inds": null,
            "search_path_hints": [],
            "skeletons": []
        },
        "preprocessing": {
            "ensure_rgb": false,
            "ensure_grayscale": false,
            "imagenet_mode": null,
            "input_scaling": 1.0,
            "pad_to_stride": null,
            "resize_and_pad_to_target": true,
            "target_height": null,
            "target_width": null
        },
        "instance_cropping": {
            "center_on_part": null,
            "crop_size": null,
            "crop_size_detection_padding": 16
        }
    },
    "model": {
        "backbone": {
            "leap": null,
            "unet": {
                "stem_stride": null,
                "max_stride": 16,
                "output_stride": 2,
                "filters": 16,
                "filters_rate": 2.0,
                "middle_block": true,
                "up_interpolate": true,
                "stacks": 1
            },
            "hourglass": null,
            "resnet": null,
            "pretrained_encoder": null
        },
        "heads": {
            "single_instance": {
                "part_names": null,
                "sigma": 2.5,
                "output_stride": 2,
                "loss_weight": 1.0,
                "offset_refinement": false
            },
            "centroid": null,
            "centered_instance": null,
            "multi_instance": null,
            "multi_class_bottomup": null,
            "multi_class_topdown": null
        }
    },
    "optimization": {
        "preload_data": true,
        "augmentation_config": {
            "rotate": true,
            "rotation_min_angle": -15.0,
            "rotation_max_angle": 15.0,
            "translate": false,
            "translate_min": -5,
            "translate_max": 5,
            "scale": false,
            "scale_min": 0.9,
            "scale_max": 1.1,
            "uniform_noise": false,
            "uniform_noise_min_val": 0.0,
            "uniform_noise_max_val": 10.0,
            "gaussian_noise": false,
            "gaussian_noise_mean": 5.0,
            "gaussian_noise_stddev": 1.0,
            "contrast": false,
            "contrast_min_gamma": 0.5,
            "contrast_max_gamma": 2.0,
            "brightness": false,
            "brightness_min_val": 0.0,
            "brightness_max_val": 10.0,
            "random_crop": false,
            "random_crop_height": 256,
            "random_crop_width": 256,
            "random_flip": false,
            "flip_horizontal": true
        },
        "online_shuffling": true,
        "shuffle_buffer_size": 128,
        "prefetch": true,
        "batch_size": 4,
        "batches_per_epoch": null,
        "min_batches_per_epoch": 200,
        "val_batches_per_epoch": null,
        "min_val_batches_per_epoch": 10,
        "epochs": 200,
        "optimizer": "adam",
        "initial_learning_rate": 0.0001,
        "learning_rate_schedule": {
            "reduce_on_plateau": true,
            "reduction_factor": 0.5,
            "plateau_min_delta": 1e-06,
            "plateau_patience": 5,
            "plateau_cooldown": 3,
            "min_learning_rate": 1e-08
        },
        "hard_keypoint_mining": {
            "online_mining": false,
            "hard_to_easy_ratio": 2.0,
            "min_hard_keypoints": 2,
            "max_hard_keypoints": null,
            "loss_scale": 5.0
        },
        "early_stopping": {
            "stop_training_on_plateau": true,
            "plateau_min_delta": 1e-08,
            "plateau_patience": 10
        }
    },
    "outputs": {
        "save_outputs": true,
        "run_name": "220516_200359.single_instance.n=26",
        "run_name_prefix": "",
        "run_name_suffix": "",
        "runs_folder": "C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation\\models",
        "tags": [
            ""
        ],
        "save_visualizations": true,
        "delete_viz_images": true,
        "zip_outputs": false,
        "log_to_csv": true,
        "checkpointing": {
            "initial_model": false,
            "best_model": true,
            "every_epoch": false,
            "latest_model": false,
            "final_model": false
        },
        "tensorboard": {
            "write_logs": false,
            "loss_frequency": "epoch",
            "architecture_graph": false,
            "profile_graph": false,
            "visualizations": true
        },
        "zmq": {
            "subscribe_to_controller": true,
            "controller_address": "tcp://127.0.0.1:9000",
            "controller_polling_timeout": 10,
            "publish_updates": true,
            "publish_address": "tcp://127.0.0.1:9001"
        }
    },
    "name": "",
    "description": "",
    "sleap_version": "1.2.3",
    "filename": "C:\\Users\\moorelab\\AppData\\Local\\Temp\\tmpgdzdwt6i\\220516_200359_training_job.json"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Using GPU 0 for acceleration.
INFO:sleap.nn.training:Disabled GPU memory pre-allocation.
INFO:sleap.nn.training:System:
GPUs: 1/1 available
  Device: /physical_device:GPU:0
         Available: True
        Initalized: False
     Memory growth: True
INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
INFO:sleap.nn.training:Loading training labels from: C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation/JoplinLick.v000.slp
INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1
INFO:sleap.nn.training:  Splits: Training = 23 / Validation = 3.
INFO:sleap.nn.training:Setting up for training...
INFO:sleap.nn.training:Setting up pipeline builders...
INFO:sleap.nn.training:Setting up model...
INFO:sleap.nn.training:Building test pipeline...
2022-05-16 20:04:03.657263: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-16 20:04:04.078266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5979 MB memory:  -> device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5
2022-05-16 20:04:04.576879: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
INFO:sleap.nn.training:Loaded test example. [2.101s]
INFO:sleap.nn.training:  Input shape: (1088, 1920, 3)
INFO:sleap.nn.training:Created Keras model.
INFO:sleap.nn.training:  Backbone: UNet(stacks=1, filters=16, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False)
INFO:sleap.nn.training:  Max stride: 16
INFO:sleap.nn.training:  Parameters: 1,953,624
INFO:sleap.nn.training:  Heads:
INFO:sleap.nn.training:    [0] = SingleInstanceConfmapsHead(part_names=['mouth', 'eyes', 'chest', 'l_hand', 'r_hand', 'l_foot', 'r_foot', 'tongue'], sigma=2.5, output_stride=2, loss_weight=1.0)
INFO:sleap.nn.training:  Outputs:
INFO:sleap.nn.training:    [0] = KerasTensor(type_spec=TensorSpec(shape=(None, 544, 960, 8), dtype=tf.float32, name=None), name='SingleInstanceConfmapsHead/BiasAdd:0', description="created by layer 'SingleInstanceConfmapsHead'")
INFO:sleap.nn.training:Setting up data pipelines...
INFO:sleap.nn.training:Training set: n = 23
INFO:sleap.nn.training:Validation set: n = 3
INFO:sleap.nn.training:Setting up optimization...
INFO:sleap.nn.training:  Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08)
INFO:sleap.nn.training:  Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-08, plateau_patience=10)
INFO:sleap.nn.training:Setting up outputs...
INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: )
INFO:sleap.nn.training:  ZMQ controller subcribed to: tcp://127.0.0.1:9000
INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set
INFO:sleap.nn.training:  ZMQ progress reporter publish on: tcp://127.0.0.1:9001
INFO:sleap.nn.training:Created run path: C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation\models\220516_200359.single_instance.n=26
INFO:sleap.nn.training:Setting up visualization...
INFO:sleap.nn.training:Finished trainer set up. [2.7s]
INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation...
INFO:sleap.nn.training:Finished creating training datasets. [5.7s]
INFO:sleap.nn.training:Starting training loop...
Epoch 1/200
2022-05-16 20:04:14.712217: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8201
Saving config: C:\Users\moorelab/.sleap/1.2.3/preferences.yaml

Answered by panichem

May 19, 2022

@talmo @roomrys - I switched to a bottom-up model and changed the RF scaling to .5 and now the first ~10 epochs are done in a few minutes. Thanks for your help!!

View full answer

talmo · 2022-05-19T06:22:54Z

talmo
May 19, 2022
Maintainer

Hey @panichem!

Are you able to use the sample dataset from the tutorial?

If so, or if you just want to give it a quick try, does training a bottom-up multi-animal model work? (This works for single animals as well.)

Give those a spin and if neither works, do you mind sharing the video + .slp file with talmo@salk.edu?

Talmo

0 replies

roomrys · 2022-05-19T17:16:34Z

roomrys
May 19, 2022
Maintainer

Hi @panichem,

I also came across this when I created a model where the receptive field (RF) size was relatively small compared to the overall frame size. You could try lowering the input scaling to ~0.5 (which increases the RF size) and see how that effects the first epoch training time. Please let us know if any of these solutions worked.

Thanks,
Liezl

0 replies

panichem · 2022-05-19T18:58:06Z

panichem
May 19, 2022
Author

@talmo @roomrys - I switched to a bottom-up model and changed the RF scaling to .5 and now the first ~10 epochs are done in a few minutes. Thanks for your help!!

0 replies

roomrys · 2022-05-20T16:55:47Z

roomrys
May 20, 2022
Maintainer

Marking this as a TODO since there is a work-around, but we still need to find the root cause (and prevent it from happening)

0 replies

talmo · 2022-05-20T17:16:25Z

talmo
May 20, 2022
Maintainer

The fact that there weren't any errors and that training didn't even start makes me think it's a tensorflow deadlock.

We've run into this in the past (see attempted fixes in 613c201 and 492b67b). I think it's related to how we use tf.py_function -- there's a thread about it over in tensorflow/tensorflow#32454, but no solution.

In the past I've had a hard time reliably reproducing this -- it seems to be stochastic and maybe system-dependent -- so maybe let's just close this for now and revisit it if more people are having the same problem.

Also moving this to Discussions so folks see it when asking q's.

Thanks for the report @panichem!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training doesn't proceed past epoch 1 on GPU workstation #755

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

training doesn't proceed past epoch 1 on GPU workstation #755

panichem May 18, 2022

Replies: 5 comments

talmo May 19, 2022 Maintainer

roomrys May 19, 2022 Maintainer

panichem May 19, 2022 Author

roomrys May 20, 2022 Maintainer

talmo May 20, 2022 Maintainer

panichem
May 18, 2022

talmo
May 19, 2022
Maintainer

roomrys
May 19, 2022
Maintainer

panichem
May 19, 2022
Author

roomrys
May 20, 2022
Maintainer

talmo
May 20, 2022
Maintainer