Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training issues with custom dataset (Missing) and learning rate fluctuations #11804

Open
Warcry25 opened this issue Jun 20, 2024 · 12 comments
Open
Assignees

Comments

@Warcry25
Copy link

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug

  1. I custom trained retinaNet with my own dataset. i have 600 images in the train set but upon training only 300 is trained another 300 is missing i cant seems to figure out the cause.

  2. The training at the early stage seems off 👍 mmengine - ERROR - /content/mmdetection/mmdet/evaluation/metrics/coco_metric.py - compute_metrics - 465 - The testing results of the whole dataset is empty. 5th epoch onwards is ok. and the learning rate as well. everything cn be see in the log file below.
    20240619_183332.log

@Warcry25
Copy link
Author

@RangiLyu @MiXaiLL76 Need help pls asap. thank you

@MiXaiLL76
Copy link

MiXaiLL76 commented Jun 21, 2024

compute_metrics - 465 - The testing results of the whole dataset is empty.

This is normal, it means that your model did not predict a single bbox!)

learning rate fluctuations

In you setup:

param_scheduler = [
    dict(
        begin=0, by_epoch=False, end=500, start_factor=0.001, type='LinearLR'),
    dict(
        begin=0,
        by_epoch=True,
        end=12,
        gamma=0.1,
        milestones=[
            8,
            11,
        ],
        type='MultiStepLR'),
]

This means that initially you have a low LR, which becomes normal around epoch 5, and then decreases as you go through MultiStepLR at 8 and 12 epochs.

why did you call me?) I am not the developer of this library)

@Warcry25
Copy link
Author

Warcry25 commented Jun 21, 2024

@MiXaiLL76 i read in the previous issues [ #2942 ] it stated that the learning rate reduce by 10x will resolve the issues but it didnt. and Im also not sure why the train images are cut by half. i had 600 train images in the dataset but only 300 are trained.

need someone help me asap since i no idea how to resolve this issue and developer is so busy. thats why

@MiXaiLL76
Copy link

@MiXaiLL76 i read in the previous issues [ #2942 ] it stated that the learning rate reduce by 10x will resolve the issues but it didnt. and Im also not sure why the train images are cut by half. i had 600 train images in the dataset but only 300 are trained.

need someone help me asap since i no idea how to resolve this issue and developer is so busy. thats why

There may be some problems with the data, send them to me by email, I can quickly train the model and check it.

mike.milos@yandex.ru

@Warcry25
Copy link
Author

@MiXaiLL76 I have emailed you. Did you received it?

@MiXaiLL76
Copy link

@MiXaiLL76 I have emailed you. Did you received it?

Hello! Yes, I received a message, I'll see today when I'm free from work

@MiXaiLL76
Copy link

pipeline.zip
Here is an example of my result using your data.
img

+----------+-------+--------+--------+-------+-------+-------+
| category | mAP   | mAP_50 | mAP_75 | mAP_s | mAP_m | mAP_l |
+----------+-------+--------+--------+-------+-------+-------+
| SMD      | 0.544 | 0.835  | 0.643  | 0.0   | 0.486 | 0.719 |
| THP      | 0.605 | 0.854  | 0.652  | nan   | 0.371 | 0.687 |
+----------+-------+--------+--------+-------+-------+-------+

@Warcry25
Copy link
Author

What was the issue in my config file that cause only half of the Train set to go missing?

@MiXaiLL76
Copy link

What was the issue in my config file that cause only half of the Train set to go missing?

You setup

train_dataloader = dict(
    batch_sampler=dict(type='AspectRatioBatchSampler'),
    batch_size=2,

total frames = 600
batch = 2

600 / 2 = 300

@Warcry25
Copy link
Author

Warcry25 commented Jun 27, 2024

@MiXaiLL76 i have a question. what does it means went bbox_mAP, bbox_mAP_50, and bbox_mAP_75 values are same with AP IoU=0.50:0.95, IoU=0.50 ,IoU=0.75 respectively?
Model Train Log.txt

@MiXaiLL76
Copy link

These are the same metrics, just called differently

@MiXaiLL76
Copy link

@Warcry25 I think it's time to close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants