-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online leaderboard result and question on comparsion results setting on other methods #4
Comments
The reason I proposed the question is that I noticed the dataset size for MILE is really large compared with Transfuser and LAV (like 10x larger), Just the same question as I asked on the InterFuse also here opendilab/InterFuser#3: the large dataset size makes it unclear whether it's the model or the large data that brings the boost to the performance. |
I can answer your first question. Since the numbers are from 3 different benchmarks (MILE is evaluated on their new benchmark) they are not really comparable. |
Thanks! @Kait0 I see. Then, only one question now: How to prove/analyze whether it's the method (MILE) or the large data that brings the boost to the performance? |
Thanks @Kait0 for answering the first question! The updated version of the paper will be available tomorrow on arxiv. As for the discussion around data: If we compare the number of frames in the dataset, it seems that our dataset is larger than the other methods. When we look at the number of hours of driving data however we realise all the methods have roughly the same amount: for LAV (28hours or 400k at 4Hz), TransFuser (31hours or 220k at 2Hz), TCP (60 hours or 400k at 2Hz), and us (32 hours or 2.9M at 25Hz). |
I see. Thanks for the reply. How about controlling the total frame num but not the hours? since in the training, most of the training steps will shuffle the dataset. Would you also try to control the total frames in the ablation study? I think just random select the fix total dataset is enough to peek the reason. Or the question could be: why not also with 2Hz which will not let the dataset frames so big and have more different data scenarios, more same dataset may violate your point on generalizing? And there is one more question which is how about the online leaderboard result for MILE? |
It is probably possible to train the same model with fewer frames by reducing the video frequency (25Hz -> 5Hz for example). The online leaderboard is in preparation. |
I see. Please leave the issue here (Thanks, and we can wait to see if someone experimented with controlling the total frames. Besides, here is the question mentioned above: why not also with 2Hz? since more similar dataset (25Hz collection) may violate your point on generalizing Is there any reason MILE uses the 25Hz high frequency differently from others? And how to make sure such high collect frequency does not overfit those large datasets in similar frames? (generalization problem you mentioned) |
There is no particular reason for 25Hz, and the frequency can probably be set to something lower. |
Their model is using 12 temporal frames during training (think it had a recurrent component inside). With such a model sub-sampling the training data (like all the single frame models do) might not be a good strategy as you need to increase the distance between the frames that the model sees at inference too. Thought 25 Hz was a weird number as most works set the CARLA simulator frequency to 20 Hz (default in the leaderboard client). Might make it harder to compare to other work. |
Yes, it's still weird for me. And I don't think the hour about the dataset can convince me. Especially, after the author's response on the hours, frames, and frequency. since more similar dataset (25Hz collection) may violate your point on generalizing. It just let me confuse more about how to make sure such high collection frequency does not overfit those large datasets in similar frames. (generalization problem you mentioned) Maybe leave this question (issue) to future experiments. It's also can be a reminder. Let's see.... Anyway, Still, thanks for your code to the community and I believe it's easy for us to experiment with what we are confused about. |
Maybe one interesting detail to point out is that in the paper they train for 50.000 iterations on batch size 64 (64x50000 -> 3M samples). If true this would be similar to training for 10 epochs with data stored at 2 FPS, just that instead of training on the same images multiple times you train on slightly augmented (in the sense of the vehicle moved a little bit) versions of them. So maybe training on a denser sampled data can be understood as a form of data augmentation. |
That's correct, the model was trained for a single epoch. And I agree with the subsequent analysis. |
I see. That makes sense. Thanks @Kait0 and @anthonyhu I will (or maybe someone else who is interested) attach more ablation/ comparison results tables once I have time. (FLAG may never achieve, Hahahaha |
Thanks for your work. I'm wondering if you push the agent to the online leaderboard. What is the score for the work?
For Table I in the paper, screenshot here:
What's the version of the LAV and TransFuser etc weight you use in the paper? Did you train on your dataset or use them directly pre-trained weight that the author provided and what's the version on the weight file and config file also?
Since I noticed you use the 2.9M frames dataset on the MILE as the paper said.
Update: 2022/11/1, adding the discussion link here: Kin-Zhang/carla-expert#4
The text was updated successfully, but these errors were encountered: