You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for putting together this great resource. I am struggling to understand the following conflict:
The authors recommend to be capable to perform several jobs in parallel for effective exploration (source).
They also recommend fixing the batch size for all experiments since changing the batch size leads to needing to adjust a lot of the hyperparameters again (source).
However, in all scenarios that I encountered, the practitioner has a fixed multi-gpu hardware configuration that they want to fully leverage to train the final model. To explore good hyperparameters (compute-bound round 1 style), they can either (1) reduce the effective batch size and do multiple trials in parallel or (2) keep the desired effective batch size and do only a single trial at once. To me, it is not clear if the results from (1) translate to the "full" effective batch size of the whole configuration and (2) seems very slow.
Am I missing something? If not, what is the more efficient strategy? Is doing exploration with (1) and exploitation with (2) effective, respectively, do the results transfer sufficiently?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Thank you for putting together this great resource. I am struggling to understand the following conflict:
However, in all scenarios that I encountered, the practitioner has a fixed multi-gpu hardware configuration that they want to fully leverage to train the final model. To explore good hyperparameters (compute-bound round 1 style), they can either (1) reduce the effective batch size and do multiple trials in parallel or (2) keep the desired effective batch size and do only a single trial at once. To me, it is not clear if the results from (1) translate to the "full" effective batch size of the whole configuration and (2) seems very slow.
Am I missing something? If not, what is the more efficient strategy? Is doing exploration with (1) and exploitation with (2) effective, respectively, do the results transfer sufficiently?
Beta Was this translation helpful? Give feedback.
All reactions