You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Am I correct in saying the training code in customtokenizer only trains one X Y pair at a time instead of a whole batch at once?
Are there any plans to add batches to the training code so it can process a large batch at once? As it stands right now, when combining english, german, polish, japanese, and portugese datasets from huggingface, it is taking about 1 hour per epoch and only 3 out of 8GB of VRAM is used.
(Obviously this doesn't run on google colab since trying to load 32,000+ files crashes google drive AND runs out of instance RAM on colab, but if it could, batching would be a very nice idea so that one epoch could be done in maybe a hundred steps or something instead of .. .tens of thousands. It seems to be trying to fit to one set of data but that's conflicting with another set of data, basically to rephrase that in plain English, is it tries to learn one feature and gets worse at the other, it corrects the other and gets worse at the first, whereas I think if it was all batched it would "see the bigger picture" and "all the patterns as related and part of a whole" or something like that. )
(But I dunno though, maybe this architecture is insufficient for OMNI-LINGUAL and can at best only learn languages in a group like traditional linguists define. Romance languages, Indo.. uhh.. something languages.... I say it might be like that because just last night I tried using the english 23 epoch model as a pretrain starting point, and well... 8hrs later, at 8 epochs, it sort of can map an unsupported language like vietnamese. Approximates alot of words at the wrong "notes" but it did better than expected in THAT regard, so the theory is not too far off. Where it fucked up is suddenly some speakers english words, said with an accent turned into a Russian or Polish phoneme, which really makes me wonder if there's a limit to how "different" the languages can be, but still, way off topic here, I think batched training would really help with all this. )
(But the important thing here is it seems bark DOES have the ability to generate the correct phonemes for novel sounds if you can just tease out the right semantic tokens, which you can get really close by having the quantizer hybridize the languages your target language is closest to.... but ghyaaah thats such a pain in the ass to do for every language)
The text was updated successfully, but these errors were encountered:
Am I correct in saying the training code in customtokenizer only trains one X Y pair at a time instead of a whole batch at once?
Are there any plans to add batches to the training code so it can process a large batch at once? As it stands right now, when combining english, german, polish, japanese, and portugese datasets from huggingface, it is taking about 1 hour per epoch and only 3 out of 8GB of VRAM is used.
(Obviously this doesn't run on google colab since trying to load 32,000+ files crashes google drive AND runs out of instance RAM on colab, but if it could, batching would be a very nice idea so that one epoch could be done in maybe a hundred steps or something instead of .. .tens of thousands. It seems to be trying to fit to one set of data but that's conflicting with another set of data, basically to rephrase that in plain English, is it tries to learn one feature and gets worse at the other, it corrects the other and gets worse at the first, whereas I think if it was all batched it would "see the bigger picture" and "all the patterns as related and part of a whole" or something like that. )
(But I dunno though, maybe this architecture is insufficient for OMNI-LINGUAL and can at best only learn languages in a group like traditional linguists define. Romance languages, Indo.. uhh.. something languages.... I say it might be like that because just last night I tried using the english 23 epoch model as a pretrain starting point, and well... 8hrs later, at 8 epochs, it sort of can map an unsupported language like vietnamese. Approximates alot of words at the wrong "notes" but it did better than expected in THAT regard, so the theory is not too far off. Where it fucked up is suddenly some speakers english words, said with an accent turned into a Russian or Polish phoneme, which really makes me wonder if there's a limit to how "different" the languages can be, but still, way off topic here, I think batched training would really help with all this. )
(But the important thing here is it seems bark DOES have the ability to generate the correct phonemes for novel sounds if you can just tease out the right semantic tokens, which you can get really close by having the quantizer hybridize the languages your target language is closest to.... but ghyaaah thats such a pain in the ass to do for every language)
The text was updated successfully, but these errors were encountered: