Replies: 3 comments 1 reply
-
cc: @ashdtu |
Beta Was this translation helpful? Give feedback.
-
Hi @aaarrti. Thanks for raising this discussion. The motivation at the time was to keep the model head generic between roberta family and Bert family for downstream tasks. The pooling layer which is actually a linear layer on top of the [CLS] token embedding is trained using a next-sentence-prediction objective. The improved model Roberta was not trained with a NSP objective compared to the original BERT model. For benchmarks like GLUE, SQuAD it's actually helpful to use the [CLS] token embedding directly without pooling layer because of the tanh activation at the end which may cause vanishing gradients for downstream task. Overall, I think it's a great point and we can rest this decision on the user whether to use the pooling layer output or the [CLS] hidden state directly, like the transformers API. I raised an issue here and we would be adding this soon. Thanks a lot for pointing this out! |
Beta Was this translation helpful? Give feedback.
-
Hi @ashdtu, so turns out Might I suggest smth among the lines: // roberta.rs
// renamed from BertModel
#[derive(Module, Debug)]
pub struct RoBertaModel<B: Backend> {
pub embeddings: BertEmbeddings<B>,
pub encoder: TransformerEncoder<B>,
}
// bert.rs
#[derive(Module, Debug)]
pub struct BertModel<B: Backend> {
pub embeddings: BertEmbeddings<B>,
pub encoder: TransformerEncoder<B>,
pub pooler: Linear<B>
} btw, I think I should have some free time to my disposal over the next few week, so I'd be happy to open aaa PR with those changes once I get a green light from you. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I've been recently experimenting with porting bert-based classifier to
burn
, and today I've noticed that BERT has been recently added to model zoo. However, I am a little bit confused about the implementationThis is the one from
burn
In contrast here is the official one from
keras_nlp
And here is the one from 🤗
transformers
:It looks to me, like the
burn
s implementation is missing a pooler layer after the encoder. Is there a reason you decided to omit it, or was it just something you've overseen?Beta Was this translation helpful? Give feedback.
All reactions