BERT architecture #1345

aaarrti · 2024-02-21T22:55:35Z

aaarrti
Feb 21, 2024

Hi,

I've been recently experimenting with porting bert-based classifier to burn, and today I've noticed that BERT has been recently added to model zoo. However, I am a little bit confused about the implementation

This is the one from burn

#[derive(Module, Debug)]
pub struct BertModel<B: Backend> {
    pub embeddings: BertEmbeddings<B>,
    pub encoder: TransformerEncoder<B>,
}

In contrast here is the official one from keras_nlp

@keras_nlp_export("keras_nlp.models.BertBackbone")
class BertBackbone(Backbone):
    def __init__(...):
        self.token_embedding = ReversibleEmbedding(
            ...
        )
        self.position_embedding = PositionEmbedding(
            ...
        )
        self.segment_embedding = keras.layers.Embedding(
            ...
        )
        self.embeddings_add = keras.layers.Add(
            ...
        )
        self.embeddings_layer_norm = keras.layers.LayerNormalization(
            ...
        )
        self.embeddings_dropout = keras.layers.Dropout(
           ...
        )
        self.transformer_layers = []
        for i in range(num_layers):
            layer = TransformerEncoder(
                ...
            )
            self.transformer_layers.append(layer)
        self.pooled_dense = keras.layers.Dense(
            ...
        )

And here is the one from 🤗 transformers:

class BertModel(BertPreTrainedModel):

    def __init__(self, config, add_pooling_layer=True):
        super().__init__(config)
        self.config = config

        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)

        self.pooler = BertPooler(config) if add_pooling_layer else None

        # Initialize weights and apply final processing
        self.post_init()

It looks to me, like the burns implementation is missing a pooler layer after the encoder. Is there a reason you decided to omit it, or was it just something you've overseen?

antimora · 2024-02-27T17:20:29Z

antimora
Feb 27, 2024
Collaborator

cc: @ashdtu

0 replies

ashdtu · 2024-02-28T15:14:56Z

ashdtu
Feb 28, 2024

Hi @aaarrti. Thanks for raising this discussion. The motivation at the time was to keep the model head generic between roberta family and Bert family for downstream tasks. The pooling layer which is actually a linear layer on top of the [CLS] token embedding is trained using a next-sentence-prediction objective. The improved model Roberta was not trained with a NSP objective compared to the original BERT model. For benchmarks like GLUE, SQuAD it's actually helpful to use the [CLS] token embedding directly without pooling layer because of the tanh activation at the end which may cause vanishing gradients for downstream task.

Overall, I think it's a great point and we can rest this decision on the user whether to use the pooling layer output or the [CLS] hidden state directly, like the transformers API. I raised an issue here and we would be adding this soon. Thanks a lot for pointing this out!

0 replies

aaarrti · 2024-02-28T15:25:28Z

aaarrti
Feb 28, 2024
Author

Hi @ashdtu, so turns out BertModel<B: Backend> is actually a roberta architecture, did I understand this correctly? It'a bit confusing 🙃

Might I suggest smth among the lines:

// roberta.rs
// renamed from BertModel
#[derive(Module, Debug)]
pub struct RoBertaModel<B: Backend> {
    pub embeddings: BertEmbeddings<B>,
    pub encoder: TransformerEncoder<B>,
}   

// bert.rs
#[derive(Module, Debug)]
pub struct BertModel<B: Backend> {
    pub embeddings: BertEmbeddings<B>,
    pub encoder: TransformerEncoder<B>,
    pub pooler: Linear<B>
}

btw, I think I should have some free time to my disposal over the next few week, so I'd be happy to open aaa PR with those changes once I get a green light from you.

1 reply

antimora Mar 19, 2024
Collaborator

tagging @ashdtu in case it's been missed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT architecture #1345

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

BERT architecture #1345

aaarrti Feb 21, 2024

Replies: 3 comments · 1 reply

antimora Feb 27, 2024 Collaborator

ashdtu Feb 28, 2024

aaarrti Feb 28, 2024 Author

antimora Mar 19, 2024 Collaborator

aaarrti
Feb 21, 2024

Replies: 3 comments 1 reply

antimora
Feb 27, 2024
Collaborator

ashdtu
Feb 28, 2024

aaarrti
Feb 28, 2024
Author

antimora Mar 19, 2024
Collaborator