What is the difference between AraBERTv0.1 and AraBERTv1? #14
-
Thank you for your contribution! But I'm still confused about what is the difference between AraBERTv0.1 and AraBERTv1 model |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
From their paper, I can say the difference is that v1 is trained on segmented data while v0.1 is not. |
Beta Was this translation helpful? Give feedback.
-
yeah, its exactly what @WaelMohammedAbed said. The v0.1 model is trained on regular Arabic data and v1 is trained on pre-segmented data using farasa. Thank you @WaelMohammedAbed for answering |
Beta Was this translation helpful? Give feedback.
From their paper, I can say the difference is that v1 is trained on segmented data while v0.1 is not.
"To avoid this issue, we first segment the words using Farasa (Abdelali et al., 2016) into stems, prefixes and suffixes. For instance, “اللغة - Alloga” becomes ال+لغ+ة -Al+ log +a”. Then, we trained a SentencePiece (an unsupervised text tokenizer and detokenizer (Kudo, 2018)), in unigram mode, on the segmented pre-training dataset to produce a subword vocabulary of ∼60K tokens. To evaluate the impact of the proposed tokenization, we also trained SentencePiece on non-segmented text to create a second version of ARABERT (AraBERTv0.1) that does not require any segmentation"