Skip to content

What is the difference between AraBERTv0.1 and AraBERTv1? #14

Answered by WaelMohammedAbed
hischen asked this question in Q&A
Discussion options

You must be logged in to vote

From their paper, I can say the difference is that v1 is trained on segmented data while v0.1 is not.
"To avoid this issue, we first segment the words using Farasa (Abdelali et al., 2016) into stems, prefixes and suffixes. For instance, “اللغة - Alloga” becomes ال+لغ+ة -Al+ log +a”. Then, we trained a SentencePiece (an unsupervised text tokenizer and detokenizer (Kudo, 2018)), in unigram mode, on the segmented pre-training dataset to produce a subword vocabulary of ∼60K tokens. To evaluate the impact of the proposed tokenization, we also trained SentencePiece on non-segmented text to create a second version of ARABERT (AraBERTv0.1) that does not require any segmentation"

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by WissamAntoun
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants
Converted from issue

This discussion was converted from issue #14 on December 09, 2020 13:42.