-
Notifications
You must be signed in to change notification settings - Fork 210
What is the state of the project?
We will try making a new release every 8-10 weeks if possible. As always, pull requests are welcome.
What is the difference between Zemberek2 and Zemberek-NLP
Morphological alaysis and generation tool Zemberek2 is no longer maintained. Zemberek-NLP is developed from almost scratch and shares almost no code with Zemberek2. There were several shortcomings of Zemberek2 such as: Too strict parsing, incompatible formatting, weak dictionary, complex code, slowness, no disambiguation. Hopefully Zemberek-nlp will address all those issues.
What is "Morphological Analysis"? Morphological analysis is used for finding meaningful syntactic parts (Morphemes) of a word. Such as root words and suffixes. For example, "kalemlerimden (from my pencils)" is analyzed as follows in Zemberek:
[kalem:Noun] kalem:Noun+ler:A3pl+im:P1sg+den:Abl
[kalem:noun] → lemma and root POS.
kalem:Noun → Stem.
ler:A3pl → Plural suffix `A3pl` with form `ler`
im:P1sg → First person singular possessive suffix `P1sg` with form `im`
den:Abl → Ablative suffix `Abl` (from) with form "den"
Usually the actual letters of a morpheme, such as den
in the example above is called surface form
and representing suffix name Abl
is called lexical form
.
Finding morphemes for Turkish computationally is not easy, such system requires knowledge of complex phonetic and hand crafted suffix sequence rules (morphotactics).
What is "Morphological Disambiguation"? Many Turkish words are highly ambiguous. A single word can have 2 to 10 correct analyses with different stem and suffixes depending on the context. For example, word "yarın" can be interpreted as follows:
[yar:Noun] yar:Noun+A3sg+ın:Gen (cliff's)
[yar:Noun] yar:Noun+A3sg+ın:P2sg (your cliff)
[yarı:Noun] yarı:Noun+A3sg+n:P2sg (your half)
[yarı:Adj] yarı:Adj|Zero→Noun+A3sg+n:P2sg (your half, root is adjective)
[yarın:Adv] yarın:Adv (during tomorrow)
[yarın:Noun,Time] yarın:Noun+A3sg (tomorrow) This is the most common.
[yarmak:Verb] yar:Verb+Imp+ın:A2pl (split!)
For resolving ambiguity, a simple machine learning mechanism is trained with hand tagged sentences. It uses context words and their analyses to determine the correct result.
Why is disambiguation not working well?
As of version 0.12.0, morphological ambiguity resolution mechanism uses an Averaged Perceptron based algorithm. This mechanism must be trained with hand tagged sentences. For now we use a rule based system to generate data for training. Therefore current system may not perform well. But we expect to improve it quickly by adding semi-automatic training data.
Can I use it as a stemmer-lemmatizer?
Yes, it is trivial to access stem and lemmas from the parse result. However, for correct stemming good disambiguation is required.
Can I add a new dictionary item programatically?
Can I generate words?
Where is word suggestion functionality?
After version 0.11.0 there is a simple spelling functionality available in normalization module.
Where is deasciifier functionality?
Currently zemberek-nlp does not offer deasciifier functionality. But there are several applications available in internet that use Deniz Yuret's deasciifier algorithm .
Can I detect languages?
Yes. Use lang-id module for this. There are also alternatives like language-detector. Keep in mind that this module is for detecting the language of text with reasonable character count (usually more than 20 characters). It is usually not suitable for detecting the language of individual words.
Why is the code in English?
Zemeberek2 code was completely Turkish. It was one of the point that made it attractive for new comers. However, we wanted Zemberek-nlp to be used in global NLP community and academia and therefore used English in the code. Not that it worked out that way, but still we stick to that decision.
What about Libre Office or Lucene-Solr extensions?
We do not have extensions for external applications for now. But it is easy to write a Turkish stemmer or lemmatizer (There is already a Lucene-Solr Turkish Analysis project available using different NLP tools.).
We have a plan to create a LibreOffice extension soon.
Can I use it in Android?
It is possible in theory, but we have not tried it. Library is more suitable for server or desktop usage in it's current state.
Why don't you use an FST tool?
Most Turkish morphological parsing tools use an FST (Finite state transducer). Oflazer, Sak and Çöltekin uses this approach. FST greatly simplifies the parser and it is very fast. However, we did not go that route because:
- Good FST tools were not available for Java.
- Some FST tools were too low level
- You cannot modify the search graph at run-time if you use an FST tool.
Zemberek uses a different approach and uses a graph that is created programatically. It is slower but programming gives more flexibility.
What are the alternatives ?
There are many tools for Turkish NLP available. Some are:
- Kemal Oflazer's command line parser
- Haşim Sak's morphological parser and disambiguator.
- Çağrı Çöltekin's TRmorph.
- Ali Ok's trnltk-java
- ITU Turkish NLP pipeline
- TS Corpus provides variety of Turkish linguistic corpora and online NLP tools.
- Harun Reşit Zafer's nuve
- Deniz Yüret's deasciifier and disambiguator.
- Odtü-Sabancı Tree-bank
- Weka, Open-NLP, NLTK, Stanford NLP and many recent Neural Network based tools can be trained for Turkish.
I want to know about Turkish Morphology
There are many books available for Turkish Grammar. There is also a slightly outdated documentation with perspective of Zemberek developers available here .
Why word "Zemberek"?
Zemberek is the main spring of a watch in Turkish. Etymologically It comes from Persian word "zanbūrak زنبورك", meaning "little bee". Long ago @mdakin picked this word as it sounds funny/interesting.
Who are the one eyed mouse and hamster in the avatars?
They are Danger Mouse and Penfold from animated series Danger Mouse - Tehlikeli Fare.