Skip to content
Ahmet A. Akın edited this page Feb 16, 2016 · 36 revisions

Welcome to the zemberek-nlp wiki!

FAQ

What is the state of the project?

Although we (Mehmet and I) are not developing the project regularly, we will continue adding features and fix bugs without a timeline. As always, pull requests are welcome.

What is the difference between Zemberek2 and Zemberek-nlp

Morphological alaysis and generation tool Zemberek2 is no longer maintained. Zemberek-nlp is developed from almost scratch and shares almost no code with Zemberek2. There were several shortcomings of Zemberek2 such as: Too strict parsing, incompatible formatting, weak dictionary, complex code, slowness, no disambiguation. We aimed to fix all those issues with Zemberek-nlp and succeeded some of it. But unfortunately complexity of the code is a major issue that has hindered the development.

Why disambiguation does not work well?

Current disambiguation mechanism uses an HMM system that uses two language models. However models are trained from a relatively small and noisy corpus. Therefore disambiguation performance is sub-par. We did test a separate Perceptron system and it works much better, but we have not incorporated it to Zemberek system. Even we do that, performance will probably not be satisfactory. We had some ideas about automatically generating training sets and attacking most ambiguous words, but we could not work on it.

Can I use it as stemmer-lemmatizer?

Yes, it is trivial to access stem and lemmas from the parse result. However, for correct stemming you need good disambiguation.

Can I add a new dictionary item programatically?

Yes.

Can I generate words?

[Yes.] (https://github.com/ahmetaa/turkish-nlp-examples/blob/master/src/main/java/morphology/ChangeStem.java)

Why is the code in English?

Zemeberek2 code was completely Turkish. It was one of the point that made it attractive for new comers. However, we wanted Zemberek-nlp to be used in global NLP community and academia and therefore used English in the code. Not that it worked out that way, but still we stick to that decision.

What about Libre Office or Lucene-Solr extensions?

We do not plan to write extensions for external applications now. But it is easy to write a Turkish stemmer or lemmatizer (There is already a [Lucene-Solr Turkish Analysis] (https://github.com/iorixxx/lucene-solr-analysis-turkish) project available using different NLP tools.). For LibreOffice, perhaps updating tr-spell with a really large corpus is a better idea.

Is Morphological parsing overrated?

For many NLP tasks, yes. When you have a lot of data, importance of morphological parsing accuracy diminishes. Often, tools like Zemberek are only used for lemmatization and stemming. Sometimes, no morphology tools are required. For example, a recent work on NER for Turkish shows that without any advanced morphological parsing, systems can achieve excellent results. Statistical unsupervised morpheme tools can work quite well for Speech Recognition systems. However, sparsity of Turkish is still a problem and advanced morphology is still used in some tasks such as Machine Translation, Dependency Parsing and morphological language models. That's said, some see the recent advances in neural networks as the dawn of unsupervised methods where tools like deterministic morphological parsing have little importance.

Why don't you use an FST?

Most Turkish morphological parsing tools use an FST (Finite state transducer). Oflazer, Sak and Çöltekin uses this approach. FST greatly simplifies the parser and it is very fast. However, we did not go that route because:

  • Good FST tools were not available for Java.
  • Some FST tools were too low level
  • You cannot modify the search graph at run-time if you use an FST tool.

We instead created a graph programatically. But our design turned out to be complex and inadequate for some exception cases.

What are the alternatives ?

There are many tools for Turkish NLP available. Some are:

  • Kemal Oflazer's command line parser
  • Haşim Sak's morphological parser and disambiguator.
  • Tr-morph project.
  • Ali Ok's parser.
  • ITU Turkish NLP pipeline
  • Deniz Yüret's deasciifier and disambiguator.
  • Odtü-Sabancı Tree-bank
  • Weka, Open-NLP, NLTK, Stanford NLP and many recent Neural Network based tools can be trained for Turkish.
Clone this wiki locally