-
Notifications
You must be signed in to change notification settings - Fork 210
Text Dictionary Rules
If you want to use your own dictionary, these are the syntax options you can use. This is the general syntax:
word [P:PRIMARY_POS,SECONDARY_POS ; A:ATTRIBUTE1, ATTRIBUE2 ; Pr:PRONUNCATION ]
P is used for defining dictionary item's Part of Speech tag. There are Primary and Secondary tags available. Some Primary POS values are
Noun, Adj, Verb, Adv, Pron, Conj, Det, Postp, Interj, Dup
For nouns, usually you do not need to add a P:Noun
property. System assumes words that does not end with mek
or mak
a noun. Such as:
kalem --> Assumes "kalem [P:Noun]"
If a word ends with mek
or mak
system assumes that word is a verb.
okumak --> Assumes "okumak [P:Verb]"
For some nouns that ends with mak
or mek
, P:Noun
property must be written. Such as
çomak [P:Noun]
yumak [P:Noun]
If first letter is capital, it is assumed a proper noun.
Ankara --> Assumes "Ankara [P:Noun, Prop]"
For abbreviations, Abbrv
secondary Pos value must be written.
Tdk [P:Abbrv] --> Assumes "Tdk [P:Noun, Abbrv]"
For all other types, Pos value must be written.
ekşi [P:Adj]
ve [P:Conj]
bu [P:Pron, Demons] --> Demonstrative Pronoun
However, defining pronouns externally may not work well because most pronoun suffix rules are handled in the code.
There are primary and secondary POS values.
Primary POS
Noun
Adjective (Adj)
Adverb (Adv)
Conjunction (Conj)
Interjection (Interj)
Verb
Pronoun (Pron)
Numeral (Num)
Determiner (Det)
PostPositive (Postp)
Question (Ques) // Only "mi, mu.."
Duplicator (Dup) // çıtır, şırıl etc.
Punctiation (Punc)
Secondary POS
Demons DemonstrativePron
Time Time
QuantitivePron (Quant)
QuestionPron (Ques)
ProperNoun (Prop)
Abbreviation
RegulaAbbreviation
PersonalPron (Pers)
ReflexivePron (Reflex)
None
Ordinal (Ord)
Cardinal (Card)
Percentage (Percent)
Ratio
Distribution (Dist)
Voicing and NoVoicing In Turkish, if last letter of a word or suffix is a stop consonant (tr: süreksiz sert sessiz), and a suffix that starts with a vowel is appended to that word, last letter changes. This is called voicing. Changes are p-b, ç-c, k-ğ, t-d, g-ğ. Such as kitap → kitab-a, pabuç → pabuc-u, cocuk → cocuğ-a, hasat → hasad-ı
It also applies to some verbs: et→ed-ecek. But for verb roots, only ‘t’ endings are voiced. And most suffixes: elma-cık→elma-cığ-ı, yap-acak→yap-acağ-ım.
When a word ends with ‘nk‘, then ‘k’ changes to ‘g’ instead of ‘ğ’. Such as cenk → ceng-e, çelenk → çeleng-i
For some loan words, g-ğ change occurs. psikolog → psikoloğ-a
Usually if the word has only one syllable, rule does not apply. Such as turp → turp-u, kat → kat-a, kek → kek-e, küp → küp-üm. But this rule has some exceptions as well: harp → harb-e
Some multi syllable words also do not obey this rule. Such as taksirat → taksirat-ı, kapat → kapat-ın
For making a dictionary entry, you do not need to add Voicing
attribute to nouns or adjectives that ends voiceless consonants. For example for word kitap
System automatically assumes this is a noun and because it ends with p
it has Voicing attribute. But for single syllable words, system automatically add NoVoicing
attribute.
kitap --> assumes "kitap [P:Noun; A:Voicing]". Allows "kitaba"
top --> assumes "top [P:Noun; A:NoVoicing]". Prevents "toba"
If default behavior is wrong for a word, "Voicing" or "NoVoicing" attributes must be written. For example,
bulut [A:NoVoicing] --> to prevent `buludu`
turp [A:Voicing] --> to allow `turbu`
InverseHarmony : For some loan words, suffix vowel harmony rules does not apply. This usually happens in some loan words. Such as saat-ler and alkol-ü
saat [A:InverseHarmony,NoVoicing]
Doubling When a suffix that starts with a vowel is added to some words, last letter is doubled. Such as hat → hat-tı. If last letter is also changed by the appended suffix, transformed letter is repeated. Such as ret → red-di
hat [A:Doubling] // no need to add NoVoicing because it is single syllable.
LastVowelDrop Last vowel before the last consonant drops in some words when a suffix starting with a vowel is appended. ağız → ağz-a, burun → burn-um, zehir → zehr-e.
Some words have this property optionally. Both omuz → omuz-a, omz-a are valid. Sometimes different meaning of the words effect the outcome such as oğul-u and oğl-u. In first case "oğul" means "group of bees", second means "son".
Some verbs obeys this rule. kavur → kavr-ul. But it only happens for passive suffix. It does not apply to other suffixes. Such as kavur→kavur-acak, not kavur-kavracak
When a vowel is dropped, the form of the suffix to be appended is determined by the original form of the word, not the form after vowel is dropped. Such as nakit → nakd-e, lütuf → lütf-un.
If we were to apply the vowel harmony rule after the vowel is dropped, it would be nakit → nakd-a and lütuf → lütf-ün, which are not correct.
For making a dictionary entry:
ağız [A:LastVowelDrop] kavurmak [A:LastVowelDrop]
CompoundP3sg
This is for marking compound words that ends with third person possesive suffix P3sg [+sI]. Such as aşevi, balkabağı, zeytinyağı.
These compound words already contains a suffix so their handling is different than other words. For example some suffixes changes the for of the root. Such as zeytinyağı → zeytinyağ-lar-ı atkuyruğu → atkuyruklu
Dictionary entries are a little different. You need to specify root forms of the words that builds the compound word.
aşevi [A:CompoundP3sg; Roots:aş-ev]
Aorist_I and Aorist_A
Generally Present tense (Aorist) suffix has the form [Ir]; such as gel-ir, bul-ur, kapat-ır. But for most verbs with single syllable and compound verbs it forms as [Ar]. Such as yap-ar, yet-er, hapsed-er. There are exceptions for this case, such as "var-ır". Below two represents the attributes for clearing the ambiguity. These attributes does not modify the root form.
System automatically assumes a verb has A:Aorist_I form for multi syllable verbs and Aorist_I form for single syllable verb roots. Such as
bulmak // "bul-ur"
yapmak // "yap-ar"
If a verb does not obey the rule, related attribute should be written.
affetmek [A:Voicing, Aorist_A] // "affeder", not "affedir"
bilmek [A:Aorist_I] // "bilir", not "biler"
NoQuote
This attribute is used for formatting a word. If this is used, when a suffix is added to a Proper noun, no single quote is used as a separator. Such as "Türkçenin" not "Türkçe'nin"
Ext
This is mostly Zemberek internal attribute. Used for items that are not in official TDK dictionary
ImplicitDative
For some words implicitly contains Dative suffix. For example içeri and içeriye are the same.
içeri [A:ImplicitDative]
Suffixes are added to foreign words or abbreviations by their pronunciation. For example Google'a
or Facebook'un
. System cannot determine phonetic rules from the surface form of the words. For those words, pronunciation value should be defined.
Google [Pr:gugıl]
A101 [P:Abbrv; Pr:ayüzbir]
For multiple pronunciations, for now a workaround may work.
Bmw [Pr:bemeve]
Bmw [Pr:biemdabılyu; Ref:Bmw; Index:2]
Here second Bmw is referring to first Bmw so that only that root is actually valid. Index value is necessary for system to decide which reference value to use.
TODO: This mechanism needs to be improved.