-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
German Stemmer - improvements when possible #200
Comments
Related to #161, but this is about different rules |
Removing Most of the cases here seem to boil down to not removing I suspect it wasn't added to the algorithm originally because it's hard to avoid removing them in cases where it's harmful (because some words end in It'd be good to investigate - if you have any useful insights how to distinguish If it really isn't practical to remove these then it'd be good to document that. The older algorithm descriptions tend to just describe the mechanics of the algorithm without giving much (if any) background as to why choices were made. |
Oh well, its hard. My best idea was to consider character pairs like kt lt ßt instead of single t or d. But that doesnt lead to safe general rules either. German language probably needs some more reforms, before stemming like in english could work. Probably never. |
Restricting removal based on what's before the suffix is quite a common solution (and removing a suffix in a subset of cases can still be worthwhile). I'll take a deeper look, and document if it doesn't seem solvable.
I wouldn't say English is particularly easy to stem - it doesn't have as many inflected forms as some other languages, but it has a large vocabulary much of which has been taken from multiple other languages, so there's rather a lot of irregularity to deal with. There are definitely endings in English we don't try to deal with either (e.g. see #172). A stemmer can be useful without handling every possible word perfectly though, and overstemming tends to be more problematic because it can result in a search term matching an unrelated word. |
Oh my gosh, you are right. In my naiive mind, english was still the english in which 20 years ago or so, I could pluralize and singularize words with just a bunch of simple rules (and yes, |
The folliwing words should (ideally, if possible) produce the same „stemmed word“:
schließen -> schliess
schließt -> schliesst
schließend -> schliessend
holen -> hol
holt -> holt
vorbereiten -> vorbereit
vorbereitet -> vorbereitet
vorbereitend -> vorbereit
schenken -> schenk
schenkt -> schenkt
schenkte -> schenkt
schenkten -> schenkt
schenkend -> schenkend
schenkender -> schenkend
The text was updated successfully, but these errors were encountered: