Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weird difference between hfst and lttoolbox transducers #3

Open
jonorthwash opened this issue Dec 23, 2018 · 3 comments
Open

weird difference between hfst and lttoolbox transducers #3

jonorthwash opened this issue Dec 23, 2018 · 3 comments

Comments

@jonorthwash
Copy link
Member

jonorthwash commented Dec 23, 2018

The analysis of "איז" is different in the hfst and lttoolbox transducers:

$ echo "איז" | hfst-proc yid.automorf.hfst 
^איז/זײַן<v><pres><p3><sg>$

$ echo "איז" | lt-proc yid.automorf.bin
^איז/ז<>ן<v><pres><p3><sg>$

Something similar happens with "אַ":

$ echo "אַ" | hfst-proc yid.automorf.hfst
^אַ/אַ<det><sg>$

$ echo "אַ" | lt-proc yid.automorf.bin 
^א/<><det><sg>$ַ

Any thoughts on what might be going on, @ftyers or @flammie?

@jonorthwash
Copy link
Member Author

Another thing that's different is the following:

$ echo "זיי" | hfst-proc yid.automorf.hfst 
!! Warning: Transducer contains one or more multi-character symbols made up of
ASCII characters which are also available as single-character symbols. The
input stream will always be tokenised using the longest symbols available.
Use the -t option to view the tokenisation. The problematic symbol(s):
וו יי
^זיי/זײ<prn><pers><p3><pl><acc>/זײ<prn><pers><p3><pl><dat>/זײ<prn><pers><p3><pl><nom>/זײַן<v><imp><sg>$

$ echo "זיי" | lt-proc yid.automorf.bin
^זיי/*זיי$

@jonorthwash
Copy link
Member Author

@unhammer, in regards to the extra <>, it seems to have something to do with spellrelax, as it's replacing letters that are allowed in those contexts by spellrelax, e.g. here:

^האט/ה<>בן<v><imp><pl>/ה<>בן<vaux><pres><p2><pl>/ה<>בן<vaux><pres><p3><sg>/ה<>בן<v><pres><p3><sg>/ה<>בן<v><pres><p2><pl>$

@mr-martian
Copy link
Contributor

[17:27:56] <popcorndude> oh, this is so dumb
[17:28:03] <firespeaker> oh?
[17:28:23] <popcorndude> lt-comp stores tags without the <>
[17:28:30] <firespeaker> wat?
[17:28:37] <popcorndude> for space reasons
[17:28:59] <popcorndude> but that leads to the assumption that any multichar symbol in .bin is a tag
[17:29:12] <popcorndude> lt-comp on a .att will take of the initial and final <>
[17:29:19] <popcorndude> and lt-proc will put them back on
[17:30:11] <popcorndude> so output of <> is you had a 2-codepoint symbol, lt-comp took off the < and > without checking that they actually were < and > leaving a symbol of length 0
[17:30:25] <popcorndude> then when outputting that symbol, lt-proc added back the < and >
[17:30:29] <popcorndude> leaving <>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants