Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix m-070 and add tests #732

Merged
merged 5 commits into from
Jul 25, 2024
Merged

Fix m-070 and add tests #732

merged 5 commits into from
Jul 25, 2024

Conversation

apasel422
Copy link
Contributor

@apasel422 apasel422 commented Jul 13, 2024

I ran the updated lint against the corpus entries that contain a glossary, and it revealed a large number of unused entries. Some of these are typos (e.g. Ygdrasil as the glossary term vs Ygdrasill appearing in the text of Bulfinch's Mythology), while others are a singular/plural form that doesn't appear in that inflected form in the text (e.g. trilobite vs trilobites in The Origin of Species).

Fixes #731

- Treat glossary values as string literals, not patterns
- Search for glossary values surrounded by word boundaries
@apasel422 apasel422 marked this pull request as ready for review July 13, 2024 21:00
@apasel422
Copy link
Contributor Author

It's unclear to me whether the inflected forms that aren't present in the text should be present in the search-key-map, but presumably inflected forms that are present in the text should be present in the search-key-map.

@acabal
Copy link
Member

acabal commented Jul 15, 2024

I think there's still an error here somewhere, because in in your output for Morte d'Arthur it says unsiker is not found, but it does indeed exist in the text in chapter-17-23. It's directly followed by an endnote which maybe is why there's an issue. Likewise for Bulfinch, it says pariah not found, but that word does exist in the text.

It's unclear to me whether the inflected forms that aren't present in the text should be present in the search-key-map, but presumably inflected forms that are present in the text should be present in the search-key-map.

Correct on both counts.

@apasel422
Copy link
Contributor Author

It's directly followed by an endnote which maybe is why there's an issue.

Good find. The issue is that the call to inner_text ends up stripping the intervening <a>, resulting in text like unsiker245, which of course fails the (?!\w) portion of the regex. Thoughts on how to deal with this?

@acabal
Copy link
Member

acabal commented Jul 15, 2024

You could always clone the dom and strip out all endnotes before testing

@apasel422
Copy link
Contributor Author

I've fixed the noteref handling and published an updated gist of the corpus findings.

@acabal
Copy link
Member

acabal commented Jul 15, 2024

Where's the new gist?

@apasel422
Copy link
Contributor Author

Sorry, updated at the same URL: https://gist.github.com/apasel422/0fcaebed784fbc7fa45282cb1bbbe037.

@acabal
Copy link
Member

acabal commented Jul 15, 2024

Bulfich still matches pariah in that gist, which is a wrong match. Can you review this output and confirm that it's actually correct in all cases?

@apasel422
Copy link
Contributor Author

Bulfich still matches pariah in that gist, which is a wrong match. Can you review this output and confirm that it's actually correct in all cases?

https://github.com/search?q=repo%3Astandardebooks%2Fthomas-bulfinch_bulfinchs-mythology%20pariah&type=code shows that it only appears in the plural.

@acabal
Copy link
Member

acabal commented Jul 15, 2024

I suppose the confusion here is that while pariah doesn't appear as a lone word in the text, pariahs does, and if one selects just the pariah portion of pariahs then we should still get a glossary result.

Additionally, the glossary spec has this to say about matches with multiple <value>s:

When the match element has a search key defined in its value attribute, this search key also represents the canonical form (i.e. lemma) of the headword indexed by the match element. If present, all of the value children may be considered specific forms (e.g. inflected forms) of that canonical form. This information can be used by Reading Systems to improve storage or performance (e.g. by leveraging lemmatization rules).

So I understand the spec to mean that if a <match> has multiple <value>s that do occur in the text, then the <match> should have a logical stem @value even if the @value itself doesn't occur in the text. This groups related words into one stem (even if the stem per se doesn't exist in the text) and I suppose that's useful for reading systems. This occurs for example in Bulfinch with celt and perhaps others.

I haven't looked at the glossary spec in some time. I think it would be worth it to review it to make sure we're not removing useful information here, even if a bounded string doesn't strictly occur in the text.

@apasel422
Copy link
Contributor Author

  • Hudibras
    • measle only appears as measled and measles
  • A Yankee in the Trenches
    • entrenching tool only appears in the plural
    • funk hole only appears in the plural
    • identification disc only appears in the plural
    • medal is listed in glossary-search-key-map but only appears as medals and is actually labeled as military medal in glossary.xhtml
    • sandbag only appears in the plural

So I understand the spec to mean that if a <match> has multiple <value>s that do occur in the text, then the <match> should have a logical stem @value even if the @value itself doesn't occur in the text. This groups related words into one stem (even if the stem per se doesn't exist in the text) and I suppose that's useful for reading systems. This occurs for example in Bulfinch with celt and perhaps others.

This sounds reasonable, but since we don't have a way to determine whether a particular glossary value is the stem value, perhaps the right thing to do would be to ensure that at least one of the values for a term does appear, rather than require that all of its values do.

As is, we shouldn't be assuming that a stem is a prefix of all of its inflected terms, and the current behavior that causes food to match foo is clearly wrong in that regard.

@acabal
Copy link
Member

acabal commented Jul 15, 2024

Maybe it's enough to simply check child <value> of <match> and not <match @value> if it has children. We can simply assume that the producer got the stem right even if it doesn't appear in the text. If <match> does not have children, then @value must appear in the text otherwise that's clearly an error.

@apasel422
Copy link
Contributor Author

If <match> does not have children, then @value must appear in the text otherwise that's clearly an error.

With this, a few more entries are removed (see diff).

Some notes:

  • In Lavengro, chabe, which is in a <match>, does not appear in the text, only its plural. Presumably that inflected form should be added as a <value>.
  • In Origin of Species, dicotyledonous plant, which is in a <value>, does not appear in the text, only its plural. If I'm reading the glossary spec right, it might be more precise to have separate <match value="dicotyledonous"/> and <match value="dicotyledonous plant"><value value="dicotyledonous plants"/></match> under the same <search-key-group>, as I'd argue that dicotyledonous is not the canonical form of dicotyledonous plant.

@apasel422
Copy link
Contributor Author

apasel422 commented Jul 15, 2024

More notes:

  • Rhapsodist and cimmeria do not appear in Mythology at all
  • etna appears in the glossary XML, but it is spelled Aetna in Mythology's text
  • As previously stated, Ygdrasil has two Ls in Mythology's text
  • Aeaea appears as an adjective (with a trailing n) in the text, so it should almost certainly be present as a <value> in the glossary

@acabal
Copy link
Member

acabal commented Jul 16, 2024

This update fixes pariah but that type of error still appears, for example with plutonic rock in Darwin. If you look at that case, the key map seems valid to me - the stem is plutonic, and the values are plutonic rock (a possible user selection from the longer string plutonic rocks) and plutonic rocks. All of those three strings could be selected by the user and we would want to show a glossary entry.

So how can we fix this test so this valid case doesn't get caught?

@apasel422
Copy link
Contributor Author

apasel422 commented Jul 16, 2024

All of those three strings could be selected by the user and we would want to show a glossary entry.

Maybe I'm misunderstanding the glossary spec, but wouldn't such an entry be more precisely represented as:

<search-key-group href="text/glossary.xhtml#plutonic-rocks">
  <match value="plutonic"/>
  <match value="plutonic rock">
    <value value="plutonic rocks"/>
  </match>
</search-key-group>

With that formulation, I think the current PR would avoid emitting any errors for this case. For example, this seems similar to https://idpf.org/epub/dict/epub-dict-20140204.html#id.7fhcrrky2qjt.

However, as I suggested above, if you want to avoid having to change that <search-key-group> to satisfy the lint, then we could consider ensuring that at least one of the values appears, rather than all of them. To me this makes sense since we have no way of enumerating all possible derived terms of the stem, or even a way to determine if a given string is associated with a particular stem (c.f. foo and food). In other words, we could trust the producer that all the values are things the user might want to search for.

@apasel422
Copy link
Contributor Author

I suppose another option would be to continue requiring all values to appear, but to tolerate a trailing e?s instead of absolutely requiring a word boundary. That could still lead to some false positives, but is perhaps unlikely with the current corpus. It's a bit of a hack, but might be good enough. I can try to see what the results look like later today.

@acabal
Copy link
Member

acabal commented Jul 16, 2024

OK, yes, I agree with your proposed change to the search key map.

What we should do next is go through the remaining errors this update highlights, and confirm that we can fix them in a similar way that makes sense.

As we do that, we can submit PRs to fix them. If we encounter a surprise edge case then we can revisit the test.

If we're able to fix the entire corpus with no surprises then we can merge this in. Do you have time to do that?

@apasel422
Copy link
Contributor Author

What we should do next is go through the remaining errors this update highlights, and confirm that we can fix them in a similar way that makes sense.

As we do that, we can submit PRs to fix them. If we encounter a surprise edge case then we can revisit the test.

If we're able to fix the entire corpus with no surprises then we can merge this in. Do you have time to do that?

Will do. I'll post an updated list of speculative changes later, perhaps with example PRs against the affected repos.

@apasel422 apasel422 deleted the m-070 branch July 25, 2024 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

m-070 (glossary entry search) behaves unexpectedly
2 participants