Fix m-070 and add tests #732

apasel422 · 2024-07-13T20:53:08Z

I ran the updated lint against the corpus entries that contain a glossary, and it revealed a large number of unused entries. Some of these are typos (e.g. Ygdrasil as the glossary term vs Ygdrasill appearing in the text of Bulfinch's Mythology), while others are a singular/plural form that doesn't appear in that inflected form in the text (e.g. trilobite vs trilobites in The Origin of Species).

Fixes #731

- Treat glossary values as string literals, not patterns - Search for glossary values surrounded by word boundaries

apasel422 · 2024-07-14T17:37:10Z

It's unclear to me whether the inflected forms that aren't present in the text should be present in the search-key-map, but presumably inflected forms that are present in the text should be present in the search-key-map.

acabal · 2024-07-15T19:53:09Z

I think there's still an error here somewhere, because in in your output for Morte d'Arthur it says unsiker is not found, but it does indeed exist in the text in chapter-17-23. It's directly followed by an endnote which maybe is why there's an issue. Likewise for Bulfinch, it says pariah not found, but that word does exist in the text.

It's unclear to me whether the inflected forms that aren't present in the text should be present in the search-key-map, but presumably inflected forms that are present in the text should be present in the search-key-map.

Correct on both counts.

apasel422 · 2024-07-15T20:02:49Z

It's directly followed by an endnote which maybe is why there's an issue.

Good find. The issue is that the call to inner_text ends up stripping the intervening <a>, resulting in text like unsiker245, which of course fails the (?!\w) portion of the regex. Thoughts on how to deal with this?

acabal · 2024-07-15T20:04:09Z

You could always clone the dom and strip out all endnotes before testing

apasel422 · 2024-07-15T20:23:21Z

I've fixed the noteref handling and published an updated gist of the corpus findings.

acabal · 2024-07-15T20:32:26Z

Where's the new gist?

apasel422 · 2024-07-15T20:33:04Z

Sorry, updated at the same URL: https://gist.github.com/apasel422/0fcaebed784fbc7fa45282cb1bbbe037.

acabal · 2024-07-15T20:37:00Z

Bulfich still matches pariah in that gist, which is a wrong match. Can you review this output and confirm that it's actually correct in all cases?

apasel422 · 2024-07-15T20:38:24Z

Bulfich still matches pariah in that gist, which is a wrong match. Can you review this output and confirm that it's actually correct in all cases?

https://github.com/search?q=repo%3Astandardebooks%2Fthomas-bulfinch_bulfinchs-mythology%20pariah&type=code shows that it only appears in the plural.

acabal · 2024-07-15T20:46:17Z

I suppose the confusion here is that while pariah doesn't appear as a lone word in the text, pariahs does, and if one selects just the pariah portion of pariahs then we should still get a glossary result.

Additionally, the glossary spec has this to say about matches with multiple <value>s:

When the match element has a search key defined in its value attribute, this search key also represents the canonical form (i.e. lemma) of the headword indexed by the match element. If present, all of the value children may be considered specific forms (e.g. inflected forms) of that canonical form. This information can be used by Reading Systems to improve storage or performance (e.g. by leveraging lemmatization rules).

So I understand the spec to mean that if a <match> has multiple <value>s that do occur in the text, then the <match> should have a logical stem @value even if the @value itself doesn't occur in the text. This groups related words into one stem (even if the stem per se doesn't exist in the text) and I suppose that's useful for reading systems. This occurs for example in Bulfinch with celt and perhaps others.

I haven't looked at the glossary spec in some time. I think it would be worth it to review it to make sure we're not removing useful information here, even if a bounded string doesn't strictly occur in the text.

apasel422 · 2024-07-15T20:51:56Z

Hudibras
- measle only appears as measled and measles
A Yankee in the Trenches
- entrenching tool only appears in the plural
- funk hole only appears in the plural
- identification disc only appears in the plural
- medal is listed in glossary-search-key-map but only appears as medals and is actually labeled as military medal in glossary.xhtml
- sandbag only appears in the plural

So I understand the spec to mean that if a <match> has multiple <value>s that do occur in the text, then the <match> should have a logical stem @value even if the @value itself doesn't occur in the text. This groups related words into one stem (even if the stem per se doesn't exist in the text) and I suppose that's useful for reading systems. This occurs for example in Bulfinch with celt and perhaps others.

This sounds reasonable, but since we don't have a way to determine whether a particular glossary value is the stem value, perhaps the right thing to do would be to ensure that at least one of the values for a term does appear, rather than require that all of its values do.

As is, we shouldn't be assuming that a stem is a prefix of all of its inflected terms, and the current behavior that causes food to match foo is clearly wrong in that regard.

acabal · 2024-07-15T20:55:00Z

Maybe it's enough to simply check child <value> of <match> and not <match @value> if it has children. We can simply assume that the producer got the stem right even if it doesn't appear in the text. If <match> does not have children, then @value must appear in the text otherwise that's clearly an error.

apasel422 · 2024-07-15T21:21:54Z

If <match> does not have children, then @value must appear in the text otherwise that's clearly an error.

With this, a few more entries are removed (see diff).

Some notes:

In Lavengro, chabe, which is in a <match>, does not appear in the text, only its plural. Presumably that inflected form should be added as a <value>.
In Origin of Species, dicotyledonous plant, which is in a <value>, does not appear in the text, only its plural. If I'm reading the glossary spec right, it might be more precise to have separate <match value="dicotyledonous"/> and <match value="dicotyledonous plant"><value value="dicotyledonous plants"/></match> under the same <search-key-group>, as I'd argue that dicotyledonous is not the canonical form of dicotyledonous plant.

apasel422 · 2024-07-15T21:26:50Z

More notes:

~~Rhapsodist and cimmeria do not appear in Mythology at all~~
etna appears in the glossary XML, but it is spelled Aetna in Mythology's text
As previously stated, Ygdrasil has two Ls in Mythology's text
Aeaea appears as an adjective (with a trailing n) in the text, so it should almost certainly be present as a <value> in the glossary

acabal · 2024-07-16T17:16:16Z

This update fixes pariah but that type of error still appears, for example with plutonic rock in Darwin. If you look at that case, the key map seems valid to me - the stem is plutonic, and the values are plutonic rock (a possible user selection from the longer string plutonic rocks) and plutonic rocks. All of those three strings could be selected by the user and we would want to show a glossary entry.

So how can we fix this test so this valid case doesn't get caught?

apasel422 · 2024-07-16T17:28:05Z

All of those three strings could be selected by the user and we would want to show a glossary entry.

Maybe I'm misunderstanding the glossary spec, but wouldn't such an entry be more precisely represented as:

<search-key-group href="text/glossary.xhtml#plutonic-rocks">
  <match value="plutonic"/>
  <match value="plutonic rock">
    <value value="plutonic rocks"/>
  </match>
</search-key-group>

With that formulation, I think the current PR would avoid emitting any errors for this case. For example, this seems similar to https://idpf.org/epub/dict/epub-dict-20140204.html#id.7fhcrrky2qjt.

However, as I suggested above, if you want to avoid having to change that <search-key-group> to satisfy the lint, then we could consider ensuring that at least one of the values appears, rather than all of them. To me this makes sense since we have no way of enumerating all possible derived terms of the stem, or even a way to determine if a given string is associated with a particular stem (c.f. foo and food). In other words, we could trust the producer that all the values are things the user might want to search for.

apasel422 · 2024-07-16T17:46:16Z

I suppose another option would be to continue requiring all values to appear, but to tolerate a trailing e?s instead of absolutely requiring a word boundary. That could still lead to some false positives, but is perhaps unlikely with the current corpus. It's a bit of a hack, but might be good enough. I can try to see what the results look like later today.

acabal · 2024-07-16T18:48:08Z

OK, yes, I agree with your proposed change to the search key map.

What we should do next is go through the remaining errors this update highlights, and confirm that we can fix them in a similar way that makes sense.

As we do that, we can submit PRs to fix them. If we encounter a surprise edge case then we can revisit the test.

If we're able to fix the entire corpus with no surprises then we can merge this in. Do you have time to do that?

apasel422 · 2024-07-16T18:50:18Z

What we should do next is go through the remaining errors this update highlights, and confirm that we can fix them in a similar way that makes sense.

As we do that, we can submit PRs to fix them. If we encounter a surprise edge case then we can revisit the test.

If we're able to fix the entire corpus with no surprises then we can merge this in. Do you have time to do that?

Will do. I'll post an updated list of speculative changes later, perhaps with example PRs against the affected repos.

apasel422 added 2 commits July 13, 2024 16:43

Add test demonstrating erroneous behavior of m-070

21dd6d5

Fix m-070

ab2b45e

- Treat glossary values as string literals, not patterns - Search for glossary values surrounded by word boundaries

apasel422 marked this pull request as ready for review July 13, 2024 21:00

Replace word boundaries with negative assertions

1f78ebb

Properly handle noterefs in m-070

9545297

Vary m-070 searches based on whether <match> has children

b089973

acabal merged commit 90bd736 into standardebooks:master Jul 25, 2024
1 check passed

apasel422 deleted the m-070 branch July 25, 2024 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix m-070 and add tests #732

Fix m-070 and add tests #732

apasel422 commented Jul 13, 2024 •

edited

Loading

apasel422 commented Jul 14, 2024

acabal commented Jul 15, 2024

apasel422 commented Jul 15, 2024

acabal commented Jul 15, 2024

apasel422 commented Jul 15, 2024

acabal commented Jul 15, 2024

apasel422 commented Jul 15, 2024

acabal commented Jul 15, 2024

apasel422 commented Jul 15, 2024

acabal commented Jul 15, 2024

apasel422 commented Jul 15, 2024

acabal commented Jul 15, 2024

apasel422 commented Jul 15, 2024

apasel422 commented Jul 15, 2024 •

edited

Loading

acabal commented Jul 16, 2024 •

edited

Loading

apasel422 commented Jul 16, 2024 •

edited

Loading

apasel422 commented Jul 16, 2024

acabal commented Jul 16, 2024

apasel422 commented Jul 16, 2024

Fix m-070 and add tests #732

Fix m-070 and add tests #732

Conversation

apasel422 commented Jul 13, 2024 • edited Loading

apasel422 commented Jul 14, 2024

acabal commented Jul 15, 2024

apasel422 commented Jul 15, 2024

acabal commented Jul 15, 2024

apasel422 commented Jul 15, 2024

acabal commented Jul 15, 2024

apasel422 commented Jul 15, 2024

acabal commented Jul 15, 2024

apasel422 commented Jul 15, 2024

acabal commented Jul 15, 2024

apasel422 commented Jul 15, 2024

acabal commented Jul 15, 2024

apasel422 commented Jul 15, 2024

apasel422 commented Jul 15, 2024 • edited Loading

acabal commented Jul 16, 2024 • edited Loading

apasel422 commented Jul 16, 2024 • edited Loading

apasel422 commented Jul 16, 2024

acabal commented Jul 16, 2024

apasel422 commented Jul 16, 2024

apasel422 commented Jul 13, 2024 •

edited

Loading

apasel422 commented Jul 15, 2024 •

edited

Loading

acabal commented Jul 16, 2024 •

edited

Loading

apasel422 commented Jul 16, 2024 •

edited

Loading