Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve y-029/y-031/y-032, add y-031 thru y-033 tests #701

Merged
merged 1 commit into from
May 28, 2024

Conversation

vr8hub
Copy link
Contributor

@vr8hub vr8hub commented May 26, 2024

With this, the tests for lint typos (y-xxx) are complete. At this rate we should be done with all of the lint tests around 2027. :) I think I'm going to take a little break from tests and try to actually read something, or at least work on something to read.

Changes to lint:
y-029—since y-032 is checking for text on italics with epub:types, then y-029 doesn't need to. I added an exclusion for italics with epub:type.
y-031 (all on second re:test)

  1. It was checking for whitespace followed by one of he/she/I, but there were leading \b on each of the he/she/I. Since we're checking for whitespace immediately before, the \b isn't needed (the whitespace already guarantees it).
  2. The 'I' didn't have a \b behind it, so it was matching anything that began with I, rather than I itself. I added the '\b'.
  3. The beginning of the regex excluded period, but not other sentence-terminating punctuation, e.g. ! and ?. I added them.
    y-032—As mentioned in an issue, this is testing for an italic with epub:type immediately followed by text, but it excludes (e|es|er). However, again, there is no \b following, so it actually excludes anything starting with any of those three. Although it's possible those were the intentions, I thought it was unlikely, so I went ahead and added it. If that's wrong, it's easy enough to remove.

I ran this lint on the corpus before submitting. There were no new false positives, and the changes to y-031 will eliminate six ignore entries.

@acabal
Copy link
Member

acabal commented May 27, 2024

Great work! Can we merge y-029 and y-032? They do very similar things, would it make sense to just make it one test?

@vr8hub
Copy link
Contributor Author

vr8hub commented May 27, 2024

I had an issue three-fourths written to suggest exactly that, and changed my mind, because I didn't know if it was going to be OK to be either more liberal in exclusions (like y-032) and possibly miss more, or more restrictive (like y-029) and possibly have more false positives.

So, I agree they can/should be combined. Let me do some testing and I'll get back with some data on which direction we might want to go.

@vr8hub
Copy link
Contributor Author

vr8hub commented May 28, 2024

OK, I remember now; I actually did a little bit of testing, and that was the real reason I aborted my proposal. Too many things happened since then.

Combining both into the more restrictive y-032 (which looks for text both before and after, while y-029 only checks after, and only a single letter) doesn't work for <em>; there are numerous instances of emphasizing part of a compound word, e.g. chairman, or imperfect, etc. Way too many of those that are valid to make it worth it.

Keeping y-029 as it is but only for em, and using y-032 for any i, not just epub:type i's. If we remove the @epub:type requirement from y-032, then we get another good-sized group of false positives, but a number of them are either nth or <i>something</i>ing and at least one <i>something</i>ed; it would be reasonable to add ed|ing to the e|es|er exclusion list, and we could either add th to the list generally, or exclude nth specifically. I did that (added ed|ing|th) and reran lint on the corpus.

New false positives (on a couple of them I don't know whether they're OK or not):
Those Barren Leaves has terre-à-terreishly.
Machen's Short Fiction has an inscription with a few instances of italics in the middle of words.
Poe's Short Fiction has an instance of Daddyship, where Daddy is a magazine.
The Middle Five has a couple of occurrences of n appearing in the middle of a word; I don't know if that's correct or not.
A Cycle of the West has <i>Sh–sh–</i>for men…; if the second em-dash was outside of the italics, it would be fine. (Not sure of the semantically correct thing there.)
Ten Days That Shook the World has a couple of instances of language tagged italics immediately followed by an endnote reference.
Connecticut Yankee has a number of intentional italics of single letters in the middle of words, to match scans.
The Kural has a number of instances of j in the middle of a word. I don't know if that's correct or not.

New errors caught:
In Search of the Castaways, Tristram Shanty, M. R. James Short Fiction have errors that should be corrected.

So, up to you if you think the tradeoff is worth it. If so, let me know and I'll make the change.

@acabal
Copy link
Member

acabal commented May 28, 2024

Meh, so many false positives just to catch three new errors isn't worth it. Let's leave it then. But can you fix the ones that did come up with PRs?

@acabal acabal merged commit 46acd2c into standardebooks:master May 28, 2024
1 check passed
@vr8hub
Copy link
Contributor Author

vr8hub commented May 28, 2024

Yep, I agree.
Yes, I'll do that. I need to get rid of some ignore entries leftover from the t-074 improvement as well. Do you want PRs? I can just fix them myself and save you the hassle.

@acabal
Copy link
Member

acabal commented May 28, 2024

Yes you can just go ahead and fix them yourself, thanks!

@vr8hub
Copy link
Contributor Author

vr8hub commented May 28, 2024

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants