Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lint restructure #595

Closed
vr8hub opened this issue Jul 22, 2023 · 14 comments
Closed

Lint restructure #595

vr8hub opened this issue Jul 22, 2023 · 14 comments

Comments

@vr8hub
Copy link
Contributor

vr8hub commented Jul 22, 2023

This is just to open the discussion, see what thoughts have already been made, etc. No hurry on discussion/answers, just as anyone has time.

I took a pass through the main lint function, and highlighted some (to me) notable chunks. The parentheses are line numbers. There are gaps for small tests, etc.; I was just trying to note the larger code blocks.

Questions/thoughts:

  1. Can we break out some of the big chunks into separate (lint) functions, e.g. the pre-loop CSS checks, the main chunk of metadata checks, etc.?
  2. Can we separate handling of the "special" files, which don't need all (maybe some of?) the checks the standard book text files do, e.g. colophon/imprint/[half]titlepage/etc.
  3. For the main chunk (1380-3121), can we "group" like tests and, depending on size, possibly break off the larger ones into separate (lint) functions? (Alex has already proposed breaking out the t-042's into their own se function, separate from lint.)

If we think the above is both feasible and desirable, then I'll start looking into more detail what it will take.

Alex, do you still want to break out the t-042's, and if so have you started on it yet? If not, do you want me to start looking at that first?

maint lint function is 609–3328

* Parse ignore file (648–686)
* Main CSS checks (704–807)
    * Read and parse local.css, store list of selectors
    * Do several CSS checks
* Metadata checks (808–1162)
    * Illegal/missing files (808–830)
    * Long description checks (841–902)
* Build section tree (1164–1216)
* Iterate over individual files (1218–3121)
    * Handle images specially
    * Read file
    * Handle .svg separately (1299–1365)
    * Handle .xml (glossary) separately (1367–1378)
    * .xhtml checks (1380–3121)
        * special files: colophon/imprint (1421–1530)
        * t-042 typos (1792–2001); there are a couple outside of this block
        * image alt attributes (2788–2836)
        * special file: endnote checks (2974–3004)
        * special file: imprint checks (3008–3047)
* Reiterate, checking that all IDs are used (3150–3184)
* File/directory not URL safe (3186–3221)
* Heading checks (3227–3243)
* Image accessibility (3251–3271)
* Handle ignores (3298–3324)
@acabal
Copy link
Member

acabal commented Jul 23, 2023

I don't have strong opinions on how to reorganize lint, other that it's clear that it's gotten far too big and would benefit from some kind of reorganization.

Maybe it would be helpful to look at the source of other lint-like tools like epubcheck to see how they organize their code. Like se lint, I imagine epubcheck is mostly just stepping through a long list of checks in a loop. How do they organize it?

I haven't started working on breaking out t-042 yet. That might be too much to do at the same time as restructuring lint though. Not looking at "special" files may also be out of scope for reorganizing because there's a ton of checks you'd have to carefully review to confirm the would never apply to those special files.

While it is more time consuming to apply the same checks to SE template files, it's also a benefit because we can sometimes catch rare mistakes, and if we update the templates we never have to go back to lint to make sure checks do or don't apply any more.

@vr8hub
Copy link
Contributor Author

vr8hub commented Jul 23, 2023

My thought on t-042 was to do it before reorganizing. That has a two-fold benefit: it gets 300+ lines out of lint that doesn't have to be reorganized, and speeds up lint. And it should be relatively simple to do (famous last words).

We will need to update the Step by Step as well; would you suggest it be run before the first lint, or after the first lint?

@acabal
Copy link
Member

acabal commented Jul 24, 2023

If we're pulling out t-042 then that will be a pretty big project and we should focus on it first.

I think we should try to plan out what a separate se find-typos tool might look like, and what its output might look like.

It might still make sense to have typo checks under se lint, but to break them out into their own error code category. The reasons are:

  1. Updates to a book in the future might introduce typos. After a book is released we shouldn't need to run multiple tools to make sure it's still good; and a find-typos tool would mean that we'd have to inspect each typo it found all over again, even if we cleared it in the past.
  2. The major problem with t-042 is that it combines a large amount of checks into a single error code. When we add an se-lint-ignore entry, this obscures all of the typos t-042 finds, even new ones in the future. This is not desirable, because we really only want to ignore the exact typo we found.

So, maybe we don't want a separate typos tool. I'm still not sure.

But we certainly do need to reorganize the lint function!

@acabal
Copy link
Member

acabal commented Jul 24, 2023

So maybe the answer is to create a new error code category like y-001 for typos and build that out, instead of a single t-042 code.

@vr8hub
Copy link
Contributor Author

vr8hub commented Jul 24, 2023

I’m all for keeping them in lint and giving them different messages: I tried to get you to do that when I originally changed the messages to be different. :)

To me, doing something with the typos is a smaller project and has a higher priority than reorganizing the whole thing. So whichever choice is made re the typos, I think that should be done first.

I'm great with leaving them in lint and giving them unique messages. To me that makes the most sense, for the reasons you state. If we wanted to at some point, we could put a flag on lint that included/excluded the typos from being run (depending on what we wanted the default to be). Because it is true that, once corrected, they're generally not an issue for the rest of a project, and so having a way for lint to run faster by not including them might be nice. But that's just a possible future enhancement, not a need to right now.

@vr8hub
Copy link
Contributor Author

vr8hub commented Jul 24, 2023

If you decide y-xxx is the way to go, I believe it's ready.

We have 99(!) ignore files that have t-042 in them, so they will have to be gone through and the correct y-xxx code put in. I can start on that if this is the way you want to go.

@vr8hub
Copy link
Contributor Author

vr8hub commented Jul 26, 2023

In the meantime…

The "Possible typo: paragraph missing ending punctuation" check (y-003 in the possible new scheme) tests for a paragraph ending in a lowercase letter ([a-z]$) and that a bunch of conditions aren't true: it's too long to try to list, so I'll just show the code here so you don't have to go find it. (I put in some newlines to make it hopefully easier to read.)

not(following-sibling::*[1]/node()[1][contains(@epub:type, 'z3998:roman')] or
following-sibling::*[1][re:test(., '^([0-9]|first|second|third|fourth|fifth|sixth|seventh|eight|ninth|tenth)', 'i')])
and not(@class or contains(@epub:type, 'z3998:salutation'))
and not(following-sibling::*[1][name() = 'blockquote'
or name() = 'figure' or name() = 'table' or name() = 'footer' or name() = 'ul'
or name() = 'ol']
or ancestor::*[name() = 'blockquote' or name() = 'footer' or name() = 'header'
or name() = 'table' or name() = 'ul' or name() = 'ol' or name() = 'figure'
or name() = 'hgroup' or re:test(@epub:type, '(z3998:drama|dedication|halftitlepage)')])]

The "Possible typo: missing ending punctuation" check (y-021 in the possible new scheme) tests for either said or .*ed at the end of a paragraph and that a (smaller) bunch of conditions aren't true:

not(@epub:type or ancestor::*[name()='header' or name()='table'
or name()='blockquote' or name()='li' or contains(@epub:type, 'z3998:letter')]
or (following-sibling::*[1])[name()='blockquote' or name()='figure'
or name()='footer'])]

Since "said" and ".*ed" both end in a lowercase letter, the first part of the second test always matches the first part of the first test, and several of the conditions in the second part of the tests are the same as well. Thus, in many/most cases, a paragraph getting one of the errors gets the other as well, and the messages are pretty much identical.

Thus, it would seem (to me) that only one of these tests is needed. Since any lowercase letter is the more relaxed of the first two conditions, that would seem the first test to use. For the second set of exclusions, I don't know whether you would prefer the longer set or the shorter set.

@acabal
Copy link
Member

acabal commented Jul 26, 2023

Yes, it looks like those two could probably be combined but the xpaths are so complicated and there are no test cases, so I think it will be really hard to merge them in a way we can test. I guess the best we can do is merge them and run lint across the corpus, and if there are no changes to the output then we can say it's probably the same.

Related thought, lint as a whole could benefit from being run through an xpath pretty printer to make some of those more gnarly xpaths more legible. I assume something like that exists but I haven't done any actual looking.

Let's proceed with adding a new category for typos, y-***. I think that's the more conservative route and if we want to make a new tool in the future we can still do that.

It will be tough work reconciling all of the se-lint-ignore.xml entries though. t-042 is by far the most ignored code so it's going to be a lot of work to run through each one and decide which new code applies to an ebook.

@vr8hub
Copy link
Contributor Author

vr8hub commented Jul 26, 2023

Re your last para, although there are 99 to run, it's just mostly busy work. Since t-042 doesn't exist in the new lint, running the new lint yields at least two errors: one complaining that the t-042 wasn't used, and one with the new y- code. Update the ignore file from t-042 to y-xxx and it's done. For the (few) instances where there are multiple y-xxx's reported, then I just have to add the additional ones (this is how I caught the issue with y-003/y-021). Yes, there are 99 to do, but the work on each one is NBD.

For y-003/021, the question is what do you want to exclude when a paragraph ends in a lowercase letter: the big list or the small one or something in-between?

If you still want to exclude all of the things in the big list, then that's the list to use. A lint on the 99 with t-042's might then result in some t-042's that are no longer flagged, because what wasn't excluded before is excluded now. NBD; we just remove it from the ignore file. Thus, it will be helpful to make the change at the same time as the y-xxx, so that when I run lint on the 99 that have a t-042 exception, we know which one to replace it with (or whether to remove it entirely). We definitely don't want to update the file with both, and then turnaround and have to update them again to clean up the dups.

If you only want to exclude the smaller list, or something in-between the two, then we will need to run a corpus-wide lint and add any exceptions that weren't caught before. But that's not a big deal either, and our lints are far from clean today. :)

@acabal
Copy link
Member

acabal commented Jul 26, 2023

I really don't know. Those xpaths were written a long time ago and designed to catch the most errors while minimizing false positives based on the corpus at the time. Since then, the corpus has been corrected so it's hard to say which conditions are essential because we can't test them anymore. If we're going to combine them then we should basically just combine them entirely.

@vr8hub
Copy link
Contributor Author

vr8hub commented Jul 26, 2023

Very good. I think the first set wholly contains the second, but I'll make sure.

I poked around briefly for an xpath pretty-printer but didn't find anything, either.

@vr8hub
Copy link
Contributor Author

vr8hub commented Aug 26, 2023

I have made a first pass. My thoughts were to get like errors together, and then possibly convert the big sections into internal functions, so that the main function was at least easier to folllow. This is how most large "programs" I've encountered are structure, and how, e.g. epubcheck does its thing (lots of smaller functions).

To start with, I:

  • Where possible, in the "main" file processing loop (foreach .xhtml), organized the error checks by section and in order, i.e. CSS, metadata, syntax, typography, XHTML, typos.
  • Moved all of the infrastructure checks above the error checks, e.g. missing styles, glossary keys, etc.
  • Moved all of the "special file" processing inside the loop to be together, e.g. colophon, imprint, endnotes, LOI, and before the main error checks. (Ideally, I would like the error messages for these special files to be separated in some way from everything else, maybe make them X-9yy codes or something.)

If this is OK, then obvious thing, I think, is to make each of those big sections its own function (e.g. _lint_typography), so then the main function is just calling the five functions to process the five types of errors. (Mostly; there are obviously errors handled outside the main loop, but we'll get to those later.)

I have processed a few of my larger productions, and am not getting any false positives. I have not done any further testing because this is just a proof-of-concept, and I didn't want to waste any more time on it if this approach wasn't going to be accepted.

(There's plenty of opportunity for improvement in the first several hundred lines as well, but the 2K+ lines in the main file loop are the largest section, and therefore the first thing I attacked.)

The changes are so extensive (even though they're only "location" changes) that it doesn't make sense to post a PR, so I've attached it here (as .txt since GitHub doesn't accept .py).

It's only a POC, so don't stress over the additional comment lines I've added to begin each section; that was to make it easier for me to search when I was organizing the sections. Likewise, if I missed a message that's slightly out-of-order. (There are a couple of instances where it is unavoidable, because the same code block processes two different errors that are far apart in number, and in at least one case are of two different types. In the former case, I would like to get those sequential, but I would have to check the existing ignore files to see if it would cause any problems.)
se_epub_lint.txt

@vr8hub
Copy link
Contributor Author

vr8hub commented Aug 27, 2023

I needed a break from proofing Balzac, so did the next step, i.e. creating several internal functions, one each for the main checks, one for "special" files (imprint/colophon/endnotes/loi), and a couple of other minor ones. Next will probably be all of the metadata checks at the beginning. Again, only rudimentary testing so far. (It does produce expected output on the book I'm working on.)

I turned all of the local_css_* and ebook_has_* flags into dicts, so I could pass them to the functions more easily, i.e. all at once rather than five or six additional parameters.

Where these functions live (i.e. is it possible/desirable to move them out of the main se_epub_lint file?), etc. up for discussion.
se_epub_lint.txt

@vr8hub
Copy link
Contributor Author

vr8hub commented Nov 29, 2023

Closing this since the rest of the discussion is on PR #607.

@vr8hub vr8hub closed this as completed Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants