Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lint restructure #607

Merged
merged 9 commits into from
Dec 5, 2023
Merged

Lint restructure #607

merged 9 commits into from
Dec 5, 2023

Conversation

vr8hub
Copy link
Contributor

@vr8hub vr8hub commented Aug 29, 2023

This is by no means ready for prime-time, but it's progressed far enough that it might be easier to discuss here than in the issue comments. Since almost everything has moved, I believe a diff will be next to useless, so you'll probably have to review the file in its entirety.

The goal was to make it easier to read and understand the flow, but also make it easier to know where to "do" things, i.e. when a new error is needed, an existing one needs to be changed, etc.

  • Several chunks of code at the beginning of the "main" function moved to functions:

    • _lint_metadata_checks, for the initial metadata checks
    • _lint_css_checks, for the initial CSS checks
    • _update_missing_styles
    • _lint_image_checks (jpg/tif/png)
    • _lint_svg_checks
    • _lint_special_file_checks—colophon/imprint/endnotes/loi; maybe endnotes and loi should be two separate functions, but they were small enough that for a first pass I just put them all in the same function.
  • Most of the .xhtml error checks moved to (and were ordered in) functions, one for each of the "types" of errors.

    • _lint_xhtml_css_checks
    • _lint_xhtml_metadata_checks
    • _lint_xhtml_syntax_checks
    • _lint_xhtml_typography_checks
    • _lint_xhtml_xhtml_checks (I know, see the 2nd hardest thing in CS)
    • _lint_xhtml_typo_checks
  • A couple of significant chunks at the end were moved to functions as well.

    • _lint_image_metadata_checks
    • _lint_process_ignore_file—the parsing of the ignore file was at the beginning, the filtering at the end, but nothing that the parsing did was used anywhere but the filtering, so I put both in the function.

As mentioned in the issue comments, I consolidated a number of flags into two dicts, local_css and ebook_flags, to make it easier to pass information to the functions. That's about all I changed (I think), other than moving things around.

With these changes, the "main" lint function is now only 542 lines. The two biggest subfunctions, syntax and typography checks, are roughly the same, with the former a bit longer and the latter a bit shorter.

As also mentioned in the comments, it processes the book I'm currently working on and produces the expected output, but there's obviously LOTS of testing ahead, with plenty of discussion and other changes (or undoing changes or starting over) even before that. Ideally we would have a separate test as part of the tools for each of the errors, but that can be someone else's project. :)

Hopefully this is enough to start the discussion.

@acabal
Copy link
Member

acabal commented Sep 5, 2023

This is on my radar, I'm just extremely busy this month!

@vr8hub
Copy link
Contributor Author

vr8hub commented Sep 5, 2023

"This month"? :)

Thanks for the update, but no worries, I know life is hectic these days.

@vr8hub
Copy link
Contributor Author

vr8hub commented Sep 5, 2023

I rebased master to pull in your new y-033 check. Since the whole section is "check typos," I updated the comment to be a little more descriptive.

@vr8hub
Copy link
Contributor Author

vr8hub commented Oct 26, 2023

I rebased from master again to pick up the last couple of changes you made to lint (sorting c-006 and adding the dedication/epigraph style checks).

As an aside, I've been using this version since I pulled in the y-033 check, and haven't run into any problems. It did flag some incorrectly named images that master lint didn't, but I actually think we want to do that. I still need to check why it happened, though.

@vr8hub
Copy link
Contributor Author

vr8hub commented Oct 26, 2023

Ah, I see why. The master version just checks stem for ".source"; I combined two checks, so that the thing that sets the "has_cover_source" flag (which checks for stem of "cover.source") is one path, and all other files are checked on the other path. So "cover.source" is valid, but anything else that isn't URL-safe isn't. And I believe that is correct? We don't want to allow "x.source.ext" in general, just in the specific case of cover.source.jpg, right?

@acabal
Copy link
Member

acabal commented Oct 27, 2023

Correct. Also note that Github build checks are failing, can you take a look at that?

@vr8hub vr8hub force-pushed the lint-restructure branch 2 times, most recently from 4a4621c to 18d6c6f Compare October 27, 2023 04:56
@vr8hub
Copy link
Contributor Author

vr8hub commented Oct 27, 2023

All right, I fixed several variable definitions and a few other things, and a local mypy now runs clean (well, as clean as it does on the master one). The build is queued, but it's bedtime, so I'll check on it tomorrow if there are any further issues.

@vr8hub
Copy link
Contributor Author

vr8hub commented Oct 27, 2023

It finished before I made it to bed. mypy and pylint are now passing, but six of the lint tests are failing. However, they also fail locally on master as well, so that would not appear to be a regression. (I'm current; that's how I found the lint changes that started this evening's escapades.) I did not look at the details of the failures as I really am going to bed now. :)

@vr8hub
Copy link
Contributor Author

vr8hub commented Oct 27, 2023

It appears from the output that each of the six tests has one additional error (m-063, cover image not built) and one missing error (c-006, missing style z3998:verse) compared to the current "golden files". Again, this is on current master (locally).

I see that conftest.py creates a tmp directory for its working directory, where the book files are combined for the lint test. Since it's a tmp directory, I assume it's automatically deleted when the test completes.

pytest has an INI variable that controls the retention policy for paths created with the pytest tmp_path built-in. However, it doesn't appear to work on our tests, maybe because we have a work_dir function that generates a tmp directory for the working directory; it does use a tmp_path function, but it's not saved after the run even with the pytest.ini file present.

I see the argument that saves the output files, but I want to save the actual book files, so I can look at them to see if the errors are valid and we just need new golden files, or whether something is wrong somewhere.

Do you know off the top of your head an easy way to do that, i.e. save the assembled book files after the run?

EDIT: I found the --basetemp variable, and that appears to work. Not sure why it isn't working with the standard tmp directory.

@vr8hub
Copy link
Contributor Author

vr8hub commented Oct 31, 2023

All right, I finally figured out what was going on with the failing tests. I had a logic error, but that was easy to fix once I could see it fail. And I couldn't see it fail because the directory tree I have books in has /src/ in the name outside the se book directory, and the

if f"{os.sep}src{os.sep}" not in root:

was failing all of the time and thus not doing the tests that were inside that if.

Although that is probably a rare occurrence, it causes "hidden" failures that are almost impossible to figure out. I therefore changed the test to reduce the chances of it happening:

if f"{os.sep}src{os.sep}images{os.sep}" not in root:

@acabal
Copy link
Member

acabal commented Nov 27, 2023

Hi Vince, I finally had some time to work on this.

I'm trying to test it on the corpus but it hangs on Gibbon. On my machine old lint finishes Gibbon in about 18 mins. I left this branch running for an hour before killing it, so either it's taking far too long or it hangs somewhere.

Can you see what's going on?

Gibbon is of course an edge case, but if the new branch is exponentially slower than this will filter down to all ebooks which could become a problem when building on the server.

@vr8hub
Copy link
Contributor Author

vr8hub commented Nov 27, 2023

Will do. I had to a lot of instrumentation to figure out what was going on with the old lint when I did Gibbon (it was originally taking over an hour then, too), but maybe it will be easier this time, since everything is in a function.

@acabal
Copy link
Member

acabal commented Nov 27, 2023

I haven't inspected the new branch too closely, but a big reason why the old branch was such a mess was because everything was done in a single loop, and that loop naturally just got huge and disorganized.

One loop is good for speed because for the most part we're running all of the checks only once per file. But if this refactor changes that (I don't know if it does) and we wind up opening files multiple times, or having big loops within big loops, then the timing could easily become exponential at the expense of code organization.

@acabal
Copy link
Member

acabal commented Nov 27, 2023

Also I want to note that my computer has 16GB of RAM and 12 cores. The server is much, much, much weaker. If lint is just very slow and not actually hanging, and my machine takes > 1 hour to lint Gibbon, then it would take a whole day on the server.

@vr8hub
Copy link
Contributor Author

vr8hub commented Nov 27, 2023

It does not (should not) change that—there is still only one loop, but instead of everything being inline within that one loop, now things are arranged into functions, so that the loop calls function a, b, c, etc. With Gibbon, the problem is almost certainly on the endnotes, but I'll see what shakes out with the instrumentation. I'll also try it on some of the smaller big ones (Don Quixote) to see if there's a difference there, as well.

Hopefully it's a logic error that's causing it to get stuck on something…

@vr8hub
Copy link
Contributor Author

vr8hub commented Nov 28, 2023

I found something stupid I did (ran something on every file instead of once before all), and started a retest. And immediately ran into something else unrelated to the restructure. I had to track it down before I started another test, because it wasn't even getting to processing the first XHTML file. All comments below are on master; I was current before starting this.

In the section of the code that builds the section_tree, line 1288 is

has_header = bool(dom_section.xpath("./*[re:test(name(), '^h[1-6]$')] | .//*[not(re:test(name(), '^(section|article|nav)$'))]//*[re:test(name(), '^h[1-6]$')]"))

That one line, when processing endnotes.xhtml in the build tree section, was taking several minutes (could be [much] longer; I never let it go more than five minutes before killing it).

It consists of two xpath statements separated by an or |. Apparently xpath doesn't short-circuit its ifs, because when I changed it to these (hopefully) equivalent statements:

			has_header = bool(dom_section.xpath("./*[re:test(name(), '^h[1-6]$')]"))
			if not has_header:
				has_header = bool(dom_section.xpath("//*[not(re:test(name(), '^(section|article|nav)$'))]//*[re:test(name(), '^h[1-6]$')]"))

it finished immediately. Whatever that second portion is doing, it does not like doing it on an endnotes file of that size.

Now, why hasn't that caused problems with Gibbon before? The change was apparently made before I submitted it last year. I don't know, I just know it doesn't like it now.

With the above change (and the fix of my stuff I mentioned at the top), the lint of Gibbon takes about the same amount of time as using master.

Let me know if:

  1. The above statements are equivalent.
  2. Any of the above makes sense to you (I'm still not very conversant in xpath, so I'm not sure what the second half of the or is doing).

If the answer to #1 is yes, let me know if you want me to submit a PR to make that change in master, which I'll then rebase into the restructure, make my fix, and then I'll retest and then you can retest.

I did not try this level of testing before now because I really thought you would have lots to say about the code itself before we got to a point where we would be ready for testing. :)

@acabal
Copy link
Member

acabal commented Nov 28, 2023

OK, but what concerns me is that we don't know why this is happening.

I'm running on master and Gibbon takes ~18 mins. If I switch to this branch it hangs. But the xpath is the same. This suggests to me that we've entered an exponential loop somewhere accidentally. My guess, without looking closely, is that previously this long-running xpath (or some other long-running xpath) was being called in O(n) time, and now it's being called in O(n^2) or worse. That would explain why master still is OK but switching branches kills it.

@acabal
Copy link
Member

acabal commented Nov 28, 2023

We can still accept a fix for this xpath, that's a good discovery, but I think that's a bandaid for a more fundamental issue in this branch.

@vr8hub
Copy link
Contributor Author

vr8hub commented Nov 28, 2023

I gave you all the information needed to demonstrate it's not a bandaid. First:

I found something stupid I did (ran something on every file instead of once before all), and started a retest.

Second:

With the above change (and the fix of my stuff I mentioned at the top), the lint of Gibbon takes about the same amount of time as using master.

Thus, it's not a bandaid. The thing I was running every file was building the tree, so fixing my problem and fixing the xpath issue in building the tree fixed everything.

I'll submit the PR to master in just a second, and I'll retest everything after rebasing master on the restructure branch.

@acabal
Copy link
Member

acabal commented Nov 28, 2023

The other thing I would suggest is that in the new functions, the function definitions have the self parameter in various positions. But traditionally in Python self is always the first parameter. That should be an easy refactor.

@vr8hub
Copy link
Contributor Author

vr8hub commented Nov 28, 2023

Absolutely. I actually noticed that (I mean, it stuck out) last night when I was adding the parameter to the syntax function. I'll change that today.

With the xpath change and the fix of my issue, linting Gibbon with the restructure now takes 14 minutes on my Mac; linting it using master takes 21 minutes. So the "bandaid" cut a third off the time. :)

With things now in functions, I'll look further at those 14 minutes (most of which are the XHTML checks on endnotes) and see if there's something else that is taking up most of the time.

@vr8hub
Copy link
Contributor Author

vr8hub commented Nov 29, 2023

I rebased the updated master into the branch, made the self changes, and retested, and Gibbon still takes 14 minutes and some change. Ready for you again, I believe.

@vr8hub vr8hub mentioned this pull request Nov 29, 2023
@acabal
Copy link
Member

acabal commented Dec 1, 2023

It looks like f-008 is causing problems in this branch, it's raising false positives on books like The Man in the Brown Suit and others. Can you take a look?

@vr8hub
Copy link
Contributor Author

vr8hub commented Dec 1, 2023

I asked you about that above, and you replied "Correct" to my question about whether only the cover should have ".source" in the name. Something with two periods is not URL-safe, and therefore the message is accurate.

@acabal acabal merged commit 6a06d6c into standardebooks:master Dec 5, 2023
1 check passed
@acabal
Copy link
Member

acabal commented Dec 5, 2023

OK, I looked it over and I think things basically look good. Thanks! We'll let this sit on master for a while to let it shake out and then I'll do a release maybe late in the month.

@vr8hub
Copy link
Contributor Author

vr8hub commented Dec 5, 2023

Sounds good!

@vr8hub vr8hub deleted the lint-restructure branch December 5, 2023 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants