DresdenCodak: Fix and improve scraper #230

toonn · 2022-09-21T21:05:38Z

The scraper was broken due to the site layout changing. The structure is lacking so there are many unique cases to deal with.

As the comic is separated into two storylines, one finished and one ongoing, and a series of one-offs that don't fit in either storyline I've put each of the series into its own subdirectory.

To make this possible I've changed how the comicdir is guaranteed to exist, taking into account path components in the filename and I've added path separators in the allowed characters for filenames.

I've also added space as an allowed character in filenames because I want them and there's not really a reason to disallow them. This together with allowing both / and \ could be a breaking change, since scrapers that generate file names containing them will no longer have them filtered out.

If that's not viable then maybe we need a new hierarchy of objects that do allow filenames to specify subdirectories and then scrapers can opt into it one at a time.

codecov · 2022-09-21T21:11:13Z

Codecov Report

Attention: 31 lines in your changes are missing coverage. Please review.

Comparison is base (a23e070) 82.10% compared to head (fb0ea00) 81.75%.
Report is 23 commits behind head on master.

Files	Patch %	Lines
dosagelib/plugins/d.py	27.90%	31 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #230      +/-   ##
==========================================
- Coverage   82.10%   81.75%   -0.35%     
==========================================
  Files          79       79              
  Lines        6649     6687      +38     
  Branches      525      536      +11     
==========================================
+ Hits         5459     5467       +8     
- Misses       1071     1101      +30     
  Partials      119      119

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

toonn · 2022-09-22T09:38:15Z

Is code coverage of scrapers a requirement or just a helpful hint for maintainers to keep an eye on things?

TobiX · 2022-09-22T10:35:57Z

The coverage is just advisory. It would be nice to add some tests for more complicated scrapers (I broke some scrapers by refactoring and only noticed month later), but not required. If you want to add tests, you can find some inspiration in https://github.com/webcomics/dosage/blob/master/tests/test_modules.py

TobiX · 2022-10-01T17:22:12Z

I'm really not sure if I want to allow modules to mess with sub-directories... (I'm not totally opposed, but also not really convinced that the feature is that useful...) I'm pretty sure some modules actually rely on Dosage stripping the path from the "target" file name, so that's one reason I'm a bit uncomfortable with this change...

toonn · 2022-10-03T16:24:21Z

I understand, but I do really want subdirectories (and spaces in names), hence the suggestion of having a bit of a double hierarchy of objects. So we can leave everything as is but things that want to make use of the feature can use _ParserScraperNG to inherit from instead. (Or this could also be a configuration set by scrapers if you would rather have a boolean flag enabling the behavior somewhere.)

The path separators are necessary if we want to be able to organize comics into books or chapters or categories. The spaces are nice to have. I assume they haven't been allowed because they require escaping in a terminal but any modern terminal provides assistance for that during tab completion. Disallowing spaces here means they can never be used. Instead a method could be added to an appropriate class to sanitize filenames and replace spaces by default if that is actually desired.

If the filename is a path we need to ensure all its components exist to avoid errors later when we try to create the file at that path.

The scraper was broken due to the site layout changing. The structure is lacking so there are many unique cases to deal with. As the comic is separated into two storylines, one finished and one ongoing, and a series of one-offs that don't fit in either storyline I've put each of the series into its own subdirectory.

toonn added 4 commits December 7, 2023 15:39

comic: Allow filename to be a path

de2c599

If the filename is a path we need to ensure all its components exist to avoid errors later when we try to create the file at that path.

DresdenCodak: New pages can have the resolution in the name

fb0ea00

toonn force-pushed the dresden-codak branch from 82c4945 to fb0ea00 Compare December 7, 2023 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DresdenCodak: Fix and improve scraper #230

DresdenCodak: Fix and improve scraper #230

toonn commented Sep 21, 2022

codecov bot commented Sep 21, 2022 •

edited

Loading

toonn commented Sep 22, 2022

TobiX commented Sep 22, 2022

TobiX commented Oct 1, 2022

toonn commented Oct 3, 2022

DresdenCodak: Fix and improve scraper #230

Are you sure you want to change the base?

DresdenCodak: Fix and improve scraper #230

Conversation

toonn commented Sep 21, 2022

codecov bot commented Sep 21, 2022 • edited Loading

Codecov Report

toonn commented Sep 22, 2022

TobiX commented Sep 22, 2022

TobiX commented Oct 1, 2022

toonn commented Oct 3, 2022

codecov bot commented Sep 21, 2022 •

edited

Loading