[crawler] Scroll down in websites that load images on scroll #699

npelov · 2024-11-27T10:19:55Z

Describe the feature you'd like

Some websites (wordpress - https://www.wundertech.net/installing-jellyfin-on-proxmox/) use data-X attributes to store image urls. Then a javascript populates the image url. Hoarder fails to download these images when downloading full page archive. It would be nice to have a global setting (the easiest way) that delays the data processing after the page has been loaded to give javascript a chance to do it's job. 1 or 2 seconds (configurable) should be enough.

Describe the benefits this would bring to existing Hoarder users

This approach is implemented in a lot of wordpress sites, preventing images from being crawled correctly.

Can the goal of this request already be achieved via other means?

Not that I know. Setting

Have you searched for an existing open/closed issue?

I have searched for existing issues and none cover my fundamental request

Additional context

try url:
https://www.wundertech.net/installing-jellyfin-on-proxmox/
website screenshot:

crawled:

kamtschatka · 2024-11-27T11:06:39Z

this is not possible at the moment. We are using https://github.com/Y2Z/monolith, which does not execute javascript, it simply stores the html + its assets in a html archive.

MohamedBassem · 2024-11-27T11:08:42Z

@kamtschatka we actually do execute the javascript first in the browser before taking the html and passing it to monolith.

We already wait until network activity on the page is stable before taking the html dump. I wonder if this wait is not long enough.

kamtschatka · 2024-11-27T11:41:58Z

indeed. looks like it is just loaded, when it comes into the viisble area,, so waiting longer does not help. scrolling down slowly would help, but not sure if that is something we want to do . i am also not sure, why the javascript in the monolith archive does not behave the same anymore.

MohamedBassem · 2024-11-27T11:44:01Z

@kamtschatka If I recall correctly, I explicitly don't capture JS in monolith. I think scrolling down is something that we can do actually. I think it makes sense.

npelov · 2024-11-27T20:20:29Z

@MohamedBassem be careful with pages that load more content as you scroll down. Maybe make it an option. Or even better - customizable options per domain. I'm sure there are other sites with unnecessary complicated scripts. Why do people have to complicate things so much...

MohamedBassem · 2024-11-30T14:37:27Z

@npelov yeah, we probably won't infinitely scroll down, but have some small limit.

MohamedBassem added the feature request New feature or request label Nov 30, 2024

MohamedBassem changed the title ~~Images, populated by javascript~~ [crawler] Scroll down in websites that load images on scroll Nov 30, 2024

MohamedBassem added this to Hoarder's Roadmap Nov 30, 2024

MohamedBassem moved this to Backlog in Hoarder's Roadmap Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[crawler] Scroll down in websites that load images on scroll #699

[crawler] Scroll down in websites that load images on scroll #699

npelov commented Nov 27, 2024

kamtschatka commented Nov 27, 2024

MohamedBassem commented Nov 27, 2024

kamtschatka commented Nov 27, 2024

MohamedBassem commented Nov 27, 2024

npelov commented Nov 27, 2024

MohamedBassem commented Nov 30, 2024

[crawler] Scroll down in websites that load images on scroll #699

[crawler] Scroll down in websites that load images on scroll #699

Comments

npelov commented Nov 27, 2024

Describe the feature you'd like

Describe the benefits this would bring to existing Hoarder users

Can the goal of this request already be achieved via other means?

Have you searched for an existing open/closed issue?

Additional context

kamtschatka commented Nov 27, 2024

MohamedBassem commented Nov 27, 2024

kamtschatka commented Nov 27, 2024

MohamedBassem commented Nov 27, 2024

npelov commented Nov 27, 2024

MohamedBassem commented Nov 30, 2024