-
-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[crawler] Scroll down in websites that load images on scroll #699
Comments
this is not possible at the moment. We are using https://github.com/Y2Z/monolith, which does not execute javascript, it simply stores the html + its assets in a html archive. |
@kamtschatka we actually do execute the javascript first in the browser before taking the html and passing it to monolith. We already wait until network activity on the page is stable before taking the html dump. I wonder if this wait is not long enough. |
indeed. looks like it is just loaded, when it comes into the viisble area,, so waiting longer does not help. scrolling down slowly would help, but not sure if that is something we want to do . i am also not sure, why the javascript in the monolith archive does not behave the same anymore. |
@kamtschatka If I recall correctly, I explicitly don't capture JS in monolith. I think scrolling down is something that we can do actually. I think it makes sense. |
@MohamedBassem be careful with pages that load more content as you scroll down. Maybe make it an option. Or even better - customizable options per domain. I'm sure there are other sites with unnecessary complicated scripts. Why do people have to complicate things so much... |
@npelov yeah, we probably won't infinitely scroll down, but have some small limit. |
Describe the feature you'd like
Some websites (wordpress - https://www.wundertech.net/installing-jellyfin-on-proxmox/) use data-X attributes to store image urls. Then a javascript populates the image url. Hoarder fails to download these images when downloading full page archive. It would be nice to have a global setting (the easiest way) that delays the data processing after the page has been loaded to give javascript a chance to do it's job. 1 or 2 seconds (configurable) should be enough.
Describe the benefits this would bring to existing Hoarder users
This approach is implemented in a lot of wordpress sites, preventing images from being crawled correctly.
Can the goal of this request already be achieved via other means?
Not that I know. Setting
Have you searched for an existing open/closed issue?
Additional context
try url:
https://www.wundertech.net/installing-jellyfin-on-proxmox/
website screenshot:
crawled:
The text was updated successfully, but these errors were encountered: