Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[crawler] Scroll down in websites that load images on scroll #699

Open
1 task done
npelov opened this issue Nov 27, 2024 · 6 comments
Open
1 task done

[crawler] Scroll down in websites that load images on scroll #699

npelov opened this issue Nov 27, 2024 · 6 comments
Labels
feature request New feature or request

Comments

@npelov
Copy link

npelov commented Nov 27, 2024

Describe the feature you'd like

Some websites (wordpress - https://www.wundertech.net/installing-jellyfin-on-proxmox/) use data-X attributes to store image urls. Then a javascript populates the image url. Hoarder fails to download these images when downloading full page archive. It would be nice to have a global setting (the easiest way) that delays the data processing after the page has been loaded to give javascript a chance to do it's job. 1 or 2 seconds (configurable) should be enough.

Describe the benefits this would bring to existing Hoarder users

This approach is implemented in a lot of wordpress sites, preventing images from being crawled correctly.

Can the goal of this request already be achieved via other means?

Not that I know. Setting

Have you searched for an existing open/closed issue?

  • I have searched for existing issues and none cover my fundamental request

Additional context

try url:
https://www.wundertech.net/installing-jellyfin-on-proxmox/
website screenshot:
image

crawled:
image

@kamtschatka
Copy link
Collaborator

this is not possible at the moment. We are using https://github.com/Y2Z/monolith, which does not execute javascript, it simply stores the html + its assets in a html archive.

@MohamedBassem
Copy link
Collaborator

@kamtschatka we actually do execute the javascript first in the browser before taking the html and passing it to monolith.

We already wait until network activity on the page is stable before taking the html dump. I wonder if this wait is not long enough.

@kamtschatka
Copy link
Collaborator

indeed. looks like it is just loaded, when it comes into the viisble area,, so waiting longer does not help. scrolling down slowly would help, but not sure if that is something we want to do . i am also not sure, why the javascript in the monolith archive does not behave the same anymore.

@MohamedBassem
Copy link
Collaborator

@kamtschatka If I recall correctly, I explicitly don't capture JS in monolith. I think scrolling down is something that we can do actually. I think it makes sense.

@npelov
Copy link
Author

npelov commented Nov 27, 2024

@MohamedBassem be careful with pages that load more content as you scroll down. Maybe make it an option. Or even better - customizable options per domain. I'm sure there are other sites with unnecessary complicated scripts. Why do people have to complicate things so much...

@MohamedBassem
Copy link
Collaborator

@npelov yeah, we probably won't infinitely scroll down, but have some small limit.

@MohamedBassem MohamedBassem added the feature request New feature or request label Nov 30, 2024
@MohamedBassem MohamedBassem changed the title Images, populated by javascript [crawler] Scroll down in websites that load images on scroll Nov 30, 2024
@MohamedBassem MohamedBassem moved this to Backlog in Hoarder's Roadmap Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

3 participants