Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only pulled one page #12

Closed
tripleo1 opened this issue Jul 19, 2020 · 7 comments
Closed

Only pulled one page #12

tripleo1 opened this issue Jul 19, 2020 · 7 comments

Comments

@tripleo1
Copy link
Contributor

  • How do I pull an entire website with this
  • How do I see what it is doing internally?
@peterk
Copy link
Owner

peterk commented Jul 19, 2020

  1. Warcworker is for single page archiving only right now - typically for single posts on SPA websites (social media). There is no crawler or indexer. There are better tools if you want to archive a regular website including crawling.
  2. If you want to monitor logs run docker-compose logs --tail=100 -t -f

@tripleo1
Copy link
Contributor Author

tripleo1 commented Jul 19, 2020 via email

@peterk
Copy link
Owner

peterk commented Jul 19, 2020

:-) Well, I use it for mass archiving of URLs collected by a custom crawler from js heavy websites. It does its job. For regular archiving see Heritrix. Sorry if it doesn't match your use case. I have updated the README to clarify this for other potential users.

@peterk
Copy link
Owner

peterk commented Jul 19, 2020

You could check out the archiving component of warcworker - Squidwarc - it has settings that may help you in archiving more links of a website (see Page + Same Domain Links setting).

@tripleo1
Copy link
Contributor Author

tripleo1 commented Jul 20, 2020 via email

@peterk
Copy link
Owner

peterk commented Jul 20, 2020

If I remember correctly it only captures the current page and all the links from that page so it will not capture an entire website. If the website you are archiving is not dependant on running scripts in the archiving tool you could check out HTTRack as well.

@peterk
Copy link
Owner

peterk commented Jul 9, 2024

Closing

@peterk peterk closed this as completed Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants