-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only pulled one page #12
Comments
|
Thats an awful lot of work for just one page.
…On Sun, Jul 19, 2020 at 9:25 AM Peter Krantz ***@***.***> wrote:
1. Warcworker is for single page archiving only right now. There is no
crawler or indexer.
2. If you want to monitor logs run docker-compose logs --tail=100 -t -f
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#12 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL73FMY2ZN3PACFBZBA4SALR4LX6BANCNFSM4PBKJ6DA>
.
|
:-) Well, I use it for mass archiving of URLs collected by a custom crawler from js heavy websites. It does its job. For regular archiving see Heritrix. Sorry if it doesn't match your use case. I have updated the README to clarify this for other potential users. |
You could check out the archiving component of warcworker - Squidwarc - it has settings that may help you in archiving more links of a website (see Page + Same Domain Links setting). |
Thanks. Was just looking at that.
…On Sun, Jul 19, 2020 at 4:47 PM Peter Krantz ***@***.***> wrote:
You could check out the archiving component of warcworker - Squidwarc
<https://github.com/N0taN3rd/Squidwarc> - it has settings that may help
you in archiving more links of a website (see Page + Same Domain Links
setting).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#12 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL73FM7YNTRXNDUZOQEBM23R4NLVRANCNFSM4PBKJ6DA>
.
|
If I remember correctly it only captures the current page and all the links from that page so it will not capture an entire website. If the website you are archiving is not dependant on running scripts in the archiving tool you could check out HTTRack as well. |
Closing |
The text was updated successfully, but these errors were encountered: