Only pulled one page #12

tripleo1 · 2020-07-19T13:20:12Z

How do I pull an entire website with this
How do I see what it is doing internally?

peterk · 2020-07-19T13:25:39Z

Warcworker is for single page archiving only right now - typically for single posts on SPA websites (social media). There is no crawler or indexer. There are better tools if you want to archive a regular website including crawling.
If you want to monitor logs run docker-compose logs --tail=100 -t -f

tripleo1 · 2020-07-19T19:28:56Z

Thats an awful lot of work for just one page.

On Sun, Jul 19, 2020 at 9:25 AM Peter Krantz ***@***.***> wrote: 1. Warcworker is for single page archiving only right now. There is no crawler or indexer. 2. If you want to monitor logs run docker-compose logs --tail=100 -t -f — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AL73FMY2ZN3PACFBZBA4SALR4LX6BANCNFSM4PBKJ6DA> .

peterk · 2020-07-19T20:30:32Z

:-) Well, I use it for mass archiving of URLs collected by a custom crawler from js heavy websites. It does its job. For regular archiving see Heritrix. Sorry if it doesn't match your use case. I have updated the README to clarify this for other potential users.

peterk · 2020-07-19T20:47:08Z

You could check out the archiving component of warcworker - Squidwarc - it has settings that may help you in archiving more links of a website (see Page + Same Domain Links setting).

tripleo1 · 2020-07-20T03:19:29Z

Thanks. Was just looking at that.

…

On Sun, Jul 19, 2020 at 4:47 PM Peter Krantz ***@***.***> wrote: You could check out the archiving component of warcworker - Squidwarc <https://github.com/N0taN3rd/Squidwarc> - it has settings that may help you in archiving more links of a website (see Page + Same Domain Links setting). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AL73FM7YNTRXNDUZOQEBM23R4NLVRANCNFSM4PBKJ6DA> .

peterk · 2020-07-20T15:10:19Z

If I remember correctly it only captures the current page and all the links from that page so it will not capture an entire website. If the website you are archiving is not dependant on running scripts in the archiving tool you could check out HTTRack as well.

peterk · 2024-07-09T08:00:39Z

Closing

peterk closed this as completed Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only pulled one page #12

Only pulled one page #12

tripleo1 commented Jul 19, 2020

peterk commented Jul 19, 2020 •

edited

Loading

tripleo1 commented Jul 19, 2020 via email

peterk commented Jul 19, 2020 •

edited

Loading

peterk commented Jul 19, 2020

tripleo1 commented Jul 20, 2020 via email

peterk commented Jul 20, 2020

peterk commented Jul 9, 2024

Only pulled one page #12

Only pulled one page #12

Comments

tripleo1 commented Jul 19, 2020

peterk commented Jul 19, 2020 • edited Loading

tripleo1 commented Jul 19, 2020 via email

peterk commented Jul 19, 2020 • edited Loading

peterk commented Jul 19, 2020

tripleo1 commented Jul 20, 2020 via email

peterk commented Jul 20, 2020

peterk commented Jul 9, 2024

peterk commented Jul 19, 2020 •

edited

Loading

peterk commented Jul 19, 2020 •

edited

Loading