Puppeteering and integration with ArchiveBox #2

pirate · 2024-07-11T23:11:18Z

Hi! I'm the ArchiveBox maintainer and I just found your project.

It looks pretty sweet, I've been dreaming about in-browser archiving for a while now and actually implemented my own puppeteer/CDP extension to do something very similar to yours. (it records live pages from within the browser extension context and saves into archivebox)

I have a ton of asset-extraction and browser-automation-detection-avoidance snippets (10k LOC+) to share if you're interested, maybe it could save you a lot of time with your work.

ArchiveBox's core is still focused on saving on a separate machine, but I'm happy to share my side-project work on in-browser archiving with other projects so it doesn't go to waste.

Would love to have a call/chat sometime if you're interested:
https://calendly.com/nicksweeting/choose-a-time or https://sweeting.me/#contact (click for email addr)

Also you should go to DWeb camp (https://dwebcamp.org/), it's the best archiving conference imo and it's not marketed very heavily but lots of great people attend including the Webrecorder team and Archive.org

oxij · 2024-07-13T14:19:34Z

Hi! Thanks for the compliments!

Reading your post I have a feeling you have a slightly different use case in mind compared to what I'm trying to do here.

It seems to me, you want this:

you give ArchiveBox a URL,
it opens a browser window/tab running pWebArc (the extension) with that URL (or it gives that URL to an already running pWebArc via WebSockets or some such and asks pWebArc open a new tab with that URL),
pWebArc captures everything in that tab, and
sends the dump back to ArchiveBox.

Which is a valid use case, and I'm willing to help support it where it does not hurt pwebarc (the project), but it is not what this project is designed to do.

Creation of self-sufficient website archives that can be shared with others is a very low-priority for pwebarc, which mainly exists to

passively collect everything I see in the most efficient way possible so that could refer back to it later, even after it vanishes from the web (which happens all the time to me), and to
actively scrape pages hidden behind complex authentication (that requires a web browser), so that I could immediately feed the captured HTML to a custom script to, e.g., convert bank statements to ledger-cli files, read blog posts and web novels with a proper e-book reader app instead of the browser where typography, text layouting engine, and TTS really suck, etc.

So, as noted in the the FAQ, I feel like adding anti-bot-blocking and anti-browser-automation-detection to pWebArc itself would not help archiving my goals here.
Especially since I want pwebarc to avoid "success at all costs" and to follow the KISS principle as much as possible (which, objectively speaking, is not that much for pWebArc, given the messy web tech it has to use, but there's no need to make it worse).

So, in your use case, to make integration with a smart archiving server (like ArchiveBox) simpler it could be useful to add a REST call the dumping protocol so that pWebArc could notify the archiving server when the whole page finished loading.

Alternatively, pWebArc will get in-browser's local storage persistence in the next version (implementing it turned out to be much more annoying than I expected, even with all my pre-planning in the sources) and navigation tracking will be coming after that.
So, pWebArc could, in principle, soon gain the ability to efficiently produce and submit (to an archiving server) batch-dumps pertaining to a single navigation in a single tab (by doing a large streaming POST request with a concatenation of all the collected dumps), instead of doing a bunch separate per-request dumps as it does now.
Though, I'm not sure how useful that would be.
IMHO, the current way + a "done" message is way simpler.

Also, I'm toying with the idea of pWebArc storing its config on the archiving server, which could also help a bit.

Anyway, so, if I were you and I wanted to use pWebArc from the near future with ArchiveBox within my interpretation of your use case, I would put those 10K LOC+ of autopilot and anti-bot-blocking code into a separate extension (or a set of UserScripts, or some such) and then add two public in-browser sendMessage-based API calls to pWebArc:

"please DOM-snapshot tab #N after everything finishes loading there", and
"please submit everything pertaining to that tab to the archiving server".

So, the integration with ArchiveBox then becomes this:

ArchiveBox opens a new browser tab with the given URL,
the autopilot extension does its thing while pWebArc captures everything,
the autopilot decides it's done, it asks pWebArc to snapshot the tab and submit the whole capture to ArchiveBox,
pWebArc does that,
ArchiveBox archives the result.

Ta-da!

Thoughts?

As to DWeb Camp: it's on the other side of the planet, it will also require a US visa, and going through TSA, and then spending days in a forest, speaking to random people instead of doing productive stuff. No, thanks. :]

As to chatting: I consider synchronous communications like IM or voice/video calls to be poison to productive deep work, so I'm trying really hard to minimize the amount of those too.
Instead, I love e-mail.
If you want to send me something private, my e-mail can be seen both in commit messages and written in plain-text at https://oxij.org/#contact (SMTP gray-listing drops 80% of spam immediately, honeypot addresses + a very simple compression-based classifier eat 99.9% of the rest, so, IMHO, hiding e-mails behind JavaScript makes no sense, just saying :]).

But I don't see why we can't discuss this topic publicly here.

pirate · 2024-07-13T21:14:36Z

Oh I wasn't intending to propose integrating with ArchiveBox at all! I just meant I have some code I can share that you might find useful for your own project (the in-browser page saving logic to create WARCs with more than just HTML+images, e.g handling iframes, shadow DOMs, webrtc, etc).

And no worries if DWEB is not your thing, just figured Id share it in case you were nearby.

Here is fine for discussion too, no need to call.

oxij · 2024-07-22T09:53:40Z

(the in-browser page saving logic to create WARCs with more than just HTML+images, e.g handling iframes, shadow DOMs, webrtc, etc)

Point by point: - WARCs: pWebArc does not generate WARCs by design since WARCs are pretty restrictive in what they can represent (WARC generation might become an optional feature later, but I don't see how pWebArc could completely replace its own WRR file format with WARC, how would you capture search engine queries done via HTTP POST otherwise?); - more than just HTML+images: pWebArc captures everything your browser fetches via webRequest and Chromium debugger APIs, and I mean everything; - iframes: are handled transparently via the above APIs, and also in DOM snapshots; - shadow DOMs: `snapshotTab` action of pWebArc takes non-shadow DOM snapshots, by design; - webrtc: web video fetching is usually done via HTTP, which pWebArc captures, but it could be useful if there's a simple way to capture it, I suppose. For me, at the moment, the most obvious things pWebArc (as it is in my local git repo) lacks is per-host profile inheritance and WebSockets capture. But, for WebSockets Firefox provides no APIs, so I can't do anything there without patching it first, which is a very low-priority ATM. Also, why should pWebArc care about shadow DOM?

I just meant I have some code I can share that you might find useful for your own project

In general, yes, I would at the very least read most of (and would not be opposed to borrowing some of) your code if you would publish those snippets under a license compatible with GPLv3+. After all, I did read most of the source of archiveweb.page while making this (and then did almost everything almost completely differently, but, oh, well). If I were to judge just from looking at your screenshot, I don't expect I would borrow much, since I don't think pWebArc should include most of the things I can see there, but every bit helps, I suppose.

oxij changed the title ~~This project looks awesome! Nice Work~~ Puppeteering and integration with ArchiveBox Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Puppeteering and integration with ArchiveBox #2

Puppeteering and integration with ArchiveBox #2

pirate commented Jul 11, 2024 •

edited

Loading

oxij commented Jul 13, 2024

pirate commented Jul 13, 2024 •

edited

Loading

oxij commented Jul 22, 2024 via email

Puppeteering and integration with ArchiveBox #2

Puppeteering and integration with ArchiveBox #2

Comments

pirate commented Jul 11, 2024 • edited Loading

oxij commented Jul 13, 2024

pirate commented Jul 13, 2024 • edited Loading

oxij commented Jul 22, 2024 via email

pirate commented Jul 11, 2024 •

edited

Loading

pirate commented Jul 13, 2024 •

edited

Loading