Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Puppeteering and integration with ArchiveBox #2

Open
pirate opened this issue Jul 11, 2024 · 3 comments
Open

Puppeteering and integration with ArchiveBox #2

pirate opened this issue Jul 11, 2024 · 3 comments

Comments

@pirate
Copy link

pirate commented Jul 11, 2024

Hi! I'm the ArchiveBox maintainer and I just found your project.

It looks pretty sweet, I've been dreaming about in-browser archiving for a while now and actually implemented my own puppeteer/CDP extension to do something very similar to yours. (it records live pages from within the browser extension context and saves into archivebox)

I have a ton of asset-extraction and browser-automation-detection-avoidance snippets (10k LOC+) to share if you're interested, maybe it could save you a lot of time with your work.

ArchiveBox's core is still focused on saving on a separate machine, but I'm happy to share my side-project work on in-browser archiving with other projects so it doesn't go to waste.

Would love to have a call/chat sometime if you're interested:
https://calendly.com/nicksweeting/choose-a-time or https://sweeting.me/#contact (click for email addr)

Also you should go to DWeb camp (https://dwebcamp.org/), it's the best archiving conference imo and it's not marketed very heavily but lots of great people attend including the Webrecorder team and Archive.org

image
@oxij
Copy link
Member

oxij commented Jul 13, 2024

Hi! Thanks for the compliments!

Reading your post I have a feeling you have a slightly different use case in mind compared to what I'm trying to do here.

It seems to me, you want this:

  • you give ArchiveBox a URL,
  • it opens a browser window/tab running pWebArc (the extension) with that URL (or it gives that URL to an already running pWebArc via WebSockets or some such and asks pWebArc open a new tab with that URL),
  • pWebArc captures everything in that tab, and
  • sends the dump back to ArchiveBox.

Which is a valid use case, and I'm willing to help support it where it does not hurt pwebarc (the project), but it is not what this project is designed to do.

Creation of self-sufficient website archives that can be shared with others is a very low-priority for pwebarc, which mainly exists to

  • passively collect everything I see in the most efficient way possible so that could refer back to it later, even after it vanishes from the web (which happens all the time to me), and to

  • actively scrape pages hidden behind complex authentication (that requires a web browser), so that I could immediately feed the captured HTML to a custom script to, e.g., convert bank statements to ledger-cli files, read blog posts and web novels with a proper e-book reader app instead of the browser where typography, text layouting engine, and TTS really suck, etc.

So, as noted in the the FAQ, I feel like adding anti-bot-blocking and anti-browser-automation-detection to pWebArc itself would not help archiving my goals here.
Especially since I want pwebarc to avoid "success at all costs" and to follow the KISS principle as much as possible (which, objectively speaking, is not that much for pWebArc, given the messy web tech it has to use, but there's no need to make it worse).

So, in your use case, to make integration with a smart archiving server (like ArchiveBox) simpler it could be useful to add a REST call the dumping protocol so that pWebArc could notify the archiving server when the whole page finished loading.

Alternatively, pWebArc will get in-browser's local storage persistence in the next version (implementing it turned out to be much more annoying than I expected, even with all my pre-planning in the sources) and navigation tracking will be coming after that.
So, pWebArc could, in principle, soon gain the ability to efficiently produce and submit (to an archiving server) batch-dumps pertaining to a single navigation in a single tab (by doing a large streaming POST request with a concatenation of all the collected dumps), instead of doing a bunch separate per-request dumps as it does now.
Though, I'm not sure how useful that would be.
IMHO, the current way + a "done" message is way simpler.

Also, I'm toying with the idea of pWebArc storing its config on the archiving server, which could also help a bit.

Anyway, so, if I were you and I wanted to use pWebArc from the near future with ArchiveBox within my interpretation of your use case, I would put those 10K LOC+ of autopilot and anti-bot-blocking code into a separate extension (or a set of UserScripts, or some such) and then add two public in-browser sendMessage-based API calls to pWebArc:

  • "please DOM-snapshot tab #N after everything finishes loading there", and
  • "please submit everything pertaining to that tab to the archiving server".

So, the integration with ArchiveBox then becomes this:

  • ArchiveBox opens a new browser tab with the given URL,
  • the autopilot extension does its thing while pWebArc captures everything,
  • the autopilot decides it's done, it asks pWebArc to snapshot the tab and submit the whole capture to ArchiveBox,
  • pWebArc does that,
  • ArchiveBox archives the result.

Ta-da!

Thoughts?

As to DWeb Camp: it's on the other side of the planet, it will also require a US visa, and going through TSA, and then spending days in a forest, speaking to random people instead of doing productive stuff. No, thanks. :]

As to chatting: I consider synchronous communications like IM or voice/video calls to be poison to productive deep work, so I'm trying really hard to minimize the amount of those too.
Instead, I love e-mail.
If you want to send me something private, my e-mail can be seen both in commit messages and written in plain-text at https://oxij.org/#contact (SMTP gray-listing drops 80% of spam immediately, honeypot addresses + a very simple compression-based classifier eat 99.9% of the rest, so, IMHO, hiding e-mails behind JavaScript makes no sense, just saying :]).

But I don't see why we can't discuss this topic publicly here.

@oxij oxij changed the title This project looks awesome! Nice Work Puppeteering and integration with ArchiveBox Jul 13, 2024
@pirate
Copy link
Author

pirate commented Jul 13, 2024

Oh I wasn't intending to propose integrating with ArchiveBox at all! I just meant I have some code I can share that you might find useful for your own project (the in-browser page saving logic to create WARCs with more than just HTML+images, e.g handling iframes, shadow DOMs, webrtc, etc).

And no worries if DWEB is not your thing, just figured Id share it in case you were nearby.

Here is fine for discussion too, no need to call.

@oxij
Copy link
Member

oxij commented Jul 22, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants