Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web UI for searching #6

Open
aidanholm opened this issue Dec 30, 2024 · 12 comments
Open

Web UI for searching #6

aidanholm opened this issue Dec 30, 2024 · 12 comments

Comments

@aidanholm
Copy link

Hi! I am loving hoardy-web so far :)

I have got hoardy-web serve up and running successfully and serving archived websites, but one feature that would be wonderful to have is a search UI. Are there any plans to implement one, or would a PR with such a feature, tastefully implemented of course, be likely to be accepted?

It also looks like the finding and filtering functionality in map_wrr_paths is unindexed, which would definitely affect the speed of such a search interface; AFAICT these read all files on disk for each query. I am thinking of throwing some indices into a sqlite db in the root of each data store, but not sure if you've already got plans in this area?

@oxij
Copy link
Member

oxij commented Dec 30, 2024 via email

@oxij
Copy link
Member

oxij commented Dec 31, 2024 via email

@aidanholm
Copy link
Author

Thanks for the quick and detailed reply (and sorry for the slow response)

Also, realistically, a good search page needs to be completely asynchronous

I'm not sure what you mean by this, could you explain?

I haven't fully fleshed out the whole search http query architecture, but one possible way is to have requests return a fixed number of results (e.g. 100) along with a continuation token -- if sqlite can cough up 100 results at a time in (and this depends on the indexing structure used as well of course, but 10ms--100ms should be doable) then long-running queries would be broken up into many individual requests, which would prevent one long query from stalling the entire server for everyone

I am likely also missing background context on how the hoardy-web web server is deployed; e.g. I would assume that it's intended for a small number of occasional users, and that a bit of tail latency from concurrent search requests is no big deal, but I could be totally wrong there

But, I suppose, if this is super-important to you, I would not be completely opposed to accepting a hacky implementation with a simple sqlite index, with an understanding that this will be re-implemented in the future, and the DB format will not be compatible with the future version and everything will need to be re-indexed.

Sounds reasonable to me :) I'll see what I can come up with

Though, if you plan to do this, then please wait a couple of days before starting, because I have a huge change re-formatting everything with black ...
This bit is done now.

nice :)

@oxij
Copy link
Member

oxij commented Jan 17, 2025 via email

@aidanholm
Copy link
Author

while the search is generating its 100 results, the archiving will stop working

Ah yes this is true, but I am less concerned for now, since

  • archiving requests are not latency sensitive; as long as throughput isn't unduly affected, a tail latency increase is not ideal but also likely unnoticable in practice
  • there are workarounds like running multiple workers with uwsgi / gunicorn / etc (if, as I believe, the server is "stateless"), or running separate servers, one for archival requests and one for search requests
  • it's not yet clear what query latencies would be like in practice, so maybe they'd be low enough to not have to handle specially, and
  • if you're planning a sync -> async change, that's something that can likely be done without an extra search endpoint making it more difficult

So I will focus for now on getting something that works, and reconsider this once I have a clearer idea of actual performance dynamics

You can put the search into a separate OS thread, and query its state periodically instead, I suppose, but then it's hard to know when that thread should stop if the users closes the relevant page.

Possible! That thread could be limited to precomputing the next N results, with an eventual timeout -- but the first request must either be done synchronously or have special handling, which would be nice to avoid if possible

A good async implementation would use WebSockets to get return search results, solving both issues.

I'm not familiar with how websockets would work with a Flask server -- as I understand it, websockets are incompatible with the WSGI protocol, so they'd presumably need some extra handling with the actual HTTP server bit, which might complicate your plan to change to async. But if you are strongly in favour of a websocket implementation, it should also be not too much work to change a standard flask route based implementation to use websockets, once that standard implementation exists :)

@oxij
Copy link
Member

oxij commented Jan 18, 2025 via email

@aidanholm
Copy link
Author

The server is stateful since archiving -> dump parsing -> indexing is stateful.

This is only stateful because the serve path's index is maintained in memory (in the SortedIndex) right?

Meanwhile, I'm actively working on cleaning up and publishing my file indexer.

Of course you may be planning to go in a completely different direction~ but just for reference, I have some prototype code implementing that interface via sqlite on disk in aidanholm@33b8b9d9d08a; this is still hacked together and assumes wrr only, only one archival dir, etc, but seems to be working nicely given those constraints

With a very low-effort schema, I got around a 1.4MB database for 275 MB of (hopefully representative) wrr files, so around 1.5% overhead, serving startup is now "instant", and IIUC this would make the server stateless

hoardy-web uses Bottle, not Flask, Flask is too complex for me.

I can only agree :) I've used flask a fair bit at $WORK and found it deceptively simple (haven't tried Bottle yet) so this stance makes complete sense to me

@oxij
Copy link
Member

oxij commented Jan 19, 2025 via email

@aidanholm
Copy link
Author

replays have to be synchronous with dumping anyway, otherwise my plans for replay buttons in the extension will break.

I'm not sure I If I understand correctly; do you mean that if there is an ongoing dumping request for a URL, then any replay request for that same URL should block until the dump completes?

Wouldn't this also require the replay server and the archival/dumping server to share the index (or share at least some state)?

this won't work for WRR-bundles and etc, as those need to be sub-file.

IIUC a wrr bundle is basically wrr bundles directly byte-concatenated? An offset column could be added (and a size column if it cannot be inferred by the wrrb loader)

Also, full-text indexing will need indirections to deduplicate indexing of same-content data.

I am currently playing around a bit with sqlite's full text search -- it is possible to index text without storing a copy of the indexed content, and indexing response bodies only for text/http response content-types results in reasonably small indexes even without doing any content deduplication; I got index sizes of about 4% of the data store

@oxij
Copy link
Member

oxij commented Jan 19, 2025 via email

@aidanholm
Copy link
Author

But, as a client, if you dump a new visit for a URL, the server says 200 OK, and hence you immediately ask for a replay of this same URL, the server should replay the latest version, not some version from before.

Ah I see, makes sense

Yes, but that's kind of the point of having them in the same process.

I am currently running hoardy_web_sas.py separately to hoardy_web serve; not sure how supported this configuration is in general, but I'd guess the extension's replay button feature would only work / be available when connected to a server with both archival and serving enabled?

@oxij
Copy link
Member

oxij commented Jan 19, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants