-
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web UI for searching #6
Comments
Hi! Thanks for kind words!
Yes, currently `serve` has no full-text search and all reqres filtering with `find` and such is done without indexes.
And yes, indexes are planned, eventually.
To elaborate a little, I actually have another, yet unpublished, bunch of scripts which I plan to rework into a single-command app I will probably simply call `hoardy`.
Those scripts mainly do file de-duplication (a-la `fdupes`, but with an index), set operations on directories (e.g. "get me paths to all unique files in this directory (ignoring duplicates)", "list files common to these three directories", "list files present in one of these two directories but missing from the third", etc), and syncing of those sets across disks and hosts ("copy all files present in this directory and matching this filter to another host if they are not present there and also not present on yet another host") efficiently (not linearly, like `rsync`, but with Merkle-trees over indexes).
My current plan is to rework and publish that first, then split out my indexing code from there into a separate library (or put it into `kisstdlib`, maybe), which I would then reuse in `hoardy-web` to add indexing here.
Also, realistically, a good search page needs to be completely asynchronous and `hoardy-web serve` is completely synchronous at the moment, so I also need to cleanup and publish my KISS `asyncio` modules to `kisstdlib` (I hate the standard `asyncio`, sorry, not sorry) and either find a compatible `HTTP` protocol parser or write my own first...
So, it will probably take awhile to do this properly.
But, I suppose, if this is super-important to you, I would not be completely opposed to accepting a hacky implementation with a simple `sqlite` index, with an understanding that this will be re-implemented in the future, and the DB format will not be compatible with the future version and everything will need to be re-indexed.
(Though, if you plan to do this, then please wait a couple of days before starting, because I have a huge change re-formatting everything with `black` and then a bunch of whole-repo edits fixing many `pylint` warnings. I'm currently debugging these changes, because they unexpectedly broke tests, and I'm trying to figure out why as we speak.)
|
... I have a huge change re-formatting everything with `black` and then a bunch of whole-repo edits fixing many `pylint` warnings. I'm currently debugging these changes, because they unexpectedly broke tests, and I'm trying to figure out why as we speak. ...
This bit is done now.
|
Thanks for the quick and detailed reply (and sorry for the slow response)
I'm not sure what you mean by this, could you explain? I haven't fully fleshed out the whole search http query architecture, but one possible way is to have requests return a fixed number of results (e.g. 100) along with a continuation token -- if sqlite can cough up 100 results at a time in (and this depends on the indexing structure used as well of course, but 10ms--100ms should be doable) then long-running queries would be broken up into many individual requests, which would prevent one long query from stalling the entire server for everyone I am likely also missing background context on how the hoardy-web web server is deployed; e.g. I would assume that it's intended for a small number of occasional users, and that a bit of tail latency from concurrent search requests is no big deal, but I could be totally wrong there
Sounds reasonable to me :) I'll see what I can come up with
nice :) |
> Also, realistically, a good search page needs to be completely asynchronous
I'm not sure what you mean by this, could you explain?
I haven't fully fleshed out the whole search http query architecture,
but ...
I mean, the problem with any broken-up-with-HTTP-continuations synchronous design is that HTTP requests themselves will still be processed synchronously, so while the search is generating its 100 results, the archiving will stop working.
You can put the search into a separate OS thread, and query its state periodically instead, I suppose, but then it's hard to know when that thread should stop if the users closes the relevant page.
A good async implementation would use WebSockets to get return search results, solving both issues.
|
Ah yes this is true, but I am less concerned for now, since
So I will focus for now on getting something that works, and reconsider this once I have a clearer idea of actual performance dynamics
Possible! That thread could be limited to precomputing the next N results, with an eventual timeout -- but the first request must either be done synchronously or have special handling, which would be nice to avoid if possible
I'm not familiar with how websockets would work with a Flask server -- as I understand it, websockets are incompatible with the WSGI protocol, so they'd presumably need some extra handling with the actual HTTP server bit, which might complicate your plan to change to async. But if you are strongly in favour of a websocket implementation, it should also be not too much work to change a standard flask route based implementation to use websockets, once that standard implementation exists :) |
> while the search is generating its 100 results, the archiving will stop working
- archiving requests are not latency sensitive; as long as _throughput_ isn't unduly affected, a tail latency increase is not ideal but also likely unnoticable in practice
Depends on the search speed, I suppose.
Having thousands of reqres waiting in extension memory would be annoying.
- there are workarounds like running multiple workers with uwsgi / gunicorn / etc (if, as I believe, the server is "stateless"), or running separate servers, one for archival requests and one for search requests
The server is stateful since archiving -> dump parsing -> indexing is stateful.
- if you're planning a sync -> async change, that's something that can likely be done without an extra search endpoint making it more difficult
Yes, which is why I put it off for later. :)
So I will focus for now on getting something that works, and reconsider this once I have a clearer idea of actual performance dynamics
Meanwhile, I'm actively working on cleaning up and publishing my file indexer.
I'm not familiar with how websockets would work with a Flask server
`hoardy-web` uses Bottle, not Flask, Flask is too complex for me.
(Bottle is too, a bit.
I would prefer a bare-HTTP "framework" with request dispatch instead of wrappers over WSGI/CGI/FCGI.
But it is the simplest thing I know of, ATM, so `hoardy-web` uses it.
Yes, I'm very opinionated.)
-- as I understand it, websockets are incompatible with the WSGI protocol, so they'd presumably need some extra handling with the actual HTTP server bit, which might complicate your plan to change to async.
WebSockets: you make an HTTP request, it ends with "101 Switching Protocols", and the rest of the connection is now a WebSockets connection.
WebSockets protocol is, basically, message-based TCP, i.e. guaranteed order and delivery, but not a plain byte stream, but separate typed messages.
But, since it's bidirectional, both sides can notice when the other disconnects or just stops working (there's a `PING` message type).
So, as to your statement: not really, search would simply spawn a separate thread (OS or async, does not matter) and quietly work away, talking to its own WebSocket.
And immediately stop if that socket dies.
|
This is only stateful because the serve path's index is maintained in memory (in the
Of course you may be planning to go in a completely different direction~ but just for reference, I have some prototype code implementing that interface via sqlite on disk in aidanholm@33b8b9d9d08a; this is still hacked together and assumes wrr only, only one archival dir, etc, but seems to be working nicely given those constraints With a very low-effort schema, I got around a 1.4MB database for 275 MB of (hopefully representative) wrr files, so around 1.5% overhead, serving startup is now "instant", and IIUC this would make the server stateless
I can only agree :) I've used flask a fair bit at $WORK and found it deceptively simple (haven't tried Bottle yet) so this stance makes complete sense to me |
> The server is stateful since archiving -> dump parsing -> indexing is stateful.
This is only stateful because the serve path's index is maintained in memory (in the `SortedIndex`) right?
Yes, but even if it were not, replays have to be synchronous with dumping anyway, otherwise my plans for replay buttons in the extension will break.
I want those buttons to work even if the tab in question is not yet fully fetched, they would wait for everything to fetch and get archived, and then immediately switch to the replay.
Which needs the replay to be synchronous with archival.
I have some prototype code implementing that interface via sqlite on disk in aidanholm@33b8b9d9d08a; this is still hacked together and assumes wrr only, only one archival dir, etc, but seems to be working nicely given those constraints
Yes, this is basically what I expect it would look like.
(Also, `SortedIndex` clearly needs a generic interface.)
With a very low-effort schema, I got around a 1.4MB database for 275 MB of (hopefully representative) wrr files, so around 1.5% overhead, serving startup is now "instant", and IIUC this would make the server stateless
Your implementation is cute, but this won't work for `WRR`-bundles and etc, as those need to be sub-file.
Also, full-text indexing will need indirections to deduplicate indexing of same-content data.
The complete version won't be as cute, unfortunately.
|
I'm not sure I If I understand correctly; do you mean that if there is an ongoing dumping request for a URL, then any replay request for that same URL should block until the dump completes? Wouldn't this also require the replay server and the archival/dumping server to share the index (or share at least some state)?
IIUC a wrr bundle is basically wrr bundles directly byte-concatenated? An offset column could be added (and a size column if it cannot be inferred by the wrrb loader)
I am currently playing around a bit with sqlite's full text search -- it is possible to index text without storing a copy of the indexed content, and indexing response bodies only for |
> replays have to be synchronous with dumping anyway, otherwise my plans for replay buttons in the extension will break.
I'm not sure I If I understand correctly; do you mean that if there is an ongoing dumping request for a URL, then any replay request for that same URL should block until the dump completes?
No, that's a bit too strict. (And I'm not sure how one could make `hoardy-web serve` ever guarantee that.)
But, as a client, if you dump a new visit for a URL, the server says `200 OK`, and hence you immediately ask for a replay of this same URL, the server should replay the latest version, not some version from before.
Wouldn't this also require the replay server and the archival/dumping server to share the index (or share at least _some_ state)?
Yes, but that's kind of the point of having them in the same process.
Which is why it needs to be properly async.
|
Ah I see, makes sense
I am currently running |
I am currently running `hoardy_web_sas.py` separately to `hoardy_web serve`; not sure how supported this configuration is in general,
It would work fine if you disable capture before going to a replay URL, otherwise replays would get archived too.
but I'd guess the extension's replay button feature would only work / be available when connected to a server with both archival and serving enabled?
Correct.
|
Hi! I am loving hoardy-web so far :)
I have got
hoardy-web serve
up and running successfully and serving archived websites, but one feature that would be wonderful to have is a search UI. Are there any plans to implement one, or would a PR with such a feature, tastefully implemented of course, be likely to be accepted?It also looks like the finding and filtering functionality in
map_wrr_paths
is unindexed, which would definitely affect the speed of such a search interface; AFAICT these read all files on disk for each query. I am thinking of throwing some indices into a sqlite db in the root of each data store, but not sure if you've already got plans in this area?The text was updated successfully, but these errors were encountered: