-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to find events by slide text & captions in search #1189
base: master
Are you sure you want to change the base?
Conversation
a2adb5a
to
1c23f3c
Compare
1c23f3c
to
163a682
Compare
1c7e401
to
0661fd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked through the code and tested a bunch but didn't find any obvious issues.
I'll do a final round of testing today and then this can be merged. My comments are of no real concern and definitely no blockers.
This comment was marked as resolved.
This comment was marked as resolved.
These tables will hold texts of events, extracted from subtitles and slide texts, which will be searchable later. The queue is used for fetching all those text assets from Opencast.
This is useful to specify other trusted hosts, where Tobira may send the sync login data to.
This will be part of the worker and is able to deal with a variety of error cases. Figuring all this out took quite some time. I decided now that ignoring assets for which Opencast returns something unexpected is fine most of the time. Admins will be able to easily requeue these failed events. This can also deal with network errors or similar indications that OC is not available at the moment, using an exponential backoff then.
What... like the explicit color choice is there to override terminal detection. So why...?!
This was forgotten before: maybe some assets don't exist anymore after an event was updated. Those entries shouldn't persist in the event_texts table. By deleting all entries beforehand, we can also easily use a bulk insert now (since we don't require `on conflict`). I extracted some logic into a helper function to deduplicate code. I tested the users upsert function after this change.
This allows you to queue or dequeue a specific set of events. In particular the `queue --missing` is very relevant as Tobira sometimes gives up on some events after too many failures.
I just tested this with our 12 core test Opencast (where Java serves the files): 2: ~125% CPU, ~1.1 MiB/s down => 3m 2s 4: ~230% CPU, ~2.0 MiB/s down => 1m 42s 8: ~380% CPU, ~3.5 MiB/s down => 1m 16: ~600% CPU, ~5.5 MiB/s down => 42s 32: CPU and downlink wildly varying => 37s
The main change is that texts with the same span are concatenated to only be one entry in the index. This doesn't reduce the size of the `texts` field in Meili, but that of the timespan index. This optimization is mostly there for slide texts, not for captions. But this commit also moves the build process into `FromSql` to avoid a bunch of useless allocations. Ideally one would also avoid all the intermediate `String` allocations, but that's not easily possible right now.
It was only as wide as the metadata made the container, which is not great.
This is still not very aggressive... I first wanted use 2 as threshold, but ... looking at all chars encodable in 2 byte UTF-8... I cannot be sure that it doesn't make sense to search for one individually. Pi came to mind. We can always make this more aggressive later.
Mostly ignoring broken ranges
This rewrites the logic that creates the `textMatches` array for the search API. Before, one Meili match was emitted as one text match, but this had several problems. Most importantly, with two words in the query, if those words would appear in a text right next to one another, Meili would still generate two matches. They would have the same timespan and Tobira would just show two divs on top of each other, only one of which would be visible. Now, for each individual text, we join all matches (with a limit) and return only one `TextMatch`, but potentially with multiple highlight ranges.
e15a9cc
to
203793b
Compare
This pull request has conflicts ☹ |
First of all it's good to see this in action, thanks. Some initial observations:
|
Do you mean "slide text" vs "captions"? Yes, currently both are treated as one thing. Is that different in your current portal?
Correct, which was like that already before. There will be some improvements there in an upcoming PR, like combining a series with the page listing only that series, as having these as two separate results is fairly useless.
That is also something I'm planning to do in the upcoming PR. I am not sure if I will succeed with that, as it requires clever design, but yeah: my goal is that it's clear at a glance whether I'm looking at a video (should should be most results), a series or something else.
Filters are of course planned already, and in fact some basic ones are already implemented. That feature is still hidden though, and will be reenabled with, you guessed it, my upcoming PR. Apart from that, I would expect most users to just specify more query terms. I can't imagine a scenario where someone wants to find a video that they just remember had "internet" in it. And thanks to the clever ranking, users can just add a bunch of query words that they think are relevant, and the result containing most of these words will be shown first. Not to say we don't want filters -- we do -- but these search engines make filters less necessary as just adding more search terms usually works out.
Mh I'm not sure I agree. That's typo tolerance in action. All videos by you (with an exact "schulte" match) are sorted before all other videos. So in my book that's exactly as it should be. And as last resort, you can always search by
Mh fair, the image seems useful. We don't always have an image though, especially for search results in captions, it might not be clear what to show. And: would you not show the extracted text at all then? I think it's useful.
So more like the design in your current video portal?
Not sure I understand. |
Weird coincidence, but it's similar to what I just said for Paella: looks like five different fonts for five different text elements. |
For reference, here's how Kaltura organises search results in the UZH video portal.
|
I agree that the UZH results are too open. Olaf's example where he was looking for "Schulte" and "Schule" also showed up in a video further down does not bother me.
Me too.
I agree. Additionally:
|
Fixes #677
For testers
This PR does not contain any changes to the search page except adding this timeline. This is planned for later. This PR is already big enough.
Also note that the usefulness and the UX of this feature depends a lot on the available data! On our test instance, roughly 2500 events have OCR'ed slide texts, while only very few have subtitles. I will try to upload more videos soon to simply have more videos with subtitles available. Subtitle timespans are usually shorter (in the order of seconds or 10s), while the timespans associated with slide text can have durations of many minutes.
Questions/discussions
Search terms to get started
While testing myself I found a few good queries to get started. Of course, do try your own ones and also try prefixes of these to see how well it works. Also try multiple query words.
open
: big mixed bagmeilisearch
: finds two Tobira videos talking about Meilisearch (never mentioned in metadata)tycho
: finds the "Tycho crater" in the NASA moon video subtitles AND text detectioncrater
: lots of usages in the subtitles of the NASA moon videopyroxene
: finds the mineral in NASA moon subtitleselasticsearch
: lots of matches in Opencast-related videospostgres
: showing some videos with "postgres" in title first (makes sense) and only then once that only mention postgresvideoportal
: obvious Tobira videos, but also one unrelated video screen-sharing the old ETH video portal briefly and one mentioning "videoportal" in its slidesfeynman
: further down lots of videos just mentioning feynmanTechnical info
This PR has these main parts:
event_texts
(for storing all texts belonging to an event)event_texts_queue
and process to automatically download text assets from Opencast (this is ran as part of the worker)This can be mostly reviewed commit by commit. There are two times where I move a big chunk of code around that was added in a previous commit, but it should be fairly clear what and where.
Performance is kind of important for this one, since we are dealing with potentially lots of data. So far it seems like Meili responds within 25ms in all cases I tested. That's fine, but still a big increase from before. We should make sure that we don't accidentally introduce some slowness. Though right now I also have no idea how we would optimize further....
Something I want to improve in a follow up PR: replace the busy polling in the "download assets" and "update search index" workers by
LISTEN/NOTIFY
events from Postgres. Right now, both default to 30s or sth, which means that adding an event has quite a round trip (sync + 30s + 30s) before its text assets are searchable. That can be vastly reduced. But again, this PR is already big enough.