Add ability to find events by slide text & captions in search #1189

LukasKalbertodt · 2024-06-24T17:02:05Z

Fixes #677

For testers

Test

This PR does not contain any changes to the search page except adding this timeline. This is planned for later. This PR is already big enough.

Also note that the usefulness and the UX of this feature depends a lot on the available data! On our test instance, roughly 2500 events have OCR'ed slide texts, while only very few have subtitles. I will try to upload more videos soon to simply have more videos with subtitles available. Subtitle timespans are usually shorter (in the order of seconds or 10s), while the timespans associated with slide text can have durations of many minutes.

Questions/discussions

Unlike the old video portal, this shows the actual text that was matched (with a small context). I find this cool, but it of course somewhat exposes how bad the OCR slide text and automatic subtitles sometimes are.
What do you think about the timeline design and how the matched text is highlighted?
Report any query that leads to "internal server error" please. That should obviously never happen.

Search terms to get started

While testing myself I found a few good queries to get started. Of course, do try your own ones and also try prefixes of these to see how well it works. Also try multiple query words.

open: big mixed bag
meilisearch: finds two Tobira videos talking about Meilisearch (never mentioned in metadata)
tycho: finds the "Tycho crater" in the NASA moon video subtitles AND text detection
crater: lots of usages in the subtitles of the NASA moon video
pyroxene: finds the mineral in NASA moon subtitles
elasticsearch: lots of matches in Opencast-related videos
postgres: showing some videos with "postgres" in title first (makes sense) and only then once that only mention postgres
videoportal: obvious Tobira videos, but also one unrelated video screen-sharing the old ETH video portal briefly and one mentioning "videoportal" in its slides
feynman: further down lots of videos just mentioning feynman

Technical info

This PR has these main parts:

Add DB table event_texts (for storing all texts belonging to an event)
Add event_texts_queue and process to automatically download text assets from Opencast (this is ran as part of the worker)
- This was the most tricky part actually, in order to make it robust against random errors, OC or otherwise. To work well enough in most cases, without ever running into a super busy loop or something like that.
Various helper sub commands to manage fetching assets
VTT and MPEG7 parsers to parse the text assets
Add texts to MeiliSearch in a special encoded form to optimize for Meili-search-performance while still allowing us to figure out the timespans of a match
Make frontend use this data and show a timeline with matches for events

This can be mostly reviewed commit by commit. There are two times where I move a big chunk of code around that was added in a previous commit, but it should be fairly clear what and where.

Performance is kind of important for this one, since we are dealing with potentially lots of data. So far it seems like Meili responds within 25ms in all cases I tested. That's fine, but still a big increase from before. We should make sure that we don't accidentally introduce some slowness. Though right now I also have no idea how we would optimize further....

Something I want to improve in a follow up PR: replace the busy polling in the "download assets" and "update search index" workers by LISTEN/NOTIFY events from Postgres. Right now, both default to 30s or sth, which means that adding an event has quite a round trip (sync + 30s + 30s) before its text assets are searchable. That can be vastly reduced. But again, this PR is already big enough.

backend/src/args.rs

owi92

Looked through the code and tested a bunch but didn't find any obvious issues.
I'll do a final round of testing today and then this can be merged. My comments are of no real concern and definitely no blockers.

backend/src/sync/text/mpeg7.rs

backend/src/search/event.rs

frontend/src/routes/Search.tsx

LukasKalbertodt · 2024-07-01T11:26:00Z

@oas777 @dagraf Maybe the two of you want to take a short look at this before our Wednesday meeting. See the top comment for more information. But if you don't have the time till then, no worries, there will be plenty of more time to discuss search-related stuff before this is released.

These tables will hold texts of events, extracted from subtitles and slide texts, which will be searchable later. The queue is used for fetching all those text assets from Opencast.

This is useful to specify other trusted hosts, where Tobira may send the sync login data to.

This will be part of the worker and is able to deal with a variety of error cases. Figuring all this out took quite some time. I decided now that ignoring assets for which Opencast returns something unexpected is fine most of the time. Admins will be able to easily requeue these failed events. This can also deal with network errors or similar indications that OC is not available at the moment, using an exponential backoff then.

What... like the explicit color choice is there to override terminal detection. So why...?!

This was forgotten before: maybe some assets don't exist anymore after an event was updated. Those entries shouldn't persist in the event_texts table. By deleting all entries beforehand, we can also easily use a bulk insert now (since we don't require `on conflict`). I extracted some logic into a helper function to deduplicate code. I tested the users upsert function after this change.

This allows you to queue or dequeue a specific set of events. In particular the `queue --missing` is very relevant as Tobira sometimes gives up on some events after too many failures.

I just tested this with our 12 core test Opencast (where Java serves the files): 2: ~125% CPU, ~1.1 MiB/s down => 3m 2s 4: ~230% CPU, ~2.0 MiB/s down => 1m 42s 8: ~380% CPU, ~3.5 MiB/s down => 1m 16: ~600% CPU, ~5.5 MiB/s down => 42s 32: CPU and downlink wildly varying => 37s

The main change is that texts with the same span are concatenated to only be one entry in the index. This doesn't reduce the size of the `texts` field in Meili, but that of the timespan index. This optimization is mostly there for slide texts, not for captions. But this commit also moves the build process into `FromSql` to avoid a bunch of useless allocations. Ideally one would also avoid all the intermediate `String` allocations, but that's not easily possible right now.

It was only as wide as the metadata made the container, which is not great.

This is still not very aggressive... I first wanted use 2 as threshold, but ... looking at all chars encodable in 2 byte UTF-8... I cannot be sure that it doesn't make sense to search for one individually. Pi came to mind. We can always make this more aggressive later.

Mostly ignoring broken ranges

This rewrites the logic that creates the `textMatches` array for the search API. Before, one Meili match was emitted as one text match, but this had several problems. Most importantly, with two words in the query, if those words would appear in a text right next to one another, Meili would still generate two matches. They would have the same timespan and Tobira would just show two divs on top of each other, only one of which would be visible. Now, for each individual text, we join all matches (with a limit) and return only one `TextMatch`, but potentially with multiple highlight ranges.

github-actions · 2024-07-02T12:39:57Z

This pull request has conflicts ☹
Please resolve those so we can review the pull request.
Thanks.

oas777 · 2024-07-03T12:20:59Z

First of all it's good to see this in action, thanks. Some initial observations:

https://pr1189.tobira.opencast.org/~search?q=tyco tells me we're currently not distinguishing where the results come from, right?
https://pr1189.tobira.opencast.org/~search?q=opencast tells me we're looking for pages, series, and videos, right?
In conjunction, I think we might have to make these distinctions clearer for users to understand the search results they are looking at (and their order in the sense of importance also).
Also, with the number of results for https://pr1189.tobira.opencast.org/~search?q=internet we probably have to think about filters.
https://pr1189.tobira.opencast.org/~search?q=schulte (blush) providing results for "schule" to me indicates search terms are too open.
I prefer a preview of the slides to a preview of the text extracted. The image also is an additional help to remember a certain part of the lecture you are looking for.
Design: The timeline looks odd, mainly because highlighted segments hover over the timeline. I prefer having them "stringed" on the actual timeline.
Font: The line with "Part of series" looks odd, even though it's probably supposed to be related to the line with the creator's name.

LukasKalbertodt · 2024-07-03T13:46:17Z

https://pr1189.tobira.opencast.org/~search?q=tyco tells me we're currently not distinguishing where the results come from, right?

Do you mean "slide text" vs "captions"? Yes, currently both are treated as one thing. Is that different in your current portal?

https://pr1189.tobira.opencast.org/~search?q=opencast tells me we're looking for pages, series, and videos, right?

Correct, which was like that already before. There will be some improvements there in an upcoming PR, like combining a series with the page listing only that series, as having these as two separate results is fairly useless.

In conjunction, I think we might have to make these distinctions clearer for users to understand the search results they are looking at (and their order in the sense of importance also).

That is also something I'm planning to do in the upcoming PR. I am not sure if I will succeed with that, as it requires clever design, but yeah: my goal is that it's clear at a glance whether I'm looking at a video (should should be most results), a series or something else.

Also, with the number of results for https://pr1189.tobira.opencast.org/~search?q=internet we probably have to think about filters.

Filters are of course planned already, and in fact some basic ones are already implemented. That feature is still hidden though, and will be reenabled with, you guessed it, my upcoming PR.

Apart from that, I would expect most users to just specify more query terms. I can't imagine a scenario where someone wants to find a video that they just remember had "internet" in it. And thanks to the clever ranking, users can just add a bunch of query words that they think are relevant, and the result containing most of these words will be shown first. Not to say we don't want filters -- we do -- but these search engines make filters less necessary as just adding more search terms usually works out.

https://pr1189.tobira.opencast.org/~search?q=schulte (blush) providing results for "schule" to me indicates search terms are too open.

Mh I'm not sure I agree. That's typo tolerance in action. All videos by you (with an exact "schulte" match) are sorted before all other videos. So in my book that's exactly as it should be. And as last resort, you can always search by "schulte" (with quotes), which works exactly like in Google or most other search engines: looking for that term exactly.

I prefer a preview of the slides to a preview of the text extracted. The image also is an additional help to remember a certain part of the lecture you are looking for.

Mh fair, the image seems useful. We don't always have an image though, especially for search results in captions, it might not be clear what to show. And: would you not show the extracted text at all then? I think it's useful.

Design: The timeline looks odd, mainly because highlighted segments hover over the timeline. I prefer having them "stringed" on the actual timeline.

So more like the design in your current video portal?

Font: The line with "Part of series" looks odd, even though it's probably supposed to be related to the line with the creator's name.

Not sure I understand.

oas777 · 2024-07-03T13:59:31Z

So more like the design in your current video portal?

Font: The line with "Part of series" looks odd, even though it's probably supposed to be related to the line with the creator's name.

Not sure I understand.

Weird coincidence, but it's similar to what I just said for Paella:

looks like five different fonts for five different text elements.

oas777 · 2024-07-06T10:36:05Z

For reference, here's how Kaltura organises search results in the UZH video portal.

Their search seems "too open" as well, providing results for "Stein" might be ok, but "Einstellungen" and "Kleinstaaten" blurs results; not sure why "Die" is also listed.
I like the fact that you can filter this a) by source and b) date - they call it relevance, but I think "date ascending/descending" and "semester" would be perfect.
Clear distinction between "Channels" and videos, which we also need, especially if we add "pages".

dagraf · 2024-07-09T20:04:07Z

For reference, here's how Kaltura organises search results in the UZH video portal.
* Their search seems "too open" as well, providing results for "Stein" might be ok, but "Einstellungen" and "Kleinstaaten" blurs results; not sure why "Die" is also listed.

I agree that the UZH results are too open. Olaf's example where he was looking for "Schulte" and "Schule" also showed up in a video further down does not bother me.

* I like the fact that you can filter this a) by source and b) date - they call it relevance, but I think "date ascending/descending" and "semester" would be perfect.

Me too.

* Clear distinction between "Channels" and videos, which we also need, especially if we add "pages".

I agree.

Additionally:

I like the fact that the expressions or words that are responsible for the video to show up in the search result being highlighted.
Timeline: I would prefer to have the highlighted blocks be showing up not hovering over the timeline but more like in the actual video portal of ETH. And for me, an icon just before the timeline (e.g., a "Play" icon) would help to understand immediately what this strange line with the blocks is all about. But maybe if we someday have our thumbnails to the left, this will not be necessary anymore. ::

LukasKalbertodt added the changelog:user User facing changes label Jun 24, 2024

github-actions bot temporarily deployed to test-deployment-pr1189 June 24, 2024 17:08 Destroyed

LukasKalbertodt force-pushed the searchable-text branch from a2adb5a to 1c23f3c Compare June 24, 2024 17:14

github-actions bot temporarily deployed to test-deployment-pr1189 June 24, 2024 17:17 Destroyed

LukasKalbertodt force-pushed the searchable-text branch from 1c23f3c to 163a682 Compare June 25, 2024 10:24

github-actions bot temporarily deployed to test-deployment-pr1189 June 25, 2024 10:29 Destroyed

LukasKalbertodt force-pushed the searchable-text branch 2 times, most recently from 1c7e401 to 0661fd5 Compare June 26, 2024 15:28

github-actions bot temporarily deployed to test-deployment-pr1189 June 26, 2024 15:30 Destroyed

github-actions bot temporarily deployed to test-deployment-pr1189 June 27, 2024 12:06 Destroyed

owi92 reviewed Jun 27, 2024

View reviewed changes

backend/src/args.rs Show resolved Hide resolved

owi92 approved these changes Jul 1, 2024

View reviewed changes

backend/src/sync/text/mpeg7.rs Outdated Show resolved Hide resolved

backend/src/search/event.rs Outdated Show resolved Hide resolved

frontend/src/routes/Search.tsx Outdated Show resolved Hide resolved

frontend/src/routes/Search.tsx Show resolved Hide resolved

github-actions bot temporarily deployed to test-deployment-pr1189 July 1, 2024 10:03 Destroyed

owi92 approved these changes Jul 1, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

github-actions bot added the status:conflicts This PR has conflicts that need to be resolved label Jul 1, 2024

LukasKalbertodt added 13 commits July 1, 2024 14:38

Add event_texts and event_texts_queue to DB

80cc76b

These tables will hold texts of events, extracted from subtitles and slide texts, which will be searchable later. The queue is used for fetching all those text assets from Opencast.

Add other_hosts config to [opencast] section

7556544

This is useful to specify other trusted hosts, where Tobira may send the sync login data to.

Add basic_auth_header helper to SyncConfig

d5ae85c

Add sync texts status subcommand

ffacf43

Add MPEG7 parser for event text fetcher (slide text)

452b05b

Fix terminal color logic

79df4e1

What... like the explicit color choice is there to override terminal detection. So why...?!

Add queue and dequeue subcommands for sync texts

0c2c1c5

This allows you to queue or dequeue a specific set of events. In particular the `queue --missing` is very relevant as Tobira sometimes gives up on some events after too many failures.

Run sync texts fetch --daemon as part of worker

1b701ea

Add texts column to search_events view

b9756d1

Add event texts to Meilisearch index

df1175c

LukasKalbertodt added 12 commits July 1, 2024 14:39

Add DB triggers to queue events for search-reindex on text changes

9570af1

Add timespanMatches to SearchEvent in API

77b64e8

Add timeline to event search results showing timespan matches

aad36ac

Add matched text with context to search API and show it in UI

f3417b4

Fix timeline width in search events

ecd1d55

It was only as wide as the metadata made the container, which is not great.

Deduplicate texts in the same span

25d259f

Ignore texts with timespans less than 100ms

422d0f1

Mostly ignoring broken ranges

Bump search index version

51d0303

Address review comments

203793b

LukasKalbertodt force-pushed the searchable-text branch from e15a9cc to 203793b Compare July 1, 2024 12:42

github-actions bot removed the status:conflicts This PR has conflicts that need to be resolved label Jul 1, 2024

github-actions bot deployed to test-deployment-pr1189 July 1, 2024 12:44 View deployment

LukasKalbertodt added this to the 2.11 milestone Jul 1, 2024

github-actions bot added the status:conflicts This PR has conflicts that need to be resolved label Jul 2, 2024

LukasKalbertodt modified the milestones: 2.11, 2.12 Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to find events by slide text & captions in search #1189

Add ability to find events by slide text & captions in search #1189

LukasKalbertodt commented Jun 24, 2024 •

edited

Loading

owi92 left a comment

LukasKalbertodt commented Jul 1, 2024

This comment was marked as resolved.

github-actions bot commented Jul 2, 2024

oas777 commented Jul 3, 2024

LukasKalbertodt commented Jul 3, 2024

oas777 commented Jul 3, 2024

oas777 commented Jul 6, 2024 •

edited

Loading

dagraf commented Jul 9, 2024 •

edited

Loading

Add ability to find events by slide text & captions in search #1189

Are you sure you want to change the base?

Add ability to find events by slide text & captions in search #1189

Conversation

LukasKalbertodt commented Jun 24, 2024 • edited Loading

For testers

Questions/discussions

Search terms to get started

Technical info

owi92 left a comment

Choose a reason for hiding this comment

LukasKalbertodt commented Jul 1, 2024

This comment was marked as resolved.

github-actions bot commented Jul 2, 2024

oas777 commented Jul 3, 2024

LukasKalbertodt commented Jul 3, 2024

oas777 commented Jul 3, 2024

oas777 commented Jul 6, 2024 • edited Loading

dagraf commented Jul 9, 2024 • edited Loading

LukasKalbertodt commented Jun 24, 2024 •

edited

Loading

oas777 commented Jul 6, 2024 •

edited

Loading

dagraf commented Jul 9, 2024 •

edited

Loading