Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to find events by slide text & captions in search #1189

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

LukasKalbertodt
Copy link
Member

@LukasKalbertodt LukasKalbertodt commented Jun 24, 2024

Fixes #677

For testers

Test

This PR does not contain any changes to the search page except adding this timeline. This is planned for later. This PR is already big enough.

Also note that the usefulness and the UX of this feature depends a lot on the available data! On our test instance, roughly 2500 events have OCR'ed slide texts, while only very few have subtitles. I will try to upload more videos soon to simply have more videos with subtitles available. Subtitle timespans are usually shorter (in the order of seconds or 10s), while the timespans associated with slide text can have durations of many minutes.

Questions/discussions

  • Unlike the old video portal, this shows the actual text that was matched (with a small context). I find this cool, but it of course somewhat exposes how bad the OCR slide text and automatic subtitles sometimes are.
  • What do you think about the timeline design and how the matched text is highlighted?
  • Report any query that leads to "internal server error" please. That should obviously never happen.

Search terms to get started

While testing myself I found a few good queries to get started. Of course, do try your own ones and also try prefixes of these to see how well it works. Also try multiple query words.

  • open: big mixed bag
  • meilisearch: finds two Tobira videos talking about Meilisearch (never mentioned in metadata)
  • tycho: finds the "Tycho crater" in the NASA moon video subtitles AND text detection
  • crater: lots of usages in the subtitles of the NASA moon video
  • pyroxene: finds the mineral in NASA moon subtitles
  • elasticsearch: lots of matches in Opencast-related videos
  • postgres: showing some videos with "postgres" in title first (makes sense) and only then once that only mention postgres
  • videoportal: obvious Tobira videos, but also one unrelated video screen-sharing the old ETH video portal briefly and one mentioning "videoportal" in its slides
  • feynman: further down lots of videos just mentioning feynman

Technical info

This PR has these main parts:

  • Add DB table event_texts (for storing all texts belonging to an event)
  • Add event_texts_queue and process to automatically download text assets from Opencast (this is ran as part of the worker)
    • This was the most tricky part actually, in order to make it robust against random errors, OC or otherwise. To work well enough in most cases, without ever running into a super busy loop or something like that.
  • Various helper sub commands to manage fetching assets
  • VTT and MPEG7 parsers to parse the text assets
  • Add texts to MeiliSearch in a special encoded form to optimize for Meili-search-performance while still allowing us to figure out the timespans of a match
  • Make frontend use this data and show a timeline with matches for events

This can be mostly reviewed commit by commit. There are two times where I move a big chunk of code around that was added in a previous commit, but it should be fairly clear what and where.

Performance is kind of important for this one, since we are dealing with potentially lots of data. So far it seems like Meili responds within 25ms in all cases I tested. That's fine, but still a big increase from before. We should make sure that we don't accidentally introduce some slowness. Though right now I also have no idea how we would optimize further....

Something I want to improve in a follow up PR: replace the busy polling in the "download assets" and "update search index" workers by LISTEN/NOTIFY events from Postgres. Right now, both default to 30s or sth, which means that adding an event has quite a round trip (sync + 30s + 30s) before its text assets are searchable. That can be vastly reduced. But again, this PR is already big enough.

@LukasKalbertodt LukasKalbertodt added the changelog:user User facing changes label Jun 24, 2024
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 June 24, 2024 17:08 Destroyed
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 June 24, 2024 17:17 Destroyed
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 June 25, 2024 10:29 Destroyed
@LukasKalbertodt LukasKalbertodt force-pushed the searchable-text branch 2 times, most recently from 1c7e401 to 0661fd5 Compare June 26, 2024 15:28
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 June 26, 2024 15:30 Destroyed
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 June 27, 2024 12:06 Destroyed
Copy link
Member

@owi92 owi92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through the code and tested a bunch but didn't find any obvious issues.
I'll do a final round of testing today and then this can be merged. My comments are of no real concern and definitely no blockers.

backend/src/sync/text/mpeg7.rs Outdated Show resolved Hide resolved
backend/src/search/event.rs Outdated Show resolved Hide resolved
frontend/src/routes/Search.tsx Outdated Show resolved Hide resolved
frontend/src/routes/Search.tsx Show resolved Hide resolved
@github-actions github-actions bot temporarily deployed to test-deployment-pr1189 July 1, 2024 10:03 Destroyed
@LukasKalbertodt
Copy link
Member Author

@oas777 @dagraf Maybe the two of you want to take a short look at this before our Wednesday meeting. See the top comment for more information. But if you don't have the time till then, no worries, there will be plenty of more time to discuss search-related stuff before this is released.

This comment was marked as resolved.

@github-actions github-actions bot added the status:conflicts This PR has conflicts that need to be resolved label Jul 1, 2024
These tables will hold texts of events, extracted from subtitles and
slide texts, which will be searchable later. The queue is used for
fetching all those text assets from Opencast.
This is useful to specify other trusted hosts, where Tobira may send
the sync login data to.
This will be part of the worker and is able to deal with a variety of
error cases. Figuring all this out took quite some time. I decided now
that ignoring assets for which Opencast returns something unexpected is
fine most of the time. Admins will be able to easily requeue these
failed events.

This can also deal with network errors or similar indications that OC
is not available at the moment, using an exponential backoff then.
What... like the explicit color choice is there to override terminal
detection. So why...?!
This was forgotten before: maybe some assets don't exist anymore after
an event was updated. Those entries shouldn't persist in the event_texts
table. By deleting all entries beforehand, we can also easily use a bulk
insert now (since we don't require `on conflict`). I extracted some
logic into a helper function to deduplicate code. I tested the users
upsert function after this change.
This allows you to queue or dequeue a specific set of events. In
particular the `queue --missing` is very relevant as Tobira sometimes
gives up on some events after too many failures.
I just tested this with our 12 core test Opencast (where Java serves
the files):

 2: ~125% CPU, ~1.1 MiB/s down      => 3m 2s
 4: ~230% CPU, ~2.0 MiB/s down      => 1m 42s
 8: ~380% CPU, ~3.5 MiB/s down      => 1m
16: ~600% CPU, ~5.5 MiB/s down      => 42s
32: CPU and downlink wildly varying => 37s
The main change is that texts with the same span are concatenated to
only be one entry in the index. This doesn't reduce the size of the
`texts` field in Meili, but that of the timespan index. This
optimization is mostly there for slide texts, not for captions.

But this commit also moves the build process into `FromSql` to avoid
a bunch of useless allocations. Ideally one would also avoid all the
intermediate `String` allocations, but that's not easily possible right
now.
It was only as wide as the metadata made the container, which is not
great.
This is still not very aggressive... I first wanted use 2 as threshold,
but ... looking at all chars encodable in 2 byte UTF-8... I cannot be
sure that it doesn't make sense to search for one individually. Pi came
to mind. We can always make this more aggressive later.
Mostly ignoring broken ranges
This rewrites the logic that creates the `textMatches` array for the
search API. Before, one Meili match was emitted as one text match, but
this had several problems. Most importantly, with two words in the
query, if those words would appear in a text right next to one another,
Meili would still generate two matches. They would have the same
timespan and Tobira would just show two divs on top of each other, only
one of which would be visible.

Now, for each individual text, we join all matches (with a limit) and
return only one `TextMatch`, but potentially with multiple highlight
ranges.
Copy link

github-actions bot commented Jul 2, 2024

This pull request has conflicts ☹
Please resolve those so we can review the pull request.
Thanks.

@github-actions github-actions bot added the status:conflicts This PR has conflicts that need to be resolved label Jul 2, 2024
@oas777
Copy link
Collaborator

oas777 commented Jul 3, 2024

First of all it's good to see this in action, thanks. Some initial observations:

  • https://pr1189.tobira.opencast.org/~search?q=tyco tells me we're currently not distinguishing where the results come from, right?
  • https://pr1189.tobira.opencast.org/~search?q=opencast tells me we're looking for pages, series, and videos, right?
  • In conjunction, I think we might have to make these distinctions clearer for users to understand the search results they are looking at (and their order in the sense of importance also).
  • Also, with the number of results for https://pr1189.tobira.opencast.org/~search?q=internet we probably have to think about filters.
  • https://pr1189.tobira.opencast.org/~search?q=schulte (blush) providing results for "schule" to me indicates search terms are too open.
  • I prefer a preview of the slides to a preview of the text extracted. The image also is an additional help to remember a certain part of the lecture you are looking for.
  • Design: The timeline looks odd, mainly because highlighted segments hover over the timeline. I prefer having them "stringed" on the actual timeline.
  • Font: The line with "Part of series" looks odd, even though it's probably supposed to be related to the line with the creator's name.

@LukasKalbertodt
Copy link
Member Author

https://pr1189.tobira.opencast.org/~search?q=tyco tells me we're currently not distinguishing where the results come from, right?

Do you mean "slide text" vs "captions"? Yes, currently both are treated as one thing. Is that different in your current portal?

https://pr1189.tobira.opencast.org/~search?q=opencast tells me we're looking for pages, series, and videos, right?

Correct, which was like that already before. There will be some improvements there in an upcoming PR, like combining a series with the page listing only that series, as having these as two separate results is fairly useless.

In conjunction, I think we might have to make these distinctions clearer for users to understand the search results they are looking at (and their order in the sense of importance also).

That is also something I'm planning to do in the upcoming PR. I am not sure if I will succeed with that, as it requires clever design, but yeah: my goal is that it's clear at a glance whether I'm looking at a video (should should be most results), a series or something else.

Also, with the number of results for https://pr1189.tobira.opencast.org/~search?q=internet we probably have to think about filters.

Filters are of course planned already, and in fact some basic ones are already implemented. That feature is still hidden though, and will be reenabled with, you guessed it, my upcoming PR.

Apart from that, I would expect most users to just specify more query terms. I can't imagine a scenario where someone wants to find a video that they just remember had "internet" in it. And thanks to the clever ranking, users can just add a bunch of query words that they think are relevant, and the result containing most of these words will be shown first. Not to say we don't want filters -- we do -- but these search engines make filters less necessary as just adding more search terms usually works out.

https://pr1189.tobira.opencast.org/~search?q=schulte (blush) providing results for "schule" to me indicates search terms are too open.

Mh I'm not sure I agree. That's typo tolerance in action. All videos by you (with an exact "schulte" match) are sorted before all other videos. So in my book that's exactly as it should be. And as last resort, you can always search by "schulte" (with quotes), which works exactly like in Google or most other search engines: looking for that term exactly.

I prefer a preview of the slides to a preview of the text extracted. The image also is an additional help to remember a certain part of the lecture you are looking for.

Mh fair, the image seems useful. We don't always have an image though, especially for search results in captions, it might not be clear what to show. And: would you not show the extracted text at all then? I think it's useful.

Design: The timeline looks odd, mainly because highlighted segments hover over the timeline. I prefer having them "stringed" on the actual timeline.

So more like the design in your current video portal?

Font: The line with "Part of series" looks odd, even though it's probably supposed to be related to the line with the creator's name.

Not sure I understand.

@oas777
Copy link
Collaborator

oas777 commented Jul 3, 2024

So more like the design in your current video portal?

Font: The line with "Part of series" looks odd, even though it's probably supposed to be related to the line with the creator's name.

Not sure I understand.

Weird coincidence, but it's similar to what I just said for Paella:

grafik

looks like five different fonts for five different text elements.

@oas777
Copy link
Collaborator

oas777 commented Jul 6, 2024

For reference, here's how Kaltura organises search results in the UZH video portal.

  • Their search seems "too open" as well, providing results for "Stein" might be ok, but "Einstellungen" and "Kleinstaaten" blurs results; not sure why "Die" is also listed.
  • I like the fact that you can filter this a) by source and b) date - they call it relevance, but I think "date ascending/descending" and "semester" would be perfect.
  • Clear distinction between "Channels" and videos, which we also need, especially if we add "pages".

@dagraf
Copy link
Collaborator

dagraf commented Jul 9, 2024

For reference, here's how Kaltura organises search results in the UZH video portal.

* Their search seems "too open" as well, providing results for "Stein" might be ok, but "Einstellungen" and "Kleinstaaten" blurs results; not sure why "Die" is also listed.

I agree that the UZH results are too open. Olaf's example where he was looking for "Schulte" and "Schule" also showed up in a video further down does not bother me.

* I like the fact that you can filter this a) by source and b) date - they call it relevance, but I think "date ascending/descending" and "semester" would be perfect.

Me too.

* Clear distinction between "Channels" and videos, which we also need, especially if we add "pages".

I agree.

Additionally:

  • I like the fact that the expressions or words that are responsible for the video to show up in the search result being highlighted.
  • Timeline: I would prefer to have the highlighted blocks be showing up not hovering over the timeline but more like in the actual video portal of ETH. And for me, an icon just before the timeline (e.g., a "Play" icon) would help to understand immediately what this strange line with the blocks is all about. But maybe if we someday have our thumbnails to the left, this will not be necessary anymore. ::

@LukasKalbertodt LukasKalbertodt modified the milestones: 2.11, 2.12 Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:user User facing changes status:conflicts This PR has conflicts that need to be resolved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make events findable by transcript (ideally jump to timestamp)
4 participants