Document scanner's v5 workflow #678

zoriya · 2024-11-14T23:34:59Z

Part of #597

For v5, we will rework a lot of things in the scanning process. This PR creates a document that details the scanning workflow.

Big changes compared to current version include:

The api knows & register even episodes we don't have videos for (or that are not aired).
- Episodes missing a video will be displayed in the app accordingly.
- This allows the user to know there's an episode missing and not inadvertently skip one.
- We could use this information in admin dashboards.
- Fetching all episodes was already needed in some cases, this would make them easier.
Videos are saved in database before we register their series/movie.
- We can display them in the interface as loading or fetching-data.
- If metadata matching fails, we can still play them.
- We can manually correct the guess & continue the registration flow.
Metadata guesses are saved in database (just guesses based on file name, not external db data)
- This allows correction to also apply to other episodes of the same series.
- It gives better information to the user as to why it was guessed this way (or at least it improves the debugging experience for us).

zoriya · 2024-11-14T23:37:26Z

scanner/README.md

+ - Scanner pushes everything to the api in a single post `/videos` call
+ - Api registers every video in the database
+ - For each video without an associated entry, the guess data + the video's id is sent to the Matcher via a queue.
+ - Matcher retrieves metadata from the movie/serie + ALL episodes/seasons (from an external provider)


How should we dedup metadata retrieval here? If we have 100 video of One Piece, we do not want to fetch every metadata of One Piece 100 times.
Should we group videos based on the guessed name & year fields before sending them as a bulk in the matcher's queue?

A big event being sent on queue is bad.. And this is also stateful, because you need to wait all files to be processed before sending the big bulk event. If some error happen when processing, all state may be lost.

Keep things stateless, just use a cache in matcher.

The api would receive a bulk add of videos, and would then either:

create an event per video

or create an event per video group

This is not stateful, it only affects items sent in the bulk create request.

Adding a cache in the matcher would make things stateful & I feel might be more complex. With a cache we would need to:

1 check if item already exist on the api

if it does get it's id, link the video to it via an api call

2 check if an item of the same show is currently being processed:

if yes, wait for it to finish, retrieve it's id and go to step 1

3 download metadata for item

4 push metadata + link video with the metadata

vs just:

download metadata for item

push metadata + link videoS with the metadata

I'm not sure I understand, wasn't the problem just getting the metadata multiple times?

So, for 100 videos:

Get metadata -> One Piece

Get metadata -> One Piece (cached)
and so on..

zoriya · 2024-11-14T23:42:16Z

scanner/README.md

+ - Api registers every video in the database
+ - For each video without an associated entry, the guess data + the video's id is sent to the Matcher via a queue.
+ - Matcher retrieves metadata from the movie/serie + ALL episodes/seasons (from an external provider)
+ - Matcher pushes every metadata to the api (if there are 1000 episodes but only 1 video, still push the 1000 episodes)


I'm still undecided if this should be done

via an api call to post /series that would contain all episodes/seasons/...

or via a queue of the same data & having the api pull this queue to add items to the database.

v4 uses the api but requires a call per episode + per season + per series so scanning ddos the api.

v5 could benefit from it being an api since external services could want to create movies/series in kyoo since now we can display shows that do not have a video file available.

I prefer the concept of adding multiple items rather than having a separate endpoint just for series.

You could either use a queue or create a batch endpoint for multiple items.

To help you decide: where is the bottleneck? If it’s on the API side, a queue might make sense.

But if the API response is fast, why use a queue?

Not sure if I understood your first point. What do you mean by adding multiples items?
My idea was to have a post /movies & a post /series. A single post would contain every metadata about the item (staff, studio, episodes, seasons...)

I think I'll create an API first because it's the easiest to write & test rn, and it would be cool to have it publicly available for 3rd party apps.
If the API becomes a bottleneck, we could add a queue for the matcher to use.

I mean a single endpoint that can handle multiple kinds:

POST /items { "type": "movie|serie|episode" ... other attributes... }

If you decide to have a batch endpoint you can just generalize this:

POST /items/batch

[ { "type": "movie" ... movie schema... }, { "type": "serie" ... serie schema... }, { "type": "episode" ... episode schema... }, # this will depend on serieId maybe { "type": "special" ... schema ... }, ]

I don't know how is the database schema, so this could be different, but I hope you get the idea

The idea would be to NOT have a episode/special creation method. When creating the series, all episodes & seasons would be included in the same request body.
We might need to have a create method for extra, since I don't think every extra is available on database sites, tho.

For those, a batch method might be a good option. I'll keep it in mind!

felipemarinho97 · 2024-11-19T03:05:58Z

scanner/README.md

+In order of action:
+
+ - Scanner gets `/videos` & scan file system to list all new videos
+ - Scanner guesses as much as possible from filename/path ALONE (no external database query).


So the guessit will run on scanner instead of the matcher?

Yes. The scanner would do tree walk+monitor and for each file run guessit before pushing items.
The matcher would just download/process metadata from online databases & push them to the api.

Cool! I liked this approach!
This also makes implementing a .nfo parser simpler. You can usually find the externalIds (tmdb, tvdb, imdb) and other metadata directly from these files.

felipemarinho97 · 2024-11-19T03:20:51Z

scanner/README.md

+ - Scanner pushes everything to the api in a single post `/videos` call
+ - Api registers every video in the database
+ - For each video without an associated entry, the guess data + the video's id is sent to the Matcher via a queue.
+ - Matcher retrieves metadata from the movie/serie + ALL episodes/seasons (from an external provider)


A big event being sent on queue is bad.. And this is also stateful, because you need to wait all files to be processed before sending the big bulk event. If some error happen when processing, all state may be lost.

Keep things stateless, just use a cache in matcher.

felipemarinho97 · 2024-11-19T03:52:40Z

scanner/README.md

+ - Api registers every video in the database
+ - For each video without an associated entry, the guess data + the video's id is sent to the Matcher via a queue.
+ - Matcher retrieves metadata from the movie/serie + ALL episodes/seasons (from an external provider)
+ - Matcher pushes every metadata to the api (if there are 1000 episodes but only 1 video, still push the 1000 episodes)


I prefer the concept of adding multiple items rather than having a separate endpoint just for series.

You could either use a queue or create a batch endpoint for multiple items.

To help you decide: where is the bottleneck? If it’s on the API side, a queue might make sense.

But if the API response is fast, why use a queue?

Document scanner's v5 workflow

070ad2b

zoriya added this to the v5.0.0 milestone Nov 14, 2024

zoriya commented Nov 14, 2024

View reviewed changes

zoriya added the scanner label Nov 14, 2024

zoriya self-assigned this Nov 14, 2024

This was referenced Nov 15, 2024

v5 plans #597

Open

v5 api: initial schemas, types & basic routes #680

Draft

felipemarinho97 reviewed Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document scanner's v5 workflow #678

Document scanner's v5 workflow #678

zoriya commented Nov 14, 2024

zoriya Nov 14, 2024

felipemarinho97 Nov 19, 2024

zoriya Nov 19, 2024

felipemarinho97 Nov 19, 2024

zoriya Nov 14, 2024

felipemarinho97 Nov 19, 2024

zoriya Nov 19, 2024

felipemarinho97 Nov 19, 2024

zoriya Nov 19, 2024

felipemarinho97 Nov 19, 2024

zoriya Nov 19, 2024

felipemarinho97 Nov 19, 2024 •

edited

Loading

felipemarinho97 Nov 19, 2024

felipemarinho97 Nov 19, 2024

Document scanner's v5 workflow #678

Are you sure you want to change the base?

Document scanner's v5 workflow #678

Conversation

zoriya commented Nov 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemarinho97 Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemarinho97 Nov 19, 2024 •

edited

Loading