Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document scanner's v5 workflow #678

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

zoriya
Copy link
Owner

@zoriya zoriya commented Nov 14, 2024

Part of #597

For v5, we will rework a lot of things in the scanning process. This PR creates a document that details the scanning workflow.

Big changes compared to current version include:

  • The api knows & register even episodes we don't have videos for (or that are not aired).
    • Episodes missing a video will be displayed in the app accordingly.
    • This allows the user to know there's an episode missing and not inadvertently skip one.
    • We could use this information in admin dashboards.
    • Fetching all episodes was already needed in some cases, this would make them easier.
  • Videos are saved in database before we register their series/movie.
    • We can display them in the interface as loading or fetching-data.
    • If metadata matching fails, we can still play them.
    • We can manually correct the guess & continue the registration flow.
  • Metadata guesses are saved in database (just guesses based on file name, not external db data)
    • This allows correction to also apply to other episodes of the same series.
    • It gives better information to the user as to why it was guessed this way (or at least it improves the debugging experience for us).

@zoriya zoriya added this to the v5.0.0 milestone Nov 14, 2024
- Scanner pushes everything to the api in a single post `/videos` call
- Api registers every video in the database
- For each video without an associated entry, the guess data + the video's id is sent to the Matcher via a queue.
- Matcher retrieves metadata from the movie/serie + ALL episodes/seasons (from an external provider)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we dedup metadata retrieval here? If we have 100 video of One Piece, we do not want to fetch every metadata of One Piece 100 times.
Should we group videos based on the guessed name & year fields before sending them as a bulk in the matcher's queue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big event being sent on queue is bad.. And this is also stateful, because you need to wait all files to be processed before sending the big bulk event. If some error happen when processing, all state may be lost.

Keep things stateless, just use a cache in matcher.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The api would receive a bulk add of videos, and would then either:

  • create an event per video
  • or create an event per video group

This is not stateful, it only affects items sent in the bulk create request.

Adding a cache in the matcher would make things stateful & I feel might be more complex. With a cache we would need to:

  • 1 check if item already exist on the api
    • if it does get it's id, link the video to it via an api call
  • 2 check if an item of the same show is currently being processed:
    • if yes, wait for it to finish, retrieve it's id and go to step 1
  • 3 download metadata for item
  • 4 push metadata + link video with the metadata

vs just:

  • download metadata for item
  • push metadata + link videoS with the metadata

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand, wasn't the problem just getting the metadata multiple times?

So, for 100 videos:

    1. Get metadata -> One Piece
    1. Get metadata -> One Piece (cached)
      and so on..

- Api registers every video in the database
- For each video without an associated entry, the guess data + the video's id is sent to the Matcher via a queue.
- Matcher retrieves metadata from the movie/serie + ALL episodes/seasons (from an external provider)
- Matcher pushes every metadata to the api (if there are 1000 episodes but only 1 video, still push the 1000 episodes)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still undecided if this should be done

  • via an api call to post /series that would contain all episodes/seasons/...
  • or via a queue of the same data & having the api pull this queue to add items to the database.

v4 uses the api but requires a call per episode + per season + per series so scanning ddos the api.

v5 could benefit from it being an api since external services could want to create movies/series in kyoo since now we can display shows that do not have a video file available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the concept of adding multiple items rather than having a separate endpoint just for series.

You could either use a queue or create a batch endpoint for multiple items.

To help you decide: where is the bottleneck? If it’s on the API side, a queue might make sense.

But if the API response is fast, why use a queue?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understood your first point. What do you mean by adding multiples items?
My idea was to have a post /movies & a post /series. A single post would contain every metadata about the item (staff, studio, episodes, seasons...)

I think I'll create an API first because it's the easiest to write & test rn, and it would be cool to have it publicly available for 3rd party apps.
If the API becomes a bottleneck, we could add a queue for the matcher to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean a single endpoint that can handle multiple kinds:

POST /items { "type": "movie|serie|episode" ... other attributes... }

If you decide to have a batch endpoint you can just generalize this:

POST /items/batch

[
  { "type": "movie" ... movie schema... },
  { "type": "serie" ... serie schema... },
  { "type": "episode" ... episode schema... }, # this will depend on serieId maybe
  { "type": "special" ... schema ... },
]

I don't know how is the database schema, so this could be different, but I hope you get the idea

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would be to NOT have a episode/special creation method. When creating the series, all episodes & seasons would be included in the same request body.
We might need to have a create method for extra, since I don't think every extra is available on database sites, tho.

For those, a batch method might be a good option. I'll keep it in mind!

@zoriya zoriya added the scanner label Nov 14, 2024
@zoriya zoriya self-assigned this Nov 14, 2024
In order of action:

- Scanner gets `/videos` & scan file system to list all new videos
- Scanner guesses as much as possible from filename/path ALONE (no external database query).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the guessit will run on scanner instead of the matcher?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The scanner would do tree walk+monitor and for each file run guessit before pushing items.
The matcher would just download/process metadata from online databases & push them to the api.

Copy link
Contributor

@felipemarinho97 felipemarinho97 Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I liked this approach!
This also makes implementing a .nfo parser simpler. You can usually find the externalIds (tmdb, tvdb, imdb) and other metadata directly from these files.

- Scanner pushes everything to the api in a single post `/videos` call
- Api registers every video in the database
- For each video without an associated entry, the guess data + the video's id is sent to the Matcher via a queue.
- Matcher retrieves metadata from the movie/serie + ALL episodes/seasons (from an external provider)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big event being sent on queue is bad.. And this is also stateful, because you need to wait all files to be processed before sending the big bulk event. If some error happen when processing, all state may be lost.

Keep things stateless, just use a cache in matcher.

- Api registers every video in the database
- For each video without an associated entry, the guess data + the video's id is sent to the Matcher via a queue.
- Matcher retrieves metadata from the movie/serie + ALL episodes/seasons (from an external provider)
- Matcher pushes every metadata to the api (if there are 1000 episodes but only 1 video, still push the 1000 episodes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the concept of adding multiple items rather than having a separate endpoint just for series.

You could either use a queue or create a batch endpoint for multiple items.

To help you decide: where is the bottleneck? If it’s on the API side, a queue might make sense.

But if the API response is fast, why use a queue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants