-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document scanner's v5 workflow #678
base: master
Are you sure you want to change the base?
Conversation
- Scanner pushes everything to the api in a single post `/videos` call | ||
- Api registers every video in the database | ||
- For each video without an associated entry, the guess data + the video's id is sent to the Matcher via a queue. | ||
- Matcher retrieves metadata from the movie/serie + ALL episodes/seasons (from an external provider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we dedup metadata retrieval here? If we have 100 video of One Piece
, we do not want to fetch every metadata of One Piece
100 times.
Should we group videos based on the guessed name
& year
fields before sending them as a bulk in the matcher's queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A big event being sent on queue is bad.. And this is also stateful, because you need to wait all files to be processed before sending the big bulk event. If some error happen when processing, all state may be lost.
Keep things stateless, just use a cache in matcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The api would receive a bulk add of videos, and would then either:
- create an event per video
- or create an event per video group
This is not stateful, it only affects items sent in the bulk create request.
Adding a cache in the matcher would make things stateful & I feel might be more complex. With a cache we would need to:
- 1 check if item already exist on the api
- if it does get it's id, link the video to it via an api call
- 2 check if an item of the same show is currently being processed:
- if yes, wait for it to finish, retrieve it's id and go to step 1
- 3 download metadata for item
- 4 push metadata + link video with the metadata
vs just:
- download metadata for item
- push metadata + link videoS with the metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand, wasn't the problem just getting the metadata multiple times?
So, for 100 videos:
-
- Get metadata -> One Piece
-
- Get metadata -> One Piece (cached)
and so on..
- Get metadata -> One Piece (cached)
- Api registers every video in the database | ||
- For each video without an associated entry, the guess data + the video's id is sent to the Matcher via a queue. | ||
- Matcher retrieves metadata from the movie/serie + ALL episodes/seasons (from an external provider) | ||
- Matcher pushes every metadata to the api (if there are 1000 episodes but only 1 video, still push the 1000 episodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still undecided if this should be done
- via an api call to post
/series
that would contain all episodes/seasons/... - or via a queue of the same data & having the api pull this queue to add items to the database.
v4 uses the api but requires a call per episode + per season + per series so scanning ddos the api.
v5 could benefit from it being an api since external services could want to create movies/series in kyoo since now we can display shows that do not have a video file available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the concept of adding multiple items rather than having a separate endpoint just for series.
You could either use a queue or create a batch endpoint for multiple items.
To help you decide: where is the bottleneck? If it’s on the API side, a queue might make sense.
But if the API response is fast, why use a queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understood your first point. What do you mean by adding multiples items
?
My idea was to have a post /movies & a post /series. A single post would contain every metadata about the item (staff, studio, episodes, seasons...)
I think I'll create an API first because it's the easiest to write & test rn, and it would be cool to have it publicly available for 3rd party apps.
If the API becomes a bottleneck, we could add a queue for the matcher to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean a single endpoint that can handle multiple kinds:
POST /items { "type": "movie|serie|episode" ... other attributes... }
If you decide to have a batch endpoint you can just generalize this:
POST /items/batch
[
{ "type": "movie" ... movie schema... },
{ "type": "serie" ... serie schema... },
{ "type": "episode" ... episode schema... }, # this will depend on serieId maybe
{ "type": "special" ... schema ... },
]
I don't know how is the database schema, so this could be different, but I hope you get the idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea would be to NOT have a episode/special creation method. When creating the series, all episodes & seasons would be included in the same request body.
We might need to have a create method for extra, since I don't think every extra is available on database sites, tho.
For those, a batch method might be a good option. I'll keep it in mind!
In order of action: | ||
|
||
- Scanner gets `/videos` & scan file system to list all new videos | ||
- Scanner guesses as much as possible from filename/path ALONE (no external database query). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the guessit will run on scanner instead of the matcher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The scanner would do tree walk+monitor and for each file run guessit before pushing items.
The matcher would just download/process metadata from online databases & push them to the api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! I liked this approach!
This also makes implementing a .nfo parser simpler. You can usually find the externalIds (tmdb, tvdb, imdb) and other metadata directly from these files.
- Scanner pushes everything to the api in a single post `/videos` call | ||
- Api registers every video in the database | ||
- For each video without an associated entry, the guess data + the video's id is sent to the Matcher via a queue. | ||
- Matcher retrieves metadata from the movie/serie + ALL episodes/seasons (from an external provider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A big event being sent on queue is bad.. And this is also stateful, because you need to wait all files to be processed before sending the big bulk event. If some error happen when processing, all state may be lost.
Keep things stateless, just use a cache in matcher.
- Api registers every video in the database | ||
- For each video without an associated entry, the guess data + the video's id is sent to the Matcher via a queue. | ||
- Matcher retrieves metadata from the movie/serie + ALL episodes/seasons (from an external provider) | ||
- Matcher pushes every metadata to the api (if there are 1000 episodes but only 1 video, still push the 1000 episodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the concept of adding multiple items rather than having a separate endpoint just for series.
You could either use a queue or create a batch endpoint for multiple items.
To help you decide: where is the bottleneck? If it’s on the API side, a queue might make sense.
But if the API response is fast, why use a queue?
Part of #597
For v5, we will rework a lot of things in the scanning process. This PR creates a document that details the scanning workflow.
Big changes compared to current version include:
loading
orfetching-data
.