This project introduces an efficient approach to video segmentation aimed at summarizing videos into highly representative keyframes. The core technique involves analyzing semantic embeddings extracted from video frame embedding pairs. Each frame is assigned a score calculated based on the Euclidean distance between the embedding of the current frame and its successor. A significant distance implies dissimilarity between frames as a marker for a potential transition or "seam" between segments. These seams are identified when the successor score surpasses a dynamically adjusted threshold or is set according to a predetermined value.
The Deadly Portuguese Man O' War | 4KUHD | Blue Planet II | BBC Earth
Link to full keyframe grid for video 1 |
|
|
-
Download YouTube Video and Extract Keyframes Using video2dataset
video2dataset
employsyt-dl
for video downloading andffmpeg
for keyframe extraction.- This step yields: original videos, keyframes, and the YouTube metadata JSON.
-
Convert Original Frames and Keyframes to Numpy Array
- The original frames and keyframes are processed via
clip-video-encode
to generate their embedding vectors. - These vectors are created using a pre-trained
laion2b_s34b_b79k ViT-B-32
model. - The embeddings are a compact representation of the frames and keyframes, encapsulating essential visual features.
- The original frames and keyframes are processed via
-
Analysis, Visualization, and Fine-tuning
- A sliding window method and k-NN are used to identify segments where the successor value crosses a specific threshold or shows a unique pattern.
- After determining optimal values, the script is configured to produce image and video keyframes.
- The primary output consists of 2-15-second clips, each containing 2-4 keyframes most representative of the clip's content.
Key Terms - Distance Metrics: Uses Euclidean distance to measure the similarity between embeddings of adjacent frames.
- Successor Values: The Euclidean distance of the current frame to its successor frame - used to qualify new segments.
- Dynamic Thresholds: Adapts to varying video content using a rolling average and standard deviation to adjust the threshold.
- Embedding Semantics: Using semantic embeddings allows for a rich representation of content, enabling the system to identify better segments where a significant change occurs.
- Successor Score: The primary heuristic in keyframe detection is through the Euclidean distance between successive frame embeddings.
- Embedding Surveyor: Leverages K-Nearest Neighbors (KNN) to fine-tune the dynamic thresholds, providing a second layer of adaptability and increasing segmentation accuracy.
- Seam Detection: Leveraging the successor score and KNN for detecting "trending" seams presents a novel way of identifying key moments without needing explicit object recognition or manual labeling.
- Adaptive System: The combination of dynamic thresholds and successor scores allows the system to adapt to different videos and changes within a single video.
evaluations/
: Embedding summary statistics by videokeyframeembeddings/
: Keyframe embeddingskeyframes/
: Inital video keyframes, reduced FPS version of original videooriginalembeddings/
: Full video embeddingsoriginalvideos/
: Full videos used to create keyframes
evaluations/
: Embedding statistics by videoEDA.ipynb
: Inital EDA and methods visualizedExample1.ipynb
: Embedding surveyor with video 1Example2.ipynb
: Embedding surveyor with video 7Results EDA.ipynb
: Successor segmentation video 1Generative Summarization EDA.ipynb
: Final summarization process - work in progress
Successor-Segmentation-Pipeline.ipynb
: Google Colab notebook for full video segmentation pipeline.pipeline.py
: Setup script for running the pipeline.clipvideoencode.py
: Script for extracting embeddings from video frames - usesclip-video-encode
library.video2dataset.py
: Script for downloading YouTube videos with metadata and extracting keyframes - usesvideo2dataset
library.segment_averaging.py
: Script for calculating average clip embedding for each cut segment.move_and_group.py
&rename_and_move_files.py
: Utility scripts for organizing files.
1.png through 8.png
: Grid plots of the keyframes for each video.original_video_scatter_1.png
&key_video_scatter_1.png
: Scatter plots for video 1 for original and keyframe embeddings - shows the latent space for each video.original_video_embeddings_1
&original_video_embeddings_7
: Visualize the relationship between video frames and embedding values over video duration. Video frames are plotted with associated histograms of average embedding values and successor similarity scores.keyframe_embeddings_1
&keyframe_embeddings_7
: Visualize the relationship between keyframes and keyframe embedding values over video duration. Video frames are plotted with associated histograms of average embedding values and successor similarity scores.- Example outputs
segment_processing.py
: Provides utility functions for video segmentation and keyframe filtering based on metrics like perceptual hashing and Euclidean distances between embeddings. It reads configurable thresholds and uses them to detect new segments in a video, filter out similar keyframes, and calculate distances to centroids.
embedding_surveyor.py
: The SlidingWindowAnalyzer class in Python is designed for video segmentation and keyframe analysis. It uses a sliding window approach to analyze video embeddings, employing algorithms like K-Nearest Neighbors (KNN) and t-SNE for clustering and visualization. The class also dynamically updates distance thresholds and leverages various utility functions for plotting and threshold management tasks.successor_segmentation.py
: This file contains the SegmentSuccessorAnalyzer class, designed for video keyframe analysis and segmentation. It operates on pre-computed video embeddings and uses configurable thresholds for segment identification. The class also incorporates optional maximum segment duration constraints and saves keyframes and their metadata for further analysis. It is part of the pipeline that can be run on multiple videos, and it leverages utility functions for tasks like annotation and plotting.fold_seams.py
: The primary function in this file issegment_video_using_keyframes_and_embeddings
and is designed to segment a video based on keyframe timestamps (obtained fromsuccessor_segmentation.py
). It uses FFmpeg for the actual video cutting. The function also incorporates a tolerance level that can be adjusted to fine-tune each segment's start and end times. This tolerance is mainly used when the segmentation is based on keyframes so that each segment's start and end times are not too close to the keyframe timestamps.
- Su, J., Yin, R., Zhang, S., & Luo, J. (2023). Motion-state Alignment for Video Semantic Segmentation.
- Cho, S., Kim, W. J., Cho, M., Lee, S., Lee, M., Park, C., & Lee, S. (2022). Pixel-Level Equalized Matching for Video Object Segmentation.
- Han, Z., He, X., Tang, M., & Lv, Y. (2021). Video Similarity and Alignment Learning on Partial Video Copy Detection.
- Cho, D., Hong, S., Kang, S., & Kim, J. (2019). Key Instance Selection for Unsupervised Video Object Segmentation.
- Foster, D. (n.d.-b). Generative Deep Learning, 2nd Edition. O'Reilly Online Learning. O'Reilly Online Learning
@misc{kilian-2023-video2dataset,
author = {Maciej Kilian, Romain Beaumont, Daniel Mendelevitch, Sumith Kulal, Andreas Blattmann},
title = {video2dataset: Easily turn large sets of video urls to a video dataset},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\\url{https://github.com/iejMac/video2dataset}}
}