Combines videos with matching audio files (e.g. audio descriptions). Works by aligning parts of the audio file to matching parts of the video's sound.
Create a copy of a video file with the sound replaced by an audio description:
Select a video file and a corresponding audio description using the file browsers, then click "Combine":
The combined media is saved in the folder "videos_with_ad" placed in the directory describealign was run in. The directory that combined media files are saved in can be changed in "Settings":
Note: media longer than an hour should only be processed on computers with at least 16 GB of RAM.
This script is compatible with Python versions 3.8 and up. Versions before that won't work (it relies on a recent update to Scipy's linprog). On Intel-CPU Macs specifically, Python must be one of 3.8, 3.9, or 3.10, as 3.11 and up are not supported by PySide2, which is needed for the Mac GUI. Non-Intel Macs will need to use the binary.
describealign is available with pip:
pip install describealign
Note: You may need to add the folder Python's pip.exe is in to your system path. It might be something like: "C:/Users/User/AppData/Local/Programs/Python/Python310/Scripts" Don't forget to restart command prompt after updating the PATH!
The GUI can then be opened from console/command prompt in any directory with:
describealign
Note: You may need to add the folder Python's compiled scripts (e.g. describealign.exe) are kept in to your system path. It may be a different folder than pip's, like AppData/Roaming instead of AppData/Local.
Alternatively, the python script (describealign.py) can be downloaded from here and run directly (with Python 3.8+ on Windows/Linux or 3.8-3.10 on Intel-CPU Mac) after installing the dependencies manually (requirements.txt):
pip install -r requirements.txt
python3 describealign.py
The binary methods don't require installing python or messing about with PATH.
Windows and Mac users can instead download and unzip the latest release, then double click on describealign.exe to open the GUI.
Note for Mac binary users: To open the binary, you'll need to ctrl+click (or right click) on the binary, then click "Open" and then click "Open" again in the window that pops up. This minor annoyance is a result of my unwillingness to pay Apple $100 a year.
When new versions are released, the pip package can be updated with:
pip install describealign --upgrade
Note: users with multiple python versions may need to use pip3 rather than pip.
The installation can be tested on a clip from the 1929 comedy short Ask Dad, with the first part of an audio description provided by Valerie H. in her podcast Movies For the Blind. Download the trimmed versions from the test_media folder in this repository, then select them in the GUI:
This produces two outputs, a new video file "videos_with_ad/ad_ask_dad_trimmed.mp4" and a plot in alignment_plots:
The plot shows the audio description was already aligned with the video apart from a fixed offset of 199 seconds, which means Valerie starts describing Ask Dad 199 seconds into the episode. The y-scale is so zoomed in that the sub-second dithering of tokens (used to fine-tune alignment) is visible as blue streaks.
If the full video (22 minutes) and audio description (27 minutes) are used instead, describealign runs in about 90 seconds, using up about 3 GB of RAM, and we get the following plot:
This plot shows a number of small pauses in the audio description starting around 10 minutes in, which add up to a total offset of 30 seconds by the end of the video. The jump discontinuities have been smoothed out by stretching the video. The plot also shows which segments of audio would be replaced if --stretch_audio were used. All of the audio would be replaced except for a segment around the 9 minute mark in which the video's original audio would be kept, as the replacement audio would have been too noticably distorted (i.e. more than 10% stretched).
A text version of each plot is saved alongside each image:
Main changes needed to video to align it to audio input:
Start Offset: 199.06 seconds
Median Rate Change: 0.00%
Rate change of 0.0% from 0:00:00.00 to 0:09:08.64 aligning with audio from 0:03:19.06 to 0:12:27.67
Rate change of 11.0% from 0:09:08.64 to 0:10:07.34 aligning with audio from 0:12:27.67 to 0:13:20.56
Rate change of 0.4% from 0:10:07.34 to 0:10:51.53 aligning with audio from 0:13:20.56 to 0:14:04.60
Rate change of 4.7% from 0:10:51.53 to 0:13:48.45 aligning with audio from 0:14:04.60 to 0:16:53.65
Rate change of 0.2% from 0:13:48.45 to 0:14:48.80 aligning with audio from 0:16:53.65 to 0:17:53.89
Rate change of 6.4% from 0:14:48.80 to 0:17:42.84 aligning with audio from 0:17:53.89 to 0:20:37.44
Rate change of 0.4% from 0:17:42.84 to 0:18:52.56 aligning with audio from 0:20:37.44 to 0:21:46.88
Rate change of 4.3% from 0:18:52.56 to 0:20:02.88 aligning with audio from 0:21:46.88 to 0:22:54.31
Rate change of -0.0% from 0:20:02.88 to 0:20:52.45 aligning with audio from 0:22:54.31 to 0:23:43.91
Rate change of -2.9% from 0:20:52.45 to 0:21:38.09 aligning with audio from 0:23:43.91 to 0:24:30.92
Rate change of -0.1% from 0:21:38.09 to 0:22:15.77 aligning with audio from 0:24:30.92 to 0:25:08.63
describealign can be given a directory of videos and a directory of audio files rather than individual files. describealign assumes files from the two directories correspond based on their lexicographic order.
Alternatively, multiple files can be selected simultaneously in the GUI's file selectors. The selected Video and Audio files are first sorted in lexicographic order, then corresponded and aligned.
By default describealign stretches video to fit audio descriptions, but the inverse is also possible: stretching the audio description to fit the video with the "--stretch_audio" argument. This argument also keeps the video's original audio tracks, adding the audio description as the first track and setting it as the default audio.
Whereas describealign is designed to align video-to-audio, it can also align an audio file to another audio file.
When using the --stretch_audio argument, describealign also has a few other experimental capabilities, like boosting the volume of audio descriptions relative to the video's sound with the "--boost x" argument, where x is in decibels. "--boost 3" approximately doubles the audio description volume, while "--boost -3" approximately halves it.
The default behavior of --stretch_audio is to replace all or almost all of a video's audio with the audio description file's audio. But the "--keep_non_ad" argument tells describealign to try to only replace audio when the describer is speaking. This can be useful when the audio description has significantly worse sound quality than the video.
If an alignment isn't working perfectly, the ambitious user can try adjusting a few parameters with arguments described in "--help" and the GUI's Settings tooltips.
describealign can be run without the GUI by specifying input media as positional arguments:
describealign video.mp4 audio_desc.mp3
Note for Mac binary users: The executable is inside the .app and can be run from Terminal with:
describealign.app/Contents/MacOS/describealign video.mp4 audio_desc.mp3
describealign can also be used as a python module:
import describealign as dal
dal.combine("ask_dad_trimmed.mp4", "ask_dad_moviesfortheblind_ep_01_trimmed.mp3")
describealign is robust enough to align media with completely different dialogue, meaning it can align audio dubbed in a different language to the original video.
With default settings (i.e. --stretch_audio set to False), describealign doesn't re-encode either the video or audio streams. It aligns them by modifying the timestamps that video frames are shown at, which means no loss in quality. Basic video editing can be done by deleting or stretching segments of a video's sound in Audacity, then running describealign on the original video and the modified audio with --smoothness set low (e.g. 1). The video will be edited losslessly, but the audio can also be edited losslessly by exporting from Audacity as FLAC, then setting --extension to mkv or another container that supports FLAC.
By using a very high boost value (e.g. --boost 100), the output audio will only contain the audio descriptions and all other sounds will be silenced. Passing the output into Whisper will then create a transcript of just the audio descriptions.