Custom video dataset encoding/serialize uses all memory, process killed. How to fix? #5499

cleong110 · 2024-07-03T15:49:37Z

What I need help with / What I was wondering

I want to load a dataset containing these

without this happening (Colab notebook for replicating)

...How can I edit my dataset loader to use less memory when encoding videos?

Background:
I am trying to load a custom dataset with a Video feature.
When I try to tfds.load() it, or even just download_and_prepare, RAM usage goes up very high and then the process gets killed.
For example this notebook will crash if allowed to run, though with a High-RAM instance it may not.
It seems it is using over 30GB of memory to encode one or two 10 MB videos.
I would like to know how to edit/update this custom dataset so that it will not use so much memory.

What I've tried so far

I did a bunch of debugging and tracing of the problem with memray, etc. See this notebook and this issue for detailed analysis including a copy of the memray report.

Tried various different ideas in the notebook, including loading just a slice, editing buffer size, and switching from .load() to download_and_prepare()

Finally I traced the problem to serializing and encoding steps under the
See this comment, which was allocating many GiB of memory to encode even one 10MB video.

I discovered that even one 10MB video was extracted to over 13k video frames, taking up nearly 5GiB of space. And then the
serializing would take up 14-15 GiB, and the encoding would take another 14-15, and so the process would be killed.

Relevant items:

The data loader in question, dgs_corpus.py
The full memray report: memray_output_file.tar.gz
Encoding path: The dataset uses a custom VideoFeature as well, defined here. The memray showsthat encode_example here ends up allocating 14.5 GiB
Serialization: The memray shows that the other path that uses memory is serialization: split_builder.py here which calls writer.py's serialization

It would be nice if...

...there were more examples of how to efficiently load video datasets, and explanations of why they are more efficient.
...there were a way to do this in some sort of streaming fashion that used less memory, e.g. loading in a batch of frames, using a sliding window, etc.
...there were some way to set a memory limit, and just have it process more slowly within that limit.
...there were a way to separate the download and prepare processes. A download_only option, like --download_only in the CLI
...there were a warning that the dataset was using a lot of memory in processing, before the OS kills the process.
...for saving disk space, a way to encode and serialize videos without extracting thousands of individual frames, ballooning the size from 10MB to multiple GiB. Maybe there is and I just don't know.
...it was possible to download only part of a dataset. It's possible to load a slice, but only after download_and_prepare does its whole thing.
...more explanation of what serialization and encoding are for, maybe? What are they?

Environment information
I've tested it on Colab and a few other Ubuntu workstations. High-Ram Colab Instances seem to have enough memory to get past this.

The text was updated successfully, but these errors were encountered:

tomvdw · 2024-07-04T09:55:15Z

Hey,

Thanks for your question. Those are some cool datasets! I'm very sorry to hear that you're running into these problems.

We brainstormed a bit and came up with a couple of ideas:

14-15GB for 13k frames means that each frame takes up ~1MB. IIUC ffmpeg extracts frames as PNG files. Switching to JPG could maybe bring ~5x savings. However, you'd still end up with ~3GB for a 10MB video. Not great.
Store the encoded video in the dataset. This means that the video will stay 10MB, but that the decoding needs to happen when you use the data. I'm not sure if using ffmpeg to decode when training would be a good solution (i.e. running a separate tool that writes 14-15 GB to disk, then read those 14-15 GB from disk). Alternatively, there seem to be Python libraries that can read videos, e.g. OpenCV.

Even if we make storing encoded videos work, I'm worried that the problem would just be moved to when the dataset is used. Namely, reading a single example would still require 14-15 GB of memory.

After the dataset has been prepared, how are you expecting that it will be used? Would it make sense to lower the FPS (it's 50 now right)? Will users only use chunks of the video? If so, perhaps you can store the chunks instead of the entire video.

Kind regards,
Tom

cleong110 · 2024-07-05T14:06:51Z

Tom,

Thank you very much for your reply, and those ideas!

How will they be used:

I'm just getting into Sign Language Processing research, so I'm still not quite sure how I want to use these, but potentially for training translation models for signed language videos to spoken-language text, or for pretraining a vision transformer, or a bunch of other things A few use-cases follow:

test out models on real data

I figured I'd start learning by at least running some inference pipelines with already-trained models, and got stuck on this step. I expected running a model to take significant memory, but didn't expect that loading the video would be the issue. I guess I'm successfully learning things! Specifically I'd like to load in some videos and run this demo of segmentation+recognition pipeline.

replicate other research on github

I went looking for examples of people using these, and it seems that not many use the video option, perhaps for this very reason, that loading them is too cumbersome.

This project on sign language translation loads actual videos in a number of places including for prediction here and here and here. And for training in this script.

replicate WMT results, or at least re-run their models

One thing I wanted to do was replicate results for the WMT Sign Language Translation contests, which provides data in a number of formats including video, and a number of the submissions do use video as inputs instead of poses.

WMT 22 data
WMT 23 data
According to the "Findings" papers that came from these, a good number of the submissions to these did take videos as inputs instead of poses, I'd like to be able to tinker with those pipelines.

At least load the videos and then run pose estimation on them

Another thing I wanted to do was to be able to load the videos, run a pose estimator on them, and then use that, in order to potentially improve that part of the pipeline. A number of sign language translation models take pose keypoints as inputs, and I'd like to try those out.

At the very least I'd like to be able to do this! And then the pose methods may take less compute from there.

cleong110 · 2024-07-05T14:29:11Z

Regarding the suggestions:

seems pretty easy to test, worth a shot!
I admit I'm pretty ignorant about this, what is the encoding/decoding even doing exactly? What would it mean to store the encoded video, decode later, etc.? I read about it a bit, and I think I understand that encoding is to compress the frames to a video format, and decode is to expand out to the frames...? If so, then is there a way to load in only some limited number of the frames at a time? And why does the dataset need to encode when it's already encoded as a .mp4?

I guess I'd like to be able to, and I don't know if any of this is feasible, but:

If I have plenty of time but not memory or hard drive space, have a way to just slowly decode as needed.
If I have plenty of time AND hard drive space, expand it out to frames on the hard drive, but then only load into memory what I need when I need it.
If I have memory enough to load half the video, only load half. Stream the rest in in like a buffer
and so forth, but basically have it do its best with the available resources but not crash.

Did some further Googling, and I found a few things:

Memory issues when loading videos into frames suggestion is to use the pims library which lets you index/slice videos, and only loads them when used.
How to read part of a video and load into RAM without loading the entire video on RAM? suggests using the ffmpeg "trim" method

cleong110 · 2024-07-05T14:31:21Z

FPS lowering: that's another good idea, I think there might be a method in there to set that already. Maybe tweaking that would reduce memory usage, I can try.

cleong110 added the help label Jul 3, 2024

cleong110 changed the title ~~Custom video dataset encoding step uses all memory, process killed. How to fix?~~ Custom video dataset encoding/serialize uses all memory, process killed. How to fix? Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom video dataset encoding/serialize uses all memory, process killed. How to fix? #5499

Custom video dataset encoding/serialize uses all memory, process killed. How to fix? #5499

cleong110 commented Jul 3, 2024 •

edited

Loading

tomvdw commented Jul 4, 2024

cleong110 commented Jul 5, 2024

cleong110 commented Jul 5, 2024 •

edited

Loading

cleong110 commented Jul 5, 2024

Custom video dataset encoding/serialize uses all memory, process killed. How to fix? #5499

Custom video dataset encoding/serialize uses all memory, process killed. How to fix? #5499

Comments

cleong110 commented Jul 3, 2024 • edited Loading

tomvdw commented Jul 4, 2024

cleong110 commented Jul 5, 2024

How will they be used:

test out models on real data

replicate other research on github

replicate WMT results, or at least re-run their models

At least load the videos and then run pose estimation on them

cleong110 commented Jul 5, 2024 • edited Loading

cleong110 commented Jul 5, 2024

cleong110 commented Jul 3, 2024 •

edited

Loading

cleong110 commented Jul 5, 2024 •

edited

Loading