Distributed batch processing #377

dolfim-ibm · 2024-11-19T13:25:21Z

dolfim-ibm
Nov 19, 2024
Maintainer

This threads defines the roadmap for batch processing within the Docling package.

In general, the first priority is to optimize the resource usages when converting each document. See the respective issues outlining benchmarks and optimization strategies #306.

Distributed batch processing within Docling

We are currently not planning of adding support for distributed batch processing within Docling. This would add more complexity and increase the base dependencies, which is not what we are looking for.

The advantage of embedding distributed batch processing would be sharing and reusing model instances. The models used by Docling are designed to minimize the required resources, hence we don't expect this would bring any big benefit.

Distributed batch processing around Docling

We encourage the usage of Docling within distributed pipelines which parallelize the documents being processed. We will also do our best to support the requirements of these frameworks, in order to simplify the adoption of Docling.

In particular, we are also planning to provide a few out-of-the-box examples on scheduling via Ray.

Data-prep-kit

Our colleagues are actively working on a simple scalable toolkit for scaling out the data workflows, called data-prep-kit (DPK).

Docling is available as a transform in DPK which allows to launch conversion of large amount of documents on a distributed infrastructure. This is currently being used to process +1 billion documents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed batch processing #377

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Distributed batch processing #377

dolfim-ibm Nov 19, 2024 Maintainer

Distributed batch processing within Docling

Distributed batch processing around Docling

Data-prep-kit

Replies: 0 comments

dolfim-ibm
Nov 19, 2024
Maintainer