Distributed batch processing #377
dolfim-ibm
announced in
Roadmap
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This threads defines the roadmap for batch processing within the Docling package.
In general, the first priority is to optimize the resource usages when converting each document. See the respective issues outlining benchmarks and optimization strategies #306.
Distributed batch processing within Docling
We are currently not planning of adding support for distributed batch processing within Docling. This would add more complexity and increase the base dependencies, which is not what we are looking for.
The advantage of embedding distributed batch processing would be sharing and reusing model instances. The models used by Docling are designed to minimize the required resources, hence we don't expect this would bring any big benefit.
Distributed batch processing around Docling
We encourage the usage of Docling within distributed pipelines which parallelize the documents being processed. We will also do our best to support the requirements of these frameworks, in order to simplify the adoption of Docling.
In particular, we are also planning to provide a few out-of-the-box examples on scheduling via Ray.
Data-prep-kit
Our colleagues are actively working on a simple scalable toolkit for scaling out the data workflows, called data-prep-kit (DPK).
Docling is available as a transform in DPK which allows to launch conversion of large amount of documents on a distributed infrastructure. This is currently being used to process +1 billion documents.
Beta Was this translation helpful? Give feedback.
All reactions