Discussion on Enhancing Performance and Memory Efficiency in Data Processing Workflows #84

Sohambutala · 2024-05-15T04:24:58Z

This ticket aims to open a discussion on potential improvements in performance and memory management within our data processing workflows. The current approach, although functional, leads to significant I/O overhead due to the necessity of writing outputs and reading them in subsequent stages.

Current Challenge : Each stage of our workflow currently writes its output to disk, which is then read by the next stage. This process is not only I/O intensive but also becomes a bottleneck in terms of performance, primarily due to the serialization requirements of data passing between Prefect flows.

Proposed Solutions for Discussion:

Custom Serialization:

Pros: Allows tighter control over how data is managed and passed between stages.
Cons: Requires additional development effort and maintenance.
Implementation: Develop a custom class (if required) that handles serialization of intermediate data efficiently.
Compression:

Pros: Reduces the physical size of serialized files, potentially decreasing I/O time.
Cons: Introduces a trade-off between the time saved on I/O operations and the additional time required for compressing and decompressing data.
Implementation: Implement compression algorithms suited to our data types and processing needs.
Enhanced Caching Strategy:

Pros: Minimizes disk I/O by keeping frequently used data in faster access storage areas.
Cons: Complex implementation, especially when deciding what data remains in cache and what gets written to disk.
Implementation: Utilize a hybrid caching system using Redis for in-memory caching and disk-based storage for less frequently accessed data. Consider writing a custom caching strategy tailored to predict and pre-load data required for upcoming stages.

Any insights, comments or criticism is welcomed.

Sohambutala added the enhancement New feature or request label May 15, 2024

Sohambutala self-assigned this May 15, 2024

Sohambutala added this to Echoflow May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on Enhancing Performance and Memory Efficiency in Data Processing Workflows #84

Discussion on Enhancing Performance and Memory Efficiency in Data Processing Workflows #84

Sohambutala commented May 15, 2024 •

edited

Loading

Discussion on Enhancing Performance and Memory Efficiency in Data Processing Workflows #84

Discussion on Enhancing Performance and Memory Efficiency in Data Processing Workflows #84

Comments

Sohambutala commented May 15, 2024 • edited Loading

Sohambutala commented May 15, 2024 •

edited

Loading