You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This ticket aims to open a discussion on potential improvements in performance and memory management within our data processing workflows. The current approach, although functional, leads to significant I/O overhead due to the necessity of writing outputs and reading them in subsequent stages.
Current Challenge : Each stage of our workflow currently writes its output to disk, which is then read by the next stage. This process is not only I/O intensive but also becomes a bottleneck in terms of performance, primarily due to the serialization requirements of data passing between Prefect flows.
Proposed Solutions for Discussion:
Custom Serialization:
Pros: Allows tighter control over how data is managed and passed between stages. Cons: Requires additional development effort and maintenance. Implementation: Develop a custom class (if required) that handles serialization of intermediate data efficiently.
Compression:
Pros: Reduces the physical size of serialized files, potentially decreasing I/O time. Cons: Introduces a trade-off between the time saved on I/O operations and the additional time required for compressing and decompressing data. Implementation: Implement compression algorithms suited to our data types and processing needs.
Enhanced Caching Strategy:
Pros: Minimizes disk I/O by keeping frequently used data in faster access storage areas. Cons: Complex implementation, especially when deciding what data remains in cache and what gets written to disk. Implementation: Utilize a hybrid caching system using Redis for in-memory caching and disk-based storage for less frequently accessed data. Consider writing a custom caching strategy tailored to predict and pre-load data required for upcoming stages.
Any insights, comments or criticism is welcomed.
The text was updated successfully, but these errors were encountered:
This ticket aims to open a discussion on potential improvements in performance and memory management within our data processing workflows. The current approach, although functional, leads to significant I/O overhead due to the necessity of writing outputs and reading them in subsequent stages.
Current Challenge : Each stage of our workflow currently writes its output to disk, which is then read by the next stage. This process is not only I/O intensive but also becomes a bottleneck in terms of performance, primarily due to the serialization requirements of data passing between Prefect flows.
Proposed Solutions for Discussion:
Custom Serialization:
Pros: Allows tighter control over how data is managed and passed between stages.
Cons: Requires additional development effort and maintenance.
Implementation: Develop a custom class (if required) that handles serialization of intermediate data efficiently.
Compression:
Pros: Reduces the physical size of serialized files, potentially decreasing I/O time.
Cons: Introduces a trade-off between the time saved on I/O operations and the additional time required for compressing and decompressing data.
Implementation: Implement compression algorithms suited to our data types and processing needs.
Enhanced Caching Strategy:
Pros: Minimizes disk I/O by keeping frequently used data in faster access storage areas.
Cons: Complex implementation, especially when deciding what data remains in cache and what gets written to disk.
Implementation: Utilize a hybrid caching system using Redis for in-memory caching and disk-based storage for less frequently accessed data. Consider writing a custom caching strategy tailored to predict and pre-load data required for upcoming stages.
Any insights, comments or criticism is welcomed.
The text was updated successfully, but these errors were encountered: