Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic/fault tolerance #34

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Topic/fault tolerance #34

wants to merge 7 commits into from

Conversation

pgzmnk
Copy link
Contributor

@pgzmnk pgzmnk commented Aug 6, 2021

Fault Tolerance mechanism is now functional.

  • On Task retry, stores upstream outputs to GCS.
  • Attempts to recover objects from Ray then from GCS. Fails the Task if objects DNE.
  • Tests create scenarios and make assertions based on the expected presence/absence of Ray objects (the counter was deprecated).

To-do:

  • S3 support.
  • Pass GCS/S3 credentials via Airflow connection (currently env variable).
  • Flag to enable/disable Fault Tolerance modes, such as: off, always on retry, and checkpointing.
  • Verify functionality when tasks execute on different clusters.

Comment on lines 82 to 90
class RayPythonOperator(PythonOperator):

def __init__(self, *,
python_callable: Callable,
op_args: Optional[List] = None,
op_kwargs: Optional[Dict] = None,
templates_dict: Optional[Dict] = None,
templates_exts: Optional[List[str]] = None,
**kwargs) -> None:

This comment was marked as resolved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is basically the implementation of the task?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave a docstring here?

Copy link
Contributor Author

@pgzmnk pgzmnk Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the ray decorator used the PythonOperator.

Assigning the recovered objects as Task attributes requires modifying to the pre_execute method of the PythonOperator.

RayPythonOperator subclasses the PythonOperator; it sets a custom pre_execute method and implements logic to enable attribute assignment.

  • I will add docstrings explaining the implementation.

@pgzmnk pgzmnk marked this pull request as ready for review August 31, 2021 22:08
@pgzmnk pgzmnk changed the title Topic/fault tolerance failure poc v4.2 Topic/fault tolerance Aug 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants