Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Performance improvement for runtime env serialization #48749

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

dentiny
Copy link
Contributor

@dentiny dentiny commented Nov 14, 2024

Addresses issue #48591

The problem is:

  • If we specify anything in runtime_env in remote decorator, the parsing and serialization happens for each function invocation
    • Parsing calls parse_runtime_env, which involves a dictionary to class transformation
    • Serialization calls get_runtime_env_info, which serialize a class into json format

Discussed with @rynewang , the proposed solution here is cache pre-calculated serialized runtime env info, so the parsing and serialization only happens once at initialization.

Benchmarked with the test @jjyao mentioned on the ticket, I confirm we could reach similar performance between no env var vs with env var.

Alternatives considered:

  • Use functools.cache for get_runtime_env_info, which is a stateless function
    • The caveat is, we have to figure out an acceptable way to decide whether serialization options and runtime env info is the same, simply traversing all fields is not a good plan

Signed-off-by: dentiny <dentinyhao@gmail.com>
@dentiny dentiny added the go add ONLY when ready to merge, run all tests label Nov 14, 2024
# runtime env will be merged and re-serialized.
#
# Caveat: for `func.option().remote()`, we have to recalculate serialized
# runtime env info upon every call. But it's acceptable since pre-calculation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be more clear,

To support dynamic runtime envs in `func.option(runtime_env={...}).remote()`, we recalculate the serialized runtime env info in the `option` call. If there are multiple calls to a same option, one can save the calculation by `opt_f = func.option(runtime_env={...}); [opt_f.remote() for i in range(many)]`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the "If there are multiple calls to a same option" part, since we don't do any caching for option calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopted other comments.

python/ray/remote_function.py Outdated Show resolved Hide resolved
python/ray/remote_function.py Outdated Show resolved Hide resolved
python/ray/remote_function.py Show resolved Hide resolved
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Nov 15, 2024
@rynewang
Copy link
Contributor

stress_test_many_tasks for this PR:

{
        "perf_metrics": [
            {
                "perf_metric_name": "stage_0_time",
                "perf_metric_type": "LATENCY",
                "perf_metric_value": 10.17169737815857
            },
            {
                "perf_metric_name": "stage_1_avg_iteration_time",
                "perf_metric_type": "LATENCY",
                "perf_metric_value": 26.369622302055358
            },
            {
                "perf_metric_name": "stage_2_avg_iteration_time",
                "perf_metric_type": "LATENCY",
                "perf_metric_value": 69.62744202613831
            },
            {
                "perf_metric_name": "stage_3_creation_time",
                "perf_metric_type": "LATENCY",
                "perf_metric_value": 2.026155710220337
            },
            {
                "perf_metric_name": "stage_3_time",
                "perf_metric_type": "LATENCY",
                "perf_metric_value": 3455.4008374214172
            },
            {
                "perf_metric_name": "stage_4_spread",
                "perf_metric_type": "LATENCY",
                "perf_metric_value": 0.344243728831649
            }
        ],
        "stage_0_time": 10.17169737815857,
        "stage_1_avg_iteration_time": 26.369622302055358,
        "stage_1_max_iteration_time": 27.35737895965576,
        "stage_1_min_iteration_time": 23.87061834335327,
        "stage_1_time": 263.69632601737976,
        "stage_2_avg_iteration_time": 69.62744202613831,
        "stage_2_max_iteration_time": 71.21925210952759,
        "stage_2_min_iteration_time": 67.01057362556458,
        "stage_2_time": 348.1383538246155,
        "stage_3_creation_time": 2.026155710220337,
        "stage_3_time": 3455.4008374214172,
        "stage_4_spread": 0.344243728831649,
        "success": 1
}

As compared to 2.39:

"stage_0_time": 9.837347507476807,
"stage_1_avg_iteration_time": 25.959900307655335,
"stage_1_max_iteration_time": 26.58783769607544,
"stage_1_min_iteration_time": 25.21652603149414,
"stage_1_time": 259.59910821914673,
"stage_2_avg_iteration_time": 80.91288795471192,
"stage_2_max_iteration_time": 130.10139632225037,
"stage_2_min_iteration_time": 67.79082655906677,
"stage_2_time": 404.5656635761261,
"stage_3_creation_time": 1.8206799030303955,
"stage_3_time": 3413.631863594055,
"stage_4_spread": 0.32116315011886526,

It seems stage_2 is better, while other stages are worse. idk if this is expected?

Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
@rynewang
Copy link
Contributor

@jjyao
Copy link
Collaborator

jjyao commented Nov 18, 2024

Many tests failed.

Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
@dentiny
Copy link
Contributor Author

dentiny commented Nov 18, 2024

Many tests failed.

I made a typo, should be fixed now.

@dentiny dentiny requested a review from jjyao November 18, 2024 21:16
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants