[core] Performance improvement for runtime env serialization #48749

dentiny · 2024-11-14T22:31:55Z

Addresses issue #48591

The problem is:

If we specify anything in runtime_env in remote decorator, the parsing and serialization happens for each function invocation
- Parsing calls parse_runtime_env, which involves a dictionary to class transformation
- Serialization calls get_runtime_env_info, which serialize a class into json format

Discussed with @rynewang , the proposed solution here is cache pre-calculated serialized runtime env info, so the parsing and serialization only happens once at initialization.

Benchmarked with the test @jjyao mentioned on the ticket, I confirm we could reach similar performance between no env var vs with env var.

Alternatives considered:

Use functools.cache for get_runtime_env_info, which is a stateless function
- The caveat is, we have to figure out an acceptable way to decide whether serialization options and runtime env info is the same, simply traversing all fields is not a good plan

Signed-off-by: dentiny <dentinyhao@gmail.com>

rynewang · 2024-11-14T23:24:44Z

python/ray/remote_function.py

+        # runtime env will be merged and re-serialized.
+        #
+        # Caveat: for `func.option().remote()`, we have to recalculate serialized
+        # runtime env info upon every call. But it's acceptable since pre-calculation


to be more clear,

To support dynamic runtime envs in `func.option(runtime_env={...}).remote()`, we recalculate the serialized runtime env info in the `option` call. If there are multiple calls to a same option, one can save the calculation by `opt_f = func.option(runtime_env={...}); [opt_f.remote() for i in range(many)]`.

I'm not sure I follow the "If there are multiple calls to a same option" part, since we don't do any caching for option calls.

Adopted other comments.

python/ray/remote_function.py

Signed-off-by: dentiny <dentinyhao@gmail.com>

rynewang · 2024-11-15T23:25:33Z

stress_test_many_tasks for this PR:

{
        "perf_metrics": [
            {
                "perf_metric_name": "stage_0_time",
                "perf_metric_type": "LATENCY",
                "perf_metric_value": 10.17169737815857
            },
            {
                "perf_metric_name": "stage_1_avg_iteration_time",
                "perf_metric_type": "LATENCY",
                "perf_metric_value": 26.369622302055358
            },
            {
                "perf_metric_name": "stage_2_avg_iteration_time",
                "perf_metric_type": "LATENCY",
                "perf_metric_value": 69.62744202613831
            },
            {
                "perf_metric_name": "stage_3_creation_time",
                "perf_metric_type": "LATENCY",
                "perf_metric_value": 2.026155710220337
            },
            {
                "perf_metric_name": "stage_3_time",
                "perf_metric_type": "LATENCY",
                "perf_metric_value": 3455.4008374214172
            },
            {
                "perf_metric_name": "stage_4_spread",
                "perf_metric_type": "LATENCY",
                "perf_metric_value": 0.344243728831649
            }
        ],
        "stage_0_time": 10.17169737815857,
        "stage_1_avg_iteration_time": 26.369622302055358,
        "stage_1_max_iteration_time": 27.35737895965576,
        "stage_1_min_iteration_time": 23.87061834335327,
        "stage_1_time": 263.69632601737976,
        "stage_2_avg_iteration_time": 69.62744202613831,
        "stage_2_max_iteration_time": 71.21925210952759,
        "stage_2_min_iteration_time": 67.01057362556458,
        "stage_2_time": 348.1383538246155,
        "stage_3_creation_time": 2.026155710220337,
        "stage_3_time": 3455.4008374214172,
        "stage_4_spread": 0.344243728831649,
        "success": 1
}

As compared to 2.39:

ray/release/perf_metrics/stress_tests/stress_test_many_tasks.json

Lines 34 to 45 in 3435c25

    
           "stage_0_time": 9.837347507476807, 
        
           "stage_1_avg_iteration_time": 25.959900307655335, 
        
           "stage_1_max_iteration_time": 26.58783769607544, 
        
           "stage_1_min_iteration_time": 25.21652603149414, 
        
           "stage_1_time": 259.59910821914673, 
        
           "stage_2_avg_iteration_time": 80.91288795471192, 
        
           "stage_2_max_iteration_time": 130.10139632225037, 
        
           "stage_2_min_iteration_time": 67.79082655906677, 
        
           "stage_2_time": 404.5656635761261, 
        
           "stage_3_creation_time": 1.8206799030303955, 
        
           "stage_3_time": 3413.631863594055, 
        
           "stage_4_spread": 0.32116315011886526,

It seems stage_2 is better, while other stages are worse. idk if this is expected?

Signed-off-by: dentiny <dentinyhao@gmail.com>

rynewang · 2024-11-18T18:26:28Z

the updated release test link: https://buildkite.com/ray-project/release/builds/26062#01932d21-694c-4baa-bbc7-e624bdeeb712

python/ray/remote_function.py

jjyao · 2024-11-18T20:40:25Z

Many tests failed.

Signed-off-by: dentiny <dentinyhao@gmail.com>

dentiny · 2024-11-18T21:16:56Z

Many tests failed.

I made a typo, should be fixed now.

Signed-off-by: dentiny <dentinyhao@gmail.com>

performance improvement for runtime env

3e312e0

Signed-off-by: dentiny <dentinyhao@gmail.com>

dentiny requested review from jjyao and rynewang November 14, 2024 22:31

dentiny added the go add ONLY when ready to merge, run all tests label Nov 14, 2024

rynewang reviewed Nov 14, 2024

View reviewed changes

dentiny added 3 commits November 15, 2024 00:08

update comment

c5dd0e7

Signed-off-by: dentiny <dentinyhao@gmail.com>

avoid default value

2a2ee84

Signed-off-by: dentiny <dentinyhao@gmail.com>

fix is_job_runtime_env

a224041

Signed-off-by: dentiny <dentinyhao@gmail.com>

dentiny requested a review from rynewang November 15, 2024 00:15

Merge branch 'master' into hjiang/improve-runtime-env-remote

1f2121c

jcotant1 added the core Issues that should be addressed in Ray Core label Nov 15, 2024

dentiny added 2 commits November 16, 2024 08:13

revert experimental change

89850e2

Signed-off-by: dentiny <dentinyhao@gmail.com>

use named argument

401c7a5

Signed-off-by: dentiny <dentinyhao@gmail.com>

jjyao reviewed Nov 18, 2024

View reviewed changes

python/ray/remote_function.py Outdated Show resolved Hide resolved

dentiny added 2 commits November 18, 2024 21:15

fix typo

2975d8b

Signed-off-by: dentiny <dentinyhao@gmail.com>

fix comment

122dd18

Signed-off-by: dentiny <dentinyhao@gmail.com>

dentiny requested a review from jjyao November 18, 2024 21:16

dentiny added 2 commits November 19, 2024 00:27

fix hard-coded unit test

a93d126

Signed-off-by: dentiny <dentinyhao@gmail.com>

fix another hard coded test

6e9065c

Signed-off-by: dentiny <dentinyhao@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Performance improvement for runtime env serialization #48749

[core] Performance improvement for runtime env serialization #48749

dentiny commented Nov 14, 2024 •

edited

Loading

rynewang Nov 14, 2024

dentiny Nov 15, 2024

dentiny Nov 15, 2024

rynewang commented Nov 15, 2024

rynewang commented Nov 18, 2024

jjyao commented Nov 18, 2024

dentiny commented Nov 18, 2024

[core] Performance improvement for runtime env serialization #48749

Are you sure you want to change the base?

[core] Performance improvement for runtime env serialization #48749

Conversation

dentiny commented Nov 14, 2024 • edited Loading

rynewang Nov 14, 2024

Choose a reason for hiding this comment

dentiny Nov 15, 2024

Choose a reason for hiding this comment

dentiny Nov 15, 2024

Choose a reason for hiding this comment

rynewang commented Nov 15, 2024

rynewang commented Nov 18, 2024

jjyao commented Nov 18, 2024

dentiny commented Nov 18, 2024

dentiny commented Nov 14, 2024 •

edited

Loading