[Dashboard | Core?] Py-Spy and Python 3.12 - Lack of support #47388

mtralka · 2024-08-28T14:36:04Z

What happened + What you expected to happen

Py-Spy does not yet support Python 3.12 - nor does it seem to have much momentum to (benfred/py-spy#670, benfred/py-spy#661, benfred/py-spy#633). On Python 3.12 and Ray 2.32 the CPU flame graph and stack trace abilities of the Ray Dashboard are broken / non-functional due to upstream py-spy lack of support for 3.12.

Does Ray have plans to transition to another statistical profiler (ex. https://github.com/P403n1x87/austin) to support the 3.12 releases? What would this level of effort be? Is the Ray team open to PRs here?

I understand Ray is in python 3.12 alpha support per #45904. Is this issue a prerequisite for a non-alpha release?

Pyroscope seems to have the same issue:
grafana/pyroscope-rs#168
grafana/pyroscope#3287

Versions / Dependencies

Python 3.12
Ray 2.32

Reproduction script

Following the documentation for Py-Spy in KubeRay - https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/pyspy.html#kuberay-pyspy-integration.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Superskyyy · 2024-08-28T21:00:01Z

We have tried to use Pyroscope continous profiler and it works pretty well, but in the long run the new elastic ebpf profiler would be the way to go, I imagine.

Superskyyy · 2024-08-28T21:00:45Z

https://github.com/open-telemetry/opentelemetry-ebpf-profiler

anyscalesam · 2024-09-03T16:37:07Z

TY for raising this; we'll review this week internally and see what solution we can do here. I think there is a non pyspy path for CPU flamegraphs IIRC

mtralka · 2024-09-11T17:31:27Z

Thank you @anyscalesam. Any update on this path internally?

From an external view it seems prudent to create another CpuProfilingManager but for a non-pyspy target. We would have to ensure it has flame graph and / or speedscope support. Plus then support plumbing ProfilingManager selection into the dashboard reporter agent. Thoughts? I'd be happy to implement something like this if the project is interested in such a path.

anyscalesam · 2024-09-11T17:33:33Z

cc @alanwguo here > we're looking at Scalene and PyInstrument as possible substitutes.

mtralka · 2024-09-11T17:48:32Z

Thanks for the quick reply. Both of those look promising. The AI features of Scalene is slightly concerning from a naive perspective (I assume phoning home is opt-in?). Selection of profiler does seem to be the hardest decision here.

A "bring your own" profiler approach seems potentially attractive. Should Ray lib make the choice of what profiler is shipped with it or should we enable the user to provide their own + register an adapter class (basically a ProfilingManager) as needed?

alanwguo · 2024-09-11T17:56:15Z

Hi @mtralka , thanks for offering to help out here. Your outline of what changes would be needed makes sense to me. I agree that the selection of the profiler is the hardest decision here.

I do like the idea of "bring your own" profiler, although we'd probably still in the future replace pyspy as the default. How would you imagine a user could register an adapter class?

mtralka · 2024-09-11T19:31:48Z

Of course, happy to help where possible. I agree pyspy shouldn't be the default if py3.12 support is mainstreamed.

Regarding implementation - this could certainly be a rabbit hole! IMO a relatively simple / straightforward approach here seems prudent since there doesn't seem to be much user desire for new features here other than fixing py3.12 (please correct me if I'm wrong!).

Some questions / assumptions:

input from ray is always constant across profiler adapters (pid, native, etc)
output should be delivered in a compatible format with existing ray dashboard features
- This format varies across profiling tasks
  - cpu profile - likely the same as currently supported - flamegraph, "raw", speedscope.
    - I assume "raw" is a pyspy specific format. This might be an issue if raw is interpreted downstream in Ray. If it isn't, then no worries
  - trace dump - I'm unaware of a "standard" format here. Would need to find / establish a standard here. I think it is safe to make the user adapt to this standard rather then try to handle every format variation internally. Do you have any domain expertise here?
  - memory profiling - likely the same as currently supported - flamegraph, table

For profilers with "unique" features like Scalene's Web UI we'd have to consider how to present these features.

In its simplest form we can keep the existing class based approach. Pseudo ->

def register_profiler(profiler_manager: type[CpuProfilingManager], as_default: bool = False):
    """Register a CPU profiling manager.
    
    as_default: bool
        Whether to set the given profiler as the default one.
    """
    if not issubclass(profiler_manager, CpuProfilingManager):
        raise TypeError

    _PROFILERS[profiler_manager.PROFILER_NAME] = profiler_manager

def get_profiler(name: str) -> CpuProfilingManager:
  return _PROFILERS[profiler_manager.PROFILER_NAME]

async def GetTraceback(self, request, context):
        pid = request.pid
        native = request.native
        desired_profiler_name = ... # pull from ENV / config singleton / however ray handles config
        profiler_type = get_profiler(desired_profiler_name)
        p = profiler_type(self._log_dir)
        success, output = await p.trace_dump(pid, native=native)
        return reporter_pb2.GetTracebackReply(output=output, success=success)

Where we register subclasses of theCpuProfilingManager. The current CpuProfilingManager is PySpy specific but implementation seems to be refactorable without changing the API.

Note that using a class-based profiler system delineated by CPU / MEM category is a bit limiting / obtuse for combined profilers such as Scalene which handle CPU / GPU / MEM all in one.

Thoughts? Totally acceptable if I'm off base here

alanwguo · 2024-09-23T17:19:52Z

@mtralka, registering subclasses or functions can make sense.

We should see if the output of the profiling functions can be left to be pretty generic. With py-spy, the tracedumb is raw text, the flamegraph is an image file (renderable by the browser).

Memory profiling is actually handled separately using "Memray" instead of py-spy. We should consider making that customizable as well. I believe the out of this is also raw text.

It seems if the generic output format is any html-renderable document, that may be simple interface to conform to.

In terms of class-based override or not. I don't think we necessarily need to adhere to the current pattern of using classes. We could override individual functions per profiling functionality. I don't think there is very much code re-use between the functions within the same class.

Do you want to give that a shot?

mtralka added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 28, 2024

anyscalesam added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 3, 2024

anyscalesam added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dashboard | Core?] Py-Spy and Python 3.12 - Lack of support #47388

[Dashboard | Core?] Py-Spy and Python 3.12 - Lack of support #47388

mtralka commented Aug 28, 2024

Superskyyy commented Aug 28, 2024 •

edited

Loading

Superskyyy commented Aug 28, 2024

anyscalesam commented Sep 3, 2024

mtralka commented Sep 11, 2024

anyscalesam commented Sep 11, 2024

mtralka commented Sep 11, 2024

alanwguo commented Sep 11, 2024

mtralka commented Sep 11, 2024

alanwguo commented Sep 23, 2024

[Dashboard | Core?] Py-Spy and Python 3.12 - Lack of support #47388

[Dashboard | Core?] Py-Spy and Python 3.12 - Lack of support #47388

Comments

mtralka commented Aug 28, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Superskyyy commented Aug 28, 2024 • edited Loading

Superskyyy commented Aug 28, 2024

anyscalesam commented Sep 3, 2024

mtralka commented Sep 11, 2024

anyscalesam commented Sep 11, 2024

mtralka commented Sep 11, 2024

alanwguo commented Sep 11, 2024

mtralka commented Sep 11, 2024

alanwguo commented Sep 23, 2024

Superskyyy commented Aug 28, 2024 •

edited

Loading