Flexibility in the deps #165

mjschock · 2021-07-07T17:04:49Z

This assumes the external library authors are using semver.

There's still #121 to address but this unblocks those of us that want to use a newer version of pyspark.

ConeyLiu · 2021-07-08T03:14:25Z

.github/workflows/raydp.yml

@@ -19,9 +19,9 @@ name: RayDP CI

 on:
  push:
-    branches: [ master ]
+    branches: [ main, master ]


It seems we don't have the main branch.

right, i added it b/c main is now the default branch rather than master. that way the workflow runs: https://github.com/mjschock/raydp/actions/runs/1007895832

ConeyLiu · 2021-07-08T03:15:30Z

python/setup.py

+        "psutil < 6.0.0",
+        "pyarrow >= 0.10, < 5.0.0",
+        "ray >= 1.4.0, < 2.0.0",
+        "pyspark >= 3.0.0, < 4.0.0",


Hi @kira-lin, does the raydp support spark 3.1.0?

it works fine when building with the older version in the pom.xml: #121 (comment). need to address that issue separately but this unblocks usage of 3.1.0 which is the minimum required for, e.g., delta-spark (https://pypi.org/project/delta-spark/), which i'd like to integrate

Well, if there are demand for spark 3.1, we should give it a try. @mjschock can you please also modify the dependencies in the raydp.yaml to see if it pass CI?
By the way, I actually have the patch to support spark 3.1, but we are in a discussion about whether we should add a shim layer for different spark versions.

@kira-lin i updated raydp.yml to use both pyspark 3.0.0 and 3.1.2 and they both pass

ConeyLiu · 2021-07-08T03:18:33Z

python/setup.py

-        "ray == 1.4.0",
-        "pyspark >= 3.0.0, < 3.1.0",
-        "netifaces"
+        "numpy < 2.0.0",


Hi @mjschock, I think the changes for pandas make sense. Why you change the others? They should be flexible already.

Actually, ray has some internal changes for different versions. So we have to specify the given version.

it's a preventative measure so i don't run into things like the pyspark versioning issue in the future. assuming other library authors play nice and abide by semver then patch and minor updates shouldn't break anything, only major. i can reduce the changes just to pyspark if desired but i wanted to be consistant

@mjschock , based on our previous experience, we usually need to make changes to support new Ray releases like 1.5. So we should probably keep ray == 1.4.0 otherwise things may break. For most other dependencies, my understanding is setting a minimal version or without setting anything should be fine. Can we just change pyspark and pandas for now to unblock your needs?

mjschock · 2021-07-09T15:13:08Z

python/setup.py

        "psutil",
        "pyarrow >= 0.10",
-        "ray == 1.4.0",
-        "pyspark >= 3.0.0, < 3.1.0",
+        "ray >= 1.4.0, < 1.5.0",


@carsonwang - would this be alright, allowing patch changes?

Yes, looks good. Thanks!

kira-lin · 2021-07-12T02:16:15Z

It seems like this PR has gone beyond its title. Can you please explain why you merged PR #161 ? Using code search path is not ready yet, because it requires all python function/actors to be in the code search path as well. This might break our current users application.

mjschock · 2021-07-12T04:54:38Z

It seems like this PR has gone beyond its title. Can you please explain why you merged PR #161 ? Using code search path is not ready yet, because it requires all python function/actors to be in the code search path as well. This might break our current users application.

@kira-lin sorry about that, i should have created a separate branch. i've done that now and reverted the changes.

i'm now running into stackoverflow issues related to ZipArchive.scala and, after a bit of research, it seemed this could be related to an old version of scala, and possibly the usage of py4j, so i pulled in the changes to see if they helped get around that issue.

this is what i'm seeing:

  File "/home/base/.cache/pypoetry/virtualenvs/app-V0mE0AR0-py3.8/lib/python3.8/site-packages/raydp/spark/ray_cluster_master.py", line 169, in _create_app_master
    self._app_master_java_bridge.startUpAppMaster(extra_classpath)
  File "/home/base/.cache/pypoetry/virtualenvs/app-V0mE0AR0-py3.8/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/home/base/.cache/pypoetry/virtualenvs/app-V0mE0AR0-py3.8/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o0.startUpAppMaster.
: java.lang.StackOverflowError
	at scala.reflect.io.ZipArchive$.scala$reflect$io$ZipArchive$$dirName(ZipArchive.scala:58)
	at scala.reflect.io.ZipArchive.ensureDir(ZipArchive.scala:114)

kira-lin · 2021-07-12T05:46:14Z

Oh, I see. Can you please be more specific on what causes this error? You are using Spark 3.0.0 when building the scala part, and 3.1.2 for pyspark, right? And when you call raydp.init_spark, you see this error message?

Besides, what java version are you using? It seems scala, spark and py4j versions are not modified, thus it should be working.

Again, thanks for contributing @mjschock

mjschock · 2021-07-12T06:18:25Z

Oh, I see. Can you please be more specific on what causes this error? You are using Spark 3.0.0 when building the scala part, and 3.1.2 for pyspark, right? And when you call raydp.init_spark, you see this error message?

Besides, what java version are you using? It seems scala, spark and py4j versions are not modified, thus it should be working.

Again, thanks for contributing @mjschock

yes, Spark 3.0.0 in the pom.xml when i run ./build.sh. and yes, pyspark 3.1.2 when intiializing. java is 1.8, the openjdk version. the example is roughly the one from the anyscale post here: https://www.anyscale.com/blog/data-processing-support-in-ray

import ray
import raydp

ray.util.connect("ray-head:10001")

@ray.remote
class PySparkDriver:
  def __init__(self):
  self.spark = raydp.init_spark(
    app_name='RayDP example',
    num_executors=2,
    executor_cores=2,
    executor_memory='4GB')

  def foo(self):
    return self.spark.range(1000).repartition(10).count()

driver = PySparkDriver.remote()
print(ray.get(driver.foo.remote()))

mjschock · 2021-07-12T06:50:30Z

Oh, I see. Can you please be more specific on what causes this error? You are using Spark 3.0.0 when building the scala part, and 3.1.2 for pyspark, right? And when you call raydp.init_spark, you see this error message?
Besides, what java version are you using? It seems scala, spark and py4j versions are not modified, thus it should be working.
Again, thanks for contributing @mjschock

yes, Spark 3.0.0 in the pom.xml when i run ./build.sh. and yes, pyspark 3.1.2 when intiializing. java is 1.8, the openjdk version. the example is roughly the one from the anyscale post here: https://www.anyscale.com/blog/data-processing-support-in-ray
import ray
import raydp

ray.util.connect("ray-head:10001")

@ray.remote
class PySparkDriver:
  def __init__(self):
  self.spark = raydp.init_spark(
    app_name='RayDP example',
    num_executors=2,
    executor_cores=2,
    executor_memory='4GB')

  def foo(self):
    return self.spark.range(1000).repartition(10).count()

driver = PySparkDriver.remote()
print(ray.get(driver.foo.remote()))

anyway, i think this is unrelated. i'll investigate further when i get a chance

kira-lin · 2021-07-12T06:53:58Z

sorry I cannot help here since I've never seen errors like this, but I guess it's due to the environment. Did you always have the problem? Is it OK if you use pip to install our package? If so, I can upgrade the spark version now, since it'll only affect the nightly release.

kira-lin · 2021-07-12T06:58:53Z

By the way, we are curious about in what scenario you plan to use RayDP? What feature would be useful? And if you have any other issue, feel free to ask us.

carsonwang · 2021-07-20T06:16:30Z

Thanks @mjschock. Merged this.

ConeyLiu reviewed Jul 8, 2021

View reviewed changes

mjschock commented Jul 9, 2021

View reviewed changes

mjschock added 4 commits July 11, 2021 21:38

update python deps

d35cbfa

add main to RayDP CI branches

a34469e

test multiple pyspark versions

b3e7135

less changes

46d0f23

carsonwang merged commit ff48b7b into oap-project:master Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexibility in the deps #165

Flexibility in the deps #165

mjschock commented Jul 7, 2021

ConeyLiu Jul 8, 2021

mjschock Jul 8, 2021

ConeyLiu Jul 8, 2021

mjschock Jul 8, 2021

kira-lin Jul 9, 2021

mjschock Jul 9, 2021

ConeyLiu Jul 8, 2021 •

edited

Loading

ConeyLiu Jul 8, 2021

mjschock Jul 8, 2021

carsonwang Jul 9, 2021

mjschock Jul 9, 2021

carsonwang Jul 9, 2021

kira-lin commented Jul 12, 2021

mjschock commented Jul 12, 2021 •

edited

Loading

kira-lin commented Jul 12, 2021

mjschock commented Jul 12, 2021

mjschock commented Jul 12, 2021

kira-lin commented Jul 12, 2021

kira-lin commented Jul 12, 2021

carsonwang commented Jul 20, 2021

Flexibility in the deps #165

Flexibility in the deps #165

Conversation

mjschock commented Jul 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ConeyLiu Jul 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kira-lin commented Jul 12, 2021

mjschock commented Jul 12, 2021 • edited Loading

kira-lin commented Jul 12, 2021

mjschock commented Jul 12, 2021

mjschock commented Jul 12, 2021

kira-lin commented Jul 12, 2021

kira-lin commented Jul 12, 2021

carsonwang commented Jul 20, 2021

ConeyLiu Jul 8, 2021 •

edited

Loading

mjschock commented Jul 12, 2021 •

edited

Loading