-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexibility in the deps #165
Conversation
@@ -19,9 +19,9 @@ name: RayDP CI | |||
|
|||
on: | |||
push: | |||
branches: [ master ] | |||
branches: [ main, master ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we don't have the main branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, i added it b/c main is now the default branch rather than master. that way the workflow runs: https://github.com/mjschock/raydp/actions/runs/1007895832
python/setup.py
Outdated
"psutil < 6.0.0", | ||
"pyarrow >= 0.10, < 5.0.0", | ||
"ray >= 1.4.0, < 2.0.0", | ||
"pyspark >= 3.0.0, < 4.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kira-lin, does the raydp support spark 3.1.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it works fine when building with the older version in the pom.xml: #121 (comment). need to address that issue separately but this unblocks usage of 3.1.0 which is the minimum required for, e.g., delta-spark (https://pypi.org/project/delta-spark/), which i'd like to integrate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if there are demand for spark 3.1, we should give it a try. @mjschock can you please also modify the dependencies in the raydp.yaml to see if it pass CI?
By the way, I actually have the patch to support spark 3.1, but we are in a discussion about whether we should add a shim layer for different spark versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kira-lin i updated raydp.yml to use both pyspark 3.0.0 and 3.1.2 and they both pass
python/setup.py
Outdated
"ray == 1.4.0", | ||
"pyspark >= 3.0.0, < 3.1.0", | ||
"netifaces" | ||
"numpy < 2.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mjschock, I think the changes for pandas
make sense. Why you change the others? They should be flexible already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, ray has some internal changes for different versions. So we have to specify the given version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a preventative measure so i don't run into things like the pyspark versioning issue in the future. assuming other library authors play nice and abide by semver then patch and minor updates shouldn't break anything, only major. i can reduce the changes just to pyspark if desired but i wanted to be consistant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjschock , based on our previous experience, we usually need to make changes to support new Ray releases like 1.5. So we should probably keep ray == 1.4.0
otherwise things may break. For most other dependencies, my understanding is setting a minimal version or without setting anything should be fine. Can we just change pyspark and pandas for now to unblock your needs?
"psutil", | ||
"pyarrow >= 0.10", | ||
"ray == 1.4.0", | ||
"pyspark >= 3.0.0, < 3.1.0", | ||
"ray >= 1.4.0, < 1.5.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carsonwang - would this be alright, allowing patch changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, looks good. Thanks!
It seems like this PR has gone beyond its title. Can you please explain why you merged PR #161 ? Using code search path is not ready yet, because it requires all python function/actors to be in the code search path as well. This might break our current users application. |
@kira-lin sorry about that, i should have created a separate branch. i've done that now and reverted the changes. i'm now running into stackoverflow issues related to ZipArchive.scala and, after a bit of research, it seemed this could be related to an old version of scala, and possibly the usage of py4j, so i pulled in the changes to see if they helped get around that issue. this is what i'm seeing:
|
Oh, I see. Can you please be more specific on what causes this error? You are using Spark 3.0.0 when building the scala part, and 3.1.2 for pyspark, right? And when you call raydp.init_spark, you see this error message? Besides, what java version are you using? It seems scala, spark and py4j versions are not modified, thus it should be working. Again, thanks for contributing @mjschock |
yes, Spark 3.0.0 in the pom.xml when i run ./build.sh. and yes, pyspark 3.1.2 when intiializing. java is 1.8, the openjdk version. the example is roughly the one from the anyscale post here: https://www.anyscale.com/blog/data-processing-support-in-ray import ray
import raydp
ray.util.connect("ray-head:10001")
@ray.remote
class PySparkDriver:
def __init__(self):
self.spark = raydp.init_spark(
app_name='RayDP example',
num_executors=2,
executor_cores=2,
executor_memory='4GB')
def foo(self):
return self.spark.range(1000).repartition(10).count()
driver = PySparkDriver.remote()
print(ray.get(driver.foo.remote())) |
anyway, i think this is unrelated. i'll investigate further when i get a chance |
sorry I cannot help here since I've never seen errors like this, but I guess it's due to the environment. Did you always have the problem? Is it OK if you use pip to install our package? If so, I can upgrade the spark version now, since it'll only affect the nightly release. |
By the way, we are curious about in what scenario you plan to use RayDP? What feature would be useful? And if you have any other issue, feel free to ask us. |
Thanks @mjschock. Merged this. |
This assumes the external library authors are using semver.
There's still #121 to address but this unblocks those of us that want to use a newer version of pyspark.