-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][Spark] Unable to run Spark Connect 3.5.1 with Delta 3.2.0 #3332
Comments
I am having the same issue. Using:
This does NOT work using delta 3.2.0 (I have also tried 3.1.0): from pyspark.sql import SparkSession
url = "spark://${SPARK_MASTER_IP}:7077"
spark = (
SparkSession.builder.master(url)
.appName("myapp")
.config("spark.jars.packages", "io.delta:delta-core_2.12:3.2.0,io.delta:delta-contribs_2.12:3.2.0")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
) Here is the error log:
But this works (using delta 2.4.0): from pyspark.sql import SparkSession
url = "spark://${SPARK_MASTER_IP}:7077"
spark = (
SparkSession.builder.master(url)
.appName("myapp")
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0,io.delta:delta-contribs_2.12:2.4.0")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
) I am not sure if I can use delta 2.4.0 with Spark 3.5.1 though. Can I? |
Thanks for reporting this. Spark Connect support will be added in Delta 4, see this issue. Feel free to chime in on #3240 if you have any suggestions! |
Hey @matt-gorman, you said that you followed the steps of https://docs.delta.io/latest/delta-spark-connect.html and you got some Error? Could you please confirm what is the error that you are seeing? @rhazegh Using Delta over Spark Connect will only be fully available in Delta 4.0, however, you can try to use the Delta 4.0 Preview already. Could you give that a try? |
@longvu-db, Attempting to write a DataFrame:
Attempting to read a DataFrame from HDFS:
The tracebacks similarly pointed back to core.py and this appeared to perhaps be a package or something similar missing when running delta-core with the new versions instead of delta-spark with the previous versions. |
@matt-gorman Why are you using delta-core? AFAIK, this only works with delta-spark. Are you able to start two processes, server and client on the same local machine and have them working with each other? |
@longvu-db sorry, I mixed these up, delta-core is what we're using now with Spark 3.4.3; running Connect with delta-spark is where we're running into issues. Using these packages: org.apache.spark:spark-connect_2.12:3.4.3 Connect starts and works fine with this command:
When trying to run Spark 3.5.1, delta-spark was the package and running Connect with these package versions: org.apache.spark:spark-connect_2.12:3.5.1 With this command gives the errors above:
I can run PySpark locally on the master with those same packages without problems:
However, if I run a PySpark client using Connect started with those packages (the second start-connect-server.sh command above with delta-spark_2.12:3.2.0) from the master I get errors. |
@matt-gorman I understood the problem, there were some fixes that needed to go into Spark for Delta over Spark Connect, and those fixes went after 3.5.1, could you please use the Spark Connect 4.0.0 preview1 package like in the guide here? |
So for Delta over Spark Connect to work, you need to use the Spark Connect 4.0 preview version as well. |
Bug
Which Delta project/connector is this regarding?
Describe the problem
Running Spark and reading/writing Delta from a client connection to a Spark Connect (start-connect-server.sh) with Delta packages unable to convert or reports missing storage class. Unable to find specific documentation about running Delta 3.2.0 on Spark 3.5.1 and if any additional packages or configuration is needed. This appears to work using a PySpark shell, however the same packages with Spark Connect gives different results.
Steps to reproduce
sbin/start-connect-server.sh
Observed results
PySpark (Works)
Using the Delta documentation, this DOES work using a PySpark shell:
This both successfully reads and writes to a Hadoop cluster in Delta format.
Spark Connect (Doesn't Work)
Running Spark Connect with the same options does not have the same effect:
Connect to Spark Connect according to the documentation:
Comparing environment settings, the only differences were the additional JARs on Spark Connect (expected this):
And a few other Spark configurations set with the PySpark App versus Spark Connect:
Attempted to re-run Spark Connect with these settings, however the result was the same:
Expected results
Expected that Spark Connect would act the same as a PySpark session on the Master Node.
Further details
Behavior is the same when connecting from a JupyterHub server:
This produces the same result from Spark Connect running the same commands. Ideally working from JupyterHub using a Spark Connect remote client. Previously, the following configuration with Spark Connect worked without any of the issues above with Spark 3.4.0 and Delta (delta-core) 2.4.0.
Spark/Delta 4.0.0 preview
Additionally set this up using the documentation here. Spark 4.0.0 appears to also use delta-spark and the result was the same error messages as above.
Spark 3.4.3/Delta 2.4.0
These versions were similar to what was originally being run without similar issues (3.4.0/2.4.0) and setting this up proved successful. This uses delta-core instead of delta-spark.
Environment information
Connect Python Deps:
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
The text was updated successfully, but these errors were encountered: