Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BackendNotFoundError on databricks/pyspark cluster #1673

Closed
TobiRoby opened this issue Jun 5, 2024 · 7 comments · Fixed by #1775
Closed

BackendNotFoundError on databricks/pyspark cluster #1673

TobiRoby opened this issue Jun 5, 2024 · 7 comments · Fixed by #1775
Labels
bug Something isn't working

Comments

@TobiRoby
Copy link

TobiRoby commented Jun 5, 2024

Hi,

I am trying to get pandera up and running on databricks.
However, I receive the following BackendNotFoundError error and do not know what the cause is:

BackendNotFoundError: Backend not found for backend, class: (<class 'pandera.api.pyspark.container.DataFrameSchema'>, <class 'pyspark.sql.connect.dataframe.DataFrame'>). Looked up the following base classes: (<class 'pyspark.sql.connect.dataframe.DataFrame'>, <class 'object'>)

Code example

import pandera.pyspark as pa
import pyspark.sql.types as T

class TestSchema(pa.DataFrameModel):
    id: T.IntegerType() = pa.Field(gt=5)
    name: T.StringType() = pa.Field(str_startswith="B")

df = spark.createDataFrame([(5, "Bread"), (15, "Butter")], ["id", "name"])

TestSchema.validate(check_obj=df)

Complete error

BackendNotFoundError: Backend not found for backend, class: (<class 'pandera.api.pyspark.container.DataFrameSchema'>, <class 'pyspark.sql.connect.dataframe.DataFrame'>). Looked up the following base classes: (<class 'pyspark.sql.connect.dataframe.DataFrame'>, <class 'object'>)
File <command-2461794647677534>, line 10
      6     name: T.StringType() = pa.Field(str_startswith="B")
      8 df = spark.createDataFrame([(5, "Bread"), (15, "Butter")], ["id", "name"])
---> 10 TestSchema.validate(check_obj=df)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/pyspark/model.py:289, in DataFrameModel.validate(cls, check_obj, head, tail, sample, random_state, lazy, inplace)
    274 @classmethod
    275 @docstring_substitution(validate_doc=DataFrameSchema.validate.__doc__)
    276 def validate(
   (...)
    284     inplace: bool = False,
    285 ) -> Optional[DataFrameBase[TDataFrameModel]]:
    286     """%(validate_doc)s"""
    287     return cast(
    288         DataFrameBase[TDataFrameModel],
--> 289         cls.to_schema().validate(
    290             check_obj, head, tail, sample, random_state, lazy, inplace
    291         ),
    292     )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/pyspark/container.py:333, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    330     return check_obj
    331 error_handler = ErrorHandler(lazy)
--> 333 return self._validate(
    334     check_obj=check_obj,
    335     head=head,
    336     tail=tail,
    337     sample=sample,
    338     random_state=random_state,
    339     lazy=lazy,
    340     inplace=inplace,
    341     error_handler=error_handler,
    342 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/pyspark/container.py:364, in DataFrameSchema._validate(self, check_obj, head, tail, sample, random_state, lazy, inplace, error_handler)
    355 if self._is_inferred:
    356     warnings.warn(
    357         f"This {type(self)} is an inferred schema that hasn't been "
    358         "modified. It's recommended that you refine the schema "
   (...)
    361         UserWarning,
    362     )
--> 364 return self.get_backend(check_obj).validate(
    365     check_obj=check_obj,
    366     schema=self,
    367     head=head,
    368     tail=tail,
    369     sample=sample,
    370     random_state=random_state,
    371     lazy=lazy,
    372     inplace=inplace,
    373     error_handler=error_handler,
    374 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/base/schema.py:96, in BaseSchema.get_backend(cls, check_obj, check_type)
     94     except KeyError:
     95         pass
---> 96 raise BackendNotFoundError(
     97     f"Backend not found for backend, class: {(cls, check_obj_cls)}. "
     98     f"Looked up the following base classes: {classes}"
     99 )

Runtime environment short:

  • databricks-runtime 14.3 LTS
  • python 3.10.12
  • pyspark 3.5.0
  • pandera 0.19.3
@TobiRoby TobiRoby added the bug Something isn't working label Jun 5, 2024
@gidiLuke
Copy link

Hi Tobi!

I want to add the following observation:
The same error also occurs when running it in a Databricks Notebook in the web UI with a shared all-purpose cluster with version 14.3 (also <class 'pyspark.sql.connect.dataframe.DataFrame'> even though it's not connect).

However, the code example works on a 14.3 as well as a 15.3 personal compute in the web UI. Yet, it doesn't work with databricks connect from VS code - same error.

Trying to run the file from vs code with "upload and run file on databricks" with the databricks extension again works only with the personal computes, not with the shared.

@cosmicBboy
Copy link
Collaborator

So the place where the backends are registered is defined here:
https://github.com/unionai-oss/pandera/blob/main/pandera/backends/pyspark/register.py

Is there a difference between pyspark.sql.DataFrame and pyspark.sql.connect.dataframe.DataFrame? This may be the issue.

@cosmicBboy
Copy link
Collaborator

@TobiRoby @gidiLuke I'm trying to reproduce this error. How can I do so with a local setup? (i.e. with no databricks or pyspark cluster)?

@filipeo2-mck
Copy link
Contributor

Probably these specific modules *.connect.* are related to the new Spark Connect decoupled architecture. This link has some info on how to setup it.

@gidiLuke
Copy link

Spark connect is most probably the centerpiece of the issue in my view as well.
Local reproduction would therefore be possible with the info provided by @filipeo2-mck - thanks!

@cosmicBboy
The apis of pyspark.sql.DataFrame and pyspark.sql.connect.dataframe.DataFrame should be the same.
=> how can pyspark.sql.connect.dataframe.DataFrame be added as backend?

@filipeo2-mck
Copy link
Contributor

Hello!
As I'm also interested in this fix, I'll take this issue and check if I'm able to raise a fix PR for it.

@filipeo2-mck
Copy link
Contributor

filipeo2-mck commented Aug 2, 2024

I just opened the PR #1775.
These changes worked locally by setting up Spark/Connect server locally and I was able to run some Pandera validations from pyspark 3.2 to 3.5 succesfully.
I'd be happy if someone can test this branch under Databricks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants