BackendNotFoundError on databricks/pyspark cluster #1673

TobiRoby · 2024-06-05T15:34:13Z

Hi,

I am trying to get pandera up and running on databricks.
However, I receive the following BackendNotFoundError error and do not know what the cause is:

BackendNotFoundError: Backend not found for backend, class: (<class 'pandera.api.pyspark.container.DataFrameSchema'>, <class 'pyspark.sql.connect.dataframe.DataFrame'>). Looked up the following base classes: (<class 'pyspark.sql.connect.dataframe.DataFrame'>, <class 'object'>)

Code example

import pandera.pyspark as pa
import pyspark.sql.types as T

class TestSchema(pa.DataFrameModel):
    id: T.IntegerType() = pa.Field(gt=5)
    name: T.StringType() = pa.Field(str_startswith="B")

df = spark.createDataFrame([(5, "Bread"), (15, "Butter")], ["id", "name"])

TestSchema.validate(check_obj=df)

Complete error

BackendNotFoundError: Backend not found for backend, class: (<class 'pandera.api.pyspark.container.DataFrameSchema'>, <class 'pyspark.sql.connect.dataframe.DataFrame'>). Looked up the following base classes: (<class 'pyspark.sql.connect.dataframe.DataFrame'>, <class 'object'>)
File <command-2461794647677534>, line 10
      6     name: T.StringType() = pa.Field(str_startswith="B")
      8 df = spark.createDataFrame([(5, "Bread"), (15, "Butter")], ["id", "name"])
---> 10 TestSchema.validate(check_obj=df)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/pyspark/model.py:289, in DataFrameModel.validate(cls, check_obj, head, tail, sample, random_state, lazy, inplace)
    274 @classmethod
    275 @docstring_substitution(validate_doc=DataFrameSchema.validate.__doc__)
    276 def validate(
   (...)
    284     inplace: bool = False,
    285 ) -> Optional[DataFrameBase[TDataFrameModel]]:
    286     """%(validate_doc)s"""
    287     return cast(
    288         DataFrameBase[TDataFrameModel],
--> 289         cls.to_schema().validate(
    290             check_obj, head, tail, sample, random_state, lazy, inplace
    291         ),
    292     )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/pyspark/container.py:333, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    330     return check_obj
    331 error_handler = ErrorHandler(lazy)
--> 333 return self._validate(
    334     check_obj=check_obj,
    335     head=head,
    336     tail=tail,
    337     sample=sample,
    338     random_state=random_state,
    339     lazy=lazy,
    340     inplace=inplace,
    341     error_handler=error_handler,
    342 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/pyspark/container.py:364, in DataFrameSchema._validate(self, check_obj, head, tail, sample, random_state, lazy, inplace, error_handler)
    355 if self._is_inferred:
    356     warnings.warn(
    357         f"This {type(self)} is an inferred schema that hasn't been "
    358         "modified. It's recommended that you refine the schema "
   (...)
    361         UserWarning,
    362     )
--> 364 return self.get_backend(check_obj).validate(
    365     check_obj=check_obj,
    366     schema=self,
    367     head=head,
    368     tail=tail,
    369     sample=sample,
    370     random_state=random_state,
    371     lazy=lazy,
    372     inplace=inplace,
    373     error_handler=error_handler,
    374 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/base/schema.py:96, in BaseSchema.get_backend(cls, check_obj, check_type)
     94     except KeyError:
     95         pass
---> 96 raise BackendNotFoundError(
     97     f"Backend not found for backend, class: {(cls, check_obj_cls)}. "
     98     f"Looked up the following base classes: {classes}"
     99 )

Runtime environment short:

databricks-runtime 14.3 LTS
python 3.10.12
pyspark 3.5.0
pandera 0.19.3

The text was updated successfully, but these errors were encountered:

gidiLuke · 2024-07-14T15:40:13Z

Hi Tobi!

I want to add the following observation:
The same error also occurs when running it in a Databricks Notebook in the web UI with a shared all-purpose cluster with version 14.3 (also <class 'pyspark.sql.connect.dataframe.DataFrame'> even though it's not connect).

However, the code example works on a 14.3 as well as a 15.3 personal compute in the web UI. Yet, it doesn't work with databricks connect from VS code - same error.

Trying to run the file from vs code with "upload and run file on databricks" with the databricks extension again works only with the personal computes, not with the shared.

cosmicBboy · 2024-07-14T16:05:28Z

So the place where the backends are registered is defined here:
https://github.com/unionai-oss/pandera/blob/main/pandera/backends/pyspark/register.py

Is there a difference between pyspark.sql.DataFrame and pyspark.sql.connect.dataframe.DataFrame? This may be the issue.

cosmicBboy · 2024-07-19T19:05:46Z

@TobiRoby @gidiLuke I'm trying to reproduce this error. How can I do so with a local setup? (i.e. with no databricks or pyspark cluster)?

filipeo2-mck · 2024-07-24T21:46:50Z

Probably these specific modules *.connect.* are related to the new Spark Connect decoupled architecture. This link has some info on how to setup it.

gidiLuke · 2024-07-24T22:09:46Z

Spark connect is most probably the centerpiece of the issue in my view as well.
Local reproduction would therefore be possible with the info provided by @filipeo2-mck - thanks!

@cosmicBboy
The apis of pyspark.sql.DataFrame and pyspark.sql.connect.dataframe.DataFrame should be the same.
=> how can pyspark.sql.connect.dataframe.DataFrame be added as backend?

filipeo2-mck · 2024-07-25T15:34:32Z

Hello!
As I'm also interested in this fix, I'll take this issue and check if I'm able to raise a fix PR for it.

filipeo2-mck · 2024-08-02T19:02:57Z

I just opened the PR #1775.
These changes worked locally by setting up Spark/Connect server locally and I was able to run some Pandera validations from pyspark 3.2 to 3.5 succesfully.
I'd be happy if someone can test this branch under Databricks :)

TobiRoby added the bug Something isn't working label Jun 5, 2024

filipeo2-mck mentioned this issue Aug 2, 2024

Add support for Spark Connect dataframes #1775

Merged

2 tasks

cosmicBboy closed this as completed in #1775 Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BackendNotFoundError on databricks/pyspark cluster #1673

BackendNotFoundError on databricks/pyspark cluster #1673

TobiRoby commented Jun 5, 2024

gidiLuke commented Jul 14, 2024

cosmicBboy commented Jul 14, 2024

cosmicBboy commented Jul 19, 2024

filipeo2-mck commented Jul 24, 2024

gidiLuke commented Jul 24, 2024

filipeo2-mck commented Jul 25, 2024

filipeo2-mck commented Aug 2, 2024 •

edited

Loading

BackendNotFoundError on databricks/pyspark cluster #1673

BackendNotFoundError on databricks/pyspark cluster #1673

Comments

TobiRoby commented Jun 5, 2024

gidiLuke commented Jul 14, 2024

cosmicBboy commented Jul 14, 2024

cosmicBboy commented Jul 19, 2024

filipeo2-mck commented Jul 24, 2024

gidiLuke commented Jul 24, 2024

filipeo2-mck commented Jul 25, 2024

filipeo2-mck commented Aug 2, 2024 • edited Loading

filipeo2-mck commented Aug 2, 2024 •

edited

Loading