Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v0.13.0 #315

Merged
merged 9 commits into from
Jun 21, 2024
Merged

Release v0.13.0 #315

merged 9 commits into from
Jun 21, 2024

Conversation

fdosani
Copy link
Member

@fdosani fdosani commented Jun 20, 2024

Release v0.13.0

New

Bug fix

General cleanup and house keeping

github-actions bot and others added 8 commits May 27, 2024 09:53
Co-authored-by: fdosani <fdosani@users.noreply.github.com>
* spark clean up

* fixing spark session weirdness with parameters
Co-authored-by: fdosani <fdosani@users.noreply.github.com>
* adding in benchmark docs

* Update docs/source/benchmark.rst

Co-authored-by: Jacob Dawang <jdawang@users.noreply.github.com>

---------

Co-authored-by: Jacob Dawang <jdawang@users.noreply.github.com>
Co-authored-by: fdosani <fdosani@users.noreply.github.com>
Co-authored-by: fdosani <fdosani@users.noreply.github.com>
* [WIP] vanilla spark

* [WIP] fixing tests and logic

* [WIP] __index cleanup

* updating pyspark.sql logic and fixing tests

* restructuring spark logic into submodule and typing

* remove pandas 2 restriction for spark sql

* fix for sql call

* updating docs

* updating benchmarks with pyspark dataframe

* relative imports and linting

* relative imports and linting

* feedback from review, switch to monotonic and simplify checks

* allow pyspark.sql.connect.dataframe.DataFrame

* checking version for spark connect

* typo fix

* adding import

* adding connect extras

* adding connect extras
@satniks
Copy link

satniks commented Jun 21, 2024

Hi @fdosani , does SparkSQLCompare writes to the DB while processing the data? When I tried it on the databricks serverless compute for job, I get following error:

[NOT_SUPPORTED_WITH_SERVERLESS] PERSIST TABLE is not supported on serverless compute. SQLSTATE: 0A000

DataFrame.persist() is shown in the call stack.

Looks like databricks serverless compute has lots of limitations.

Anyway SparkSQLCompare works fine for normal databricks compute cluster so this is non-issue for us. We may not use databricks serverless in near future due to such limitations.

@fdosani
Copy link
Member Author

fdosani commented Jun 21, 2024

Hi @fdosani , does SparkSQLCompare writes to the DB while processing the data? When I tried it on the databricks serverless compute for job, I get following error:

[NOT_SUPPORTED_WITH_SERVERLESS] PERSIST TABLE is not supported on serverless compute. SQLSTATE: 0A000

DataFrame.persist() is shown in the call stack.

Looks like databricks serverless compute has lots of limitations.

Anyway SparkSQLCompare works fine for normal databricks compute cluster so this is non-issue for us. We may not use databricks serverless in near future due to such limitations.

So there are a couple of calls to cache which is maybe it? Yeah I'm not sure I want to support serverless as it would restrict most of the code we have it seems.

@satniks
Copy link

satniks commented Jun 21, 2024

So there are a couple of calls to cache which is maybe it? Yeah I'm not sure I want to support serverless as it would restrict most of the code we have it seems.

right, serverless will have lots of restrictions and not a common use case. Current support is great. Thanks again.

Copy link
Contributor

@gladysteh99 gladysteh99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@fdosani fdosani merged commit d02ac44 into main Jun 21, 2024
58 checks passed
rhaffar pushed a commit to rhaffar/datacompy that referenced this pull request Sep 12, 2024
@mattiazenidb
Copy link

Hey, I think you should support Databricks Serverless :) The error above is because .cache() is not supported. But it's not supported because it's not needed. The goal of Serverless is to provide best performance out of the box without the user tuning anything.

I tested and it works, I'll create a pull request to support it.

PS: I work for Databricks

@fdosani
Copy link
Member Author

fdosani commented Sep 14, 2024

Hey, I think you should support Databricks Serverless :) The error above is because .cache() is not supported. But it's not supported because it's not needed. The goal of Serverless is to provide best performance out of the box without the user tuning anything.

I tested and it works, I'll create a pull request to support it.

PS: I work for Databricks

Hey @mattiazenidb I'll happily accept a PR for that. Thanks for helping out here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants