Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EWT-1250] Sqlalchemy/Superset expectations from python-DBAPI #17

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

prantogg
Copy link
Contributor

@prantogg prantogg commented Aug 1, 2024

This PR introduces the following requirements that have risen from the Wherobots x Superset integration -

  • SQLAlchemy expects the result set to be a List of Tuples or List of Lists. SQLAlchemy wraps this result set in it's own Row object.
  • Superset requires rollback() and commit() to be implemented. Other OLAP databases such as pyhive simple "pass" the not implemented rollback() and commit() methods. For context - Superset's background processes often bypass the SQLAlchemy dialect and directly interacts with DBAPI. This is why overriding the rollback and commit methods in the Dialect doesn't suffice.
  • Adds Web socket URL parameter, ws_url, to connection. This helps maintain static connection pool configuration in Superset.

@prantogg prantogg marked this pull request as ready for review August 1, 2024 18:44
@prantogg prantogg changed the title Sqlalchemy/Superset expectations from python-DBAPI [EWT-1250] Sqlalchemy/Superset expectations from python-DBAPI Aug 1, 2024
Copy link

wherobots/db/connection.py Show resolved Hide resolved
schema = reader.schema
columns = schema.names
column_types = [field.type for field in schema]
rows = reader.read_all().to_pandas().values.tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need both read_all().to_pandas() or can you read_pandas() directly? (to be fair according to the docs it does the same thing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Using read_pandas() now for better readability.

wherobots/db/cursor.py Show resolved Hide resolved
},
headers=headers,
if ws_url:
session_uri = ws_url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things here:

  1. This can be done earlier, so if we feel confident in using the ws_url we can do that right after checking the token/api-keys.
  2. This needs a little bit different logging, since we're not requesting a new runtime (see line 70).
  3. This should early return so that we don't have to indent everything afterwards. Makes it more readable.

@prantogg prantogg requested a review from peterfoldes August 1, 2024 23:09
@@ -51,6 +50,7 @@ def connect(
results_format: Union[ResultsFormat, None] = None,
data_compression: Union[DataCompression, None] = None,
geometry_representation: Union[GeometryRepresentation, None] = None,
ws_url: str = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required, instead of calling connect_direct() directly?

The reason this parameter wasn't included in connect() is that it creates ambiguity between the rest of the parameters (like runtime/region) and the runtime you'd actually connect to when providing a ws_url, which may not match those choices.

query.handler(json.loads(result_bytes.decode("utf-8")))
data = json.loads(result_bytes.decode("utf-8"))
columns = data["columns"]
column_types = data.get("column_types")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is column_types optional? If so, then it's good that you're using data.get() here, but then in Cursor.__get_results you expect column_types to be non-None. You either need to ensure column_types is always provided, or change __get_results to be more defensive.

None, # precision
None, # scale
True, # null_ok; Assuming all columns can accept NULL values
)
for col_name in result.columns
for i, col_name in enumerate(columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use https://docs.python.org/3/library/functions.html#zip to avoid jumping hoops with an index (it's much nicer to read, and also more efficient):

self.__description = [
  (
    col_name,
    _TYPE_MAP.get(col_type, 'STRING'),
    ...
  )
  for (col_name, col_type) in zip(columns, column_types)
]

@prantogg prantogg marked this pull request as draft August 13, 2024 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants