Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STRUCT getitem operator creates unnecessary parens when chained with other types #1148

Open
r1b opened this issue Dec 11, 2024 · 1 comment
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API.

Comments

@r1b
Copy link

r1b commented Dec 11, 2024

Summary

This is illustrated in test__struct.py. Here's a sample failing test that captures how I would expect this to behave:

import sqlalchemy

def test_struct_parens_demo(faux_conn):
    from sqlalchemy_bigquery import ARRAY, STRUCT
    table = sqlalchemy.Table(
        "t",
        sqlalchemy.MetaData(),
        sqlalchemy.Column("person", STRUCT(children=ARRAY(STRUCT(name=sqlalchemy.String)))),
    )
    expr = table.c.person.children[0].name
    expected_sql = f"SELECT `t`.`person`.children[OFFSET(%(param_1:INT64)s)].name AS `anon_1` \nFROM `t`"
    actual_sql = str(sqlalchemy.select(expr).compile(faux_conn.engine))
    assert actual_sql == expected_sql
AssertionError: assert 'SELECT ((`t`...1` \nFROM `t`' == 'SELECT `t`.`...1` \nFROM `t`'
  
  - SELECT `t`.`person`.children[OFFSET(%(param_1:INT64)s)].name AS `anon_1` 
  + SELECT ((`t`.`person`.children)[OFFSET(%(param_1:INT64)s)]).name AS `anon_1` 
  ?        ++                     +                           +
    FROM `t`

I believe this is happening because the STRUCT type is taking on the lowest possible operator precedence in SQLAlchemy, when it actually has the highest operator precedence in BigQuery.

This doesn't create any functional issue, but it does make it hard to read the rendered SQL. For example, I work on a system where we run snapshot tests on the SQL generated by the bigquery dialect. We also often consult the generated SQL in the BigQuery console when debugging.


Investigation

In _struct.py we make SQLAlchemy aware of struct_getitem_op by patching the default comparator to alias json_getitem_op:

sqlalchemy.sql.default_comparator.operator_lookup[
    struct_getitem_op.__name__
] = sqlalchemy.sql.default_comparator.operator_lookup["json_getitem_op"]

But it seems this is not the only mapping for operators. We also have _PRECEDENCE, which determines when parens are necessary.

If the target operator does not exist in _PRECEDENCE, it takes on the lowest possible precedence.

So this would explain why we get unnecessary parens. Note that I'm calling these parens "unnecessary" because I see that all indexed types in BigQuery have the same precedence (the highest).

Proposal

I guess we could also patch _PRECEDENCE, but I don't think that its necessary to patch SQLAlchemy internals to accomplish this. Here's an alternative: route both ARRAY and STRUCT indexing operations to getitem.

I think the intention for the builtin getitem operator is that it can handle any indexing operation, not necessarily just for the builtin ARRAY type. It seems possible to tweak visit_getitem_binary to render based on the type of the LHS of the indexing operation:

     def visit_getitem_binary(self, binary, operator_, **kw):
         left = self.process(binary.left, **kw)
         if binary.left.type._type_affinity is _struct.STRUCT:
             return f"{left}.{binary.right.value}"
         right = self.process(binary.right, **kw)
         return f"{left}[OFFSET({right})]"

Routing indexing operations to getitem also ensures that they have the same precedence, so the parens go away.

I poked around and found a few examples of other dialects checking _type_affinity in visit hooks (just to be sure we're not exchanging one internals reference for another) - here's a sample from mysql.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. label Dec 11, 2024
@shollyman shollyman assigned Linchin and unassigned shollyman Dec 12, 2024
@r1b
Copy link
Author

r1b commented Dec 12, 2024

This also appears to fix #1117 - this test fails on main with sqlalchemy.exc.UnsupportedCompilationError but passes if I apply the refactor:

diff --git a/tests/unit/test__struct.py b/tests/unit/test__struct.py
index 6e7c7a3..ae98b40 100644
--- a/tests/unit/test__struct.py
+++ b/tests/unit/test__struct.py
@@ -141,6 +141,16 @@ def test_struct_insert_type_info(faux_conn, metadata):
     )
 
 
+def test_struct_group_by(faux_conn):
+    expr = sqlalchemy.select(_col().children[0].name.label("foo")).group_by("foo")
+    got = expr.compile(faux_conn.engine).string
+
+    assert got == (
+        "SELECT `t`.`person`.children[OFFSET(%(param_1:INT64)s)].name AS `foo` \n"
+        "FROM `t` GROUP BY `foo`"
+    )
+
+
 def test_struct_non_string_field_access(faux_conn):
     with pytest.raises(
         TypeError,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API.
Projects
None yet
Development

No branches or pull requests

3 participants