STRUCT getitem operator creates unnecessary parens when chained with other types #1148

r1b · 2024-12-11T23:48:13Z

Summary

This is illustrated in test__struct.py. Here's a sample failing test that captures how I would expect this to behave:

import sqlalchemy

def test_struct_parens_demo(faux_conn):
    from sqlalchemy_bigquery import ARRAY, STRUCT
    table = sqlalchemy.Table(
        "t",
        sqlalchemy.MetaData(),
        sqlalchemy.Column("person", STRUCT(children=ARRAY(STRUCT(name=sqlalchemy.String)))),
    )
    expr = table.c.person.children[0].name
    expected_sql = f"SELECT `t`.`person`.children[OFFSET(%(param_1:INT64)s)].name AS `anon_1` \nFROM `t`"
    actual_sql = str(sqlalchemy.select(expr).compile(faux_conn.engine))
    assert actual_sql == expected_sql

AssertionError: assert 'SELECT ((`t`...1` \nFROM `t`' == 'SELECT `t`.`...1` \nFROM `t`'
  
  - SELECT `t`.`person`.children[OFFSET(%(param_1:INT64)s)].name AS `anon_1` 
  + SELECT ((`t`.`person`.children)[OFFSET(%(param_1:INT64)s)]).name AS `anon_1` 
  ?        ++                     +                           +
    FROM `t`

I believe this is happening because the STRUCT type is taking on the lowest possible operator precedence in SQLAlchemy, when it actually has the highest operator precedence in BigQuery.

This doesn't create any functional issue, but it does make it hard to read the rendered SQL. For example, I work on a system where we run snapshot tests on the SQL generated by the bigquery dialect. We also often consult the generated SQL in the BigQuery console when debugging.

Investigation

In _struct.py we make SQLAlchemy aware of struct_getitem_op by patching the default comparator to alias json_getitem_op:

sqlalchemy.sql.default_comparator.operator_lookup[
    struct_getitem_op.__name__
] = sqlalchemy.sql.default_comparator.operator_lookup["json_getitem_op"]

But it seems this is not the only mapping for operators. We also have _PRECEDENCE, which determines when parens are necessary.

If the target operator does not exist in _PRECEDENCE, it takes on the lowest possible precedence.

So this would explain why we get unnecessary parens. Note that I'm calling these parens "unnecessary" because I see that all indexed types in BigQuery have the same precedence (the highest).

Proposal

I guess we could also patch _PRECEDENCE, but I don't think that its necessary to patch SQLAlchemy internals to accomplish this. Here's an alternative: route both ARRAY and STRUCT indexing operations to getitem.

I think the intention for the builtin getitem operator is that it can handle any indexing operation, not necessarily just for the builtin ARRAY type. It seems possible to tweak visit_getitem_binary to render based on the type of the LHS of the indexing operation:

     def visit_getitem_binary(self, binary, operator_, **kw):
         left = self.process(binary.left, **kw)
         if binary.left.type._type_affinity is _struct.STRUCT:
             return f"{left}.{binary.right.value}"
         right = self.process(binary.right, **kw)
         return f"{left}[OFFSET({right})]"

Routing indexing operations to getitem also ensures that they have the same precedence, so the parens go away.

I poked around and found a few examples of other dialects checking _type_affinity in visit hooks (just to be sure we're not exchanging one internals reference for another) - here's a sample from mysql.

The text was updated successfully, but these errors were encountered:

r1b · 2024-12-12T23:14:06Z

This also appears to fix #1117 - this test fails on main with sqlalchemy.exc.UnsupportedCompilationError but passes if I apply the refactor:

diff --git a/tests/unit/test__struct.py b/tests/unit/test__struct.py
index 6e7c7a3..ae98b40 100644
--- a/tests/unit/test__struct.py
+++ b/tests/unit/test__struct.py
@@ -141,6 +141,16 @@ def test_struct_insert_type_info(faux_conn, metadata):
     )
 
 
+def test_struct_group_by(faux_conn):
+    expr = sqlalchemy.select(_col().children[0].name.label("foo")).group_by("foo")
+    got = expr.compile(faux_conn.engine).string
+
+    assert got == (
+        "SELECT `t`.`person`.children[OFFSET(%(param_1:INT64)s)].name AS `foo` \n"
+        "FROM `t` GROUP BY `foo`"
+    )
+
+
 def test_struct_non_string_field_access(faux_conn):
     with pytest.raises(
         TypeError,

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. label Dec 11, 2024

blunderbuss-gcf bot assigned shollyman Dec 11, 2024

shollyman assigned Linchin and unassigned shollyman Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STRUCT getitem operator creates unnecessary parens when chained with other types #1148

STRUCT getitem operator creates unnecessary parens when chained with other types #1148

r1b commented Dec 11, 2024

r1b commented Dec 12, 2024

STRUCT getitem operator creates unnecessary parens when chained with other types #1148

STRUCT getitem operator creates unnecessary parens when chained with other types #1148

Comments

r1b commented Dec 11, 2024

r1b commented Dec 12, 2024