[BUG][Spark] A merge executed with a generated column requires the source to have the generated column #3318

tigerhawkvok · 2024-06-27T19:53:16Z

Consider this code merging data with a generated column:

# Databricks notebook source
tableName = "TARGET_SCHEMA.generatedTableTest"

# COMMAND ----------

spark.sql(f"DROP TABLE IF EXISTS {tableName}")

# COMMAND ----------

import warnings
from pyspark import pandas as ps
from pyspark.pandas.utils import PandasAPIOnSparkAdviceWarning
warnings.simplefilter("ignore", category= PandasAPIOnSparkAdviceWarning)

# COMMAND ----------

df = ps.DataFrame({"foo": [1,2,3,4,5], "bar":[6,7,8,9,0]})
df.display()

# COMMAND ----------

from delta.tables import DeltaTable
from pyspark.sql.types import LongType
deltaSession = DeltaTable.create(spark)
dTableBuilder = deltaSession.tableName(tableName)
dTableBuilder.addColumns(df.to_spark().schema)
dTableBuilder.addColumn("baz", LongType(), generatedAlwaysAs= "foo + bar")
dTable = dTableBuilder.execute()


# COMMAND ----------

mergeBuilder = dTable.merge(df.to_spark(), condition= "1 = 1").whenMatchedUpdateAll().whenNotMatchedInsertAll()

# COMMAND ----------

try:
    mergeBuilder.execute()
except Exception as e:
    print("***We raised an error! As of 20240627 this will say 'baz' is missing***\n\n")
    print(e)

Observed results

The merge fails, unable to resolve the generated column. The error will be (or close to)

[DELTA_MERGE_UNRESOLVED_EXPRESSION] Cannot resolve baz in UPDATE clause given columns foo, bar.

Expected results

The generated column is, well, generated from the inputs and as such is unnecessary to specify.

Environment information

Delta Lake version: DBR 14.3LTS
Spark version: 3.5.0
Scala version: 2.12

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
No. I cannot contribute a bug fix at this time.

The text was updated successfully, but these errors were encountered:

tigerhawkvok · 2024-06-27T21:51:04Z

You can workaround this by enumerating every non-generated column in the "All" functions, but that kind of misses the point of both those functions and generated columns IMO. If a column is missing from the source and the target is generated, it should be skipped during validation.

I think (not actually knowing Scala) that

https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/deltaMerge.scala#L145

and

delta/spark/src/main/scala/org/apache/spark/sql/delta/ResolveDeltaMergeInto.scala

Line 124 in b7da7f4

val resolvedActions: Seq[DeltaMergeAction] = clause.actions.flatMap { action =>

can escape the for-each assertion in those cases.

tigerhawkvok added the bug Something isn't working label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][Spark] A merge executed with a generated column requires the source to have the generated column #3318

[BUG][Spark] A merge executed with a generated column requires the source to have the generated column #3318

tigerhawkvok commented Jun 27, 2024

tigerhawkvok commented Jun 27, 2024

[BUG][Spark] A merge executed with a generated column requires the source to have the generated column #3318

[BUG][Spark] A merge executed with a generated column requires the source to have the generated column #3318

Comments

tigerhawkvok commented Jun 27, 2024

Observed results

Expected results

Environment information

Willingness to contribute

tigerhawkvok commented Jun 27, 2024