OSS Delta > Schema Evolution Tutorial Update #336

vivianzhengdatabricks · 2023-09-01T22:57:30Z

In this issue, we will perform 2 updates:
1. Adding supported and unsupported use cases to the schema evolution tutorial for OSS Delta;
2. Merging individual code blocks, to enable 1-click replication of schema evolution example.

1. Adding supported and unsupported use cases to the schema evolution tutorial for OSS Delta;

If you scroll right all the way till before "How is Schema Evolution Useful" https://www.databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html, you will see the (un)supported use cases. Of course, this is for Databricks, and perhaps you need to check with engineers whether it is the same for OSS Delta, just to err on the side of caution.

"The following types of schema changes are eligible for schema evolution during table appends or overwrites:
Adding new columns (this is the most common scenario)
Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType
Other changes, which are not eligible for schema evolution, require that the schema and data are overwritten by adding .option("overwriteSchema", "true"). For example, in the case where the column "Foo" was originally an integer data type and the new schema would be a string data type, then all of the Parquet (data) files would need to be re-written. Those changes include:
Dropping a column
Changing an existing column's data type (in place)
Renaming column names that differ only by case (e.g. "Foo" and "foo")
Finally, with the upcoming release of Spark 3.0, explicit DDL (using ALTER TABLE) will be fully supported, allowing users to perform the following actions on table schemas:
Adding columns
Changing column comments
Setting table properties that define the behavior of the table, such as setting the retention duration of the transaction log
"

2. A personal stylistic preference: when we're illustrating one example, merge all the code blocks into one, and change the comments to in line comments. So that users only need to copy and paste 1 time to replicate your example. For example, I would put all of these into 1 block, interlace it with your comments.

df = spark.createDataFrame([("bob", 47), ("li", 23), ("leonard", 51)]).toDF(
    "first_name", "age"
)
df.write.format("delta").save("tmp/fun_people")
df = spark.createDataFrame([("frank", 68, "usa"), ("jordana", 26, "brasil")]).toDF(
    "first_name", "age", "country"
)
df.write.format("delta").mode("append").save("tmp/fun_people")
df.write.option("mergeSchema", "true").mode("append").format("delta").save(
    "tmp/fun_people"
)
df = spark.createDataFrame([("bob", 47), ("li", 23), ("leonard", 51)]).toDF(
    "first_name", "age"
)
df.write.format("delta").save("tmp/fun_people")

The text was updated successfully, but these errors were encountered:

MrPowers · 2023-09-27T16:04:25Z

Thanks for the recommendations!

I agree that we should make the blog post more detailed. Open to PRs here.

I don't think we should merge the code snippets into a large code block. Readers generally don't like that because it makes the post harder to parse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSS Delta > Schema Evolution Tutorial Update #336

OSS Delta > Schema Evolution Tutorial Update #336

vivianzhengdatabricks commented Sep 1, 2023

MrPowers commented Sep 27, 2023

OSS Delta > Schema Evolution Tutorial Update #336

OSS Delta > Schema Evolution Tutorial Update #336

Comments

vivianzhengdatabricks commented Sep 1, 2023

1. Adding supported and unsupported use cases to the schema evolution tutorial for OSS Delta;

MrPowers commented Sep 27, 2023