Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSS Delta > Schema Evolution Tutorial Update #336

Open
vivianzhengdatabricks opened this issue Sep 1, 2023 · 1 comment
Open

OSS Delta > Schema Evolution Tutorial Update #336

vivianzhengdatabricks opened this issue Sep 1, 2023 · 1 comment

Comments

@vivianzhengdatabricks
Copy link

In this issue, we will perform 2 updates:
1. Adding supported and unsupported use cases to the schema evolution tutorial for OSS Delta;
2. Merging individual code blocks, to enable 1-click replication of schema evolution example.

1. Adding supported and unsupported use cases to the schema evolution tutorial for OSS Delta;

If you scroll right all the way till before "How is Schema Evolution Useful" https://www.databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html, you will see the (un)supported use cases. Of course, this is for Databricks, and perhaps you need to check with engineers whether it is the same for OSS Delta, just to err on the side of caution.

"The following types of schema changes are eligible for schema evolution during table appends or overwrites:
Adding new columns (this is the most common scenario)
Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType
Other changes, which are not eligible for schema evolution, require that the schema and data are overwritten by adding .option("overwriteSchema", "true"). For example, in the case where the column "Foo" was originally an integer data type and the new schema would be a string data type, then all of the Parquet (data) files would need to be re-written. Those changes include:
Dropping a column
Changing an existing column's data type (in place)
Renaming column names that differ only by case (e.g. "Foo" and "foo")
Finally, with the upcoming release of Spark 3.0, explicit DDL (using ALTER TABLE) will be fully supported, allowing users to perform the following actions on table schemas:
Adding columns
Changing column comments
Setting table properties that define the behavior of the table, such as setting the retention duration of the transaction log
"

2. A personal stylistic preference: when we're illustrating one example, merge all the code blocks into one, and change the comments to in line comments. So that users only need to copy and paste 1 time to replicate your example. For example, I would put all of these into 1 block, interlace it with your comments.

df = spark.createDataFrame([("bob", 47), ("li", 23), ("leonard", 51)]).toDF(
    "first_name", "age"
)
df.write.format("delta").save("tmp/fun_people")
df = spark.createDataFrame([("frank", 68, "usa"), ("jordana", 26, "brasil")]).toDF(
    "first_name", "age", "country"
)
df.write.format("delta").mode("append").save("tmp/fun_people")
df.write.option("mergeSchema", "true").mode("append").format("delta").save(
    "tmp/fun_people"
)
df = spark.createDataFrame([("bob", 47), ("li", 23), ("leonard", 51)]).toDF(
    "first_name", "age"
)
df.write.format("delta").save("tmp/fun_people")
@MrPowers
Copy link
Collaborator

Thanks for the recommendations!

I agree that we should make the blog post more detailed. Open to PRs here.

I don't think we should merge the code snippets into a large code block. Readers generally don't like that because it makes the post harder to parse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants