Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add drop columns usage page #2561

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/usage/deleting-rows-from-delta-lake-table.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,7 @@ Here are the contents of the Delta table after the delete operation has been per
| 2 | b |
+-------+----------+
```

`dt.delete()` accepts any `SQL where` clause. If no predicate is provided, all rows will be deleted.

Read more in the [API docs](https://delta-io.github.io/delta-rs/api/delta_table/#deltalake.DeltaTable.delete)
89 changes: 89 additions & 0 deletions docs/usage/droppings-columns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Dropping columns from a Delta Lake table

This section explains how to drop columns from an exising Delta table.

## Delta Lake drop columns

You can drop columns from a Delta Lake table using your query engine and an `overwrite` operation.

Let's look at an example with `pandas`.

Suppose you have a dataset with two columns:

```
> # create a toy dataframe
> data = {
> 'first_name': ['bob', 'li', 'leah'],
> 'age': [47, 23, 51],
> }
> df = pd.DataFrame.from_dict(data)

> df

first_name age
0 bob 47
1 li 23
2 leah 51
```

You then store this data as a Delta table:

```
# write to deltalake
write_deltalake("tmp/delta-table", df)
```

Now you want to remove one of the columns.

You can do this by reading the data back in with pandas and using the pandas `.drop()` method to drop the column:

```
# read delta table with pandas
df = DeltaTable("tmp/delta-table").to_pandas()

# drop column
df = df.drop(columns=["age"])
```

Next, perform an `overwrite` operation to drop the column from your Delta table.

You will need to specify `mode="overwrite"` as well as `schema_mode="overwrite"` because you will be changing the table schema:

```
write_deltalake(
"tmp/delta-table",
df,
mode="overwrite",
schema_mode="overwrite",
)
```

Read the data back in to confirm that the column has been dropped:

```
> DeltaTable("tmp/delta-table").to_pandas()

first_name
0 bob
1 li
2 leah
```

You can easily time travel back to earlier versions using the `version` option:

```
> DeltaTable("tmp/delta-table", version=0).to_pandas()

first_name age
0 bob 47
1 li 23
2 leah 51
```

## Logical vs Physical Operations

Dropping columns is a _logical operation_. You are not physically deleting the columns.

When you drop a column, Delta simply creates an entry in the transaction log to indicate that queries should ignore the dropped column going forward. The data remains on disk so you time travel back to it.

If you need to physically delete the data from storage, you will have to run a [vacuum](https://delta-io.github.io/delta-rs/usage/managing-tables/#vacuuming-tables) operation. Vacuuming may break time travel functionality.
Loading