Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Implement data lifecycle policies #536

Closed
bmtcril opened this issue Dec 4, 2023 · 12 comments · Fixed by openedx/openedx-aspects#216
Closed

Feat: Implement data lifecycle policies #536

bmtcril opened this issue Dec 4, 2023 · 12 comments · Fixed by openedx/openedx-aspects#216
Assignees
Labels
enhancement Relates to new features or improvements to existing features epic Large unit of work, consisting of multiple tasks

Comments

@bmtcril
Copy link
Contributor

bmtcril commented Dec 4, 2023

We would like to make data lifecycle part management a default part of Aspects. Currently we are unable to gracefully age-off old data, creating some potential compliance issues and making it difficult to implement data best practices. Primarily this is because time-series tables are not partitioned, making ClickHouse deletes force a complete rewrite of the table. To enable lifecycle management we need to do the following:

  • Partition all time-series tables by year-month
  • Create a configurable list of tables that are managed in the lifecycle, defaulting to all of our time-series tables
    • These tables must all be partitioned by year-month
  • Create a command to drop old partitions with a configurable number of months to keep
  • Document this work, including an ADR on this topic and a guide to how to run the command
@bmtcril bmtcril added epic Large unit of work, consisting of multiple tasks aspects v1 labels Dec 4, 2023
@bmtcril bmtcril added the enhancement Relates to new features or improvements to existing features label Jan 12, 2024
@bmtcril bmtcril changed the title Implement data lifecycle policies Feat: Implement data lifecycle policies Jan 12, 2024
@Ian2012 Ian2012 self-assigned this Jan 26, 2024
@Ian2012
Copy link
Contributor

Ian2012 commented Jan 29, 2024

This handles the first part: openedx/aspects-dbt#42 however, I'm not sure if the main xapi table should be partitioned too?

@bmtcril

@bmtcril
Copy link
Contributor Author

bmtcril commented Jan 29, 2024

Yeah, especially that table should be on the list. It'll have to be another Alembic migration on that table, sadly.

@Ian2012
Copy link
Contributor

Ian2012 commented Jan 29, 2024

Should we also implement some clean-up workflow on the event_sink tables?

@bmtcril
Copy link
Contributor Author

bmtcril commented Jan 29, 2024

Hmm I suppose so, though I think it would make sense to have that be separate from the user data. Like if you want to keep user data for 2 yrs and course data for 5 I think we should support that. Maybe a setting per table, or some kind of table grouping. We should remember the vector tables as well.

@pomegranited
Copy link
Contributor

@bmtcril @Ian2012

I'm not sure if the main xapi table should be partitioned too?

Yeah, especially that table should be on the list. It'll have to be another Alembic migration on that table, sadly.

I said in the ADR that we'd be putting partitions into dbt instead of using alembic. Was I incorrect?

@bmtcril
Copy link
Contributor Author

bmtcril commented Feb 1, 2024

@pomegranited it needs to be both since some tables aren't able to be managed in dbt (yet, we're working on it).

@Ian2012
Copy link
Contributor

Ian2012 commented Feb 29, 2024

The recommended approach by Altinitty is to use TTL with the setting ttl_only_drop_parts to improve performance. We will continue with this approach

@Ian2012 Ian2012 moved this to Backlog in Data Working Group Mar 4, 2024
@Ian2012 Ian2012 moved this from Backlog to Blocked in Data Working Group Mar 20, 2024
@bmtcril
Copy link
Contributor Author

bmtcril commented Mar 26, 2024

@Ian2012 is pushing this forward in dbt-clickhouse here: ClickHouse/dbt-clickhouse#254

@bmtcril
Copy link
Contributor Author

bmtcril commented Apr 18, 2024

@Ian2012 can we consider this done now?

@Ian2012
Copy link
Contributor

Ian2012 commented Apr 18, 2024

Yes, it's

@Ian2012 Ian2012 closed this as completed Apr 18, 2024
@github-project-automation github-project-automation bot moved this from Blocked to Done in Data Working Group Apr 18, 2024
@bmtcril
Copy link
Contributor Author

bmtcril commented Apr 19, 2024

Actually I need to reopen this since we didn't get to the documentation. @Ian2012 do you think you can write up a "concepts" doc in openedx-aspects describing the TTL, how it works, and the setting options?

@bmtcril bmtcril reopened this Apr 19, 2024
@Ian2012
Copy link
Contributor

Ian2012 commented Apr 19, 2024

Sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Relates to new features or improvements to existing features epic Large unit of work, consisting of multiple tasks
Projects
Development

Successfully merging a pull request may close this issue.

3 participants