-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/spark expectations fixes #117
base: main
Are you sure you want to change the base?
Feature/spark expectations fixes #117
Conversation
…olumn to stats_detailed table for improved observability. Fix: Addressed duplicate check issues in row_dq for improved data quality. Feature: Added new column to stats_detailed table for enhanced observability.
…d table Improves visibility and aids in quick resolution of data quality issues.
@@ -439,6 +421,7 @@ def _prep_detailed_stats( | |||
"table_name", | |||
"rule_type", | |||
"rule", | |||
"column_name", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we making this column a mandatory and is this going to be a breaking change for the existing pipelines? I think it would be better to makes this optional for the next release and then make it mandatory later as folks adopt it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, for now I'm reverting back the additional column changes and going forward with the duplicate check fixes.
README.md
Outdated
@@ -60,6 +60,18 @@ Please find the spark-expectations flow and feature diagrams below | |||
<img src=https://github.com/Nike-Inc/spark-expectations/blob/main/docs/se_diagrams/features.png?raw=true width=1000></p> | |||
|
|||
|
|||
# Observability Enhancement: Additional Column for Data Quality Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this documentation to the documentation under docs folder, so that it refelects in SE documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping the readme file as it is as reverting back the column name changes.
Will update the documentation with the further releases.
keeping the read_me file as it is
Will update the documentation with the further release.
The fix is related to the issue in the code that lies with the duplicate check logic in row_dq, which is not functioning as intended. Despite implementing the expectations outlined in the Spark Expectations Nike Info Page, the output does not align with the anticipated results, indicating a mismatch or inconsistency in the validation process. Also we had added a new column in the stats_detailed from observability point of view and we want to get the error records column wise.
Description
We have added column name in writer.py and action.py to get the column name in the stats_detailed table and as well as in the query_dq_output table .Also we had made some changes in the writer.py so that it will rectify the duplicate check and write correct error records in the stats_detailed table
Related Issue
Link to the issue below
https://github.com/Nike-Inc/spark-expectations/issues/116
Motivation and Context
This solves the duplicate check that is uniqueness issue with the dq rules also adding a column helps to get column level error records that we can further enhance from observability point of view.
How Has This Been Tested?
This has been tested rigorously also this version of the spark expectation is currently running in various environment in Nike
PSSP pipelines integrated with observability features. like alerts and dashboards.
We have also tested it locally with the all the possible combinations of rules as well as dataset.
We tested it locally and also it successfully passed all the 400 test cases of unit testing using [make cov and make test] .This signifies this changes is not breaking any other things in the code
Screenshots (if appropriate):
Fixed screenshot result of the duplicate check that is explained in the issue 116
Screenshots for all the test cases passing [make cov]
[make test] screenshots
Types of changes
Checklist: