Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/spark expectations fixes #117

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

sudeep7978
Copy link

@sudeep7978 sudeep7978 commented Nov 18, 2024

The fix is related to the issue in the code that lies with the duplicate check logic in row_dq, which is not functioning as intended. Despite implementing the expectations outlined in the Spark Expectations Nike Info Page, the output does not align with the anticipated results, indicating a mismatch or inconsistency in the validation process. Also we had added a new column in the stats_detailed from observability point of view and we want to get the error records column wise.

Description

We have added column name in writer.py and action.py to get the column name in the stats_detailed table and as well as in the query_dq_output table .Also we had made some changes in the writer.py so that it will rectify the duplicate check and write correct error records in the stats_detailed table

Related Issue

Link to the issue below
https://github.com/Nike-Inc/spark-expectations/issues/116

Motivation and Context

This solves the duplicate check that is uniqueness issue with the dq rules also adding a column helps to get column level error records that we can further enhance from observability point of view.

How Has This Been Tested?

This has been tested rigorously also this version of the spark expectation is currently running in various environment in Nike
PSSP pipelines integrated with observability features. like alerts and dashboards.
We have also tested it locally with the all the possible combinations of rules as well as dataset.
We tested it locally and also it successfully passed all the 400 test cases of unit testing using [make cov and make test] .This signifies this changes is not breaking any other things in the code

Screenshots (if appropriate):

Fixed screenshot result of the duplicate check that is explained in the issue 116
Screenshot 2024-11-19 at 12 28 28 AM
Screenshots for all the test cases passing [make cov]
Screenshot 2024-11-19 at 12 37 58 AM
[make test] screenshots
Screenshot 2024-11-19 at 12 44 27 AM

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

…olumn to stats_detailed table for improved observability.

Fix: Addressed duplicate check issues in row_dq for improved data quality.
Feature: Added new column to stats_detailed table for enhanced observability.
…d table

Improves visibility and aids in quick resolution of data quality issues.
@@ -439,6 +421,7 @@ def _prep_detailed_stats(
"table_name",
"rule_type",
"rule",
"column_name",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we making this column a mandatory and is this going to be a breaking change for the existing pipelines? I think it would be better to makes this optional for the next release and then make it mandatory later as folks adopt it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, for now I'm reverting back the additional column changes and going forward with the duplicate check fixes.

README.md Outdated
@@ -60,6 +60,18 @@ Please find the spark-expectations flow and feature diagrams below
<img src=https://github.com/Nike-Inc/spark-expectations/blob/main/docs/se_diagrams/features.png?raw=true width=1000></p>


# Observability Enhancement: Additional Column for Data Quality Metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this documentation to the documentation under docs folder, so that it refelects in SE documentation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping the readme file as it is as reverting back the column name changes.
Will update the documentation with the further releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants