Feature/spark expectations fixes #117

sudeep7978 · 2024-11-18T19:20:50Z

The fix is related to the issue in the code that lies with the duplicate check logic in row_dq, which is not functioning as intended. Despite implementing the expectations outlined in the Spark Expectations Nike Info Page, the output does not align with the anticipated results, indicating a mismatch or inconsistency in the validation process. Also we had added a new column in the stats_detailed from observability point of view and we want to get the error records column wise.

Description

We have added column name in writer.py and action.py to get the column name in the stats_detailed table and as well as in the query_dq_output table .Also we had made some changes in the writer.py so that it will rectify the duplicate check and write correct error records in the stats_detailed table

Related Issue

Link to the issue below
https://github.com/Nike-Inc/spark-expectations/issues/116

Motivation and Context

This solves the duplicate check that is uniqueness issue with the dq rules also adding a column helps to get column level error records that we can further enhance from observability point of view.

How Has This Been Tested?

This has been tested rigorously also this version of the spark expectation is currently running in various environment in Nike
PSSP pipelines integrated with observability features. like alerts and dashboards.
We have also tested it locally with the all the possible combinations of rules as well as dataset.
We tested it locally and also it successfully passed all the 400 test cases of unit testing using [make cov and make test] .This signifies this changes is not breaking any other things in the code

Screenshots (if appropriate):

Fixed screenshot result of the duplicate check that is explained in the issue 116

Screenshots for all the test cases passing [make cov]

[make test] screenshots

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

…olumn to stats_detailed table for improved observability. Fix: Addressed duplicate check issues in row_dq for improved data quality. Feature: Added new column to stats_detailed table for enhanced observability.

…ability.

…d table Improves visibility and aids in quick resolution of data quality issues.

asingamaneni · 2024-11-22T06:10:11Z

spark_expectations/sinks/utils/writer.py

@@ -439,6 +421,7 @@ def _prep_detailed_stats(
                "table_name",
                "rule_type",
                "rule",
+                "column_name",


are we making this column a mandatory and is this going to be a breaking change for the existing pipelines? I think it would be better to makes this optional for the next release and then make it mandatory later as folks adopt it

yes, for now I'm reverting back the additional column changes and going forward with the duplicate check fixes.

asingamaneni · 2024-11-22T06:10:54Z

README.md

@@ -60,6 +60,18 @@ Please find the spark-expectations flow and feature diagrams below
 <img src=https://github.com/Nike-Inc/spark-expectations/blob/main/docs/se_diagrams/features.png?raw=true width=1000></p>


+# Observability Enhancement: Additional Column for Data Quality Metrics


Please move this documentation to the documentation under docs folder, so that it refelects in SE documentation

keeping the readme file as it is as reverting back the column name changes.
Will update the documentation with the further releases.

keeping the read_me file as it is

Will update the documentation with the further release.

sudeep7978 added 4 commits November 18, 2024 01:45

Enhancement: Fixed duplicate check issues in row_dq and added a new c…

9ef5692

…olumn to stats_detailed table for improved observability. Fix: Addressed duplicate check issues in row_dq for improved data quality. Feature: Added new column to stats_detailed table for enhanced observability.

Added new column to stats_detailed table for enhanced observability.

bdb5a47

Added new column to stats_detailed table for enhanced observability.

191bbcf

Feature: Added new column to stats_detailed table for enhanced observ…

70070d0

…ability.

sudeep7978 requested review from asingamaneni and Umeshsp22 as code owners November 18, 2024 19:20

sudeep7978 added 2 commits November 19, 2024 21:56

Update CONTRIBUTORS.md

7aff41c

Add observability enhancement with additional column in stats_detaile…

e1a71c4

…d table Improves visibility and aids in quick resolution of data quality issues.

asingamaneni reviewed Nov 22, 2024

View reviewed changes

sudeep7978 and others added 10 commits November 22, 2024 12:49

reverting back additional column name changes

d64b1e6

reverting back additional column name changes

77abe81

reverting back additional column name changes

6c219ef

reverting back additional column name changes

54d4edb

Update README.md

2ae087f

Delete README.md

ddc3477

Adding the Spark Expectation Documentation under Docs Folder

08ad2b4

Delete docs/README.md

7c759c4

keeping the read_me file as it is

keeping the readme files as it is.

5b1dae6

Will update the documentation with the further release.

updating the list.

1b5e316

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/spark expectations fixes #117

Feature/spark expectations fixes #117

sudeep7978 commented Nov 18, 2024 •

edited

Loading

asingamaneni Nov 22, 2024

sudeep7978 Nov 22, 2024

asingamaneni Nov 22, 2024

sudeep7978 Nov 22, 2024

		@@ -60,6 +60,18 @@ Please find the spark-expectations flow and feature diagrams below
		<img src=https://github.com/Nike-Inc/spark-expectations/blob/main/docs/se_diagrams/features.png?raw=true width=1000></p>


		# Observability Enhancement: Additional Column for Data Quality Metrics

Feature/spark expectations fixes #117

Are you sure you want to change the base?

Feature/spark expectations fixes #117

Conversation

sudeep7978 commented Nov 18, 2024 • edited Loading

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

asingamaneni Nov 22, 2024

Choose a reason for hiding this comment

sudeep7978 Nov 22, 2024

Choose a reason for hiding this comment

asingamaneni Nov 22, 2024

Choose a reason for hiding this comment

sudeep7978 Nov 22, 2024

Choose a reason for hiding this comment

sudeep7978 commented Nov 18, 2024 •

edited

Loading