Should a table be profiled first before adding data quality tests? #5456
-
Why can't we start tests for a table without profiling it first? For some reason, DAG with test runs successfully, but you don't see the result in OpenMetadata if the table doesn't have a profiling report. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
OpenMetadata has two types of workflows: Ingestion, and Profiler. Ingestion is the fast-paced one to extract metadata and make entities available. It is maxed at one workflow per service. On the other hand, from the profiler, you can schedule as many profiler workflows as required, with different schedules and relating to different batches of entities. You can disable the profiler during the metadata ingestion and use the profiler workflows directly. You can deploy multiple of them with different filter patterns. The profiler workflow is heavier as it runs metrics for the table and columns. Data quality tests are based on those results, so for now running a fresh profile is required to compute the tests. When we started development, there was only the ingestion pipeline, so the profiling happened there. We’ve retained the profiling capabilities in the ingestion pipeline for convenience. We’ll remove them in Release 0.11 in favor of the Profiler Workflow, which is the only one that runs the tests. Having the profiler in a separated pipeline allows users to select which tables they are interested in reviewing and which not. So they can run it in a more spaced cadence compared to the metadata ingestion. After separating the pipelines, tests can be run without profiling a table first, and it’ll also be much quicker to work with. For each table, we can compute the metrics and if there are any tests, execute them too. |
Beta Was this translation helpful? Give feedback.
OpenMetadata has two types of workflows: Ingestion, and Profiler.
Ingestion is the fast-paced one to extract metadata and make entities available. It is maxed at one workflow per service. On the other hand, from the profiler, you can schedule as many profiler workflows as required, with different schedules and relating to different batches of entities. You can disable the profiler during the metadata ingestion and use the profiler workflows directly. You can deploy multiple of them with different filter patterns.
The profiler workflow is heavier as it runs metrics for the table and columns. Data quality tests are based on those results, so for now running a fresh profile is required to co…