Downstream data consumption #109

renepajta · 2021-12-01T08:13:40Z

renepajta
Dec 1, 2021

Scenario: Data Product consumes data from other data product when it's ready

Data Integrations are responsible only to bring the data to Enriched layer from where others can start consuming. Additionally, data products may expose their products for consumption.

There are different approaches how data product could consume data and it should be defined in data contract:

Scheduled refresh (e.g. data is available at 7am)
Pulling the delta based on a watermark (e.g. LastModifiedDate)
Event driven (e.g Publish event that the load is done)

I especially like the last approach as it allows to scale across the organisation. Data Consumers can listen to the event and consume newly arrived data as soon as the load is done. Also, in some cases, data consumer wants to wait until the whole dataset (multiple entities) are loaded before consuming (e.g. loading to start schema).

Synapse Pipelines / Data Factory offers Storage Account events and Custom events to configure event triggers which allows to go this direction.

However, there is little to no integration to publish events from pipeline directly. Therefore, I would be interested in concepts / your experience how we can publish events (I am exploring Web activity or using Azure Functions).

marvinbuss · 2021-12-01T15:12:56Z

marvinbuss
Dec 1, 2021
Maintainer

Thanks @renepajta for the summary!
The scheduled refresh is problematic, since there is a high risk that the previous processes have not completed. The watermark option would still have to be triggered somehow and is therefore not really an alternative to the question of how dependent workflows should be triggered.

The event-based triggers are the best approach to model dependencies between pipelines that are running within different workspaces. As mentioned, you can use storage events or custom events to trigger pipelines.
Storage events can also be misleading, because you are never sue whether more data will be dropped or whether the data is ready to be consumed. Custom events seem to be the best way for the event driven approach.

There is also the option to publish events using Web Activity and MSI Authentication:

We have a sample of this running in our environment and we are currently evaluating how we can best integrate this into the landing zones as there are service limitations which are not making it an easy decision.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downstream data consumption #109

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Downstream data consumption #109

renepajta Dec 1, 2021

Replies: 1 comment

marvinbuss Dec 1, 2021 Maintainer

renepajta
Dec 1, 2021

marvinbuss
Dec 1, 2021
Maintainer