Data Modeling - Kimball, 3NF, Star Schemas, Activity Schema, Whats Best? #1245

sisu-callum · 2022-03-17T23:04:21Z

sisu-callum
Mar 17, 2022

What is the best modeling style in the era of dbt and modern cloud data warehouses? The tradeoffs frequently mentioned in the data modeling texts of the 90s on't necessarily apply anymore. For example, Aaron Ormiston explained this well by saying:

One other view is that 3NF works really hard to avoid duplicating anything, anywhere. But in avoiding duplication, fully denormalized tables spider out in terms of the schema and are therefore a bit harder to work with in an analytical context.
A benefit and trade off of Kimball modeling is that you trade that storage savings for schema simplicity. You essentially accept that you're going to flatten some of those tables that would have spidered off during denormalization back into a single dim with duplicated values for for the purpose of having everything be one join away from a fact table.

So what is the best modeling style? Obviously there are many different implementations and different styles can fit for different use cases but an opinionated first start might be a great lead-in to junior people within the community who aren't familiar with the concept of data modeling. Or those seasoned veterans who are coming into the cloud data warehouse space from the on-prem world and are looking for guidance on modeling in a different paradigm!

My example that I like to give is:

Maybe I'm the odd-duck out here but I actually do both of star schema modeling AND big-wide table modeling that is becoming more popular. And my star schema is some version thats not constrained by all of the Kimball rules. Then I build my big wide tables off of those objects and its way easier for end-consumption. But I've found going through the entity building process of a star schema to be helpful in framing how things relate and interact

Based on this slack thread started by Noel Gomez,

sisu-callum · 2022-03-17T23:18:26Z

sisu-callum
Mar 17, 2022
Author

I've actually had a good degree of success recently using a few of the concepts from the Activity Schema of Narrator frame - especially as a method of modeling on top of the star schema. I fell into this design pattern when working with a particularly messy and unreliable front-end event tracking system. My stakeholders were interested in seeing usage patterns and the implementation of our FE tracking made it difficult to put trust in.

I realized that 80% of the events that the product team wanted to track were actually events that impacted database objects. Changing this configuration, deleting object A, etc etc. So instead of trying to wrangle an engineering team focused on launching new features, I instead took those objects (already represented as dimensions in my star schema) and created unpivoted changelogs with the below code.

    {{ dbt_utils.unpivot(
    relation=ref('your model'),
    cast_to='string',
    exclude=[
            'whatever unique fields identify person and object'
    ],
    field_name='changed_field',
    value_name='changed_value'
    ) }}

I was then able to find the specific field level changes that represented the things that the stakeholders were interested in changing and representing that as an activity in a combined activity_schema. I then layered in the fact tables, reformatting them all based on a common spec file created in our repo.

This resulted in a single table that contained all user actions in the same format. It was at this point that dbt-labs released their dbt_metrics package and my life got even better. I was able to add a field within the spec called activity_metric and define the aggregation logic within the associated metric config. Suddenly I've got most of my metrics pulling from a centralized, tested, and pruned table with pre-defined user actions and all without needing to hound my engineering pals for them to update front-end tracking.

Have to imagine that only helps in a paradigm where you have defined users but hopefully helpful!

0 replies

noel · 2022-03-18T11:49:33Z

noel
Mar 18, 2022

Thanks for getting this discussion going @sisu-callum. Aside from the options listed in the title, it would be good to write down our perspective on things like Unified Star Schema(USS), Data Vault(DV), and One Big Table(OBT). I believe each method has pros and cons and it is possible that in certain contexts one technique would be preferable.

For example, in a large enterprise it is possible that DV is preferred in the source / staging area while OBT is preferred in the data consumption layer for end user simplicity, but USS might also be an option.

One other angle I have thought about is long term maintenance / support. While one technique might be preferable like DV, the added training / understanding might lead us to a less optimal, but more understandable recommendation.

0 replies

JC-Lightfold · 2022-03-19T03:16:52Z

JC-Lightfold
Mar 19, 2022

Likewise, great to see the conversation deepening @sisu-callum. I'll repost my initial comment from slack here:

Very interested in this conversation myself. At Lightfold we’re incubating a view that with the modern stack, we might be able to literally transcend the traditional Kimball v Inmon now it's not about hardware and what we find emerges instead is a functional understanding of the “visible spectrum” of data according to the permissible structures of set theory and relational algebra. If we conceptualise a continuum from truly unstructured data (think block text or sound files) through to semi-structured (think JSON or key-value stores) and then transition into 1st, 2nd, 3-3.5 and then 4-6th normal forms, we start to see data form as simply a variable of its nature - like wavelength and amplitude. The benefit of this is we can map the body of knowledge to each layer - HADOOP style string processing at the bottom, then PIG, then Graph and OWL, then SQL. Then SQL begins the journey up the levels of increasing fractal definition with the normal forms, adding temporality to reduce atomically all the way to the sixth. Likewise, we can map architectures - Apache/Hadoop/Cassandra, Graph DBs and compression formats based on key-value pairs (like Vertipaq) or other magic (like .hyper), then into Kimball schema, then ODS 3/3.5 schema and then Data Vault. If we do this, we start to see that you could validly select any and all of the above if the right conditions are met. Similarly, none of them should remain inherently sacred or superior - they are patterns sitting on top of technologies that are implications of the maths. What if we could pick our schematic architectures as easily as we pick our graph front ends? And what if we did so on the basis of architectural and business fit, instead of our personal data religions? ;-)

0 replies

JC-Lightfold · 2022-03-19T03:22:50Z

JC-Lightfold
Mar 19, 2022

@dangalavan You may be interested in this discussion given your DEDAG initiative.

0 replies

aormiston · 2022-03-25T22:30:19Z

aormiston
Mar 25, 2022

One important thing to keep in mind with modeling styles is that it's not just a technical decision. There are maintenance and scaling costs associated with doing something that is too innovative.

Dan McKinley has a great article on "choosing boring technology" where he makes the case that, while companies tend to be optimistic about innovating endlessly, they actually only have a finite number of 'innovation tokens' to spend. Doing something cutting edge with data modeling is actually spending one of those innovation tokens. If it doesn't provide the upside of a significant competitive advantage for your business' core product offering, in my experience, it's not worth one of your team's innovation tokens.

I've been on both sides of this. I've built a DW using a more custom/innovative paradigm. I've built multiple using Kimball.

At this point, Kimball is more or less a "boring" technology. Lots of shops have used it to great effect. It's not particularly innovative or exciting. It's a little outdated and can be clunky.

But it's well documented. And proven. And there's great value in that.

If your business evaluates Kimball and decides you have a significant reason to spend an 'innovation token' on something other than Kimball or one of the other well-trodden methodologies, I also think that's perfectly valid. As long as you're viewing it as a cost and a risk with upside, because that's truly what it is.

2 replies

aormiston Mar 25, 2022

What do you think @danielschoro?

danielschoro Mar 28, 2022

I actually love this discussion!

I agree with @aormiston in the sense that if it's working and it works well then stick with it.

I have to admit, I am biased towards Kimball because I've had success in delivering real business value very quickly with it. I feel like that is where its strength is. It is old, but with some adaptation towards modern data warehouse database engines, I actually still think it's extremely relevant. Because it follows some basic principles which are so easy to grasp and scalable I believe that's what makes Kimball the best approach to data modeling in my opinion. Of course that opinion can change. But it's what I'd suggest to anyone new to the space right now. I've seen junior data engineers come in and produce a valuable data mart within as little as a few days after learning the method and getting sign off from senior engineers on the design docs (ERD). Super powerful stuff.

I had the thought come to me the other day, that with some work one could take the Kimball method and find a way to use config files to fully map between a source system (or multiple) towards a Kimball dimension or fact table. Or even a full star schema. Automating this could mean even more value-producing layers like the metrics layer. 😁

Would love to hear more of what others think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Modeling - Kimball, 3NF, Star Schemas, Activity Schema, Whats Best? #1245

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Data Modeling - Kimball, 3NF, Star Schemas, Activity Schema, Whats Best? #1245

sisu-callum Mar 17, 2022

Replies: 5 comments · 2 replies

sisu-callum Mar 17, 2022 Author

noel Mar 18, 2022

JC-Lightfold Mar 19, 2022

JC-Lightfold Mar 19, 2022

aormiston Mar 25, 2022

aormiston Mar 25, 2022

danielschoro Mar 28, 2022

sisu-callum
Mar 17, 2022

Replies: 5 comments 2 replies

sisu-callum
Mar 17, 2022
Author

noel
Mar 18, 2022

JC-Lightfold
Mar 19, 2022

JC-Lightfold
Mar 19, 2022

aormiston
Mar 25, 2022