-
Notifications
You must be signed in to change notification settings - Fork 6
First Tutorial Part 4: Pivot View Definition
Previous | Main | Next |
---|---|---|
<-- Part 3 | Main | Part 5 --> |
Once the base-line tables are defined, the next step is to define the view that will do the pivoting. For our needs, it would look like this:
{ "name": "Form_SAT01_PivotView"
,"description": "A pivoted view of SAT_01 forms"
,"columns":[
{ "sameas": "Form.refnum" , "name":"formRefnum" }
,{ "sameas": "Form.fillDate" , "name":"formFillDate" }
,{ "sameas": "User.refnum" , "name":"formUserRefnum" }
,{ "sameas": "User.email" , "name":"formUserEmail" }
,{ "sameas": "FormAnswer.value", "aggregate":"MAX" }
]
,"subWhereX":{
"clause":[" Form.type = 'SAT_01'"
,"and Form.deleted is null"
,"and FormAnswer.deleted is null"
]
,"description":["Only look at SAT_01 forms and answers that have not been deleted."
]
}
,"pivots":[
{ "on": "FormAnswer.field"
,"aggregates":[
{ "name":"value", "prefix":"" }
]
,"values":[ { "value":"Q1", "expression":"Tilda.toInt(?,null)", "type":"INTEGER", "description":"Q1"}
,{ "value":"Q2", "expression":"Tilda.toInt(?,null)", "type":"INTEGER", "description":"Q2"}
,{ "value":"Q3", "expression":"Tilda.toInt(?,null)", "type":"INTEGER", "description":"Q3" }
,{ "value":"Q4" , "description":"Q4" }
]
}
]
}
To create a pivoted view, you must:
- Specify a meaningful name and description for the pivot view. The string "PivotView" in any case format you wish, must end the view name.
- List first all the columns that will form your anchor (here, "formRefnum", "formFillDate", "formUserRefnum" and "formUserEmail").
- Add the aggregates.
- As per the View Columns Syntax, aggregates must be defined last and Tilda will assume a group by for the first non-aggregate columns defined (here, the first 4 forming our anchor).
- In this example, we only have one aggregate, but multiple ones could be defined as we'll see when we expand this example further below.
- A trick for single name/value pair pivoting is to simply use MAX or MIN as an aggregate since we know there is only one answer to a given question. Technically, FIRST or LAST (the other 2 aggregates that yield a single value of the same type as the source column) could also be used but may not perform as well. As is always the case with performance, measure what you get.
- Specify a where-clause if applicable. Here, we need to since the Form and FormAnswer tables could technically contain many different types of forms: we have to at least specify the form type we are targeting, in this case 'SAT_01'. Other form types may contain completely different question/field sets. We are also filtering out deleted contents using the auto-generated life-cycle column "deleted" on both Form and FormAnswer.
- Specify joins if applicable. Here, we did not have to specify any explicit joins as Tilda can figure it all out automatically here. You can check out the Joins definition for more details.
- Check the Pivots Syntax for a definition of all the elements that define pivots.
🎈 NOTE: Pivoted column names are generated by taking the individual pivot value names cross-producted with the list of aggregates specified with prefix/suffix if provided.
- Let's assume that we define aggregates=[{name="abc",prefix="a_"}, {name="xyz", prefix="x_"}] respectively and the values "Q1" through "Q4", the system will generate columns "a_Q1", ... , "a_Q4", "x_Q1", .., "x_Q4" if interleave is false (default), or columns "a_Q1", "x_Q1", ... , "a_Q4", "x_Q4" if interleave is true.
🎈 NOTE: If more than one aggregate is specified, at least one of prefix or suffix must be specified to differentiate the generated columns for each aggregate.
🎈 NOTE: Pivoting can specify type transformations to make the final data easier to use in many scenarios, including analytics. In most cases, a flat table of question/answers would use a generic VARCHAR/STRING data type to hold answer values. But when pivoting, you may know that answer X is a date, answer Y is text, and answer Z is an integer.
The result should simply look like this:
formRefnum | formFillDate | formUserRefnum | formUserEmail | Q1 | Q2 | Q3 | Q4 |
---|---|---|---|---|---|---|---|
111 | 2019-09-03 | 543567 | a@a.com | 10 | 8 | 8 | Blah... |
222 | 2019-09-03 | 634568 | b@b.com | 9 | 7 | 7 | Bleh... |
The SQL generated under the covers should be somewhat self-evident (T in Tilda stands for Transparent but writing a pivoting query is more complex than a straight view) and look like:
create or replace view TILDATEST.Form_SAT01_PivotView as
with T as (
-- 'A pivoted view of SAT_01 forms'
select TILDATEST.Form."refnum" as "formRefnum" -- The primary key for this record
, trim(TILDATEST.Form."fillDateTZ") as "formFillDateTZ" -- Generated helper column to hold the time zone ID for 'fillDate'.
, TILDATEST.Form."fillDate" as "formFillDate" -- The date the form was filled
, TILDATEST.User."refnum" as "formUserRefnum" -- The primary key for this record
, TILDATEST.User."email" as "formUserEmail" -- The user's email
, TILDATEST.FormAnswer."field" as "field" -- question/field id
, max(TILDATEST.FormAnswer."value") as "value" -- answer value
from TILDATEST.Form
inner join TILDATEST.User on TILDATEST.Form."userRefnum" = TILDATEST.User."refnum"
left join TILDATEST.FormAnswer on TILDATEST.FormAnswer."formRefnum" = TILDATEST.Form."refnum"
where ( TILDATEST.Form."type" = 'SAT_01'
and TILDATEST.Form."deleted" is null
and TILDATEST.FormAnswer."deleted" is null
)
and ( TILDATEST.FormAnswer."field" in ('Q1', 'Q2', 'Q3', 'Q4')
)
group by 1, 2, 3, 4, 5, 6
) select "formRefnum"
, "formFillDateTZ"
, "formFillDate"
, "formUserRefnum"
, "formUserEmail"
, MAX("value") as "value"
, (Tilda.toInt(MAX("value") filter (where "field"= 'Q1') ,null))::integer as "Q1"
, (Tilda.toInt(MAX("value") filter (where "field"= 'Q2') ,null))::integer as "Q2"
, (Tilda.toInt(MAX("value") filter (where "field"= 'Q3') ,null))::integer as "Q3"
, MAX("value") filter (where "field"= 'Q4') as "Q4"
from T
group by 1, 2, 3, 4, 5
;
COMMENT ON VIEW TILDATEST.Form_SAT01_PivotView IS E'-- DDL META DATA VERSION 2019-01-09\ncreate or replace view TILDATEST.Form_SAT01_PivotView as \nwith T as...';
COMMENT ON COLUMN TILDATEST.Form_SAT01_PivotView."formRefnum" IS E'The primary key for this record';
COMMENT ON COLUMN TILDATEST.Form_SAT01_PivotView."formFillDateTZ" IS E'Generated helper column to hold the time zone ID for ''fillDate''.';
COMMENT ON COLUMN TILDATEST.Form_SAT01_PivotView."formFillDate" IS E'The date the form was filled';
COMMENT ON COLUMN TILDATEST.Form_SAT01_PivotView."formUserRefnum" IS E'The primary key for this record';
COMMENT ON COLUMN TILDATEST.Form_SAT01_PivotView."formUserEmail" IS E'The user''s email';
COMMENT ON COLUMN TILDATEST.Form_SAT01_PivotView."Q1" IS E'Q1 (pivot of MAX(TILDATEST.FormAnswer.value) on TILDATEST.FormAnswer.field=''Q1'')';
COMMENT ON COLUMN TILDATEST.Form_SAT01_PivotView."Q2" IS E'Q2 (pivot of MAX(TILDATEST.FormAnswer.value) on TILDATEST.FormAnswer.field=''Q2'')';
COMMENT ON COLUMN TILDATEST.Form_SAT01_PivotView."Q3" IS E'Q3 (pivot of MAX(TILDATEST.FormAnswer.value) on TILDATEST.FormAnswer.field=''Q3'')';
COMMENT ON COLUMN TILDATEST.Form_SAT01_PivotView."Q4" IS E'Q4 (pivot of MAX(TILDATEST.FormAnswer.value) on TILDATEST.FormAnswer.field=''Q4'')';
As you can see, the generated SQL query is not trivial and a lot of performance-related work went into it (CTE, field filtering in the where-clause and filtered aggregates). The "native" pivoting mechanism provided by extensions, and other mechanisms that are query-based have been considered when arriving at the current form. The separate article Scalable Pivoting details this work in much more depth. That being said, there are some obvious things to consider regarding performance given this current implementation:
- We are filtering the forms data on "type", so having an index is fairly critical if you are handling large data sets. We have defined it as expected when defining the Form table. This is an application-level decision here.
- The FormAnswer table is likely much larger. If we had 1M SAT_01 forms entered, with 5 questions, the answer table would contain 5M rows. Since it's joined on "formRefnum" and pivoted on "field", an index covering those 2 columns is essential for the view to perform. This is due primarily to the structure of the generated view along with the application's table design.
Structurally, the view is made out of:
- a CTE defining a row-based structure of all the anchor columns, the columns pivoted on, and the aggregates.
- the column order is significant and creates a datamart-like structure for all required columns (akin to dimensions) and their aggregates (akin to measures).
- the join is automatically determined as columns extracted are from tables that have foreign keys defined between them. In particular, the framework knows when to do an inner join, and when to do a left join.
- the where-clause is assembled from whatever was specified for the view, ANDed with a filter for the pivoted values requested.
- the group-by clause is automatically determined
- an outer query that takes the output of the CTE and does the actual pivoting
- the final columns are automatically generated as specified
- Postgres' very convenient and well performing "aggregate filter" feature is used. For other databases, the classic max(case when "field"='Q1' then Tilda.toInt("value") else null end) pattern would be used.
- the final group-by clause is automatically determined
🎈 NOTE: The "max" trick in the CTE is pretty common when the use case calls for a single name/value pair structure of the question/answer form that we have defined here. The advantage is that performance-wise, it doesn't add much overhead in this case, while remaining general for any other types of aggregates when considering having a one-to-many relationship between the column pivoted on, and multiple values coming from multiple records. For example, you may have a table of hospital visits that have a type and a line of service, and you want to create a pivot with a simple count of the visits across specific types (e.g., ER, Inpatient, Outpatient...) and multiple lines of service (e.g., Obstetrics, ICU...)
As an example of what types of data sets we have looked at in our company, we have dealt with various width (number of columns) and length (number of rows) sizes that although they do not fall in the big-data space, they are nevertheless non trivial.
- Created a pivot table of OASIS assessments covering multiple versions of the spec.
- 500K assessments
- 140M question/answer rows
- 590 distinct questions across multiple versions of the specs
- Created a pivot table of HIS assessments covering multiple versions of the spec.
- 30K assessments
- 3M question/answer rows
- 150 distinct questions across multiple versions of the specs
- Created a pivot table of ICC/BWAT wound assessments.
- 350K assessments
- 54 distinct questions
- 6M question/answer rows
- We processed general physical assessments across an entire health system across multiple years into pivot tables.
- 3M assessments
- 2000 distinct questions
- 170M question/answer rows
🎈 NOTE: By the way, Tilda's markup represents about 80 lines of JSON definition and just here in terms of SQL generated and managed, we have about 120 lines of code. That SQL code contains full documentation at the column and table/view level, as well as some fairly complex logic in the pivot view itself. If someone enhanced the form to add a Q5 for example, a single line of JSON would be added to the Tilda pivot definition and the view and docs would be updated accordingly. For larger real-life example where a pivot could include dozens of columns, the payoff increases dramatically.
- If you then add browsable/searchable HTML documentation and Java code, the power of automation through Tilda is even more dramatic in terms of code automatically generated and managed throughout iterative development cycles.
- Tilda understands how the view is composed and can intelligently generate more detailed lineage information for pivoted columns.
- You now have a living, breathing, model. After it's initially deployed, adding that Q5 column will be handled through automated migration during iterative deployment cycles.
- Finally, with Tilda's Realization feature, such a view can efficiently be written out automatically to a table and maintained as new data is loaded. Automated migration will ensure any updates to the model will be seamless throughout the realization process.
Previous | Main | Next |
---|---|---|
<-- Part 3 | Main | Part 5 --> |