How to Use and Update the SIDF

Overviews and instructions on how to install and run the SIDF are found in the SIDF README.

This page contains the notes for the How to Workshops that SIA runs.

SIDF Workshop Notes

Overview

Social Investment Data Foundation – is a set of scripts designed to create an analysis-ready dataset. By specifying a group of people to analyse and a set of arguments to specify what variables are created, building datasets ready to analyse can be accomplished in under a day.

[Put on whiteboard]

What does an analysis dataset contain?

Characteristics (gender, age, ethnicity, iwi and so on)
Service metrics such as cost and duration (this is where the SIAL comes in)
Outcome variables that involve complex logic that don’t fit into the above two categories
- Highest qualification
- Household composition
- Mother smokes
- Potentially avoidable hospitalisations
- Te reo speaker
- Age a first birth
- Contact records
- Diabetes
- Disability needs
- ECE participation
- Frequency of address changes
- Household composition
- Mental health and addiction indicator
- Social housing indicator
- Before school check scores
- And potentially others

What is the output from the data foundation

Based on parameters you specify in a control script a series of tables will be created. These tables will have names such as:

SI_PD_COHORT_CHAR – characteristics as per the data foundation

SI_PD_COHORT_CHAR_EXT – any extensions you made to the standard SI_PD_COHORT_CHAR table

<sial_table>_RLPW – service metric tables in a wide table format (long tables have _RLPL suffix)

Outcomes tables don’t have standard names. If it is not one of the above tables or the control file or cohort table then it is an outcomes table

Note: A single analysis-ready table is not created as the control script has the potential to create thousands of variables which is not useful for analysis. It was decided that the user should determine what variables they want so the final joins to create a single analysis-ready dataset are left to the used.

What are some examples of column names?

Characteristics have intuitive names such as snz_sex_code, as_at_age and so on

Service metrics use a code based system of naming. This is best illustrated with an example.

A sample of a SIAL table is shown below

snz_uid	department	datamart	subject_area	start_date	end_date	cost	event_type
1234	MSD	BEN	T1	01/01/2013	31/12/2013	987.65	JSS
1234	MSD	BEN	T1	30/06/2014	31/12/2014	123.45	JSS

The SIAL tables are summarised to create a series of service metrics that use the following notation

W = the window. This is a time window for the analysis you are interested in [draw a picture to illustrate history, profile, forecast and analysis windows]
XXX = the department
YYY = the datamart
ZZZ = subject area
AAA = one of the following metrics: count (cnt), duration (dur), cost (cst), days since the first event (dsf), days since the last event (del).

For the benefit example some service metrics could be

P_MSD_BEN_T1_JSS_dur
F_MSD_BEN_T1_JSS_dur
P_MSD_BEN_T1_JSS_cnt
F_MSD_BEN_T1_JSS_cnt

If you are interested in benefits in general then you can also create P_MSD_BEN_T1_dur and so on.

Full details can be found in Appendix B of the Social Housing Technical Report.

Outcome names can vary but tend to be more intuitive than the service metric names e.g disability_needs_flag.

How can the data foundation be used?

The SIA has used the data foundation in the following examples:

Social housing test case
Mental health test case
Vulnerable mothers work
Outcomes framework

It is recommended that you look at these to see the data foundation in action.

[Note do this in the sial_repo_template_workshop Show in Github the submodules for some of these repositories and explain how we need submodules so we don’t have multiple copies of the data foundation that all need to be updated manually.]

Each of these is tweaked slightly to suit the analysis. For instance, the vulnerable mothers work has two main scripts – two create two analysis tables via the data foundation; one for the children and one for the mothers.

Running the SIDF (full version on README)

If you do not have a copy of the SIDF in your IDI project folder refer to the README for installation instructions Break into three groups and go to the readme and follow the example

Warning: Do not have more than 3 groups running the SIDF. We have encountered severe performance issues when we run such workshops with multiple instances.

Comments

si_get_cohort_example.sas

Note the datetime format of the as_at_date.

si_control_example_pd.sas

Step through each of the variables in the control file.

L specifies all our costs are lump sum – if you have inconsistent tables you would need a manual work around.

Draw a diagram of the as at date and the periods before and after and explain the concept of a year using the social housing propensity matching as an example.

Quickly go over the price adjustment columns and refer them to social housing if they wish to see inflation or discounting.

Note that the flags for the agencies run all the tables e.g. si_use_moh will build all the moh variables for all the tables.

Then talk about the measures we have in particular the two types of counts and days since variables.

The rest of the script creates these as sas macro variables that can be used by other parts of the SAS script. It starts off as a long file because it is easier for the viewer to read but needs to be transposed to a wide table so that it is easier to do the macro variable creation.

si_main_example_pd.sas Explain that si_get_characteristics is the slow changing characteristics. If you want to add anything to the characteristics like filter our those who are attached to the spine, do that in si_get_characteristics_extension

Explain why we write to the database – in database processing is more efficient when it comes to joining to the SIAL tables.

The macro %si_wrapper_sial_rollup is where the service metrics are constructed.

The macro %si_get_outcomes_ext is where all the outcome variables are constructed

Open the SAS server and have a look in the work library. You will see a series of tables that have been created. The ones that end in _RLPW (rollup wide) create a wide table with the service metrics. To create such a table you need to ensure your column names do not exceed 28 characters (SAS limit is 32 but we suffix _cst _dur etc to the variables. The easiest way to ensure this happens is to not have any rollups more granular than the subject area.

If you need to rollup at the event level and the event_type name is long then use the si_rollup_output_type variable to Long. This creates a long table with the suffix _RLPL. It would be up to the user to rename these and transpose to a wide table.

Open up the table and have a look at the column names. They match the naming convention explained earlier. Note (P1 is one year before the intervention date, P2 is two years before, F0 is the intervention date, F1 is one year after the date and so on.

SIAL_qualifcations is an outcome table.

It is up to you to decide what variables you want and to join all the tables together to create a final dataset.