-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
use spc python package to prepare synthetic pop
- Loading branch information
1 parent
f7e2b9b
commit 520176f
Showing
4 changed files
with
2,021 additions
and
24 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,347 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#import polars as pl" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We will use the spc package for our synthetic population. To add it as a dependancy in this virtual environment, I ran `poetry add git+https://github.com/alan-turing-institute/uatk-spc.git@55-output-formats-python#subdirectory=python`. The branch may change if the python package is merged into the main spc branch. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#https://github.com/alan-turing-institute/uatk-spc/blob/55-output-formats-python/python/examples/spc_builder_example.ipynb\n", | ||
"from uatk_spc.builder import Builder" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Loading in the SPC synthetic population\n", | ||
"\n", | ||
"I use the code in the `Quickstart` [here](https://github.com/alan-turing-institute/uatk-spc/blob/55-output-formats-python/python/README.md) to get a parquet file and convert it to JSON. \n", | ||
"\n", | ||
"You have two options:\n", | ||
"\n", | ||
"\n", | ||
"1- Slow and memory-hungry: Download the pbf file directly from [here](https://alan-turing-institute.github.io/uatk-spc/using_england_outputs.html) and load in the pbf file with the python package\n", | ||
"\n", | ||
"2- Faster: Covert the pbf file to parquet, and then load it using the python package. To convert to parquet, you need to:\n", | ||
"\n", | ||
"a. clone the [uatk-spc](https://github.com/alan-turing-institute/uatk-spc/tree/main/docs) \n", | ||
" \n", | ||
"b. Run `cargo run --release -- --rng-seed 0 --flat-output config/England/west-yorkshire.txt --year 2020` and replace `west-yorkshire` and `2020` with your preferred option\n", | ||
" " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Pick a region with SPC output saved\n", | ||
"path = \"../data/spc_output/raw/\"\n", | ||
"region = \"west-yorkshire\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### People and household data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 41, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<div><style>\n", | ||
".dataframe > thead > tr,\n", | ||
".dataframe > tbody > tr {\n", | ||
" text-align: right;\n", | ||
" white-space: pre-wrap;\n", | ||
"}\n", | ||
"</style>\n", | ||
"<small>shape: (5, 36)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>id</th><th>household</th><th>workplace</th><th>location</th><th>orig_pid</th><th>id_tus_hh</th><th>id_tus_p</th><th>pid_hs</th><th>demographics</th><th>sic1d2007</th><th>sic2d2007</th><th>soc2010</th><th>pwkstat</th><th>salary_yearly</th><th>salary_hourly</th><th>bmi</th><th>has_cardiovascular_disease</th><th>has_diabetes</th><th>has_high_blood_pressure</th><th>number_medications</th><th>self_assessed_health</th><th>life_satisfaction</th><th>events</th><th>weekday_diaries</th><th>weekend_diaries</th><th>msoa</th><th>oa</th><th>members</th><th>hid</th><th>nssec8</th><th>accommodation_type</th><th>communal_type</th><th>num_rooms</th><th>central_heat</th><th>tenure</th><th>num_cars</th></tr><tr><td>u64</td><td>u64</td><td>u64</td><td>struct[2]</td><td>str</td><td>i64</td><td>i64</td><td>i64</td><td>struct[4]</td><td>str</td><td>u64</td><td>u64</td><td>i32</td><td>f32</td><td>f32</td><td>f32</td><td>bool</td><td>bool</td><td>bool</td><td>u64</td><td>i32</td><td>i32</td><td>struct[7]</td><td>list[u64]</td><td>list[u64]</td><td>str</td><td>str</td><td>list[u64]</td><td>str</td><td>i32</td><td>i32</td><td>i32</td><td>u64</td><td>bool</td><td>i32</td><td>u64</td></tr></thead><tbody><tr><td>0</td><td>0</td><td>null</td><td>{-1.789218,53.919151}</td><td>"E02002183_0001…</td><td>11291218</td><td>1</td><td>2905399</td><td>{1,86,1,1}</td><td>"J"</td><td>58</td><td>1115</td><td>6</td><td>null</td><td>null</td><td>24.879356</td><td>false</td><td>false</td><td>false</td><td>null</td><td>3</td><td>2</td><td>{0.09,0.1134,2.9846e-31,1.2791e-31,0.000881,0.000377,0.10494}</td><td>[1583, 13161]</td><td>[1582, 13160]</td><td>"E02002183"</td><td>"E00053954"</td><td>[0]</td><td>"E02002183_0001…</td><td>1</td><td>1</td><td>null</td><td>2</td><td>true</td><td>2</td><td>2</td></tr><tr><td>1</td><td>1</td><td>null</td><td>{-1.826238,53.92028}</td><td>"E02002183_0002…</td><td>17291219</td><td>1</td><td>2905308</td><td>{1,74,3,1}</td><td>"C"</td><td>25</td><td>1121</td><td>6</td><td>null</td><td>null</td><td>27.491207</td><td>false</td><td>false</td><td>true</td><td>null</td><td>3</td><td>null</td><td>{0.239,0.30114,2.2734e-20,9.7432e-21,0.051032,0.021871,0.13662}</td><td>[2900, 4948, … 15793]</td><td>[2901, 4949, … 15792]</td><td>"E02002183"</td><td>"E00053953"</td><td>[1, 2]</td><td>"E02002183_0002…</td><td>1</td><td>3</td><td>null</td><td>6</td><td>true</td><td>2</td><td>2</td></tr><tr><td>2</td><td>1</td><td>null</td><td>{-1.826238,53.92028}</td><td>"E02002183_0002…</td><td>17070713</td><td>2</td><td>2907681</td><td>{2,68,1,2}</td><td>"P"</td><td>85</td><td>2311</td><td>6</td><td>null</td><td>null</td><td>17.310829</td><td>false</td><td>true</td><td>true</td><td>null</td><td>2</td><td>4</td><td>{0.239,0.17686,3.6288e-16,8.4672e-16,0.098134,0.228979,0.15741}</td><td>[3010, 6389, … 11598]</td><td>[3011, 6388, … 11599]</td><td>"E02002183"</td><td>"E00053953"</td><td>[1, 2]</td><td>"E02002183_0002…</td><td>1</td><td>3</td><td>null</td><td>6</td><td>true</td><td>2</td><td>2</td></tr><tr><td>3</td><td>2</td><td>56126</td><td>{-1.874994,53.942989}</td><td>"E02002183_0003…</td><td>20310313</td><td>1</td><td>2902817</td><td>{1,27,1,4}</td><td>"C"</td><td>31</td><td>3422</td><td>1</td><td>32857.859375</td><td>14.360952</td><td>20.852091</td><td>false</td><td>false</td><td>false</td><td>null</td><td>2</td><td>1</td><td>{0.233,0.14679,4.397019,1.884437,0.522664,0.223999,0.15741}</td><td>[366, 867, … 14534]</td><td>[365, 868, … 14533]</td><td>"E02002183"</td><td>"E00053689"</td><td>[3, 4]</td><td>"E02002183_0003…</td><td>4</td><td>3</td><td>null</td><td>6</td><td>true</td><td>2</td><td>1</td></tr><tr><td>4</td><td>2</td><td>null</td><td>{-1.874994,53.942989}</td><td>"E02002183_0003…</td><td>13010909</td><td>3</td><td>2900884</td><td>{2,26,1,6}</td><td>"J"</td><td>62</td><td>7214</td><td>1</td><td>18162.451172</td><td>9.439944</td><td>20.032526</td><td>false</td><td>false</td><td>false</td><td>1</td><td>2</td><td>3</td><td>{0.233,0.08621,2.090329,4.877435,0.18608,0.434187,0.15741}</td><td>[1289, 12528, 12870]</td><td>[1288, 12529, 12871]</td><td>"E02002183"</td><td>"E00053689"</td><td>[3, 4]</td><td>"E02002183_0003…</td><td>4</td><td>3</td><td>null</td><td>6</td><td>true</td><td>2</td><td>1</td></tr></tbody></table></div>" | ||
], | ||
"text/plain": [ | ||
"shape: (5, 36)\n", | ||
"┌─────┬───────────┬───────────┬─────────────────┬───┬───────────┬──────────────┬────────┬──────────┐\n", | ||
"│ id ┆ household ┆ workplace ┆ location ┆ … ┆ num_rooms ┆ central_heat ┆ tenure ┆ num_cars │\n", | ||
"│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", | ||
"│ u64 ┆ u64 ┆ u64 ┆ struct[2] ┆ ┆ u64 ┆ bool ┆ i32 ┆ u64 │\n", | ||
"╞═════╪═══════════╪═══════════╪═════════════════╪═══╪═══════════╪══════════════╪════════╪══════════╡\n", | ||
"│ 0 ┆ 0 ┆ null ┆ {-1.789218,53.9 ┆ … ┆ 2 ┆ true ┆ 2 ┆ 2 │\n", | ||
"│ ┆ ┆ ┆ 19151} ┆ ┆ ┆ ┆ ┆ │\n", | ||
"│ 1 ┆ 1 ┆ null ┆ {-1.826238,53.9 ┆ … ┆ 6 ┆ true ┆ 2 ┆ 2 │\n", | ||
"│ ┆ ┆ ┆ 2028} ┆ ┆ ┆ ┆ ┆ │\n", | ||
"│ 2 ┆ 1 ┆ null ┆ {-1.826238,53.9 ┆ … ┆ 6 ┆ true ┆ 2 ┆ 2 │\n", | ||
"│ ┆ ┆ ┆ 2028} ┆ ┆ ┆ ┆ ┆ │\n", | ||
"│ 3 ┆ 2 ┆ 56126 ┆ {-1.874994,53.9 ┆ … ┆ 6 ┆ true ┆ 2 ┆ 1 │\n", | ||
"│ ┆ ┆ ┆ 42989} ┆ ┆ ┆ ┆ ┆ │\n", | ||
"│ 4 ┆ 2 ┆ null ┆ {-1.874994,53.9 ┆ … ┆ 6 ┆ true ┆ 2 ┆ 1 │\n", | ||
"│ ┆ ┆ ┆ 42989} ┆ ┆ ┆ ┆ ┆ │\n", | ||
"└─────┴───────────┴───────────┴─────────────────┴───┴───────────┴──────────────┴────────┴──────────┘" | ||
] | ||
}, | ||
"execution_count": 41, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"# add people and households\n", | ||
"spc_people_hh = (\n", | ||
" Builder(path, region, backend=\"polars\", input_type=\"parquet\")\n", | ||
" .add_households()\n", | ||
" .unnest([\"health\", \"employment\", \"details\"])\n", | ||
" .build()\n", | ||
")\n", | ||
"\n", | ||
"spc_people_hh.head()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 42, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# save the output\n", | ||
"spc_people_hh.write_parquet('../data/spc_output/' + region + '_people_hh.parquet')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### People and time-use data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 14, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"ename": "", | ||
"evalue": "", | ||
"output_type": "error", | ||
"traceback": [ | ||
"\u001b[1;31mThe Kernel crashed while executing code in the current cell or a previous cell. \n", | ||
"\u001b[1;31mPlease review the code in the cell(s) to identify a possible cause of the failure. \n", | ||
"\u001b[1;31mClick <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. \n", | ||
"\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details." | ||
] | ||
} | ||
], | ||
"source": [ | ||
"\n", | ||
"# Subset of (non-time-use) features to include and unnest\n", | ||
"\n", | ||
"# The features can be found here: https://github.com/alan-turing-institute/uatk-spc/blob/main/synthpop.proto\n", | ||
"features = {\n", | ||
" \"health\": [\n", | ||
" \"bmi\",\n", | ||
" \"has_cardiovascular_disease\",\n", | ||
" \"has_diabetes\",\n", | ||
" \"has_high_blood_pressure\",\n", | ||
" \"self_assessed_health\",\n", | ||
" \"life_satisfaction\",\n", | ||
" ],\n", | ||
" \"demographics\": [\"age_years\",\n", | ||
" \"ethnicity\",\n", | ||
" \"sex\",\n", | ||
" \"nssec8\"\n", | ||
" ],\n", | ||
" \"employment\": [\"sic1d2007\",\n", | ||
" \"sic2d2007\",\n", | ||
" \"pwkstat\",\n", | ||
" \"salary_yearly\"\n", | ||
" ]\n", | ||
"\n", | ||
"}\n", | ||
"\n", | ||
"# build the table\n", | ||
"spc_people_tu = (\n", | ||
" Builder(path, region, backend=\"polars\", input_type=\"parquet\")\n", | ||
" .add_households()\n", | ||
" .add_time_use_diaries(features, diary_type=\"weekday_diaries\")\n", | ||
" .build()\n", | ||
")\n", | ||
"spc_people_tu.head()\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 13, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# save the output\n", | ||
"spc_people_tu.write_parquet('../data/spc_output/' + region + '_people_tu.parquet')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 19, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"['id',\n", | ||
" 'household',\n", | ||
" 'bmi',\n", | ||
" 'has_cardiovascular_disease',\n", | ||
" 'has_diabetes',\n", | ||
" 'has_high_blood_pressure',\n", | ||
" 'self_assessed_health',\n", | ||
" 'life_satisfaction',\n", | ||
" 'age_years',\n", | ||
" 'sex',\n", | ||
" 'nssec8',\n", | ||
" 'pwkstat',\n", | ||
" 'salary_yearly',\n", | ||
" 'weekday_diaries',\n", | ||
" 'uid',\n", | ||
" 'weekday',\n", | ||
" 'day_type',\n", | ||
" 'month',\n", | ||
" 'pworkhome',\n", | ||
" 'phomeother',\n", | ||
" 'pwork',\n", | ||
" 'pschool',\n", | ||
" 'pshop',\n", | ||
" 'pservices',\n", | ||
" 'pleisure',\n", | ||
" 'pescort',\n", | ||
" 'ptransport',\n", | ||
" 'phome_total',\n", | ||
" 'pnothome_total',\n", | ||
" 'punknown_total',\n", | ||
" 'pmwalk',\n", | ||
" 'pmcycle',\n", | ||
" 'pmprivate',\n", | ||
" 'pmpublic',\n", | ||
" 'pmunknown',\n", | ||
" 'age35g']" | ||
] | ||
}, | ||
"execution_count": 19, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"spc_people_tu.columns\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 22, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<div><style>\n", | ||
".dataframe > thead > tr,\n", | ||
".dataframe > tbody > tr {\n", | ||
" text-align: right;\n", | ||
" white-space: pre-wrap;\n", | ||
"}\n", | ||
"</style>\n", | ||
"<small>shape: (2_339_931,)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>health</th></tr><tr><td>struct[7]</td></tr></thead><tbody><tr><td>{24.879356,false,false,false,null,3,2}</td></tr><tr><td>{27.491207,false,false,true,null,3,null}</td></tr><tr><td>{17.310829,false,true,true,null,2,4}</td></tr><tr><td>{20.852091,false,false,false,null,2,1}</td></tr><tr><td>{20.032526,false,false,false,1,2,3}</td></tr><tr><td>{29.106817,false,false,true,null,1,3}</td></tr><tr><td>{25.621599,false,false,false,3,3,3}</td></tr><tr><td>{33.893459,true,false,true,3,1,3}</td></tr><tr><td>{null,false,false,false,null,1,null}</td></tr><tr><td>{24.492905,false,false,false,null,4,2}</td></tr><tr><td>{31.561234,true,false,true,4,2,4}</td></tr><tr><td>{28.171663,false,true,true,null,3,3}</td></tr><tr><td>…</td></tr><tr><td>{22.046501,false,false,false,2,1,3}</td></tr><tr><td>{14.627893,false,false,false,1,1,1}</td></tr><tr><td>{25.986469,false,false,false,0,1,null}</td></tr><tr><td>{23.44569,false,false,false,1,3,1}</td></tr><tr><td>{26.506229,false,false,true,null,3,3}</td></tr><tr><td>{25.481789,false,false,false,null,3,2}</td></tr><tr><td>{14.997225,false,false,false,2,3,2}</td></tr><tr><td>{22.199043,false,false,false,0,2,2}</td></tr><tr><td>{23.534786,false,false,false,null,3,2}</td></tr><tr><td>{18.523956,true,false,true,7,4,4}</td></tr><tr><td>{28.988529,false,false,false,null,1,3}</td></tr><tr><td>{18.38345,false,false,false,1,1,3}</td></tr></tbody></table></div>" | ||
], | ||
"text/plain": [ | ||
"shape: (2_339_931,)\n", | ||
"Series: 'health' [struct[7]]\n", | ||
"[\n", | ||
"\t{24.879356,false,false,false,null,3,2}\n", | ||
"\t{27.491207,false,false,true,null,3,null}\n", | ||
"\t{17.310829,false,true,true,null,2,4}\n", | ||
"\t{20.852091,false,false,false,null,2,1}\n", | ||
"\t{20.032526,false,false,false,1,2,3}\n", | ||
"\t{29.106817,false,false,true,null,1,3}\n", | ||
"\t{25.621599,false,false,false,3,3,3}\n", | ||
"\t{33.893459,true,false,true,3,1,3}\n", | ||
"\t{null,false,false,false,null,1,null}\n", | ||
"\t{24.492905,false,false,false,null,4,2}\n", | ||
"\t{31.561234,true,false,true,4,2,4}\n", | ||
"\t{28.171663,false,true,true,null,3,3}\n", | ||
"\t…\n", | ||
"\t{27.222017,false,false,false,null,2,4}\n", | ||
"\t{22.046501,false,false,false,2,1,3}\n", | ||
"\t{14.627893,false,false,false,1,1,1}\n", | ||
"\t{25.986469,false,false,false,0,1,null}\n", | ||
"\t{23.44569,false,false,false,1,3,1}\n", | ||
"\t{26.506229,false,false,true,null,3,3}\n", | ||
"\t{25.481789,false,false,false,null,3,2}\n", | ||
"\t{14.997225,false,false,false,2,3,2}\n", | ||
"\t{22.199043,false,false,false,0,2,2}\n", | ||
"\t{23.534786,false,false,false,null,3,2}\n", | ||
"\t{18.523956,true,false,true,7,4,4}\n", | ||
"\t{28.988529,false,false,false,null,1,3}\n", | ||
"\t{18.38345,false,false,false,1,1,3}\n", | ||
"]" | ||
] | ||
}, | ||
"execution_count": 22, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"spc_people_hh['health']" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "acbm-7iKwKWLy-py3.10", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.