Skip to content

Commit

Permalink
use spc python package to prepare synthetic pop
Browse files Browse the repository at this point in the history
  • Loading branch information
Hussein-Mahfouz committed Feb 15, 2024
1 parent f7e2b9b commit 520176f
Show file tree
Hide file tree
Showing 4 changed files with 2,021 additions and 24 deletions.
18 changes: 0 additions & 18 deletions notebooks/dummy_notebook.ipynb

This file was deleted.

347 changes: 347 additions & 0 deletions notebooks/synthpop.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,347 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"#import polars as pl"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use the spc package for our synthetic population. To add it as a dependancy in this virtual environment, I ran `poetry add git+https://github.com/alan-turing-institute/uatk-spc.git@55-output-formats-python#subdirectory=python`. The branch may change if the python package is merged into the main spc branch. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"#https://github.com/alan-turing-institute/uatk-spc/blob/55-output-formats-python/python/examples/spc_builder_example.ipynb\n",
"from uatk_spc.builder import Builder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading in the SPC synthetic population\n",
"\n",
"I use the code in the `Quickstart` [here](https://github.com/alan-turing-institute/uatk-spc/blob/55-output-formats-python/python/README.md) to get a parquet file and convert it to JSON. \n",
"\n",
"You have two options:\n",
"\n",
"\n",
"1- Slow and memory-hungry: Download the pbf file directly from [here](https://alan-turing-institute.github.io/uatk-spc/using_england_outputs.html) and load in the pbf file with the python package\n",
"\n",
"2- Faster: Covert the pbf file to parquet, and then load it using the python package. To convert to parquet, you need to:\n",
"\n",
"a. clone the [uatk-spc](https://github.com/alan-turing-institute/uatk-spc/tree/main/docs) \n",
" \n",
"b. Run `cargo run --release -- --rng-seed 0 --flat-output config/England/west-yorkshire.txt --year 2020` and replace `west-yorkshire` and `2020` with your preferred option\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Pick a region with SPC output saved\n",
"path = \"../data/spc_output/raw/\"\n",
"region = \"west-yorkshire\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### People and household data"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div><style>\n",
".dataframe > thead > tr,\n",
".dataframe > tbody > tr {\n",
" text-align: right;\n",
" white-space: pre-wrap;\n",
"}\n",
"</style>\n",
"<small>shape: (5, 36)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>id</th><th>household</th><th>workplace</th><th>location</th><th>orig_pid</th><th>id_tus_hh</th><th>id_tus_p</th><th>pid_hs</th><th>demographics</th><th>sic1d2007</th><th>sic2d2007</th><th>soc2010</th><th>pwkstat</th><th>salary_yearly</th><th>salary_hourly</th><th>bmi</th><th>has_cardiovascular_disease</th><th>has_diabetes</th><th>has_high_blood_pressure</th><th>number_medications</th><th>self_assessed_health</th><th>life_satisfaction</th><th>events</th><th>weekday_diaries</th><th>weekend_diaries</th><th>msoa</th><th>oa</th><th>members</th><th>hid</th><th>nssec8</th><th>accommodation_type</th><th>communal_type</th><th>num_rooms</th><th>central_heat</th><th>tenure</th><th>num_cars</th></tr><tr><td>u64</td><td>u64</td><td>u64</td><td>struct[2]</td><td>str</td><td>i64</td><td>i64</td><td>i64</td><td>struct[4]</td><td>str</td><td>u64</td><td>u64</td><td>i32</td><td>f32</td><td>f32</td><td>f32</td><td>bool</td><td>bool</td><td>bool</td><td>u64</td><td>i32</td><td>i32</td><td>struct[7]</td><td>list[u64]</td><td>list[u64]</td><td>str</td><td>str</td><td>list[u64]</td><td>str</td><td>i32</td><td>i32</td><td>i32</td><td>u64</td><td>bool</td><td>i32</td><td>u64</td></tr></thead><tbody><tr><td>0</td><td>0</td><td>null</td><td>{-1.789218,53.919151}</td><td>&quot;E02002183_0001…</td><td>11291218</td><td>1</td><td>2905399</td><td>{1,86,1,1}</td><td>&quot;J&quot;</td><td>58</td><td>1115</td><td>6</td><td>null</td><td>null</td><td>24.879356</td><td>false</td><td>false</td><td>false</td><td>null</td><td>3</td><td>2</td><td>{0.09,0.1134,2.9846e-31,1.2791e-31,0.000881,0.000377,0.10494}</td><td>[1583, 13161]</td><td>[1582, 13160]</td><td>&quot;E02002183&quot;</td><td>&quot;E00053954&quot;</td><td>[0]</td><td>&quot;E02002183_0001…</td><td>1</td><td>1</td><td>null</td><td>2</td><td>true</td><td>2</td><td>2</td></tr><tr><td>1</td><td>1</td><td>null</td><td>{-1.826238,53.92028}</td><td>&quot;E02002183_0002…</td><td>17291219</td><td>1</td><td>2905308</td><td>{1,74,3,1}</td><td>&quot;C&quot;</td><td>25</td><td>1121</td><td>6</td><td>null</td><td>null</td><td>27.491207</td><td>false</td><td>false</td><td>true</td><td>null</td><td>3</td><td>null</td><td>{0.239,0.30114,2.2734e-20,9.7432e-21,0.051032,0.021871,0.13662}</td><td>[2900, 4948, … 15793]</td><td>[2901, 4949, … 15792]</td><td>&quot;E02002183&quot;</td><td>&quot;E00053953&quot;</td><td>[1, 2]</td><td>&quot;E02002183_0002…</td><td>1</td><td>3</td><td>null</td><td>6</td><td>true</td><td>2</td><td>2</td></tr><tr><td>2</td><td>1</td><td>null</td><td>{-1.826238,53.92028}</td><td>&quot;E02002183_0002…</td><td>17070713</td><td>2</td><td>2907681</td><td>{2,68,1,2}</td><td>&quot;P&quot;</td><td>85</td><td>2311</td><td>6</td><td>null</td><td>null</td><td>17.310829</td><td>false</td><td>true</td><td>true</td><td>null</td><td>2</td><td>4</td><td>{0.239,0.17686,3.6288e-16,8.4672e-16,0.098134,0.228979,0.15741}</td><td>[3010, 6389, … 11598]</td><td>[3011, 6388, … 11599]</td><td>&quot;E02002183&quot;</td><td>&quot;E00053953&quot;</td><td>[1, 2]</td><td>&quot;E02002183_0002…</td><td>1</td><td>3</td><td>null</td><td>6</td><td>true</td><td>2</td><td>2</td></tr><tr><td>3</td><td>2</td><td>56126</td><td>{-1.874994,53.942989}</td><td>&quot;E02002183_0003…</td><td>20310313</td><td>1</td><td>2902817</td><td>{1,27,1,4}</td><td>&quot;C&quot;</td><td>31</td><td>3422</td><td>1</td><td>32857.859375</td><td>14.360952</td><td>20.852091</td><td>false</td><td>false</td><td>false</td><td>null</td><td>2</td><td>1</td><td>{0.233,0.14679,4.397019,1.884437,0.522664,0.223999,0.15741}</td><td>[366, 867, … 14534]</td><td>[365, 868, … 14533]</td><td>&quot;E02002183&quot;</td><td>&quot;E00053689&quot;</td><td>[3, 4]</td><td>&quot;E02002183_0003…</td><td>4</td><td>3</td><td>null</td><td>6</td><td>true</td><td>2</td><td>1</td></tr><tr><td>4</td><td>2</td><td>null</td><td>{-1.874994,53.942989}</td><td>&quot;E02002183_0003…</td><td>13010909</td><td>3</td><td>2900884</td><td>{2,26,1,6}</td><td>&quot;J&quot;</td><td>62</td><td>7214</td><td>1</td><td>18162.451172</td><td>9.439944</td><td>20.032526</td><td>false</td><td>false</td><td>false</td><td>1</td><td>2</td><td>3</td><td>{0.233,0.08621,2.090329,4.877435,0.18608,0.434187,0.15741}</td><td>[1289, 12528, 12870]</td><td>[1288, 12529, 12871]</td><td>&quot;E02002183&quot;</td><td>&quot;E00053689&quot;</td><td>[3, 4]</td><td>&quot;E02002183_0003…</td><td>4</td><td>3</td><td>null</td><td>6</td><td>true</td><td>2</td><td>1</td></tr></tbody></table></div>"
],
"text/plain": [
"shape: (5, 36)\n",
"┌─────┬───────────┬───────────┬─────────────────┬───┬───────────┬──────────────┬────────┬──────────┐\n",
"│ id ┆ household ┆ workplace ┆ location ┆ … ┆ num_rooms ┆ central_heat ┆ tenure ┆ num_cars │\n",
"│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n",
"│ u64 ┆ u64 ┆ u64 ┆ struct[2] ┆ ┆ u64 ┆ bool ┆ i32 ┆ u64 │\n",
"╞═════╪═══════════╪═══════════╪═════════════════╪═══╪═══════════╪══════════════╪════════╪══════════╡\n",
"│ 0 ┆ 0 ┆ null ┆ {-1.789218,53.9 ┆ … ┆ 2 ┆ true ┆ 2 ┆ 2 │\n",
"│ ┆ ┆ ┆ 19151} ┆ ┆ ┆ ┆ ┆ │\n",
"│ 1 ┆ 1 ┆ null ┆ {-1.826238,53.9 ┆ … ┆ 6 ┆ true ┆ 2 ┆ 2 │\n",
"│ ┆ ┆ ┆ 2028} ┆ ┆ ┆ ┆ ┆ │\n",
"│ 2 ┆ 1 ┆ null ┆ {-1.826238,53.9 ┆ … ┆ 6 ┆ true ┆ 2 ┆ 2 │\n",
"│ ┆ ┆ ┆ 2028} ┆ ┆ ┆ ┆ ┆ │\n",
"│ 3 ┆ 2 ┆ 56126 ┆ {-1.874994,53.9 ┆ … ┆ 6 ┆ true ┆ 2 ┆ 1 │\n",
"│ ┆ ┆ ┆ 42989} ┆ ┆ ┆ ┆ ┆ │\n",
"│ 4 ┆ 2 ┆ null ┆ {-1.874994,53.9 ┆ … ┆ 6 ┆ true ┆ 2 ┆ 1 │\n",
"│ ┆ ┆ ┆ 42989} ┆ ┆ ┆ ┆ ┆ │\n",
"└─────┴───────────┴───────────┴─────────────────┴───┴───────────┴──────────────┴────────┴──────────┘"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# add people and households\n",
"spc_people_hh = (\n",
" Builder(path, region, backend=\"polars\", input_type=\"parquet\")\n",
" .add_households()\n",
" .unnest([\"health\", \"employment\", \"details\"])\n",
" .build()\n",
")\n",
"\n",
"spc_people_hh.head()"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# save the output\n",
"spc_people_hh.write_parquet('../data/spc_output/' + region + '_people_hh.parquet')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### People and time-use data"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"ename": "",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[1;31mThe Kernel crashed while executing code in the current cell or a previous cell. \n",
"\u001b[1;31mPlease review the code in the cell(s) to identify a possible cause of the failure. \n",
"\u001b[1;31mClick <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. \n",
"\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
]
}
],
"source": [
"\n",
"# Subset of (non-time-use) features to include and unnest\n",
"\n",
"# The features can be found here: https://github.com/alan-turing-institute/uatk-spc/blob/main/synthpop.proto\n",
"features = {\n",
" \"health\": [\n",
" \"bmi\",\n",
" \"has_cardiovascular_disease\",\n",
" \"has_diabetes\",\n",
" \"has_high_blood_pressure\",\n",
" \"self_assessed_health\",\n",
" \"life_satisfaction\",\n",
" ],\n",
" \"demographics\": [\"age_years\",\n",
" \"ethnicity\",\n",
" \"sex\",\n",
" \"nssec8\"\n",
" ],\n",
" \"employment\": [\"sic1d2007\",\n",
" \"sic2d2007\",\n",
" \"pwkstat\",\n",
" \"salary_yearly\"\n",
" ]\n",
"\n",
"}\n",
"\n",
"# build the table\n",
"spc_people_tu = (\n",
" Builder(path, region, backend=\"polars\", input_type=\"parquet\")\n",
" .add_households()\n",
" .add_time_use_diaries(features, diary_type=\"weekday_diaries\")\n",
" .build()\n",
")\n",
"spc_people_tu.head()\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# save the output\n",
"spc_people_tu.write_parquet('../data/spc_output/' + region + '_people_tu.parquet')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['id',\n",
" 'household',\n",
" 'bmi',\n",
" 'has_cardiovascular_disease',\n",
" 'has_diabetes',\n",
" 'has_high_blood_pressure',\n",
" 'self_assessed_health',\n",
" 'life_satisfaction',\n",
" 'age_years',\n",
" 'sex',\n",
" 'nssec8',\n",
" 'pwkstat',\n",
" 'salary_yearly',\n",
" 'weekday_diaries',\n",
" 'uid',\n",
" 'weekday',\n",
" 'day_type',\n",
" 'month',\n",
" 'pworkhome',\n",
" 'phomeother',\n",
" 'pwork',\n",
" 'pschool',\n",
" 'pshop',\n",
" 'pservices',\n",
" 'pleisure',\n",
" 'pescort',\n",
" 'ptransport',\n",
" 'phome_total',\n",
" 'pnothome_total',\n",
" 'punknown_total',\n",
" 'pmwalk',\n",
" 'pmcycle',\n",
" 'pmprivate',\n",
" 'pmpublic',\n",
" 'pmunknown',\n",
" 'age35g']"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spc_people_tu.columns\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div><style>\n",
".dataframe > thead > tr,\n",
".dataframe > tbody > tr {\n",
" text-align: right;\n",
" white-space: pre-wrap;\n",
"}\n",
"</style>\n",
"<small>shape: (2_339_931,)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>health</th></tr><tr><td>struct[7]</td></tr></thead><tbody><tr><td>{24.879356,false,false,false,null,3,2}</td></tr><tr><td>{27.491207,false,false,true,null,3,null}</td></tr><tr><td>{17.310829,false,true,true,null,2,4}</td></tr><tr><td>{20.852091,false,false,false,null,2,1}</td></tr><tr><td>{20.032526,false,false,false,1,2,3}</td></tr><tr><td>{29.106817,false,false,true,null,1,3}</td></tr><tr><td>{25.621599,false,false,false,3,3,3}</td></tr><tr><td>{33.893459,true,false,true,3,1,3}</td></tr><tr><td>{null,false,false,false,null,1,null}</td></tr><tr><td>{24.492905,false,false,false,null,4,2}</td></tr><tr><td>{31.561234,true,false,true,4,2,4}</td></tr><tr><td>{28.171663,false,true,true,null,3,3}</td></tr><tr><td>&hellip;</td></tr><tr><td>{22.046501,false,false,false,2,1,3}</td></tr><tr><td>{14.627893,false,false,false,1,1,1}</td></tr><tr><td>{25.986469,false,false,false,0,1,null}</td></tr><tr><td>{23.44569,false,false,false,1,3,1}</td></tr><tr><td>{26.506229,false,false,true,null,3,3}</td></tr><tr><td>{25.481789,false,false,false,null,3,2}</td></tr><tr><td>{14.997225,false,false,false,2,3,2}</td></tr><tr><td>{22.199043,false,false,false,0,2,2}</td></tr><tr><td>{23.534786,false,false,false,null,3,2}</td></tr><tr><td>{18.523956,true,false,true,7,4,4}</td></tr><tr><td>{28.988529,false,false,false,null,1,3}</td></tr><tr><td>{18.38345,false,false,false,1,1,3}</td></tr></tbody></table></div>"
],
"text/plain": [
"shape: (2_339_931,)\n",
"Series: 'health' [struct[7]]\n",
"[\n",
"\t{24.879356,false,false,false,null,3,2}\n",
"\t{27.491207,false,false,true,null,3,null}\n",
"\t{17.310829,false,true,true,null,2,4}\n",
"\t{20.852091,false,false,false,null,2,1}\n",
"\t{20.032526,false,false,false,1,2,3}\n",
"\t{29.106817,false,false,true,null,1,3}\n",
"\t{25.621599,false,false,false,3,3,3}\n",
"\t{33.893459,true,false,true,3,1,3}\n",
"\t{null,false,false,false,null,1,null}\n",
"\t{24.492905,false,false,false,null,4,2}\n",
"\t{31.561234,true,false,true,4,2,4}\n",
"\t{28.171663,false,true,true,null,3,3}\n",
"\t\n",
"\t{27.222017,false,false,false,null,2,4}\n",
"\t{22.046501,false,false,false,2,1,3}\n",
"\t{14.627893,false,false,false,1,1,1}\n",
"\t{25.986469,false,false,false,0,1,null}\n",
"\t{23.44569,false,false,false,1,3,1}\n",
"\t{26.506229,false,false,true,null,3,3}\n",
"\t{25.481789,false,false,false,null,3,2}\n",
"\t{14.997225,false,false,false,2,3,2}\n",
"\t{22.199043,false,false,false,0,2,2}\n",
"\t{23.534786,false,false,false,null,3,2}\n",
"\t{18.523956,true,false,true,7,4,4}\n",
"\t{28.988529,false,false,false,null,1,3}\n",
"\t{18.38345,false,false,false,1,1,3}\n",
"]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spc_people_hh['health']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "acbm-7iKwKWLy-py3.10",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit 520176f

Please sign in to comment.