Synthetic Data Generator (SDG)

The Synthetic Data Generator (SDG) creates process-based data. These data model the treatment of breast cancer patients following the distribution of values in a real breast cancer patient population.

Parameters

number of patients - the number of patients that will be modeled
mutation probability - the mutation probability describes how likely it is for each datum to deviate from the treatment guidelines.

Note

A mutation probability of 0.0 creates 'clean' data which complies with the treatment guidelines.

Output Data Formats

On execution, SDG creates the same data set in three different formats.

CSV
RDF
SQL (MySQL 8.1 dump)

Data Generation

Requirements:

Docker

There are two options for generating the data; one is using docker-compose. After executing either of the available options, your generated data set can be found in ./data.

Option 1: With docker-compose

If you want to use the docker-compose option, run the following commands:

docker-compose up -d --build
docker exec -it SDG bash -c "SDG -n {patients} -p {mutation_prob}"
docker-compose down -v

where

{patients} is a placeholder for the number of patients
{mutation_prob} is a placeholder for the mutation probability

This option is recommended if several data sets will be generated. The SDG creates the resulting files with the same name in all the executions, do not forget to move your generated data before creating another data set!

Option 2: Without docker-compose

If you do not want to use docker-compose, you can also execute:

./generate.sh {patients} {mutation_prob}

where

{patients} is a placeholder for the number of patients
{mutation_prob} is a placeholder for the mutation probability

Similar to option 1, generate.sh will build the Docker image, start a Docker container, execute the SDG, and stop and remove the Docker container again. You can use this option if you do not have docker-compose installed or if you want to generate only one data set.

As in option 1, the SDG creates the resulting files with the same name in all the executions, do not forget to move your generated data before creating another data set!

Output Data Description

The SGD generates data that could be collected during the treatment process of breast cancer patients, including demographic, gynecologic, diagnostic, tumor-related, treatment, comorbidity, and family history data. To illustrate the output data, the following figure shows the Entity-Relationship diagram of the data generated when choosing the relational database as the output format. Because of readibility reasons, only the key attributes have been included, the rest of attrbites are described in the data dictionaty below. The other output formats generate equivalent data, using the corresponding formats.

Data dictionary:

patient
- ehr: INTEGER
- birth_date: DATE
- diagnosis_date: DATE
- age_at_diagnosis: INTEGER
- first_treatment_date: DATE
- surgery_date: DATE
- death_date: DATE / NULL (if the patient has not died)
- age_at_death: INTEGER / NULL (if the patient has not died)
- recurrence_year: INTEGER / NULL (if the patient has not relapsed)
- neoadjuvant: yes / no
- er_positive: 1 / 0
- pr_positive: 1 / 0
- her2_overall_positive: 1 / 0
- ki67_percent_max_simp: INTEGER (ranging from 0 to 100)
- menarche_age: INTEGER
- menopause_age: INTEGER
- pregnancy: INTEGER
- abort: INTEGER
- birth: INTEGER
- caesarean: INTEGER
tumor_tnm
- ehr: INTEGER
- n_tumor: INTEGER
- t_prefix_y: 0
- t_prefix: C / P
- t_category: IS / 0 / 1 / 2 / 3 / 4
- n_prefix_y: 0
- n_prefix: C / P
- n_category: 0 / 1 / 2 / 3
- n_subcategory: MI / NULL
- m_category: 0 / 1
- t_prefix_y_after_neoadj: 1
- t_prefix_after_neoadj: C / P / NULL (if not neoadjuvant)
- t_category_after_neoadj: IS / 0 / 1 / 2 / 3 / 4 / NULL (if not neoadjuvant)
- n_prefix_y_after_neoadj: 1
- n_prefix_after_neoadj: C / P / NULL (if not neoadjuvant)
- n_category_after_neoadj: 0 / 1 / 2 / 3 / NULL (if not neoadjuvant)
- n_subcategory_after_neoadj: MI / NULL
- m_category_after_neoadj: 0 / 1 / NULL (if not neoadjuvant)
- n_tumor_type: INTEGER
- n_tumor_grade: INTEGER
- stage_diagnosis: 0 / IA / IB / IIA / IIB / IIIA / IIIB / IIIC / IV
- stage_after_neo: 0 / IA / IB / IIA / IIB / IIIA / IIIB / IIIC / IV
tumor_type
- ehr: INTEGER
- n_tumor_type: INTEGER
- ductal: 1 / 0
- lobular: 1 / 0
- in_situ: 1 / 0
- invasive: 1 / 0
- associated_in_situ: 1 / 0
tumor_grade
- ehr: INTEGER
- n_tumor_grade: INTEGER
- grade: 1 / 2 / 3
drug
- id_drug: INTEGER
- name: STRING
chemoterapy_schema
- id_schema: INTEGER
- name: STRING
drug_chemoterapy_schema
- id_schema: INTEGER
- id_drug: INTEGER
chemoterapy_cycle
- ehr: INTEGER
- id_schema: INTEGER
- date: DATE
- cycle_number: INTEGER
surgery
- ehr: INTEGER
- surgery: STRING
- n_surgery: INTEGER
- date_year: INTEGER
- date_month: INTEGER
- date_day: INTEGER
radiotherapy
- ehr: INTEGER
- date_start: DATE
- date_end: DATE
- n_radiotherapy: INTEGER
- dose_gy: FLOAT
comorbidity
- id: INTEGER
- ehr: INTEGER
- comorbidity: STRING
- negated: 0 / 1
oral_drug
- ehr: INTEGER
- drug: STRING
oral_drug_type
- drug: STRING
- drug_type: STRING
family_history
- ehr: INTEGER
- cancer_cui: STRING
cui_description
- cui: STRING
- description: STRING

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SDG.py		SDG.py
docker-compose.yml		docker-compose.yml
er-diagram-generated-data.jpg		er-diagram-generated-data.jpg
generate.sh		generate.sh
mapping.ttl		mapping.ttl
requirements.txt		requirements.txt
table_structure.sql		table_structure.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Data Generator (SDG)

Parameters

Output Data Formats

Data Generation

Output Data Description

About

Contributors 2

Languages

License

SDM-TIB/Synthetic-Data-Generator

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Generator (SDG)

Parameters

Output Data Formats

Data Generation

Output Data Description

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages