MII Projecathon - Data Selection, Extraction, Pseudonymisation and Bundleing

This repository combines data selection and extraction, pseudonymisation, bundleing of the extracted FHIR resources and Creating a Document Reference + (Binary resource from the bundle OR the bundle itself) to be send via the DSF.

Execute the Data Extraction Process

Step 1 Get some testdata

Testdata is automatically provided as part of this repository in the testdata folder.

Step 2 - Spin up a FHIR Server and load it with data

To spin up a FHIR server clone the following repository: https://github.com/medizininformatik-initiative/fhir-server-examples, navigate to the server/blaze folder, set the BASE_URL in the docker-compose file to "http://blaze:8080" and execute docker-compose -p mii-projectathon up -d This will spin up blaze FHIR server and expose it on localhost on port 8082. You can access your FHIR server under http://localhost:8082/fhir/my-endpoint-here , to see all your currently loaded Patients go to: http://localhost:8082/fhir/Patient

If you are using a HAPI FHIR server the following environment variable needs to be set in the docker-compose file to allow the uploading of transaction bundles without executing them: HAPI_FHIR_ALLOWED_BUNDLE_TYPES: COLLECTION,DOCUMENT,MESSAGE,TRANSACTION,TRANSACTIONRESPONSE,BATCH,BATCHRESPONSE,HISTORY,SEARCHSET

Once the server is available (this can take a couple of seconds) and you get a response from the Patient URL above you can load your testdata.

To load your testdata execute upload-testdata.sh, which will upload all the testdata fromt the testdata folder to your FHIR server.

Once the script is finished access http://localhost:8082/fhir/Patient again to see if your resources have been loaded.

Step 3 - Create local version of config files

Create your own config files using the initialise-env-files.sh

Step 4 - Configure the env files

Before you can run the data extraction you will need to change the .env file according to your requirements. For all configuration options see "Configuration Options" below.

If you are using the standard blaze server as described in Step 2 above you should set the env var: MII_DATA_EXTRACTION_FHIR_BASE_URL=http://blaze:8080/fhir

If you are using a self-signed certificate for your FHIR server add the ca certificate to the "certs" folder and name it custom-ca-bundle.crt.

Optionally you can also change the data extraction, pseudonymisation and bundeling by changing the data_extraction_config.json and psd_config.json accordingly. For more information see "Running each script individually" below.

Step 5 - Set rights

Execute the set_rights.sh to set the folder rights correctly

Step 6 - Run the data extraction, pseudonymisation and bundleing

In this repository execute docker-compose -p mii-projectathon up. This will run in sequence the three python scripts of this repository:

data-selection-and-extraction.py (selects and extracts the data according to the data_extraction_config.json) - see section data-selection-and-extraction.py below)
pseudonymisation.py (pseudonymises selected resources according to the psd_config.json - see section pseudonymisation.py below)
build-transaction-bundle.py (bundles all extracted resources into one single transaction bundle, b64 encodes it and creates another Transaction bundle which contains a DocumentReference FHIR resource and (configurable) EITHER a Binary FHIR resources which includes the b64 encoded bundle created before OR a bundle of all extracted resources)

All scripts create files: (1.) saves all extracted reources in extracted_resources (2.) saves all pseudonymised resources in pseudonymised_resources (3.) saves the extracted resources bundle, the b64 of the extracted resources bundle and the bundle for the base64 extracted resources bundle in to_send

Running each script individually

data-selection-and-extraction.py

To see the available script arguments execute python3 data-selection-and-extraction.py -h

This script selects and extracts data from a FHIR server according to the data_extraction_config.json

field	description	possible values	default value should be
name	name of the data extraction - used to specify the filename use for this data extrcation
query	query used to extract the data -	any fhir search string which return only one resource type
output_file_path	path where to save the extracted resources	any path to a folder - best left as extracted_resources
request_type	How should data be extracted - with a POST or a GET request	POST , GET	GET
cohort_extraction	specifies that a cohort should be extracted from this fhir search extraction - contains two subfields cohort_id_field and cohort_id_prefix both of which are required if cohort_extraction is used
cohort_id_field	the json path to the field which contains the patient ids of this resource
cohort_id_prefix	the id prefix which has to be deleted to get the actual IDs
cohort_dependence	specifies that an extraction is to be cohort dependen - contains two subfields chunk_size and cohort_id_selection_field both of which are required if cohort_extraction is used
chunk_size	defines the chunks into which ids are meant to be split when extracted the resources according to the cohort
cohort_id_selection_field	the fhir search param to be used to select the cohort

pseudonymisation.py

To see the available script arguments execute python3 pseudonymisation.py -h

This script pseudonymises FHIR resources according to the psd_config.json

field	description	possible values
psd_name	name of the pseudonymisation - used to specify the filename used to save this pseudonymised data and the filename for the file from wich the resources to be pseudonymised are loaded
input_file_path	path from which to load the resources to be pseudonymised	any path to a folder - best left as extracted_resources
psd_file_path	path where pseudonymised resources are to be saved - best left as pseudonymised_resources
select	List of paths to parts of a FHIR resource to be selected into the new pseudonymised resources e.g. "id" - path logic see below this table
change_id	List of ids to replace - contains two subfields id_pool and path_to_id both of which are required if change_id is used
id_pool	the pool of pseudonyms the pseudonym is part of - this ensures that the same Ids are replaced with the same PSD ids
path_to_id	path to the field where to replace the id - path logic see below this table
apply_function	applies a pseudonymisation function to a field note that these have to be implemented in the pseudonymisation.py to be available here - contains two subfields function_to_apply and path_to_field both of which are required if apply_function is used
function_to_apply	specifies the function to apply to a field
path_to_field	path to the field which the function is to be applied to - path logic see below this table

path logic for pseudonymisation: <field_name or array>.<field_name or array>.<field_name or array> Array can be with index [0] or [] to apply to all entries in array, examples: "id", "resourceType", "diagnosis.[].use", "serviceType", "diagnosis.[*].condition.reference", "subject.reference", "period"

Further it is possible to select array entries based on attribute values (currentl only string values are supported) of the array entry objects, according to this logic [@='<stringValueOfAttribute'], example: "code.coding.[@system='http://fhir.de/CodeSystem/bfarm/icd-10-gm']"

Note - more complex paths are currently not supported

available pseudonymisation functions: obfuscate_date_to_year, obfuscate_date_to_day

build-transaction-bundle.py

To see the available script arguments execute python3 build-transaction-bundle.py -h

Configuration Options

env var	description	default value
MII_DATA_EXTRACTION_FHIR_BASE_URL	Local FHIR server base url e.g. see default value	http://fhir-server:8080/fhir
MII_DATA_EXTRACTION_FHIR_USER	Basic auth user for local FHIR server
MII_DATA_EXTRACTION_FHIR_PW	Basic auth password for local FHIR server
MII_DATA_EXTRACTION_FHIR_TOKEN	auth token for local FHIR server
MII_DATA_EXTRACTION_FHIR_PROXY_HTTP	HTTP url for proxy if used for local FHIR server
MII_DATA_EXTRACTION_FHIR_PROXY_HTTPS	HTTPS url for proxy if used for local FHIR server
MII_DATA_EXTRACTION_SHARE_FHIR_BASE_URL	Local share FHIR server base url where the transfer bundle will be uploaded and which the DSF connects to e.g. see default value	http://fhir-server:8080/fhir
MII_DATA_EXTRACTION_SHARE_FHIR_USER	Basic auth user for share FHIR server
MII_DATA_EXTRACTION_SHARE_FHIR_PW	Basic auth password for share FHIR server
MII_DATA_EXTRACTION_SHARE_FHIR_TOKEN	auth token for share FHIR server
MII_DATA_EXTRACTION_SHARE_FHIR_PROXY_HTTP	HTTP url for proxy if used for share FHIR server
MII_DATA_EXTRACTION_SHARE_FHIR_PROXY_HTTPS	HTTPS url for proxy if used for share FHIR server
MII_DATA_EXTRACTION_PROJ_IDENT	Ident of the project you are executing these scripts for
MII_DATA_EXTRACTION_ORG_IDENT	DSF ident of your organization
MII_DATA_EXTRACTION_PSD_NAMES	prefix names of files to be packaged to a bundle - should match psd_name names of the psd_config.json file for the pseudonymised resources to be bundle
MII_DATA_EXTRACTION_STORE_BUNDLE	whether to store the bundle directly on the fhir server, activate by setting env to "--storebundle"	None
MII_DATA_EXTRACTION_ENCB64	whether to encode the bundle as base64, activate by setting env to "--encb64"	None

Self-Signed Certificates

If you are using a self-signed certificate for your FHIR server add the ca certificate to the "certs" folder and name it custom-ca-bundle.crt.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
certs		certs
dm		dm
extracted_resources		extracted_resources
pseudonymised_resources		pseudonymised_resources
testdata		testdata
to_send		to_send
.env.default		.env.default
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
build-transaction-bundle.py		build-transaction-bundle.py
data-selection-and-extraction.py		data-selection-and-extraction.py
data_extraction_config.json.default		data_extraction_config.json.default
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
get-mii-testdata.sh		get-mii-testdata.sh
initialise-env-files.sh		initialise-env-files.sh
psd_config.json.default		psd_config.json.default
pseudonymisation.py		pseudonymisation.py
set_rights.sh		set_rights.sh
unzip-testdata.sh		unzip-testdata.sh
upload-final-bundle.sh		upload-final-bundle.sh
upload-testdata.sh		upload-testdata.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MII Projecathon - Data Selection, Extraction, Pseudonymisation and Bundleing

Execute the Data Extraction Process

Step 1 Get some testdata

Step 2 - Spin up a FHIR Server and load it with data

Step 3 - Create local version of config files

Step 4 - Configure the env files

Step 5 - Set rights

Step 6 - Run the data extraction, pseudonymisation and bundleing

Running each script individually

data-selection-and-extraction.py

pseudonymisation.py

build-transaction-bundle.py

Configuration Options

Self-Signed Certificates

About

Releases

Packages

Languages

medizininformatik-initiative/Projectathon7-VHF-DataExtraction

Folders and files

Latest commit

History

Repository files navigation

MII Projecathon - Data Selection, Extraction, Pseudonymisation and Bundleing

Execute the Data Extraction Process

Step 1 Get some testdata

Step 2 - Spin up a FHIR Server and load it with data

Step 3 - Create local version of config files

Step 4 - Configure the env files

Step 5 - Set rights

Step 6 - Run the data extraction, pseudonymisation and bundleing

Running each script individually

data-selection-and-extraction.py

pseudonymisation.py

build-transaction-bundle.py

Configuration Options

Self-Signed Certificates

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages