generated from databricks-industry-solutions/industry-solutions-blueprints
-
Notifications
You must be signed in to change notification settings - Fork 12
/
6-drug-analysis.py
377 lines (333 loc) · 15.2 KB
/
6-drug-analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
# Databricks notebook source
# MAGIC %md
# MAGIC You may find this series of notebooks at https://github.com/databricks-industry-solutions/omop-cdm. For more information about this solution accelerator, visit https://www.databricks.com/blog/2021/07/19/unlocking-the-power-of-health-data-with-a-modern-data-lakehouse.html.
# COMMAND ----------
# MAGIC %md
# MAGIC # Drug Usage Trends
# MAGIC <img src="https://upload.wikimedia.org/wikipedia/commons/7/76/Medicine_Drugs.svg" width=300>
# MAGIC
# MAGIC In this notebook we provide examples for exploring trends in drug usage within our data. For example, how different segments of the population use a particular drug,
# MAGIC or what are the drugs that are most commonly prescribed together. We also take a deeper look into drug combinations: For example, pairwise correlation between prescription drugs and most frequent drug prescription patterns.
# COMMAND ----------
# MAGIC %sql
# MAGIC USE omop531;
# COMMAND ----------
# MAGIC %md
# MAGIC ## Example 1. Drug usage by different population segments
# MAGIC In the following example we are looking at trends in usage of a given drug over time. Specifically how different population segments
# MAGIC are using a given drug. In this example we stratify results by age and gender.
# COMMAND ----------
# DBTITLE 1,Usage of Oxycodone stratified by gender
# MAGIC %sql
# MAGIC SELECT
# MAGIC tt.drug_concept_name,
# MAGIC tt.drug_concept_id,
# MAGIC count(1) as s_count,
# MAGIC tt.age_band,
# MAGIC tt.year_of_Era,
# MAGIC tt.gender
# MAGIC from
# MAGIC (
# MAGIC SELECT
# MAGIC floor(
# MAGIC (
# MAGIC extract(
# MAGIC year
# MAGIC from
# MAGIC t.drug_era_start_date
# MAGIC ) - p.year_of_birth
# MAGIC ) / 10
# MAGIC ) as age_band,
# MAGIC extract(
# MAGIC year
# MAGIC from
# MAGIC t.drug_era_start_date
# MAGIC ) as year_of_era,
# MAGIC p.gender_concept_id,
# MAGIC t.drug_concept_id,
# MAGIC c.concept_name as drug_concept_name,
# MAGIC c1.concept_name as gender
# MAGIC FROM
# MAGIC drug_era t,
# MAGIC person p,
# MAGIC concept c,
# MAGIC concept c1
# MAGIC where
# MAGIC t.person_id = p.person_id
# MAGIC and c.concept_id = t.drug_concept_id
# MAGIC and c1.concept_id = p.gender_concept_id
# MAGIC and c.concept_id = 1124957 -- Oxycodone
# MAGIC ) tt
# MAGIC where tt.year_of_era > 2000
# MAGIC group by
# MAGIC tt.age_band,
# MAGIC tt.year_of_Era,
# MAGIC tt.gender,
# MAGIC tt.drug_concept_name,
# MAGIC tt.drug_concept_id
# MAGIC order by
# MAGIC tt.age_band,
# MAGIC tt.year_of_Era,
# MAGIC tt.drug_concept_id;
# COMMAND ----------
# MAGIC %md
# MAGIC ## Example 2. top drugs prescribed in the past 5 years
# MAGIC Now let's take a look at the top drugs prescribed in the past five years.
# COMMAND ----------
# MAGIC %sql
# MAGIC with tt as
# MAGIC (
# MAGIC SELECT
# MAGIC extract(year from t.drug_era_start_date) as year_of_era,
# MAGIC t.drug_concept_id,
# MAGIC c.concept_name as drug_concept_name
# MAGIC FROM
# MAGIC drug_era t,
# MAGIC concept c
# MAGIC where
# MAGIC t.drug_concept_id = c.concept_id
# MAGIC )
# MAGIC SELECT
# MAGIC tt.drug_concept_name,
# MAGIC tt.drug_concept_id,
# MAGIC tt.year_of_Era,
# MAGIC count(1) as drug_count
# MAGIC from
# MAGIC tt
# MAGIC where year(current_date) - tt.year_of_Era <= 5
# MAGIC group by
# MAGIC tt.drug_concept_name,
# MAGIC tt.drug_concept_name,
# MAGIC tt.year_of_Era,
# MAGIC tt.drug_concept_id
# MAGIC order by
# MAGIC tt.year_of_Era, count(1) DESC
# MAGIC
# COMMAND ----------
# MAGIC %md
# MAGIC for example we notice a consistent trend in reduction of the number of instances where Oxycodone is being prescribed.
# COMMAND ----------
# MAGIC %md
# MAGIC ## Example 3. Comparing age distribution between two groups
# MAGIC Now let's take a look at the distribution of ages between two drugs of interest, in this case Hydrocodone and Oxycodone.
# COMMAND ----------
# MAGIC %python
# MAGIC d1_concept_id=1124957
# MAGIC d2_concept_id=1174888
# MAGIC import plotly.express as px
# MAGIC query=f"""
# MAGIC SELECT
# MAGIC extract(year from (min(t.drug_era_start_date) over(partition by t.person_id, t.drug_concept_id) )) - p.year_of_birth as age,
# MAGIC c.concept_name d_concept_name
# MAGIC FROM drug_era t, person p, concept c
# MAGIC WHERE t.person_id = p.person_id
# MAGIC AND t.drug_concept_id = c.concept_id
# MAGIC AND t.drug_concept_id in ({d1_concept_id},{d2_concept_id})
# MAGIC """
# MAGIC
# MAGIC df = sql(query).groupBy('age','d_concept_name').count().toPandas()
# MAGIC fig = px.histogram(df, x="age", y="count", color="d_concept_name",
# MAGIC marginal="box", # or violin, rug
# MAGIC hover_data=df.columns)
# MAGIC fig.update_layout(
# MAGIC autosize=False,
# MAGIC width=700,
# MAGIC height=700,
# MAGIC margin=dict(
# MAGIC l=50,
# MAGIC r=50,
# MAGIC b=100,
# MAGIC t=100,
# MAGIC pad=4
# MAGIC ),
# MAGIC paper_bgcolor="LightSteelBlue",
# MAGIC )
# MAGIC fig.show()
# COMMAND ----------
# MAGIC %md
# MAGIC ## Example 4. Finding drugs that are commonly prescribed together
# MAGIC In this example, let's take a look at the association between drugs. For example, in an study of prescription drugs, we might be interested in learning which drugs are more likely prescribed together compared to what would be expected at random. For this analysis, we use two approaches: first we apply a simple probabilistic approach, which uses the [cosine measure](https://michael.hahsler.net/research/association_rules/measures.html#support) to quantify the degree of association between a pair of events (in this case, `drug1` and `drug2` being used during the same drug era). Next, we use the FP-Growth algorithm from [spark mlLib](https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html#frequent-pattern-mining) to identify frequent patterns for prescription drugs. This is a similar approach to market basket analysis (or affinity analytics).
# COMMAND ----------
# MAGIC %md
# MAGIC
# MAGIC ### 4.1 Pairwise association
# MAGIC First we need to define a criteria for considering two drugs being prescribed together.
# MAGIC For simplicity, let's first define a time interval \\(\Delta\\) and consider two drugs being prescribed together if:
# MAGIC $$ |t_1-t_2| < \Delta$$ and $$|s_1-s_2|<\Delta$$
# MAGIC in which \\(t_i,s_i\\) are the start and end times for taking drug \\(i\\) by a patient respectively. Using this criteria, we can then compile a list of drug pairs that have been associated with each other based on being prescribed within the same time period.
# MAGIC First let's create a dataset of drug pairs, paired according to the above criteria, and relevant summary statistics such as [support](https://michael.hahsler.net/research/association_rules/measures.html#support) that we then will use to calculate the cosine metrics for association.
# COMMAND ----------
# MAGIC %sql
# MAGIC CREATE OR REPLACE TEMPORARY VIEW COPRESCRIBED AS (
# MAGIC with drugs_and_names as (
# MAGIC select PERSON_ID,c.concept_name as drug_concept_name, DRUG_CONCEPT_ID,DRUG_ERA_START_DATE,DRUG_ERA_END_DATE
# MAGIC from drug_era
# MAGIC join concept c
# MAGIC where DRUG_CONCEPT_ID=c.concept_id
# MAGIC ),
# MAGIC drugs as (
# MAGIC select d1.PERSON_ID,
# MAGIC d1.DRUG_CONCEPT_ID as drug_id1,
# MAGIC d2.DRUG_CONCEPT_ID as drug_id2,
# MAGIC d1.drug_concept_name as drug1,
# MAGIC d2.drug_concept_name as drug2,
# MAGIC d1.DRUG_ERA_START_DATE as d1_start,
# MAGIC d1.DRUG_ERA_END_DATE as d1_end,
# MAGIC d2.DRUG_ERA_START_DATE as d2_start,
# MAGIC d2.DRUG_ERA_END_DATE as d2_end
# MAGIC from drugs_and_names d1
# MAGIC join drugs_and_names d2
# MAGIC where
# MAGIC d1.PERSON_ID=d2.PERSON_ID
# MAGIC AND
# MAGIC ABS(datediff(d1.DRUG_ERA_START_DATE,d2.DRUG_ERA_START_DATE)) <=5 -- drugs that have been taken during the same time period +/-5 days
# MAGIC AND
# MAGIC ABS(datediff(d1.DRUG_ERA_END_DATE, d2.DRUG_ERA_END_DATE)) <=5
# MAGIC ),
# MAGIC counts as (
# MAGIC select drug1,drug2, count(*) as count_d1d2
# MAGIC from drugs
# MAGIC group by 1,2
# MAGIC order by 3 desc
# MAGIC ),
# MAGIC pairwise_sums as (
# MAGIC select drug1, drug2, count_d1d2,
# MAGIC sum(count_d1d2) over (partition by drug1) as sum_d1,
# MAGIC sum(count_d1d2) over (partition by drug2) as sum_d2,
# MAGIC sum(count_d1d2) over () as sum_all
# MAGIC from counts
# MAGIC )
# MAGIC select drug1,drug2, count_d1d2,
# MAGIC count_d1d2/sum_d1 as supp_d1d2,
# MAGIC count_d1d2/sum_d2 as supp_d2d1,
# MAGIC sum_d1/sum_all as supp_d1,
# MAGIC sum_d2/sum_all as supp_d2
# MAGIC from pairwise_sums
# MAGIC )
# MAGIC -- limit 100
# COMMAND ----------
# MAGIC %md
# MAGIC let's take a look at this dataset:
# COMMAND ----------
# MAGIC %sql
# MAGIC select * from COPRESCRIBED
# COMMAND ----------
# MAGIC %python
# MAGIC import pandas as pd
# MAGIC import plotly.express as px
# MAGIC
# MAGIC df = sql("""
# MAGIC select drug1,drug2, supp_d1d2, log(supp_d1d2/sqrt(supp_d1*supp_d2)) as lod_cosine_d1d2, log(supp_d1d2)-log(supp_d1) as lod_d1d2
# MAGIC from COPRESCRIBED order by lod_cosine_d1d2 desc
# MAGIC """)
# MAGIC
# MAGIC pdf=df.toPandas()
# MAGIC pdf_matx=pd.crosstab(pdf.drug1,pdf.drug2,values=pdf.lod_cosine_d1d2,aggfunc='sum')
# MAGIC fig = px.imshow(pdf_matx)
# MAGIC fig.update_layout(
# MAGIC autosize=False,
# MAGIC width=700,
# MAGIC height=700,
# MAGIC margin=dict(
# MAGIC l=50,
# MAGIC r=50,
# MAGIC b=100,
# MAGIC t=100,
# MAGIC pad=4
# MAGIC ),
# MAGIC paper_bgcolor="LightSteelBlue",
# MAGIC )
# MAGIC
# MAGIC fig.show()
# COMMAND ----------
# MAGIC %sql
# MAGIC with drugs_and_names as (
# MAGIC select PERSON_ID,c.concept_name as drug_concept_name, DRUG_CONCEPT_ID,DRUG_ERA_START_DATE,DRUG_ERA_END_DATE
# MAGIC from drug_era
# MAGIC join concept c
# MAGIC where DRUG_CONCEPT_ID=c.concept_id
# MAGIC ),
# MAGIC drugs as (
# MAGIC select de1.PERSON_ID,
# MAGIC de1.DRUG_CONCEPT_ID as drug_id1,
# MAGIC de2.DRUG_CONCEPT_ID as drug_id2,
# MAGIC de1.drug_concept_name as drug1,
# MAGIC de2.drug_concept_name as drug2,
# MAGIC de1.DRUG_ERA_START_DATE as d1_start,
# MAGIC de1.DRUG_ERA_END_DATE as d1_end,
# MAGIC de2.DRUG_ERA_START_DATE as d2_start,
# MAGIC de2.DRUG_ERA_END_DATE as d2_end
# MAGIC from drugs_and_names de1
# MAGIC join drugs_and_names de2
# MAGIC where
# MAGIC de1.PERSON_ID=de2.PERSON_ID
# MAGIC AND
# MAGIC ABS(datediff(de1.DRUG_ERA_START_DATE,de2.DRUG_ERA_START_DATE)) <=5 -- drugs that have been taken during the same time period +/-5 days
# MAGIC AND
# MAGIC ABS(datediff(de1.DRUG_ERA_END_DATE, de2.DRUG_ERA_END_DATE)) <=5
# MAGIC ),
# MAGIC counts as (
# MAGIC select drug1,drug2, count(*) as count
# MAGIC from drugs
# MAGIC group by 1,2
# MAGIC order by 3 desc
# MAGIC )
# MAGIC select *, unix_timestamp('DRUG_ERA_START_DATE') from drugs_and_names
# COMMAND ----------
# MAGIC %md
# MAGIC ### 4.2 Frequent Pattern Mining
# MAGIC Now let's take a deeper look into frequent patterns. We use the FP-Growth algorithm from [spark mlLib](https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html#frequent-pattern-mining) to identify frequent patterns for prescription drugs. This is a similar approach to market basket analysis (or affinity analytics), in which items linked to a given transaction are analyzed to identify affinity patterns. In our case, a transaction corresponds to a given drug era record and we consider two items (drugs in this case) to have co-occurred together based on our definition provided in the first part of this example.
# MAGIC In the following cell, we switch to python to create a dataset which each row of the dataset corresponds to drugs that have co-occurred within the same time interval
# MAGIC \\( t+\Delta; \Delta=\pm 5 \\) days.
# COMMAND ----------
# MAGIC %py
# MAGIC from pyspark.sql.functions import *
# MAGIC drugs_and_names=sql("""
# MAGIC
# MAGIC select PERSON_ID,c.concept_name as drug_concept_name, DRUG_CONCEPT_ID,DRUG_ERA_START_DATE,DRUG_ERA_END_DATE
# MAGIC from drug_era
# MAGIC join concept c
# MAGIC where DRUG_CONCEPT_ID=c.concept_id
# MAGIC """)
# MAGIC
# MAGIC def get_date_key(d,delta_days=5):
# MAGIC days_to_sec=delta_days*24*60*60
# MAGIC t1=days_to_sec*((d+days_to_sec)//days_to_sec)
# MAGIC t2=days_to_sec*((d-days_to_sec)//days_to_sec)
# MAGIC return(f"{t1}:{t2}")
# MAGIC
# MAGIC get_key_udf = udf(lambda d:get_date_key(d))
# MAGIC
# MAGIC df=drugs_and_names\
# MAGIC .withColumn('start_key',get_key_udf(unix_timestamp('DRUG_ERA_START_DATE')))\
# MAGIC .withColumn('end_key',get_key_udf(unix_timestamp('DRUG_ERA_END_DATE')))\
# MAGIC .orderBy('DRUG_ERA_START_DATE')\
# MAGIC .groupBy(['PERSON_ID','start_key','end_key'])\
# MAGIC .agg(collect_set('drug_concept_name').alias('drug_items'))\
# MAGIC .select(udf(lambda x: ",".join(x))(col('drug_items')).alias('drug_items'))
# COMMAND ----------
# MAGIC %md
# MAGIC now we use FPGrowth algorithm from `spark.mllib` to find the most frequent patterns. Note that we are using `rdd` api to feed data to the algorithm. In addition to the dataset
# MAGIC we also need to specify a minimum threshold for a particular pattern to be considered significant.
# COMMAND ----------
# MAGIC %py
# MAGIC from pyspark.mllib.fpm import FPGrowth
# MAGIC
# MAGIC transactions = df.rdd.map(lambda row: row['drug_items'].strip().split(','))
# MAGIC model = FPGrowth.train(transactions, minSupport=0.02, numPartitions=10)
# MAGIC result = model.freqItemsets().collect()
# COMMAND ----------
# MAGIC %py
# MAGIC display(spark.createDataFrame(result).orderBy(col('freq').desc()))
# COMMAND ----------
# MAGIC %md
# MAGIC Copyright / License info of the notebook. Copyright Databricks, Inc. [2021]. The source in this notebook is provided subject to the [Databricks License](https://databricks.com/db-license-source). All included or referenced third party libraries are subject to the licenses set forth below.
# MAGIC
# MAGIC |Library Name|Library License|Library License URL|Library Source URL|
# MAGIC | :-: | :-:| :-: | :-:|
# MAGIC |Smolder |Apache-2.0 License| https://github.com/databrickslabs/smolder | https://github.com/databrickslabs/smolder/blob/master/LICENSE|
# MAGIC |Synthea|Apache License 2.0|https://github.com/synthetichealth/synthea/blob/master/LICENSE| https://github.com/synthetichealth/synthea|
# MAGIC | OHDSI/CommonDataModel| Apache License 2.0 | https://github.com/OHDSI/CommonDataModel/blob/master/LICENSE | https://github.com/OHDSI/CommonDataModel |
# MAGIC | OHDSI/ETL-Synthea| Apache License 2.0 | https://github.com/OHDSI/ETL-Synthea/blob/master/LICENSE | https://github.com/OHDSI/ETL-Synthea |
# MAGIC |OHDSI/OMOP-Queries|||https://github.com/OHDSI/OMOP-Queries|
# MAGIC |The Book of OHDSI | Creative Commons Zero v1.0 Universal license.|https://ohdsi.github.io/TheBookOfOhdsi/index.html#license|https://ohdsi.github.io/TheBookOfOhdsi/|