-
Notifications
You must be signed in to change notification settings - Fork 0
/
hiv_indicators_prediction.py
1319 lines (908 loc) · 94.5 KB
/
hiv_indicators_prediction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# -*- coding: utf-8 -*-
"""HIV Indicators Prediction.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1z-y1SWQlNOvCg3BJL20OLc2I02k739-Z
#CIS 545 Project
By: Zihao Deng, Anjaly Nagarajan, Dung Than
December 15th, 2022
#Motivation and Project Overview
![image](https://drive.google.com/uc?export=view&id=1-WOa0GHEO2SDhtoPJyHmxAjoFTUx-Gs0)
HIV has long been a devastating and deadly disease that has taken away lives of millions of people all over the world. According to [WHO](https://www.who.int/news-room/fact-sheets/detail/hiv-aids#:~:text=HIV%20continues%20to%20be%20a,no%20cure%20for%20HIV%20infection.), by the end of 2021, around 38.4 million people in the world is living with HIV, two thirds of whom are in the WHO African Region. Even though there are no cure available for HIV at the moment, individuals and communities can take prevention against the spread HIV in a variety of way. For this project, our group set out to determine the correlation of certain social indicators and the prevalence of HIV. Through our analysis, we hope to shed light on the effectiveness of certain measures in the fight against HIV across the world.
For this project, we analyze a dataset of over 89,000 entries of almost 200 countries and regions from the 1960s to 2015 regarding their performance on various socioeconomic, education, health, and poverty index. We will be focusing on exploring the relationship between HIV prevalence and the aforementioned indicators in these countries using different regression and classification models such as linear regression, logistic regression, decision trees and neural networks.
After performing EDA, we decided to focus on using a country's adolescent fertility rate, health expenditure, sanitation rate, urban population ratio, and unemployment rate to predict each country's HIV outcome. Based on our models, we would like to see whether we can predict future HIV prevalence in a given region based on the information regarding the indicators above.
Through our analysis, we want to investigate the question of which factors are most likely to correlate to HIV infection rate and asnwer the question of whether certain investment is helpful in reducing HIV prevalence.
This project will be particularly helpful for governments to recognize the correlation between HIV infection rates in their country and certain socioeconomic determinants so that they can make conscious decisions regarding allocating budget to the most effective measures to stop the spread of HIV in their community. In addition, this analysis can also help international organizations makes timely intervention to help countries at high risk of HIV spreading. Our result will help contribute to the 95-95-95 target set by [UNAIDS](https://www.google.com/search?q=95+95+95+hiv&oq=95+95+95+&aqs=chrome.1.69i57j0i512l4j69i60j69i61j69i60.2780j0j7&sourceid=chrome&ie=UTF-8) that aims for HIV testing, treatment, and viral suppression rates to be 95%-95%-95% by 2025 by giving healthcare providers a better idea of the current progress, required future efforts, and predictive results for the upcoming years.
# **1.** Preparation
## **1.1** Imports
"""
# Import neccessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import datetime as dt
import matplotlib.pyplot as plt
import plotly.express as px
import statsmodels.formula.api as smf
import sklearn
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
"""## **1.2** Download Dataset
Run this section once in order to download the dataset from Kaggle and convert it into a dataframe. If you run it again, it will ask you to replace data.csv, so you will have to manually type in y into fifth cell.
"""
! pip install -q kaggle
from google.colab import drive
drive.mount('/content/drive')
# Create the kaggle directory and read the uploaded kaggle.json file
# (NOTE: Do NOT run this cell more than once unless restarting kernel)
!mkdir ~/.kaggle
# Read the uploaded kaggle.json file
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/
# Download datasets
! kaggle datasets download -d theworldbank/health-nutrition-and-population-statistics
# Unzip folder in Colab content folder
!unzip /content/health-nutrition-and-population-statistics.zip
# Read the csv file and save it to a dataframe
df_health_nutrition = pd.read_csv("data.csv")
df_health_nutrition
"""# **2.** EDA
## **2.1** Look at Data and convert tolist()
Looking at the first five rows of the data, we can see that for each country, there is a set of indicatrs realted to health that have data from 1960 to 2015. However, upon first glance, there appears to be a significant number of null values even among the first five rows, which we will explore and cut down upon below. Additionally, based on the current format of the dataframe, it is difficult to do in-depth analysis comparing indicators across countries for such a wide range of years, so this indicated that we should reformat the dataframe to be easier to work with.
"""
df_health_nutrition.head(5)
"""In order to see the indicators for every country, we converted the column Indicator Name using tolist() in order to guide our choice for indicators that we wanted to focus on."""
set(df_health_nutrition['Indicator Name'].tolist())
"""In order to make sure that none of the columns had NaN values, we ran .info() on the data set."""
df_health_nutrition.info()
"""## **2.2** Select Indicators
Below, we select our indicators of interest using our objective to guide us. Essentially, we want to focus on economic and environmental indicators of HIV to determine what areas of improvement governments can focus on in order to improve HIV prevalence rates. For this process, we iteratively selected different indicators to see which ones had enough data to work with (not too many null values).
"""
df_health_nutrition.columns
"""**Note regarding the indicators:**
The Health and Nutrition Dataset Contains hundreds of indicators that could potentially be used for our analysis and modeling. However, many of the indicators are repetitive. For instance, different indicators might be giving the same information about GDP, but with either percentage values or real numerical values. Indicators may also give information about the population of all ages as well as smaller age ranges. So the selection of indicators can be considered as a first step to remove multicollinearities through manual inspection.
After selecting the relevant features, we create and *indicator_rename_dict* to map each indicator to a new name which is more concise and will be used later in the DataFrame.
"""
# Select indicators of interest
health_nutrition_indicators = [
#HIV (6/6)
'% of females ages 15-49 having comprehensive correct knowledge about HIV (2 prevent ways and reject 3 misconceptions)',
'% of males ages 15-49 having comprehensive correct knowledge about HIV (2 prevent ways and reject 3 misconceptions)',
'Adults (ages 15+) and children (0-14 years) living with HIV',
'Antiretroviral therapy coverage (% of people living with HIV)',
'Prevalence of HIV, total (% of population ages 15-49)',
'Incidence of HIV (% of uninfected population ages 15-49)',
#economic and poverty (7/7)
'Health expenditure per capita, PPP',
'Health expenditure, public (% of total health expenditure)',
'Health expenditure, total (% of GDP)',
'Out-of-pocket health expenditure (% of total expenditure on health)',
'Unemployment, total (% of total labor force)',
'Urban population (% of total)',
'Urban poverty headcount ratio at national poverty lines (% of urban population)',
# #Health In General (sanitary, water, undernourishment)
# 'Prevalence of undernourishment (% of population)',
'Improved sanitation facilities (% of population with access)',
'Improved water source (% of population with access)',
'Adolescent fertility rate (births per 1,000 women ages 15-19)',
# 'Smoking prevalence, females (% of adults)',
# 'Smoking prevalence, males (% of adults)',
# 'Community health workers (per 1,000 people)',
#education (primary net, secndary net, tertiary gross)
'School enrollment, primary (% net)',
'School enrollment, secondary (% net)',
'School enrollment, tertiary (% gross)',
# 'Literacy rate, adult female (% of females ages 15 and above)',
# 'Literacy rate, adult male (% of males ages 15 and above)',
# 'Literacy rate, adult total (% of people ages 15 and above)',
# 'Literacy rate, youth male (% of males ages 15-24)',
# 'Literacy rate, youth total (% of people ages 15-24)',
]
health_nutrition_education_indicators = []
indicator_rename_dict = {'% of females ages 15-49 having comprehensive correct knowledge about HIV (2 prevent ways and reject 3 misconceptions)' : '% females 15-49 comprehensive HIV knowledge',
'% of males ages 15-49 having comprehensive correct knowledge about HIV (2 prevent ways and reject 3 misconceptions)' : '% males 15-49 comprehensive HIV knowledge',
'Adolescent fertility rate (births per 1,000 women ages 15-19)' : 'adolescent fertility rate',
'Adults (ages 15+) and children (0-14 years) living with HIV' : 'total population living with HIV',
'Adults (ages 15+) and children (ages 0-14) newly infected with HIV' : 'total population newly infected with HIV',
'Adults (ages 15+) living with HIV' : 'adults living with HIV',
'Adults (ages 15+) newly infected with HIV' : 'adults newly infected with HIV',
'Antiretroviral therapy coverage (% of people living with HIV)' : 'HIV population with antiretroviral therapy coverage',
'Antiretroviral therapy coverage for PMTCT (% of pregnant women living with HIV)' : 'pregnant women with HIV with antiretroviral therapy coverage for PMTCT',
'Children (0-14) living with HIV' : 'children living with HIV',
'Children (ages 0-14) newly infected with HIV' : 'children newly infected with HIV',
'Comprehensive correct knowledge of HIV/AIDS, ages 15-24, female (2 prevent ways and reject 3 misconceptions)' : '% females 15-24 comprehensive HIV knowledge',
'Comprehensive correct knowledge of HIV/AIDS, ages 15-24, male (2 prevent ways and reject 3 misconceptions)' : '% males 15-24 comprehensive HIV knowledge',
'Condom use with non regular partner, % adults(15-49), female' : '% 15-49 female adult condom',
'Condom use with non regular partner, % adults(15-49), male' : '% 15-49 male adult condom',
'Condom use, population ages 15-24, female (% of females ages 15-24)' : '% 15-24 female adult condom',
'Condom use, population ages 15-24, male (% of males ages 15-24)' : '% 15-24 male adult condom',
'Contraceptive prevalence, any methods (% of women ages 15-49)' : '% 15-49 female any contraceptive',
'Contraceptive prevalence, modern methods (% of women ages 15-49)' : '% 15-49 female modern contraceptive',
'Health expenditure, private (% of total health expenditure)' : 'private % health expenditure',
'Incidence of HIV (% of uninfected population ages 15-49)' : 'HIV uninfected incidence percentage 15-49',
'Prevalence of HIV, female (% ages 15-24)' : '% 15-24 female HIV prevalence',
'Prevalence of HIV, male (% ages 15-24)' : '% 15-24 male HIV prevalence',
'Prevalence of HIV, total (% of population ages 15-49)' : '% 15-49 total HIV prevalence',
'Unemployment, total (% of total labor force)' : 'total unemployment ratio',
'Urban population (% of total)' : 'urban population ratio',
'Urban poverty headcount ratio at national poverty lines (% of urban population)' : 'urban poverty ratio',
'Teenage mothers (% of women ages 15-19 who have had children or are currently pregnant)' : 'teenage mothers ratio',
'Improved sanitation facilities (% of population with access)' : 'improved sanitation facilities rate',
'Improved water source (% of population with access)' : 'improved water source rate'}
new_cols = ['country_name', 'year'] + health_nutrition_indicators
new_cols
"""## **2.3** Change Dataframe Format & Rename Columns
The most important transformation we need to apply to the dataset before it is ready to use is to restructure the way it presents the indicator/feature values.We changed the format of our dataframe here to make the indicators of choice our columns and convert the country name and corresponding year into singular columns. This will make the data easier to work with because we can isolate certain columns when we want to study specific indicators and their relation to the HIV indicators.
In particular, the original DataFrame has every year between 1960 and 2015 as a single column, and each row contains the values of one indicator and one coutry on each year. We use the **melt** and **pivot_table** functions provided by the Pandas library to switch year to rows and indicators to columns. The new DataFrame should now have each row containing all indicator values for one country in one year.
We also rename the columns/indicators after changing the format using the *indicator_rename_dict* we defined in section **2.2**
"""
df_tmp = df_health_nutrition[df_health_nutrition['Indicator Name'].isin(health_nutrition_indicators)].sort_values(['Country Name', 'Indicator Name'])
df_tmp = df_tmp.drop(columns=['Country Code', 'Indicator Code'])
# Change the DataFrame format and rename columns
df_tmp = df_tmp.melt(id_vars=["Country Name", "Indicator Name"], var_name="year", value_name="value")
df_tmp = df_tmp.pivot_table('value', ['Country Name', 'year'], 'Indicator Name').reset_index().rename(columns=indicator_rename_dict)
df_health_nutrition_tmp = df_tmp.rename_axis(None, axis=1)
df_tmp
"""## **2.4** Clean Dataframes
Since the availability of data varies greatly from countries to countries, we decided to fill in the missing values in the table. In order to determine the values to be filled in for each feature, we imported a data set (linked below) to determine the continent of each country in the dataset, after which we group the data by continent to find the continental mean of each category. After finding the average of each of these categories, we fill the missing values in the dataset with the corresponding continental average. By filling missing data with continental mean values rather than 0s, we are able to provide a more accurate interpolation of the data since it does not make sense that countries would have arbitrary values for health rates.
In doing so, we first extract the list of continent from the country-continent dataset we found online (linked below)
"""
continents_df = pd.read_csv('https://raw.githubusercontent.com/datasets/country-codes/master/data/country-codes.csv',na_filter = False)
continents_df = continents_df[['official_name_en', 'Continent']]
unique_continent = list(continents_df['Continent'].unique())
unique_continent
"""Then we joined the dataset with our reformatted dataframe on "Country Name" to assign a Continent value to each of the rows. We noticed that since there are some mismatching values of Countries not having a valid continent associated with them in the joined dataset, we extract a list of these values to manually assign a continent value."""
df_joined = df_health_nutrition_tmp.join(continents_df.set_index('official_name_en'), on = ['Country Name'], how = 'left')
null_continent = df_joined[df_joined['Continent'].isna()]
country_list = list(null_continent['Country Name'].unique())
country_list
"""Now, we will assign a Continent value to some of the countries in the list above. We omit entries that is ambiguous or cause double counting such as 'Early-demographic dividend', 'Caribbean small states', 'East Asia & Pacific (IDA & IBRD countries)', 'East Asia & Pacific (excluding high income)',..."""
df_joined.loc[df_joined['Country Name'] == 'Bahamas, The', 'Continent'] = 'NA'
df_joined.loc[df_joined['Country Name'] == 'Bolivia','Continent'] = 'SA'
df_joined.loc[df_joined['Country Name'] == 'Channel Islands','Continent'] = 'EU'
df_joined.loc[df_joined['Country Name'] == 'Congo, Dem. Rep.','Continent'] = 'AF'
df_joined.loc[df_joined['Country Name'] == "Cote d'Ivoire",'Continent'] = 'AF'
df_joined.loc[df_joined['Country Name'] == 'Curacao','Continent'] = 'SA'
df_joined.loc[df_joined['Country Name'] == 'Czech Republic','Continent'] = 'EU'
df_joined.loc[df_joined['Country Name'] == 'Egypt, Arab Rep.','Continent'] = 'AF'
df_joined.loc[df_joined['Country Name'] == 'Gambia, The','Continent'] = 'AF'
df_joined.loc[df_joined['Country Name'] == 'Hong Kong SAR, China','Continent'] = 'AS'
df_joined.loc[df_joined['Country Name'] == 'Iran, Islamic Rep.','Continent'] = 'AS'
df_joined.loc[df_joined['Country Name'] == 'Korea, Dem. People’s Rep.','Continent'] = 'AS'
df_joined.loc[df_joined['Country Name'] == 'Korea, Rep.','Continent'] = 'AS'
df_joined.loc[df_joined['Country Name'] == 'Kosovo','Continent'] = 'EU'
df_joined.loc[df_joined['Country Name'] == 'Kyrgyz Republic','Continent'] = 'AS'
df_joined.loc[df_joined['Country Name'] == 'Lao PDR','Continent'] = 'AS'
df_joined.loc[df_joined['Country Name'] == 'Macao SAR, China','Continent'] = 'AS'
df_joined.loc[df_joined['Country Name'] == 'Macedonia, FYR','Continent'] = 'EU'
df_joined.loc[df_joined['Country Name'] == 'Micronesia, Fed. Sts.','Continent'] = 'AN'
df_joined.loc[df_joined['Country Name'] == 'Moldova','Continent'] = 'EU'
df_joined.loc[df_joined['Country Name'] == 'Slovak Republic','Continent'] = 'EU'
df_joined.loc[df_joined['Country Name'] == 'St. Kitts and Nevis','Continent'] = 'NA'
df_joined.loc[df_joined['Country Name'] == 'St. Lucia','Continent'] = 'NA'
df_joined.loc[df_joined['Country Name'] == 'St. Vincent and the Grenadines','Continent'] = 'NA'
df_joined.loc[df_joined['Country Name'] == 'Swaziland','Continent'] = 'AF'
df_joined.loc[df_joined['Country Name'] == 'Tanzania','Continent'] = 'AF'
df_joined.loc[df_joined['Country Name'] == 'United Kingdom','Continent'] = 'EU'
df_joined.loc[df_joined['Country Name'] == 'United States','Continent'] = 'NA'
df_joined.loc[df_joined['Country Name'] == 'Venezuela, RB','Continent'] = 'SA'
df_joined.loc[df_joined['Country Name'] == 'Vietnam','Continent'] = 'AS'
df_joined.loc[df_joined['Country Name'] == 'Virgin Islands (U.S.)','Continent'] = 'NA'
df_joined.loc[df_joined['Country Name'] == 'West Bank and Gaza','Continent'] = 'AS'
df_joined.loc[df_joined['Country Name'] == 'Yemen, Rep.','Continent'] = 'AS'
"""Since there are a good amount of missing data for our dataset, especially for developing countries and data before the year 2000, we will be filling in missing values of this dataset with the average value of each indicator in its continent. We are doing this so that we can keep at least 10k rows and to meet the lower bound number of rows of this assignment. """
df_joined = df_joined.set_index(['Continent'])
column_headers = list(df_joined.columns.values)
column_headers.remove('Country Name')
column_headers.remove('year')
for column in column_headers:
means = df_joined.groupby('Continent')[column].mean()
df_joined[column] = df_joined[column].fillna(means)
df_joined = df_joined.reset_index()
df_joined = df_joined.rename(columns={'Country Name':"country_name"})
df_health_nutrition_tmp = df_joined
df_health_nutrition_tmp
"""After conducting value imputation, cleaning, and reformatting our dataframe, we dropped all the null values that still remained and were left with our final dataframe that we used for our indicator analysis and modeling."""
#cleaning df_health_nutrition_tmp
df_health_nutrition_tmp.columns = df_health_nutrition_tmp.columns.str.replace(' ', '_').str.lower()
#drop duplicates
df_health_nutrition_tmp = df_health_nutrition_tmp.drop_duplicates()
#change dtype of year column
df_health_nutrition_tmp = df_health_nutrition_tmp.astype({'year':'int64'})
df_health_nutrition_tmp
# df_final = df_health_nutrition_tmp.dropna(thresh=df_health_nutrition_tmp.shape[0]*0.2,how='all',axis=1)
df_final = df_health_nutrition_tmp.dropna()
df_final
# We use this to ensure we are using the properly formatted column names in our analysis
df_final.columns
"""# 3 EDA: HIV and Indicator Analysis
To begin our analysis, we wanted to plot the heatmap again offering correlations between each of our indicators. We want to acknowledge the risk of multicollinearity, which is when multiple features are correlated with each other. For the most part, there is not extremely high correlation between our different categories of features like economics vs. health vs. education, but there is realtively strong correlation within categories. This is why we plan on focusing on one indicator at a time within each category. For instance, snce public health expenditure and out of pocket health expenditure % have a -0.91 strong negative correlation, we tried to avoid any analysis with both variables at once. We will address this in our modeling section by running PCA to reduce multicollinearity and perform dimensionality reduction.
As we can see, by looking at the % 15-49 Total HIV Prevalence Rate in specific, there appears to be a negative correlation with sanitation facilities, water source improvement, health expenditure per capita, public health expenditure, urban population ratio, and primary, secondary, and tertiary education percentages.
On the other hand, there appears to be positive correlation with adolescent fertility rate, male comprehensive HIV knowledge, unemployment, and urban poverty ratio. There is no correlation between female comprehensive HIV knowledge, HIV antiretroviral therapy, out of pocket health expenditures, and total health expenditure. We will explore the higher correlations which include secondary school enrollment, adolescent fertility rate, improved sanitation facilitites rate, improved water source.
"""
fig, ax = plt.subplots(figsize=(15,15))
correlation_matrix = df_final.corr()
ax = sns.heatmap(correlation_matrix, vmax=1, vmin=-1, cmap='RdBu', annot = True, fmt = '.2f')
ax.set_title('Correlation Matrix Heatmap of all Indicators')
"""On the map below, we can see that for a majority of the countries, there is insufficient data on HIV prevalence, which is why after our value imputation for filling in null values, especially in North America, Europe, and Australia. Thus, this includes many developed nations such as the US, Canada, Europe, and Australia. The United States and Canada appear to have slightly higher levels of HIV because Mexico and Central American countries/islands were the only ones with data.
However, there is relatively thorough data for South America, south/southeast Asia, central Asia, and Africa. We can clearly see just from this map alone that Africa is a hotspot for HIV especially among the rest of the world. Zooming into Africa, it is specifically the Southern region that has the highest rates, almost double the rest of Africa and almost 5-6x the rest of the world. We will try to determine what is propelling their higher than average HIV rates as well as common indicators among each of the major regions that have lower HIV rates.
"""
geohiv = df_final[['country_name','%_15-49_total_hiv_prevalence']]
geohiv = geohiv.groupby('country_name').mean()
geohiv.reset_index(inplace=True)
fig = px.choropleth(data_frame = geohiv, locations="country_name", locationmode = 'country names',
color="%_15-49_total_hiv_prevalence", hover_name='%_15-49_total_hiv_prevalence',
color_continuous_scale=px.colors.sequential.Sunset,
title = 'Average HIV Prevalence Rate by Nation (After Imputation)')
fig.show()
"""## **3.1** Economic Indicators
First, we created a smaller scale heatmap to look into the correlation between HIV indicators and health expenditures. We thought it would be visually easier to look at over the large correlation matrix with all of our indicators.
At first glance, there is a moderate negative correlation between health_expenditure_per_capita,_ppp and total HIV prevalence of around -0.32. This means that as health_expenditure_per_capita,_ppp increases, HIV prevalence decreases, which is shown more on our dot plot below. This could either indicate that governments are not spending their health expenditures on HIV protection/prevention or that they are not spending in an effective way that actually brings down HIV rates.
"""
fig, ax = plt.subplots(figsize=(10,10))
df_econ = df_final.drop(columns = ['adolescent_fertility_rate',
'improved_sanitation_facilities_rate', 'improved_water_source_rate',
'total_unemployment_ratio',
'urban_population_ratio',
'school_enrollment,_primary_(%_net)',
'school_enrollment,_secondary_(%_net)',
'school_enrollment,_tertiary_(%_gross)'])
correlation_matrix = df_econ.corr()
ax = sns.heatmap(correlation_matrix, vmax=1, vmin=-1, cmap='RdBu', annot = True, fmt = '.2f')
ax.set_title('Correlation Matrix Heatmap of Economic and HIV Indicators')
"""As we can see clearly, the United States, Canada, and areas of Europe have the highest health expenditure per capita by a margin. Next, Asia and South America spend about half of what those developed nations spend. Finally, Africa spends less than a sixth of the highest countries. We hypothesize that overall lower public health expendture per capita on the whole reduces the number of people that visit hospitals in Africa, get treatment or even tested in the first palce for HIV, and it can lead to overall reduced immunity levels, which increases suseptibility to HIV. However, we will explore the relationship between HIV prevalence and health expendture per capita more in depth below."""
econhiv_prim = df_final[['country_name','health_expenditure_per_capita,_ppp']]
econhiv_prim = econhiv_prim.groupby('country_name').mean()
econhiv_prim.reset_index(inplace=True)
fig = px.choropleth(data_frame = econhiv_prim, locations="country_name", locationmode = 'country names',
color='health_expenditure_per_capita,_ppp',
hover_name='health_expenditure_per_capita,_ppp',
color_continuous_scale=px.colors.sequential.Sunset,
title = 'Average Health Expenditure Per Capita')
fig.show()
"""In the dot plot below, we plotted health expenditure per capita vs HIV %prevalence among ages 15-49 with our hue being countries. We can see there is a weak negative relationship between the variables because there is a higher concentration of countries that have higher toatl HIV prevalence rates and much lower health expenditure per capita in their country as compared to countries with high health expenditure per capita. At first, I was shocked to see almost a vertical line of countries centered around the lowest health expenditure per capita rates and how there was such a range in HIV prevalence rates. However, this led me to draw an overall conclusion that even increasing from no health expenditure per capita to a minimal amount upwards of 1,000 per person can drastically reduce HIV rates and should be seen as an investment. However, there is not a one to one connection that the countries that spend nothing have a high HIV rate because there are some dots (nations) with no health expenditure per capita but low HIV prevalence rates. One possible shortcoming of this graph is that our value imputation method of averaging caused a higher amount of dots centered throughout the $0 health expenditure per capita range. """
hiv_econ_countries= sns.relplot(x='health_expenditure_per_capita,_ppp',
y='%_15-49_total_hiv_prevalence', palette = 'winter',
data=df_final, hue = 'country_name', height=8, aspect=1.5);
plt.title("Dot Plot Health Expenditure Per Capita Each Country")
"""Looking at the highest per capita health expenditure and lowest HIV prevalence %, there is no overlap, but this could be because countries with the lowest HIV prevalence rates have extremely similar ones because they are so low and the European countries dominate with the Highest Per Capita Health Expenditure indicator.
As for countries with the lowest per capita health expenditure and highest HIV prevalence, there is overlap including Central African Republic and Mozambique. However, there is a high concentration of countries in Africa on both lists and the lack of overlap could be due to our value imputations using averages rather than the specific numbers and countries. This further supports our hypothesis that lower per capita health expenditure could be a reason for higher prevailing health expenditure rates.
"""
econ = df_final.groupby('country_name')['health_expenditure_per_capita,_ppp'].mean()
econcat_sorted = econ.sort_values()
hiv_econ = df_final.groupby('country_name')['%_15-49_total_hiv_prevalence'].mean()
hiv_econ_sorted = hiv_econ.sort_values()
print("Highest Per Capita Health Expenditure and Lowest HIV Prevalence")
print(econcat_sorted.tail(15));
print()
print(hiv_econ_sorted.head(15));
print()
print()
print()
print("Lowest Per Capita Health Expenditure and Highest HIV Prevalence")
print(econcat_sorted.head(15));
print()
print(hiv_econ_sorted.tail(15));
"""Looking at a continent aggregated level, there is a heightened value for Europe, a much lower value for Africa, and the other continents are ina much closer range of of values between. However, contrasting this to the % 15-49 Total HIV Prevalence bar chart, Africa has the highest prevalence rate significantly followed by North America.
We should note that Asia actually hast the lowest HIV prevalence rate despite their Health Expenditure per Capita being les than Half of Europe's. However, after doing outside research, part of the reason is that many countries in Asia like China, India, etc. have extremely high population numbers which means that even a small percentage still signals a lot of people. This means that many could still be infected with HIV, but also that the governments of these highly populated countries may not be able to spend as much on health expenditure per capita. Also, in the 1990s when there was an HIV epidemic, many countries took extreme measures to install preventative programs, thus decreasing their prevalence rate drastically (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4651036/).
From this continent level aggregation, we are reinforcing the points from above that Africa health expenditure per capita is significantly below the rest of the world and their HIV prevalence rate is significantly higher.
"""
continent_econ = df_final.groupby('continent')['health_expenditure_per_capita,_ppp'].mean().reset_index()
continent_econ_sorted = continent_econ.sort_values(by = 'health_expenditure_per_capita,_ppp')
cont_econ_plot = sns.barplot(x = 'continent', y = 'health_expenditure_per_capita,_ppp', data = continent_econ_sorted,
palette = 'winter')
plt.show()
continent_hiv = df_final.groupby('continent')['%_15-49_total_hiv_prevalence'].mean().reset_index()
continent_hiv_sorted = continent_hiv.sort_values(by = '%_15-49_total_hiv_prevalence')
cont_hiv_plot = sns.barplot(x = 'continent', y = '%_15-49_total_hiv_prevalence', data = continent_hiv_sorted,
palette = 'winter')
plt.show()
"""## **3.2** Health Indicators
Next, we decided to focus on health indicators to see how they correlate with HIV prevalence. This includes metrics like adolescent fertility rate, sanitation facilites rate, improved sanitaion facilities rate, and improved water source rate. We wanted to focus on the indicators with the highest correlations with HIV rates. This includes adolescent fertility rate which has moderately positive correlation with % 15-49 HIV prevalence and improved sanitation facilitites rate and improved water soruce rate both have moderately negative correlation with % 15-49 HIV prevalence.
"""
fig, ax = plt.subplots(figsize=(8,8))
df_health = df_final.drop(columns = ['health_expenditure_per_capita,_ppp',
'health_expenditure,_public_(%_of_total_health_expenditure)',
'health_expenditure,_total_(%_of_gdp)',
'out-of-pocket_health_expenditure_(%_of_total_expenditure_on_health)',
'total_unemployment_ratio',
'urban_population_ratio',
'school_enrollment,_primary_(%_net)',
'school_enrollment,_secondary_(%_net)',
'school_enrollment,_tertiary_(%_gross)'])
correlation_matrix = df_health.corr()
ax = sns.heatmap(correlation_matrix, vmax=1, vmin=-1, cmap='RdBu', annot = True, fmt = '.2f')
ax.set_title('Correlation Matrix Heatmap of Health and HIV Indicators')
"""### 3.2.1 Adolescent Fertility Rate
In the map below, we can see that specific regions have much higher adolescent fertiliy rates such as Africa, South America, the Middle East, and South/Southeastern Asia. However, Africa has significantly higher rates than these other regions, so we will specifically look at this region to see there is a pattern, which we hypothesize that there is because HIV normally stems from unsafe sex practices, which could also lead to higher adolescent fertility rates because both involve lack of contraceptive protection.
"""
fertilityhiv_prim = df_health[['country_name','adolescent_fertility_rate']]
fertilityhiv_prim = fertilityhiv_prim.groupby('country_name').mean()
fertilityhiv_prim.reset_index(inplace=True)
fig = px.choropleth(data_frame = fertilityhiv_prim, locations="country_name", locationmode = 'country names',
color='adolescent_fertility_rate',
hover_name='adolescent_fertility_rate',
color_continuous_scale=px.colors.sequential.Sunset,
title = 'Average Adolescent Fertility Rate by Nation')
fig.show()
"""Visually, this graph does not provide a lot of information about whether there is a clear correlation between adolescent fertility rates and hiv prevalence among age groups 15-49. This dot plot in fact is difficult to analyze because of the concentration of dots for 0-5% of 15-49 total HIV prevalence. Although it looks like a slight negative correlation, the correlation matrix above stated there was a moderately positve correlation, indicating that there is likely concentrations of dots at low adolescent fertility rates and low HIV prevalence rates which are difficult to discern - a limitation of this graph."""
hiv_fert_countries= sns.relplot(x='adolescent_fertility_rate',
y='%_15-49_total_hiv_prevalence', palette = 'winter',
data=df_final, hue = 'country_name', height=8, aspect=1.5);
"""Looking at the highest adolescent fertility rates and highest HIV prevalence rates, we cannot really compare the countries because the adolescent fertility rate data includes many countries in Europe and Western Asia that do not have HIV data that is from the original dataset and not value imputed. Thus we will ignore those. Additionally, the lowest adolescent fertility rate and lowest HIV prevalence rates have overlap of Uganda, Mozambique, Malawai, and Zambia. This could indicate a slight relation between adolescent fertility rates and HIV prevalence rates. """
fert = df_final.groupby('country_name')['adolescent_fertility_rate'].mean()
fertcat_sorted = fert.sort_values()
hiv_fert = df_final.groupby('country_name')['%_15-49_total_hiv_prevalence'].mean()
hiv_fert_sorted = hiv_fert.sort_values()
# print("Highest Adolescent Fertility Rate and Highest HIV Prevalence")
# print(fertcat_sorted.head(15));
# print()
# print(hiv_fert_sorted.head(15));
# print()
# print()
print("Lowest Adolescent Fertility Rate and Lowest HIV Prevalence")
print(fertcat_sorted.tail(15));
print()
print(hiv_fert_sorted.tail(15));
print()
"""Compared the the economic indicators, there appears to be more similarities between the continent aggregated averages for adolescent fertlity rate and HIV prevalence % for ages 15-49. The top three continents for both are South America, North America, and Africa although like we noted before, there is a huge jump for HIV prevalence in Africa comapred to a more gradual increase for adolescent fertility rates. It is interesting to note that there is more of a connection on an aggregated continent level than a specific country by country level. It is in line with our hypothesis that there would be a relation even at a continent level because typically contraceptive laws (abortion laws) are similar among regions rather than being country specific due to having a large basis in culture and religion as reasoning for instituting them."""
continent_fert = df_final.groupby('continent')['adolescent_fertility_rate'].mean().reset_index()
continent_fert_sorted = continent_fert.sort_values(by = 'adolescent_fertility_rate')
cont_fert_plot = sns.barplot(x = 'continent', y = 'adolescent_fertility_rate', data = continent_fert_sorted,
palette = 'winter')
plt.show()
continent_hiv = df_final.groupby('continent')['%_15-49_total_hiv_prevalence'].mean().reset_index()
continent_hiv_sorted = continent_hiv.sort_values(by = '%_15-49_total_hiv_prevalence')
cont_hiv_plot = sns.barplot(x = 'continent', y = '%_15-49_total_hiv_prevalence', data = continent_hiv_sorted,
palette = 'winter')
plt.show()
"""### 3.2.2 Improved Sanitation Facilities
Continents like Europe, North America, South America, Australia, and parts of Northern Asia all have relatively high improved sanitation facilities rates. However, Africa improved sanitation facility rate is relatively low compared to the rest of the world, with their improved sanitation rates almost a third to half of the world. Even Southern Asia hs a relatively lower level with around half of the rest of the world.
"""
sannhiv_prim = df_final[['country_name','improved_sanitation_facilities_rate']]
sannhiv_prim = sannhiv_prim.groupby('country_name').mean()
sannhiv_prim.reset_index(inplace=True)
fig = px.choropleth(data_frame = sannhiv_prim, locations="country_name", locationmode = 'country names',
color='improved_sanitation_facilities_rate',
hover_name='improved_sanitation_facilities_rate',
color_continuous_scale=px.colors.sequential.Sunset,
title = 'Average Improved Sanitation Facilities Rate by Nation')
fig.show()
"""There appears to be a somewhat positive correlation upon first glance (limitation of dot plo), bu we know there is a moderate negative correlation from our heatmap between the x-axis of improved sanitation facilities rate and the y-axis of HIV prevalaence %. Although there are a few outliers that have extremely high improved sanitiation facilities rates and about 5% HIV, this could either be due to our averaging method of filling in null values based on similar regions or due to their baseline for improving their sanitary facilities being really low. Furthermore, the countries with below 60% improved sanitary facilities rate have the higher HIV prevalence %, which could support the hypothesis that worse sanitary facilities increase the transpirary rate of HIV. One positive could be that the countries witht he highest total HIV prevalence have middle levels of improved sanitation facilitites which could reflect investment into their infrastructure and effects that cannot be seen for some time lag."""
hiv_san_countries= sns.relplot(x='improved_sanitation_facilities_rate',
y='%_15-49_total_hiv_prevalence', palette = 'winter',
data=df_final, hue = 'country_name', height=8, aspect=1.5);
"""It does not make sense to look at the highest improved sanitation facilities rate countries because they include Europe which there is no non-imputed HIV data for. Additionally, this analysis may not be as useful in general since it is measured improved sanitation facilities as a rate but not all of these countries had the same baseline to begin with. However, if we briefly look at the lowest improved sanitation facilities rate and highest HIV prevalence, there is overlap between Mozambique and Uganda which has been a common trend among many of these health metrics due to their higher HIV prevalance rate and overall low health and economic metrics/rates."""
san = df_final.groupby('country_name')['improved_sanitation_facilities_rate'].mean()
sancat_sorted = san.sort_values()
hiv_san = df_final.groupby('country_name')['%_15-49_total_hiv_prevalence'].mean()
hiv_san_sorted = hiv_san.sort_values()
print("Lowest Improved Sanitation Facilities Rate and Highest HIV Prevalence")
print(sancat_sorted.head(15));
print()
print(hiv_san_sorted.tail(15));
# print()
# print()
# print("Highest Improved Sanitation Facilities Rate and Lowest HIV Prevalence")
# print(sancat_sorted.tail(15));
# print()
# print(hiv_san_sorted.head(15));
# print()
"""It's important to note that on a continent aggregated level, Africa has the least improved sanitation facilitites rate by a relative margin and that they have the highest HIV % prevalence rates among ages 15-49. This means taht even if Africa is starting at a lower baseline than Europe or North America, it is stll not making sanitation facility improvement at an eqivalent rate, so it will continue falling behind on this metric. We see that Europe has shown the most improvement for sanitaiton facilities by a relative margin and has the second lowest HIV prevalence rate. HIV works by desrying CD4 T cells which are white blood cells that help one's body fight disease, so having sanitized facilities prevents general viruses from passing on easily. It also can ensure general sanitary practices like not sharing needles or coming into contact with blood, which are two major ways that HIV passes. Thus, Africa may need to invest in their sanitary facilities to indirectly decrease HIV prevelance rates. Thus, we can takeaway that there could be a deeper connection between these variables."""
continent_san = df_final.groupby('continent')['improved_sanitation_facilities_rate'].mean().reset_index()
continent_san_sorted = continent_san.sort_values(by = 'improved_sanitation_facilities_rate')
cont_san_plot = sns.barplot(x = 'continent', y = 'improved_sanitation_facilities_rate', data = continent_san_sorted,
palette = 'winter')
plt.show()
continent_hiv = df_final.groupby('continent')['%_15-49_total_hiv_prevalence'].mean().reset_index()
continent_hiv_sorted = continent_hiv.sort_values(by = '%_15-49_total_hiv_prevalence')
cont_hiv_plot = sns.barplot(x = 'continent', y = '%_15-49_total_hiv_prevalence', data = continent_hiv_sorted,
palette = 'winter')
plt.show()
"""### 3.2.3 Improved Water Source Rate
Looking at the map below, we have a significant amount of data to analyze without much required value imputation, especially compared to some other metrics. It's clear that the hghest average improved water source rates occurred in Europe, North America, South America, and Australia. Meanwhile, Africa was lagging behind significantly and Asia slightly with less water source improvement. We did some outside research (https://www.wvi.org/clean-water-sanitation-and-hygiene-wash/why-water-matters-hivaids) and found that although HIV cannot be directly spread by water, it is important because it can make one's immune system more vulnerable to the virus.
"""
sannhiv_prim = df_final[['country_name','improved_water_source_rate']]
sannhiv_prim = sannhiv_prim.groupby('country_name').mean()
sannhiv_prim.reset_index(inplace=True)
fig = px.choropleth(data_frame = sannhiv_prim, locations="country_name", locationmode = 'country names',
color='improved_water_source_rate',
hover_name='improved_water_source_rate',
color_continuous_scale=px.colors.sequential.Sunset,
title = 'Average Improved Water Source Rate by Nation')
fig.show()
"""In the dotplot, it looks like there is a positive correlation between improved water source rate and total HIV prevalence % for ages 15-49 which is interesting given that the heatmap above showed a moderately negative relationship. However, one possible reason for this there is an extremely high concentration of dots at high improved water source rate and extremely low HIV prevalence rates. To continue, we can also note that the same baseline problem occurs here as with improved sanitation facilities rate where different countries start at various baselines so it is difficult to compare countries one to one. Rather, we could see the dots with HIV water source improvement and HIV prevalence as a sign of the countries with high HIV prevalence rates investing in their water sources because they recognize it as a root of many health and immunity problems, including HIV."""
hiv_san_countries= sns.relplot(x='improved_water_source_rate',
y='%_15-49_total_hiv_prevalence', palette = 'winter',
data=df_final, hue = 'country_name', height=8, aspect=1.5);
"""Again, we cannot look at the highest improved water source rates because they include countries in Europe but HIV prevalence does not. However, even on the flip side with the lowest improved water source rate and highest HIV prevalence, only Mozambique is in common which could indicate a less strong direct relationship between these variables while taken at an individual country level."""
water = df_final.groupby('country_name')['improved_water_source_rate'].mean()
watercat_sorted = water.sort_values()
hiv_water = df_final.groupby('country_name')['%_15-49_total_hiv_prevalence'].mean()
hiv_water_sorted = hiv_water.sort_values()
print("Lowest Improved Water Source Rate and Highest HIV Prevalence")
print(watercat_sorted.head(15));
print()
print(hiv_san_sorted.tail(15));
# print()
# print()
# print("Highest Improved Water Source Rate and Lowest HIV Prevalence")
# print(watercat_sorted.tail(15));
# print()
# print(hiv_san_sorted.head(15));
# print()
"""However, on a continent aggregated level, there could be more relation between these variables. Specifically, it appears that Africa has the lowest average improved water source rate and clearly has the highest HIV prevalence rate, which could support our hypothesis that not clean water can influence immune system strength and consequently HIV rates. This is also seen on the flip side where Europe has dramatically improved their water source rate across the entire continent and they have the second lower continent aggregated HIV prevalence rate."""
continent_water = df_final.groupby('continent')['improved_water_source_rate'].mean().reset_index()
continent_water_sorted = continent_water.sort_values(by = 'improved_water_source_rate')
cont_water_plot = sns.barplot(x = 'continent', y = 'improved_water_source_rate', data = continent_water_sorted,
palette = 'winter')
plt.show()
continent_hiv = df_final.groupby('continent')['%_15-49_total_hiv_prevalence'].mean().reset_index()
continent_hiv_sorted = continent_hiv.sort_values(by = '%_15-49_total_hiv_prevalence')
cont_hiv_plot = sns.barplot(x = 'continent', y = '%_15-49_total_hiv_prevalence', data = continent_hiv_sorted,
palette = 'winter')
plt.show()
"""## **3.3** Education Indicators
Compared to the other categories, there is a higher level of correlation between the three school enrollment categories (primary, secondary, and teritiary). Additionlly, of these, secondary has the highest negative correlation, which makes sense because this is the same age in which people start engaging in sexual relationships that can transmit HIV in the first place. Additionally, the drop off for tertiary might stem because typically, a specific subset of more wealthy or overall more education people attend college-level schooling, which could also mean they have more access to contraceptive protection.
"""
fig, ax = plt.subplots(figsize=(10,10))
df_edu = df_final.drop(columns = ['health_expenditure_per_capita,_ppp',
'health_expenditure,_public_(%_of_total_health_expenditure)',
'health_expenditure,_total_(%_of_gdp)',
'out-of-pocket_health_expenditure_(%_of_total_expenditure_on_health)',
'total_unemployment_ratio',
'urban_population_ratio',
'adolescent_fertility_rate',
'improved_sanitation_facilities_rate',
'improved_water_source_rate',
'total_unemployment_ratio',
'urban_population_ratio',
])
correlation_matrix = df_edu.corr()
ax = sns.heatmap(correlation_matrix, vmax=1, vmin=-1, cmap='RdBu', annot = True, fmt = '.2f')
ax.set_title('Correlation Matrix Heatmap of Economic and HIV Indicators')
"""To begin, we decided to map primary, secondary, and tertiary school enrollment rates to see how they shifted over time. Begininng with the primary school enrollment rates, they are extremely high, almost 90% or more in South America, Central Asia, and South/Southeast Asia but they are still lower in Africa. Note that the upperbound appears to be 100% and the lower bound appears to be slightly less than 50%, of which these lower rates all stem from Africa."""
eduhiv_prim = df_edu[['country_name','school_enrollment,_primary_(%_net)']]
eduhiv_prim = eduhiv_prim.groupby('country_name').mean()
eduhiv_prim.reset_index(inplace=True)
fig = px.choropleth(data_frame = eduhiv_prim, locations="country_name", locationmode = 'country names',
color='school_enrollment,_primary_(%_net)', hover_name='school_enrollment,_primary_(%_net)',
color_continuous_scale=px.colors.sequential.Sunset,
title = 'Average Primary School Enrollment by Nation')
fig.show()
"""Moving onto secondary school enrollment, the rates drop by at around 10-20% across the board, but even more significantly in Southern Africa, Western Africa, and Southeast Asia. For those regions, the drop off rate was 30-40% for certain countries, which shows there is a high attrition rate between primary and secondary school enrollment. We will explore secondary school enrollment primarily because it has the highest correlation with HIV and is more representative of the overall education levels than primary which is relatively high across the board and tertiary which is relatively low across the board. Note that the maximum for secondary school enrollment is about 90% and the minimum is about 20%."""
eduhiv_sec = df_edu[['country_name','school_enrollment,_secondary_(%_net)']]
eduhiv_sec = eduhiv_sec.groupby('country_name').mean()
eduhiv_sec.reset_index(inplace=True)
fig = px.choropleth(data_frame = eduhiv_sec, locations="country_name", locationmode = 'country names',
color="school_enrollment,_secondary_(%_net)", hover_name='school_enrollment,_secondary_(%_net)',
color_continuous_scale=px.colors.sequential.Sunset,
title = 'Average Secondary School Enrollment by Nation')
fig.show()
"""Compared to the primary and secondary maps, there is relatively less tertiary school enrollment across South America, Africa, and Asia, which makes sense given that with each level of education, there will be a drop off rate as people choose to enter the workforce or have families rather than pursuing more education. We can see that South America and central Asia still have the highest rates among these continents and Western and Southern Africa still have the lowest rates. However, note that even the highest countries still only have ~70% enrollment and this does not necessarily mean they completed college or their form of tertiary schooling. The drop of rate from 90% being the max and 20% being the lowest to 70% being the max and ~that0% being the lowest is significant."""
eduhiv_ter = df_edu[['country_name','school_enrollment,_tertiary_(%_gross)']]
eduhiv_ter = eduhiv_ter.groupby('country_name').mean()
eduhiv_ter.reset_index(inplace=True)
fig = px.choropleth(data_frame = eduhiv_ter, locations="country_name", locationmode = 'country names',
color='school_enrollment,_tertiary_(%_gross)', hover_name='school_enrollment,_tertiary_(%_gross)',
color_continuous_scale=px.colors.sequential.Sunset,
title = 'Average Tertiary School Enrollment by Nation')
fig.show()
"""In the following graph, we plot a dot plot showing the relation between secondary school enrollment (x-axis) and total HIV prevalence for 15-49. We made the hue countries as well to slightly distinguish between the dots. However, as clearly shown, there is a slight negative correlation between these features, which is one of the first times among all these dot plots that the true correlation has been relatively clear upon looking. This is because a majority of countries have hiv prevalence rates really close to zero, but note that a congregation of these zero HIV prevalence points happens at much higher secondary school enrollment rates. Additionally, it is clear that a majority of countries that have higher HIV prevalence rates above 5% have secondary school enrollment rates of below 40 and there are only a few near 60%."""
hiv_education_countries= sns.relplot(x='school_enrollment,_secondary_(%_net)', y='%_15-49_total_hiv_prevalence', palette = 'winter',
data=df_edu, hue = 'country_name', height=8, aspect=2);
"""Looking first at the lowest education and highest HIV prevalence, there are overlaps including Mozambique, Lesotho, Central Africa Republic, and Uganda. This indicates that there could be a connection between low education rates and high HIV prevalence. In fact, according to outside research, HIV typically emerges due to a lack of knowledge of safe sex practices as well as lack of access to contraceptives. This follows in line with this outside research and also, since most of these countries are concentrated in similar regions in Africa, it points to where global resources should be invested.
Looking on the other side at the highest education and lowest HIV prevalence countries, it is hard to compare because European countries do not have their own HIV data.
"""
education = df_edu.groupby('country_name')['school_enrollment,_secondary_(%_net)'].mean()
educat_sorted = education.sort_values()
hiv_education = df_edu.groupby('country_name')['%_15-49_total_hiv_prevalence'].mean()
hiv_educat_sorted = hiv_education.sort_values()
print("Lowest Education and Highest HIV Prevalence")
print(educat_sorted.head(15));
print()
print(hiv_educat_sorted.tail(15));
# print()
# print()
# print("Highest Education and Lowest HIV Prevalence")
# print(educat_sorted.tail(15));
# print()
# print(hiv_educat_sorted.head(15));
# print()
"""Looking at the correlation matrix below, I wanted to explore the specific relation between education and comprehensive HIV knowledge for both genders. Interestingly, males has a negative correlation and females have a positive. As the level of enrollment in school increases from primary to secondary to tertiary, the males level of comprehensive knowledge of HIV becomes less negatively correlated and closer towards no correlations. The female level of comprehensive knowledge shows a higher positive correlation between primary and secondary and a slightly lower for tertiary. This is fascinating because it implies that as primary schooling levels increase, the male comprehensive HIV knowledge decreases, which is contrary to intuition. However, the increase in primary schooling correlates to an increase in female comprehensive HIV knowledge which is on par with our intuition but I thought it would have a higher correlation. This could indicate that education by itself or the way the current education systems are set up is not an extremely effective way to increase knowledge and understanding of HIV. """
fig, ax = plt.subplots(figsize=(10,10))
df_edu_zoom = df_final.drop(columns = [
'adolescent_fertility_rate', 'total_population_living_with_hiv',
'hiv_population_with_antiretroviral_therapy_coverage',
'health_expenditure_per_capita,_ppp',
'health_expenditure,_public_(%_of_total_health_expenditure)',
'health_expenditure,_total_(%_of_gdp)',
'improved_sanitation_facilities_rate', 'improved_water_source_rate',
'hiv_uninfected_incidence_percentage_15-49',
'out-of-pocket_health_expenditure_(%_of_total_expenditure_on_health)',
'%_15-49_total_hiv_prevalence', 'total_unemployment_ratio',
'urban_population_ratio', 'urban_poverty_ratio'
])
correlation_matrix = df_edu_zoom.corr()
ax = sns.heatmap(correlation_matrix, vmax=1, vmin=-1, cmap='RdBu', annot = True, fmt = '.2f')
ax.set_title('Correlation Matrix Heatmap HIV Knowledge and Education Rates')
"""To take a deeper look, we decided to plot maps for both male and female knowledge of HIV and we found an explanation. Our value imputation for female HIV knowledge caused entire continents to be grouped together, with only some other countries in Africa having gathered this knowledge. Likewise, for Male HIV knowledge, there is data for specific countries in Africa, but none really beyond that. However, it is interesting to note that even then, Africa has higher male knowledge of HIV than continents taken as a whole, we makes sense give how prevalent HIV is in the continent. This is not the same for females though. """
edufem_prim = df_edu[['country_name','%_females_15-49_comprehensive_hiv_knowledge']]
edufem_prim = edufem_prim.groupby('country_name').mean()
edufem_prim.reset_index(inplace=True)
fig1 = px.choropleth(data_frame = edufem_prim, locations="country_name", locationmode = 'country names',
color= '%_females_15-49_comprehensive_hiv_knowledge', hover_name='%_females_15-49_comprehensive_hiv_knowledge',
color_continuous_scale=px.colors.sequential.Sunset,
title = 'Female HIV Comprehensive Knowledge by Nation')
fig1.show()
edumale_prim = df_edu_zoom[['country_name','%_females_15-49_comprehensive_hiv_knowledge', '%_males_15-49_comprehensive_hiv_knowledge']]
edumale_prim = edumale_prim.groupby('country_name').mean()
edumale_prim.reset_index(inplace=True)
fig2 = px.choropleth(data_frame = edumale_prim, locations="country_name", locationmode = 'country names',
color= '%_males_15-49_comprehensive_hiv_knowledge', hover_name='%_males_15-49_comprehensive_hiv_knowledge',
color_continuous_scale=px.colors.sequential.Sunset,
title = 'Male HIV Comprehensive Knowledge by Nation')
fig2.show()
"""#4.0 EDA: Continent Aggregated Analysis
In the graph below, it is clear that Africa's bottom quartile for % 15-49 Total HIV Prevalence is in line with the other continents averages and even upper bounds. Meanwhile, among the other continents, Europe and Oceania appear to have the lowest HIV Prevalence rates which are in line with our analysis from above, but North America's is slightly higher which is surprising. However, after looking at the countries who data is available, it is clear that Mexico is the primary driver and thus might be overestimating the overall continents HIV prevalence rates given that there is no data for Canada and the US.
"""
# In order to demonstrate the above point that the mean might be skewed but also to see if there is any continent that is relatively uniform
continent_outliers = sns.boxplot(x='continent',y='%_15-49_total_hiv_prevalence', data = df_final, color = 'blue')
continent_outliers.get_figure().autofmt_xdate()
continent_outliers.set_xlabel('continent')
continent_outliers.set_ylabel('% 15-49 Total HIV Prevalence')
continent_outliers.set_title('Exploring Continental Outliers')
plt.show()
"""In the violinplot and corresponding table right below that in the following cell, it is clear that even using a median creates a skewed plot where Africa's Median % 15-49 Total HIV Prevalence is significantly higher than the rest of the world, again noting the reason for why North America might be in second.However, we can see that Africa's violin plot extends high but its midpoint is still relatively near the 5% mark. Since violinplots also show density and Africa's is very skinny throughout, it shows that there is a wide variety of HIV prevalence rates among all the countries whereas the other continents have wider violinplots indicating there is more consistency/density."""
f,ax = plt.subplots(figsize=(15, 7))
aggrog = sns.violinplot(x='continent',y='%_15-49_total_hiv_prevalence',data = df_final, color='blue')
aggrog.get_figure().autofmt_xdate()
aggrog.set_xlabel('continent')
aggrog.set_ylabel('Median % 15-49 Total HIV Prevalence')
aggrog.set_title('HIV Distribution by Continent')
plt.show()
joined_continents_df = df_final.groupby(['continent']).median()
joined_continents_df.reset_index(inplace=True)
joined_continents_df[['continent','%_15-49_total_hiv_prevalence']]
"""Overall, this section simply highlighted how prevalent HIV was in Africa compared to the rest of the world. It also indicates that there are possible geographic ramifiations of HIV such that because it is a virally spread disease, it can transcend borders and stay prevalent among regions.
# **5** Modeling
In this section we explore both regression and classification models to predict the HIV prevalence given the indicators we chose in section **2**.
## **5.1** Preprocessing
### **5.1.1** Split into Features and Label
We first define the DataFrames for our features and label. Note that we include all the indicators as well as the country names and years as the features. The label DataFrame consists of a single column which contains the indicator value for *Prevalence of HIV, total (% of population ages 15-49)*. We first use regression model to predict the prevalence percentage so the labels are the original percentage values.
"""
features = df_final[['country_name',
'year',
'adolescent_fertility_rate',
'improved_sanitation_facilities_rate',
'improved_water_source_rate',
#'total_population_living_with_hiv',
'health_expenditure_per_capita,_ppp',
'health_expenditure,_public_(%_of_total_health_expenditure)',
'health_expenditure,_total_(%_of_gdp)',
#'hiv_uninfected_incidence_percentage_15-49',
'out-of-pocket_health_expenditure_(%_of_total_expenditure_on_health)',
#'%_15-49_total_hiv_prevalence',
'total_unemployment_ratio',
'urban_population_ratio'
]]
hiv_prevalence = df_final[['%_15-49_total_hiv_prevalence']]
"""### **5.1.2** One-hot-encoding
The country_name column contains categorical values so we need to convert it into numerical vectors using one-hot-encoding before we can use it for modeling.
"""
features = pd.get_dummies(features, columns=['country_name'])
"""### **5.1.3** Compute Numerical Feature Correlation
Before running our models we first plot the correlation matrix of the features as a simple check of multicollinearity. As we can see from the plot, there are several features which exhibit relatively high correlation. The *adolescent_fertility_rate* is negatively correlated with *improved_sanitation_facilities_rate* and *improved_water_source_rate*, and *improved_sanitation_facilities_rate* and *improved_water_source_rate* are also positively correlated. In addition, *health_expenditure_per_capita,_ppp* is positively correlated with *urban_population_ratio*.
"""
numerical_features = features[['year',
'adolescent_fertility_rate',
'health_expenditure_per_capita,_ppp',
'health_expenditure,_public_(%_of_total_health_expenditure)',
'health_expenditure,_total_(%_of_gdp)',
'out-of-pocket_health_expenditure_(%_of_total_expenditure_on_health)',
'improved_sanitation_facilities_rate',
'improved_water_source_rate',
'total_unemployment_ratio',
'urban_population_ratio'
]]
correlation_matrix = numerical_features.corr()
ax = sns.heatmap(correlation_matrix, vmax=1, vmin=-1, cmap='RdBu')
ax.set_title('Correlation Matrix Heatmap of all Indicators')
"""### **5.1.4** Split Data into Train and Test
We split the dataset into train and test with a ratio of 0.8/0.2.
"""
x_train, x_test, y_train, y_test = train_test_split(features, hiv_prevalence, test_size=0.2)
"""## **5.2** Regression Models
### **5.2.1** Linear Regression
In this section we first train a simple Unregularized Linear Regression on the selected indicators to predict the percentage value of hiv prevalence. We report the R Square (R2) score as the model evaluation metric.
The R2 score measures how much variability in the dependent variable is explained by the model. The R2 score ranges from 0 to 1 and a higher value indicates a better model performance.
"""
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(x_train, y_train)
y_pred = reg.predict(x_test)
score = reg.score(x_test, y_test)
print(score)
"""### **5.2.2** Ridge Regression
As we have seen in section **3.1.3**, there exists multicollinearity issues in our features. So we additionally train a Ridge Regression model with the aim to mitigate this issue.
Ridge regression is a regularized model that is used to analyse any data that suffers from multicollinearity. This method performs L2 regularization. When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large. This results in predicted values being far away from the actual values.
"""
from sklearn.linear_model import Ridge
reg_ridge = Ridge(alpha=10).fit(x_train, y_train)
y_pred = reg_ridge.predict(x_test)
ridge_score = reg_ridge.score(x_test, y_test)
print(ridge_score)
"""### **5.2.3** Lasso Regression
Finally we use another type of regularization called Lasso Regression. Unlike Ridge Regression, Lasso regularization allows the model to assign non-important features weights of value 0. This feature makes Lasso regularization good for making a sparse linear model where fewer features are used for prediction.
"""
from sklearn.linear_model import Lasso
reg_lasso = Lasso().fit(x_train, y_train)
y_pred = reg_lasso.predict(x_test)
lasso_score = reg_lasso.score(x_test, y_test)
print(lasso_score)
"""**Observation on regression models**
According to the R2 score we see that the unregularized linear regression model in fact gives the best performance. The performance of Ridge Regression is slightly worse, and Lasso Regression has a much lower score. This may suggest that we are not having many features that are highly redundant, so training a sparse linear model will lose important features that contribute to the right prediction.
## **5.3** Classification Models
In this section we turn to classification models and we try to use them to predict a categorical value for the hiv prevalence. Previously we directly used the percentage value as the label, but now we will divide these values into 3 categories indicating a low/middle/high HIV prevalence rate.
For simplicity we may attempt to assign the class labels using the 3 ranges [0,0.33), [0.33, 0.66), and [0.66,1]. However it is almost impossible for the HIV prevalence rate to reach a middle or high range according to this definition, and the highest value of HIV prevalence is just around 0.25 in our dataset. Therefore, we instead decide the thresholds by sorting the HIV prevalence values in our dataset and select the values corresponding to the datapoints at the top 33% and top 66% position.
We report the test accuracy to evaluate the models' performances.
### **5.3.1** Get Class Labels
**Create 3-class Labels for hiv prevanence level** : We convert the original hiv prevenance level into high/mid/low categories according to the value range, and the labels are 2/1/0 respectively.
"""
# Get value range for hiv_prevalence, then divide equally into low/mid/high.
prevalence_df = df_final['%_15-49_total_hiv_prevalence'].sort_values().reset_index()['%_15-49_total_hiv_prevalence']
low_threshold = prevalence_df[len(prevalence_df) // 3]
middle_threshold = prevalence_df[len(prevalence_df) * 2 // 3]
df_final['hiv_prevalence_class'] = df_final['%_15-49_total_hiv_prevalence'].apply(lambda x: 0 if x < low_threshold else 2 if x > middle_threshold else 1)
# Get labels dataframe
labels = df_final['hiv_prevalence_class']
"""We use a barplot to show the number of datapoints for each category. We can see that the datapoints are roughly balanced. But of course it is not perfectly divided into 3 groups because there are datapoints with the same values."""
# Plot counts for each class
sns.countplot(labels)
# Combine features and labels into 1 dataframe
feature_label_df = pd.concat([features, labels], axis=1)
"""### **5.3.2** Split Data into Train and Test
Split the dataset into train and test. We still use a ratio of 0.8/0.2
"""
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.2)
"""### **5.3.3** Logistic Regression
####Unregularized Logistic Regression
The first and simplest model we use for classification is Unregularized Logistic Regression. Logistic Regression is simple and fast to train. However, the main limitation of it is that it is a linear model, which means that the decision boundry produced by the model is always linear.
"""
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(max_iter=100).fit(x_train, y_train)
y_pred = log_reg.predict(x_test)
log_acc = log_reg.score(x_test, y_test)
print(log_acc)
"""#### ElasticNet Logistic Regression
Just like how we used Ridge and Lasso regularizations for our linear regression model, here we also consider a regularized version of logistic regression.
We use ElasticNet, which essentially consists of both Ridge and Lasso regularization terms with a weighting factor to determine the strengths of each. When the ratio is set to 0 and 1, it would be identical to Ridge and Lasso regularization. Here we use *l1_ratio*= 0.5
"""
from sklearn.linear_model import LogisticRegression
log_en_reg = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5 , max_iter=100).fit(x_train, y_train)
y_pred = log_en_reg.predict(x_test)
log_en_acc = log_en_reg.score(x_test, y_test)
print(log_en_acc)
"""###**5.3.4** Decision Trees
As mentioned in the previous section, Logistic Regression cannot capture any non-linear relationships. Therefore we consider a more powerful yet efficient family of models - Decision Trees - to perform the classification task.
#### Random Forest
We start with the Random Forest model, which is just an ensemble of decision trees trained in parallel. Random Forest has the advantage of smaller possibility of overfitting than a single Decision Tree. However, in general it does not increase the performance too much, because each tree is trained independently.
"""
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier().fit(x_train, y_train)
y_pred = rf.predict(x_test)
rf_acc = rf.score(x_test, y_test)
print(rf_acc)
"""#### Gradient-Boosted Tree