Skip to content

Python code to read, retrieve, analyze, and plot district-level findings from official (pdf) publications of the 5th National Family Health Survey of India

License

Notifications You must be signed in to change notification settings

kalyaninagaraj/NFHS5

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Read, store and analyze NFHS-5 data from district-level summaries

  1. Download State and District-level PDFs [Notebook]
    Download PDF reports of key indicators for each state/UT and each of their districts from http://rchiips.org/nfhs/.

  2. Pickle the Indicators [Notebook, Notebook] Save indicators, names of states/UTs and their respective districts in dictionary format for easy "pickling" (serializing).

  3. Save district-level statistics to DataFrame [Notebook, PY]
    Read the PDF reports sequentially and store 104 indicator values for each of 700+ districts in a CSV file.

  4. Perform PCA, K-Means Clustering on the reported NFHS-5 data [Notebook, PY]
    Perform PCA to (1) plots 2D/3D representations of all 700+ data points, (2) find k-nearest neighbors to (3) impute missing (unavailable) values in the dataset.

    For example, the plot below on the left is a 2D representation of the original 95-dimensional data. Each dot represents a district in the dataset, and the two highlighted in red are from the state of Goa. This reduction in the data's orignal dimensionality (to 2 dimensions) explains only about 34% of the variance in the data. A 3D representation (on the right below) explains roughly 40% of the variance in the data.

2D representation by PCA 3D representation by PCA
2D-PCA 3D-PCA
  1. Display NFHS-5 data on interactive maps using GeoPandas [Notebook, PY]
    Generate maps to view reported statistics for each district. Missing or unavailable entries are estimated using Principal Component Analysis (PCA). The images below are screenshots of maps showing three such indicators (or statistics) for different districts in the country. The number of principal components for imputing missing entries is chosen in such a way so as to explain 99% percent of the variance in the dataset.

    (a) Percentage of literate women (aged 15-49)

    Q14

    (b) Percentage of married women (aged 15-49) who follow some family planning method

    Q20

    (c) Percentage of pregnant women (aged 15-49) who are anaemic

    Q83

Code Credit

@kalyaninagaraj

Resources

  1. National Family Health Survey of India (official website)
  2. fitz, or PyMuPDF (documentation)
  3. pickle (documentaion)
  4. GeoPandas (documenatation)
  5. District boundary data of India in the form of shapefiles sourced from Kaggle

About

Python code to read, retrieve, analyze, and plot district-level findings from official (pdf) publications of the 5th National Family Health Survey of India

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published