Causal inference can be defined as the process by which causes are inferred from the data. In this project, data from breast cancer diagnosis is analyzed and causes inferred from this analysis.
- Perform a causal inference task using Pearl’s framework;
- Infer the causal graph from observational data and then validate the graph
- Merge machine learning with causal inference;
The data is extracted from Kaggle.. Features in the data are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
- ID number
- Diagnosis (M = malignant, B = benign)
- Ten real-valued features are computed for each cell nucleus:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- Perimeter
- Area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- Symmetry
The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features
Conducted an exploratory data analysis on the data & communicated useful insights. This includes: identification and treating all missing values and outliers in the dataset by using appropriate methods,performing feature extraction and scaling This is found on the notebooks folder
- Split data into training and hold-out set
- Create a causal graph using all training data and get the insights (this will be considered the ground truth)
- Create new causal graphs using increasing fractions of the data and compare with the ground truth graph
- The comparison can be done with a Jaccard Similarity Index, measuring the intersection and union of the graph edges
- After reaching a stable causal graph, select only variables that point directly to the target variable
- Train one model using all variables and another using only the variables selected by the graph
- Measure how much each of the models overfit the hold-out set created in step