Skip to content

Performance of the EM algorithm and imputation methods with different missing data mechanisms (EPFL - Statistical Computation and Visualization)

Notifications You must be signed in to change notification settings

kalos11/EM-Algorithm_Missing-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Performance of the EM algorithm and imputation methods with different missing data mechanisms

Final project of the course MATH-517 Statistical Computation and Visualization (EPFL, Fall 2023).

The goal of this project is to study the performance of inference using the Expectation Maximization algorithm and various imputation methods with different missing data mechanisms.

The report is available here.

Installation and Requirements

The requirements are listed in the requirements.txt file. To install them, run the following command in the root directory of the project:

pip install -r requirements.txt

To run the notebooks, first move them in the source folder. Some deprecated notebooks, need the library.py file to run.

Architecture of the repository

main-project-l-b-g/
│
├── docs/                            # Report documents
│   ├── report.pdf                     # PDF of the final report
│   ├── report.html                    # HTML of the final report
│   ├── img/                           # Images used in the report and not directly generated by the code
│   └── report_files/                  # Ignore, files for quarto rendering
| 
├── src/                             # Source code directory
│   ├── mask_utils.py                  # Code from public repo to generate masks for missing data
│   ├── produce_NA.py                  # Code from public repo to generate missing values in a complete dataset
|   ├── utils.py                       # Functions to create masks missing data (from public repository).
│   ├── data_generation.py             # Code to generate both complete and incomplete data (Gaussian and Student-t)
│   ├── updated_impyute.py             # Code from public repo
│   ├── imputation.py                  # Code to impute data and perform inference given missing data.
│   │                                  # Also computes MSEs given complete data
|   ├── visualization.py.py            # Functions to plot results
│   ├── stat_utils.py                  # General utility functions in statistics
│   | 
|   ├── data/                          # Data directory
|   │   └── winequality-white.csv        # Real data where the underlying model is unknown
|   │
|   ├── notebooks/                     # Jupyter notebooks used for observations and testing
|   │   └── library.py                   # Old and deprecated library needed to run some of the notebooks. 
|   |
|   ├── img/                           # Folder with saved plots
|   |
|   └── misc/                          # Ignore
|
├── requirements.txt                 # List of project required libraries
|
├── .gitignore                       # Specifies intentionally untracked files to ignore
|
├── Makefile                         # Makefile to render the report
|
└── README.md                        # Detailed description of the project

Authors, Professor and Supervisors

Authors

Professor

Supervisor

Usage

Using quarto is a one-liner (quarto render src --to html or quarto render src --to pdf), but the provided Makefile makes it even easier:

make html
make pdf
make # both pdf and html

The resulting webpage is in docs/index.html, which can be used directly with Github Pages. The pdf is at docs/report.pdf

Last edited: 2024-01-07

About

Performance of the EM algorithm and imputation methods with different missing data mechanisms (EPFL - Statistical Computation and Visualization)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages