Learning from Incomplete Data

Missing values frequently occur in datasets and hinder many approaches from being directly applied. This repository collects different approaches for feature selection and classification on incomplete datasets. It further provides utilities to create synthetic datasets and to simulate different missing mechanisms.

It was developed during my Master's thesis and has the aim to check whether the traditional approaches - imputation or deletion - are inferior to approaches which handle missing values internally. Further, the repository includes a version of the RaR[1] algorithm extended by a robust handling of handling missing values internally and by introducing active sampling of subspaces rather than random subsampling.

[1] Shekar, Arvind Kumar, et al. "Including multi-feature interactions and redundancy for feature ranking in mixed datasets." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Cham, 2017.

Thesis

The attached thesis presents the theoretical foundations of RaR and discusses its strengths and weaknesses compared to related algorithms and traditional missing value handling techniques.

Datasets

Datasets can be downloaded directly either from UCI (as csv) or from openml (as arff). To add a dataset on your own create a data folder at top level and inside a csv or arff folder if not already existent. Create a folder for your own dataset which name equals the name you pass to the dataloader. For arff datasets you need to name the file containing the dataset "data.arff". In case of csv files, the file must be called "data.csv". When loading a csv dataset for the first time, a file called "meta_data.json" will be created. It stores the feature names and types which are needed during the algorithms (can be edited).

Evaluation

To try and test the algorithms, there is a jupyter script at the top-level which is called test.py (can be executed in vscode or atom using jupyter and hydrogen extension, can also be run as a script but without breakpoints).

It shows how to load data, introduce missing values and how to run the developed algorithm on it.

Installation (Windows)

Required installations

Python 3.6 and pip (other versions might also work but are not tested)
Microsoft Visual C++ Build Tools
- https://visualstudio.microsoft.com/de/vs/community/
- Tools für Visual Studio 2017 > Download Build Tools für Visual Studio 2017 (including Windows SDK)
- requires c++ build tools and windows 10 sdk
- add paths to ucrt and vcvarshall.bat to path
Gurobi optimizer (https://www.gurobi.com/downloads/gurobi-optimizer)
- Get licence, install and run "grbgetkey <key>" in run and start menu (win+r)
- Run "python setup.py install" in gurobi folder

Steps:

Install requirements using pip (virtual environment is recommended)
- Using git bash under Windows:
- python3 -m venv rar
- source rar/Scripts/activate
pip install -r requirements.txt
- (eventually ecos module does not succeed to install, install then using whl file from https://www.lfd.uci.edu/~gohlke/pythonlibs/)
python setup.py build_ext --inplace
(Jupyter when using venv and vs code extension):
- Jupyter must be installed globally
- Create kernel in venv: ipython kernel install --name=rar
- Go to python site-packages and copy gurobipy folder to site-packages in venv
- Test kernel from vs code
Add root folder to system path or set workspace root in vs code

Experiments and Statistics

The experiment folder contains scripts to compare different feature selection/ classification algorithms on synthetic or real-world datasets. The data folder also contains statistics about the datasets and might include visualizations of datasets. If not present,the scripts to create these statistics and visualizations can be found in the scripts folder.

Name		Name	Last commit message	Last commit date
Latest commit History 396 Commits
.vscode		.vscode
data/statistics		data/statistics
docs		docs
project		project
scripts		scripts
thesis-experiments @ 3f1e3fe		thesis-experiments @ 3f1e3fe
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.mypy.ini		.mypy.ini
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning from Incomplete Data

Thesis

Datasets

Evaluation

Installation (Windows)

Experiments and Statistics

About

Releases

Packages

Contributors 2

Languages

SebastianRehfeldt/missing-value-aware-feature-selection

Folders and files

Latest commit

History

Repository files navigation

Learning from Incomplete Data

Thesis

Datasets

Evaluation

Installation (Windows)

Experiments and Statistics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages