A Hadoop-based $k$-nearest-neighbors classifier

1. Introduction

Based on the Hadoop cluster built on the virtual machine, this project implemented the K-nearest-neighbors classifier algorithm under the MapReduce framework in Java and verified its correctness on two small-scale datasets. For a detailed description of this project, please read the following documentation:

1.1 The basic idea of a KNN classifier

K-Nearest Neighbors Classifier is a parameterless model. Its basic idea is: given the training data set Train and a test case $S$, calculate the distance between $S$ and all the instances in Train, take the nearest K training set instances, and the predicted $S$ label is the label with the largest proportion in these instances.

The KNN classifier can be formulated as follows:

$$ Pr(Y=j|X=x_0)=\frac{1}{K} \sum_{i \in N_0} I(y_i = j) $$

This formula shows that given $x_0$, the conditional probability of $Y=j$ is equal to the proportion of all training instances labeled $j$ in the K neighborhood to K. It can also be seen as an approximation of the conditional probability of the Bayesian optimal classifier.

1.2 MapReduce KNN classifier

Since the KNN classifier needs to calculate the distance between test cases and all training cases, the time complexity is relatively high, and the algorithm has poor scalability to big data. With the help of the Hadoop distributed file system and parallel computing framework MapReduce, we can accelerate the KNN classification algorithm.

The basic idea of the MapReduce KNN classifier is to distribute the training data to each server and calculate the distance between the training instance and the test instance at the same time. Since the distance between different training instances and the test instance is calculated independently of each other, it conforms to the characteristics of the MapReduce framework. Thus it can achieve good acceleration and is easy to understand.

2. Build & run

2.1 Build

Maven is used as the construction tool in this project, and the construction process is automated. The command to build this project is as follows (taking the Ubuntu operating system as an example):

$ sudo apt-get install mvn
$ cd hadoop-knn-classifier && mvn package

In addition to building on the command line, this project also supports building with mainstream IDEs (such as Eclipse, IDEA, etc.), just importing them into the IDE as Maven projects.

2.2 Run

For how to run the classifier and experiment of this project, please refer to the following scripts:

data/iris/run-demo.sh
data/iris/run-exp.sh
data/iris/run-finetune.sh
data/iris/upload-data.sh

3. Citation

If you find our work useful in your research, please cite us as:

@misc{cong_hadoop-based_2021,
	title = {A {Hadoop}-based {MapReduce} \$k\$-nearest-neighbors {Classifier}},
	shorttitle = {hadoop-knn-classifier},
	url = {https://github.com/cgsdfc/hadoop-knn-classifier.git},
	abstract = {We implemented the K-nearest-neighbors classifier algorithm under the MapReduce framework in Java and verified its correctness on two small-scale datasets. The basic idea of the MapReduce KNN classifier is to distribute the training data to each server and calculate the distance between the training instance and the test instance at the same time. Since the distance between different training instances and the test instance is calculated independently of each other, it conforms to the characteristics of the MapReduce framework. Thus it can achieve good acceleration and is easy to understand.},
	author = {Cong, Feng},
	year = {2021},
}

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.settings		.settings
data		data
docs-zh		docs-zh
docs		docs
figure		figure
script		script
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README-zh.md		README-zh.md
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Hadoop-based $k$-nearest-neighbors classifier

1. Introduction

1.1 The basic idea of a KNN classifier

1.2 MapReduce KNN classifier

2. Build & run

2.1 Build

2.2 Run

3. Citation

About

Languages

License

cgsdfc/hadoop-knn-classifier

Folders and files

Latest commit

History

Repository files navigation

A Hadoop-based $k$-nearest-neighbors classifier

1. Introduction

1.1 The basic idea of a KNN classifier

1.2 MapReduce KNN classifier

2. Build & run

2.1 Build

2.2 Run

3. Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages