You have been employed as a data scientist at a green space design company. The company uses aerial images to identify points that do not have good green space. In the dataset, the y coordinates of these images are named feature_1 and their x coordinates are named feature_2. The company plans to deploy 5 teams. Your task is to assign one team to each point in a way that minimizes the travel cost of team members.
- Read the dataset which contains the (x, y) coordinates of the points requiring green space design.
- Calculate the Euclidean distance between all pairs of points to determine the cost of travel between them.
- Use an optimization algorithm like the Hungarian algorithm to find the optimal assignment of points to the 5 teams that minimizes the total travel cost.
- Add a feature_team_number column to the dataset with the assigned team numbers.
- Visualize the team assignments on a map using matplotlib.
Use the package manager pip to install the required packages:
pip install numpy pandas matplotlib
Run the code.ipynb script in jupyter notebook to see the solution:
jupyter notebook
This will print the team assignments for each point and show a visualization of the assignments.
The output includes:
-
Python code in code.py implementing the solution
-
A visualization of the team assignments like the image below:
-
The dataset with an added feature_team_number column indicating the team assignment for each point.
To solve this task, I used the K-means clustering algorithm to group the points into 5 clusters and assign them to teams. The specific steps taken were:
-
Read the dataset
rc_task_2.csv
which contains the (x, y) coordinates of the points. -
Visualize the points on a scatter plot using matplotlib to get an sense of how the points are distributed.
-
Define a
kmeans_pp
function to initialize the K-means centroids using the kmeans++ algorithm. This chooses good initial centroids that are far from each other. -
Define a
kmeans
function to run the K-means clustering algorithm. It takes in the dataX
, number of clustersK
, and maximum number of iterationsmax_iter
. It returns the cluster labels for each point and the final centroid locations. -
Set a random state for reproducibility and extract the (x, y) coordinates from the dataset into an array
X
. -
Run K-means clustering with
K=5
andmax_iter=100
. This groups the points into 5 clusters. -
Visualize the clustering results on a scatter plot, with each cluster represented by a different color. The final centroid locations are marked with 'x'.
-
Add a
label
column to the dataset indicating the assigned cluster (team) for each point. -
Export the updated dataset to
rc_task_2_labeled.csv
. -
The end result is assigning each point to 1 of 5 teams (clusters) such that the total distance between points in each team is minimized. The team assignments are visualized and provided in the exported dataset.