Skip to content

๐Ÿ“Š Python tool for creating datasets with clusters using a normal distribution. Customize clusters, significant columns, and add variability with dummy columns. Ideal for testing clustering algorithms.

License

Notifications You must be signed in to change notification settings

josemarialuna/RandomClustersGenerator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

23 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Random Clusters Generator

Contributors Forks Stargazers Issues MIT License

Generate datasets with defined clusters using a normal distribution. This specialized tool allows you to customize data creation, specifying the number of significant columns forming the clusters. Additionally, it provides the option to include dummy columns, adding variability and noise to your datasets.

4 clusters with 2 significative featuresRepresentation of the two dummies features

Key Features:

  • Defined Clusters: Create datasets with clearly defined clusters, ideal for applying clustering algorithms.
  • Normal Distribution: Utilize a normal distribution to generate data, providing realism and coherence in your datasets.
  • Significant Columns Configuration: Customize the number of columns forming clusters, allowing you to adjust the complexity of your datasets.
  • Optional Dummy Columns: Add dummy columns to introduce variability and noise in the data, providing a more realistic approach to real-world scenarios.

Project Status

๐Ÿš€ In Development | Production Ready

This project is in constant evolution to enhance and expand its functionalities. We welcome any community contributions to make it even more robust. While the current version is ready for deployment in production environments, we are committed to continuous improvement and optimization of the code. Feel free to explore, use, and contribute to the project. Refer to the Contribution section for more details on how to get involved in development. Your feedback and suggestions are valuable to us. Together, we can make this project even better!

Project Structure

This project consists of three main files, each serving a specific purpose:

  • main_generator.py: This file is responsible for generating datasets based on the parameters provided. It allows users to create customized datasets with defined clusters using a normal distribution. Users can specify the number of significant columns forming the clusters and choose to include additional dummy columns for added variability.
  • main_generator_parameters.py: This file generates datasets based on parameters specified in a CSV file. Users can provide a CSV file containing configuration details, and the script will use this information to create datasets accordingly.
  • config_gen.py: This file generates a CSV file in the config folder containing various combinations of data parameters, including the number of clusters, the number of significant features, the number of dummy features, and standard deviation values.
  • add_dummy_columns.py: The purpose of this file is to add dummy columns to an existing CSV file. It takes a CSV file as input and appends additional columns with dummy data, introducing variability and noise to the dataset.

(back to top)

How to use

Configuration Parameters in main_generator.py

  • DATA_PATH: The name of the result path where generated datasets will be saved. Example: DATA_PATH = "data"
  • CLUSTERS_NUM: The number of clusters to be generated. The maximum value must be equal to SIGNIFICANT_NUM^2. Example: CLUSTERS_NUM = 4
  • INSTANCES: The number of instances per cluster in the generated datasets. Example: INSTANCES = 10
  • SIGNIFICANT_NUM: The number of significant columns forming the clusters. Pay attention to CLUSTERS_NUM as it must satisfy the condition CLUSTERS_NUM = SIGNIFICANT_NUM^2. Example: SIGNIFICANT_NUM = 3
  • DUMMY_NUM: The number of dummy columns to be included, adding variability and noise to the datasets. Example: DUMMY_NUM = 3
  • STANDARD_DEV: The standard deviation for the Normal Distribution of data, influencing the spread of the generated values. Example: STANDARD_DEV = 0.05

Configuration Parameters in main_generator_parameters.py

The main_generator_parameters.py script generates datasets based on configuration parameters specified in a CSV file. The CSV file should have the following columns:

  • clusters_num: The number of clusters to be generated. It influences the structure of the datasets. Example: clusters_num = 4
  • significant_num: The number of significant columns forming the clusters. It must satisfy the condition clusters_num = significant_num^2. Example: significant_num = 3
  • dummy_num: Description: The number of dummy columns to be included, adding variability and noise to the datasets. Example: dummy_num = 3
  • standard_dev: The standard deviation for the Normal Distribution of data, influencing the spread of the generated values. Example: standard_dev = 0.05

To use this script, create a CSV file with these columns and corresponding values, and then run the script by specifying the path to your CSV file.

Example CSV file:

clusters_num,significant_num,dummy_num,standard_dev
4,2,2,0.10
4,6,3,0.05

(back to top)

Contribution

๐ŸŽ‰ We welcome and encourage community contributions to enhance this project. Whether you want to report issues, propose new features, or submit improvements, your collaboration is valuable.

How to Contribute

  1. Fork the Repository:

    • Fork the repository to your GitHub account.
  2. Clone the Repository:

    • Clone the forked repository to your local machine.
    git clone https://github.com/josemarialuna/RandomClustersGenerator.git
    cd RandomClustersGenerator
    

(back to top)

License

This project is licensed under the MIT License - see the LICENSE.md file for details

(back to top)

Contact

  • Josรฉ Marรญa Luna-Romera - Website

(back to top)

About

๐Ÿ“Š Python tool for creating datasets with clusters using a normal distribution. Customize clusters, significant columns, and add variability with dummy columns. Ideal for testing clustering algorithms.

Topics

Resources

License

Stars

Watchers

Forks

Languages