Welcome to the Data Cleaning project for the NashvilleHousing dataset. This repository contains SQL scripts designed to clean and enhance the data quality and structure of the NashvilleHousing dataset, making it more usable for analysis and reporting.
This project aims to:
- Data Integrity: Ensure the accuracy and consistency of the data by cleaning discrepancies and standardizing formats.
- Enhanced Usability: Modify the dataset to make it more user-friendly for further analysis by splitting columns and standardizing addresses and other values.
Before starting, ensure you have the following:
- MySQL Server: Installation of MySQL or a compatible SQL server.
- SQL Knowledge: Basic to intermediate knowledge of SQL syntax and operations.
- Development Environment: Access to a SQL-friendly environment like MySQL Workbench or a similar SQL management tool.
The project is based on the NashvilleHousing data, initially provided as a CSV file. This dataset includes various details about properties in Nashville, such as sale prices, addresses, and owner information.
- Initial CSV File: The data used can be sourced from the Nashville Housing Data for Data Cleaning.csv, included in this repository for convenience.
Create a new database schema named HousingDataCleaning and connect to it.
Follow these steps to configure your system for local file imports and set up your database tables:
- Enable Local File Import: Configure your MySQL server to allow local file imports.
- Create Tables: Execute the provided setup_database.sql script to create the NashvilleHousing table.
- Import Data: Populate the table with data from the included CSV file using the LOAD DATA LOCAL INFILE command.
The SQL scripts in this repository are designed to perform a series of data cleaning operations:
- Date Standardization: Convert all date formats to ISO standard.
- Address Standardization: Split property addresses into separate columns for address and city to improve granularity.
- Data Validation: Check and fill missing values to ensure data completeness.
- Data Setup: Begin by setting up your database and importing the initial data as outlined above.
- Script Execution: Execute the data_cleaning_scripts.sql to perform the cleaning operations.
- Verify Results: After running the scripts, review the database to ensure that the data has been cleaned and structured as expected.
Contributions to improve the scripts or documentation are welcome:
- Fork the Repository: Create your own fork and experiment with new cleaning techniques or improvements.
- Submit Pull Requests: If you have enhancements that could benefit the broader project, consider submitting a pull request.
- Query Optimization: Continuously refine SQL queries for better performance.
- Schema Refinement: Adapt the database schema as needed to optimize data usability and performance.
- Detailed Comments: Each SQL script includes comments to explain the operations performed and the SQL techniques used.
- Learning Resources: If you are new to SQL or need a refresher, online tutorials and courses can be beneficial.
The project is envisioned to evolve with:
- Enhanced Data Cleaning Features: Develop more sophisticated SQL scripts for data cleaning.
- Collaboration and Community: Engage with other users to share insights and improve the project collectively.
This project is licensed under the MIT License - see the LICENSE file for details.