Skip to content

A SQL-based project that cleans and transforms a housing dataset by standardizing data formats, filling missing values, and optimizing data structure for analysis.

License

Notifications You must be signed in to change notification settings

RafaelKarcz/DataCleaningSQL

Repository files navigation

Data Cleaning for NashvilleHousing Dataset

Overview

Welcome to the Data Cleaning project for the NashvilleHousing dataset. This repository contains SQL scripts designed to clean and enhance the data quality and structure of the NashvilleHousing dataset, making it more usable for analysis and reporting.

Purpose and Functionality

This project aims to:

  • Data Integrity: Ensure the accuracy and consistency of the data by cleaning discrepancies and standardizing formats.
  • Enhanced Usability: Modify the dataset to make it more user-friendly for further analysis by splitting columns and standardizing addresses and other values.

Prerequisites

Before starting, ensure you have the following:

  • MySQL Server: Installation of MySQL or a compatible SQL server.
  • SQL Knowledge: Basic to intermediate knowledge of SQL syntax and operations.
  • Development Environment: Access to a SQL-friendly environment like MySQL Workbench or a similar SQL management tool.

Data Source

The project is based on the NashvilleHousing data, initially provided as a CSV file. This dataset includes various details about properties in Nashville, such as sale prices, addresses, and owner information.

  • Initial CSV File: The data used can be sourced from the Nashville Housing Data for Data Cleaning.csv, included in this repository for convenience.

The Set-Up

Database Initialization

Create a new database schema named HousingDataCleaning and connect to it.

Table Creation and Data Import

Follow these steps to configure your system for local file imports and set up your database tables:

  • Enable Local File Import: Configure your MySQL server to allow local file imports.
  • Create Tables: Execute the provided setup_database.sql script to create the NashvilleHousing table.
  • Import Data: Populate the table with data from the included CSV file using the LOAD DATA LOCAL INFILE command.

Detailed Script Explanations

The SQL scripts in this repository are designed to perform a series of data cleaning operations:

  • Date Standardization: Convert all date formats to ISO standard.
  • Address Standardization: Split property addresses into separate columns for address and city to improve granularity.
  • Data Validation: Check and fill missing values to ensure data completeness.

Usage Instructions

  • Data Setup: Begin by setting up your database and importing the initial data as outlined above.
  • Script Execution: Execute the data_cleaning_scripts.sql to perform the cleaning operations.
  • Verify Results: After running the scripts, review the database to ensure that the data has been cleaned and structured as expected.

Contributing

Contributions to improve the scripts or documentation are welcome:

  • Fork the Repository: Create your own fork and experiment with new cleaning techniques or improvements.
  • Submit Pull Requests: If you have enhancements that could benefit the broader project, consider submitting a pull request.

Project Evolution and Learning Path

  • Query Optimization: Continuously refine SQL queries for better performance.
  • Schema Refinement: Adapt the database schema as needed to optimize data usability and performance.

Documentation and Learning Resources

  • Detailed Comments: Each SQL script includes comments to explain the operations performed and the SQL techniques used.
  • Learning Resources: If you are new to SQL or need a refresher, online tutorials and courses can be beneficial.

Future Development

The project is envisioned to evolve with:

  • Enhanced Data Cleaning Features: Develop more sophisticated SQL scripts for data cleaning.
  • Collaboration and Community: Engage with other users to share insights and improve the project collectively.

Licensing

This project is licensed under the MIT License - see the LICENSE file for details.

About

A SQL-based project that cleans and transforms a housing dataset by standardizing data formats, filling missing values, and optimizing data structure for analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published