Welcome to the Crash Data Analytics BCG repository! 🚗📊 This project focuses on analyzing crash data using data analytics techniques and tools.
This repository contains code and resources for analyzing crash data:
- Profiling: Data is profiled to check the missing values , null checks and category checks.
- Config Oriented
Flexibility
:You can run the customized parameter for the analytics questions as input(any number of questions can be chosen for analysis) .You can choose.csv or .parquet
file as output of your analytics. Framework
:Whee file
is built similar to package , can you used as plug and play module by installing the wheel file.SOLID PRINCIPLES
: Class structure and components are properly utilised as per SOLID governance, along withcustom logging
,exceptional handling
,doc string
,user comments
.
Here's a quick overview of the repository structure:
.
├── LICENSE
├── README.md
├── configs
│ ├── analysis_details.json
│ └── driver.json
├── crashanalysis
│ ├── __init__.py
│ ├── __pycache__
│ ├── anlaysis.py
│ ├── jobSetup.py
│ ├── loggerlib.py
│ ├── supportfunctions.py
│ └── temp.py
├── dataset
│ ├── Data
│ └── Data.zip
├── dist
│ └── crashanalysis-0.0.1-py3-none-any.whl
├── docs
│ ├── BCG_Case_Study_CarCrash_Updated_Questions.docx
│ ├── Data Dictionary.xlsx
│ └── ~$G_Case_Study_CarCrash_Updated_Questions.docx
├── logs
│ └── data_anlysis_2024-08-25 18:11:25.116556.log
├── output_analysis
│ ├── analyze_crash_data
│ ├── count_hit_and_run_vehicles
│ ├── count_two_wheelers
│ ├── get_crashes_with_high_deaths
│ ├── get_top_5_vehicle_makes
│ ├── get_top_ethnic_groups_per_body_style
│ ├── get_top_states_with_highest_accidents
│ ├── get_top_vehicle_injuries
│ ├── get_top_vehicle_makes
│ └── get_top_zipcodes_with_alcohol_crashes
├── scripts
│ ├── driver.py
│ └── spark_submit.sh
└── setup.py
-
Setup.py builds the package
-
Scripts/driver.py main function which utilises the wheel and call the other functions
-
configs/ contains driver and analysis json, separate json to maintain flexibility
-
Dataset conventions https://www.nhtsa.gov/sites/nhtsa.gov/files/documents/cris_crisdw_data_dictionary_v4.4_sub4_17_12.pdf
- **Need to adjust the metrics driven for every run ? below is the way **
{
"analysis_number_parmas":{
"all_anlysis":"Y",
"get_crashes_with_high_deaths":"2", //male died per crash is more than 2
"count_two_wheelers":"",
"get_top_vehicle_injuries":"3|5",//this 3,5 is filtering records based on the rank between 3 and 5
"get_top_vehicle_makes": "5",
"count_hit_and_run_vehicles": "",
"get_top_states_with_highest_accidents": "1",
"get_top_ethnic_groups_per_body_style": "",
"get_top_zipcodes_with_alcohol_crashes": "",
"analyze_crash_data": "",
"get_top_5_vehicle_makes": ""
}
}
- dataset load the datasets in the folder (plse see the tree above)
- config adjust the config as required number of questions and their params
#basic lib installation
pip3 install requirements.txt
#framework wheel installation
pip3 install crashanalysis-0.0.1-py3-none-any.whl
#BASE_PATH(codes base path) and SPARK_HOME(location where pyspark is isntalled)
export BASE_PATH=path_base/Crash_data_anlytics_BCG
export SPARK_HOME=/Users/subhash/Library/Python/3.12/lib/python/site-packages/
$SPARK_HOME/bin/spark-submit \
--deploy-mode client \
--py-files $BASE_PATH/dist/crashanalysis-0.0.1-py3-none-any.whl \
$BASE_PATH/scripts/driver.py \
$BASE_PATH/
#$BASE_PATH is the base location for code reference
- Dev notebooks can found here and the development was done in Google Collab Collab Notebook Offline
- loggers can be pulled from loggers location git
Tech Diagram
to be added soon!!!Adding Github Action yaml
for codeql analysis and CICD.Enhanced logging and Exception handling
enhancing the loggers and exceptional handling a bit moreVisualisation Addition
optional parms which can give appropriate charts,graphs for the questions