In this project past seismic events data is used to predict future earthquakes in the Hindu Kush Mountain region. The dataset contains 22 columns and 14698 rows.
For the preparation of data it is important to choose which columns need to be focused on. For this dataset the columns used are mag, place, depth, nst, gap and rms. These columns were chosen because these columns contribute the most and are necessary. The other columns were removed for varied reasons like contributing similar information, having more than 50% rows as null values, The information in the column is not important for the models and having id information which is used for serial numbering of the data.
Some of the chosen columns have null values which need to be substituted with other values. For substituting, the mean, median and mode can be used. But to choose which method to use it is necessary to first look at the columns and then derive the best method to choose from them. A boxplot for them is created and is shown below.
As we can see from all the above distribution we need to use mode to replace null values to keep the distribution the same for replaced values as well.
- Random Forest Regression
- Light GBM Regression
- XG Boost Regression
- Gradient Boost Regression
Among these models the best model was Light GBM and the residual scatter plot of this model is shown below to check heteroskedasticity