This case study aims to give an idea of applying EDA in a real business scenario. In this case study, we will apply the EDA techniques towards developing a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.
This case study aims to identify patterns which indicate if a client has difficulty paying their instalments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.
This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.
After reading and inspecting the dataset, we performed missing data analysis and the below points can be considered towards missing data imputation:
- Columns with more than 50% missing data can be dropped
- Columns with 13% or lower missing values, can be imputed with mode, median or mean depending on the type of the column and data distribution
- Categorical columns with > 20% but < 50% missing data can have a new type such as ‘Unknown’ for missing data imputation.
- The columns representing the number of enquires to Credit Bureau about the client are of float data type. We can change the data type to int.
- DAYS related columns and CNT_FAM_MEMBERS can be changed to int data type as they represent number of days and family member count respectively.
- DAYS related columns have some negative values. They can be replaced with their respective absolute values.
- We can create a new column based on DAYS_BIRTH to show the age of the applicant for better readability and then we can drop the DAYS_BIRTH column. Similarly we can convert the other DAYS columns to represent the value in years.
- The CODE_GENDER column has XNA values that can be replaced with nan.
- AMT_INCOME_TOTAL has very high valued outliers. As we know the income may vary from person to person, we can cap value here and get rid of very high incomes.
- DAYS_EMPLOYED has huge outliers. Some data points are showing close to 1000 years in service which is impossible. We can cap the value at a desired point after analysing the quantiles.
We may want to bin the applicants ages into certain categories to be able to draw some insights such as - whether the loan defaulters majorly fall into any certain age groups or which age groups are much likely to repay on time etc.
- The 'TARGET‘ column represents whether the client is a defaulter or not. If we segregate our dataset based on this column, and if the distribution turns out to be 50-50, then our data set would be BALANCED. In any other case, it would be considered as IMBALANCED.
- As per the plot, we can say that our data set is imbalanced with almost 8% defaulters. Rest all 92% were able to repay the loans.
- We then created 2 data sets to segregate our original data based on the TARGET column values to have defaulters in one dataframe and others in another.
Cash loan type contracts have higher defaulters.
- Defaulters are more in 25-40 age group.
- Above 40, the number of defaulters tends to decrease.
Most of the defaulters are in low income range.
AMT_CREDIT is strongly correlated to AMT_ANNUITY and AMT_GOODS_PRICE in both cases.
- Single and civil marriage applicants are more likely to default.
- People staying in rented apartment or with parents might default.
- Across income types, cash loan seems to be the popular contract type.
- Most of the people who have taken loans are working class and they have taken cash loans mostly compared to revolving loans.
- People who have taken cash loans are likely to default as well.
- Loan amount taken by businessman is higher compared to the other income types.
- People with maternity leave income type tend to default with higher credit amount.
- People having academic degree and higher education have more loan annuity amount compared to the other groups.
- People with lower secondary education level are more likely to fail repayment.
- People with academic degree or higher education, have no difficulty in repayment.
- With an increase in credit amount, annuity and goods price, the tendency to default decreases.
- High chances of defaulting for lower credit amount, annuity and goods price.
- After reading and inspecting the dataset, we performed missing data analysis .
- The columns with more than 50% missing data can be dropped
The DAYS columns seem to have some negative values. We must convert that to absolute value.
Annuity seems to have some higher data points. The outliers can be capped at 0.99.
There are certain highly priced goods after 0.95 quantile. Here we can set a cap value to ignore very high goods price.
Merge the application data frame and previous application data frame based on the column SK_ID_CURR. Post merging the duplicate columns those are required for our analysis can be renamed. We can drop the unwanted columns from the merged data frame.
The previous data also seems imbalanced.
- High number of defaults in case of cash loans followed by consumer loans in previous applications data.
- There are some missing contract types in the data.
- High number of defaulters have their loans approved in the past.
- Number of defaulters who have not used the offer is the minimum.
In past, most of the loans had credit amount in the lower range i.e. below 1 lakh.
Previous applications annuity was also mostly below 1 lakh.
AMT_CREDIT_PREV is highly correlated to AMT_APPLICATION and AMT_GOODS_PRICE_PREV.
Number of loans taken for Repairs purpose is higher compared to others. Hence, the number of defaulters might be high in this group.
- Most of the applications were rejected for rejection code - HC. It also has the higher number of defaulters.
- Rejection by system is very less.
People taking cash loans but have refused to name the purpose are more likely to default and people who have applied loan for buying a home are more likely to repay on time.
People who have unused offers are more likely to default even though they have comparatively high total income
- With an increase in credit amount, applied amount and goods price, the tendency to default decreases.
- High chances of defaulting for lower credit amount, applied amount and goods price.
- Age : Young people – 25 to 35 age group
- Income : Lower income group with a total income of less than 5 lakhs
- Occupation : Low-skill labourers, drivers, waiters/barmen staff
- Education : Lower / secondary education
- Gender : Males
- Income type : On maternity leave and unemployed
- Family status : Civil marriage, single/unmarried
- Housing type : Rented apartment or with parents
- Contract type: Cash loan
- Cash loan purpose : Repairs and urgent needs
- Previous loan status : Approved
- Age : Older people – above 50
- Income : Higher income group
- Occupation : Managers, High-skilled tech staff, Accountants
- Education : Higher education and academic degree
- Gender : Females
- Income type : Working class, businessmen and students
- Family status : Married
- Housing type : Own house/apartment
- Contract type: Revolving loan
- Cash loan purpose : Buying garage, home etc.
- Previous loan status : Unused offer
- Young males with lower secondary education and of lower income group and staying with parents or in a rented house, applying for low-range cash contract, should be denied.
- Females are likely to repay but not if they are on maternity leave. Hence, bank can reduce the loan amount for female applicants who are on maternity leave.
- Since people taking cash loans for repairs and urgent needs are more likely to default, bank can refuse them.
- Since the people who have unused offers are more likely to default even though they have comparatively high total income, they can be offered loan at a higher interest rate.
- Banks can target businessmen, students and working class people with academic degree/ higher education as they have no difficulty in repayment.
- Bank can also approve loans taken on purpose for buying home or garage as there less chances of defaulting.