In this project, recommendations and insights are generated by first using exploratory data analysis on the "Turkiye Student Evaluation" (TSE) dataset, which can be found here, and the "Bank Marketing Campaign" (BMC) dataset, which can be found here.
I then apply descriptive analytics to TSE, predictive analytics to BMC, and lastly, prescriptive analytics on the results generated from both datasets to provide some insight and recommendations of actions a school (the TSE dataset) and a bank (the BMC dataset) can take to improve.
Section 1 - Company A (Descriptive Statistics)
Section 1 - Results Assessment and Conclusions
Section 2 - Company B (Predictive Statistics)
Section 2 - Results Assessment and Conclusions
Company A is a consulting firm for schools that recognises that student satisfaction is a KPI for most higher education institutes. Within it, there are a range of reasons why students may or may not be satisfied with their courses. The Turkiye Student Evaluation Dataset gives a small insight into the complexities that drive student experience. Using descriptive analytics, a MANOVA test will be used to indicate the factors significantly impacting the student experience.
Company B is a bank seeking data-driven recommendations for its marketing department. The data provided for this analysis is anonymised and constitutes variables relating to a past and recent marketing campaign. Using predictive analytics, a linear and logistic regression model was built to identify the factors and variables that are great predictors of a client's age and whether they have a housing loan.
The aim of this report is to determine what factor(s) may or may not drive student satisfaction at HigherEdCo Ltd. A data analytical approach will be applied to data from the “Turkiye Student Evaluation” dataset to access the effect of select variables on each other and ultimately on student satisfaction.
The “Turkiye Student Evaluation” dataset comprises 5,820 observations of student evaluation scores to 28 Likert-scale style questions (variables) and 5 other variables capturing the ‘instr’ (instructor type), ‘class’ (course type), ‘nb.repeat’ (number of times students took the test), their 'attendance', and the perceived ‘difficulty’ of the course by students.
Figure 1
Figure 1 above shows that instructor 3 has almost twice as many students as instructors 1 and 2 combined. The credibility of instructor impact in the analysis may be weakened by this because of the unequal instructor groups.
Figure 2
Figure 2 above shows that there is an uneven range of student observations per course. The credibility of course (class) impact in the analysis may be weakened by this because of the unequal instructor groups.
Figure 3
Figure 3 above shows that the mean score given by students per question (28 survey questions) is option/answer 3. This suggests the answers may not be an accurate reflection of the student’s opinion.
As is the case with most analytical projects, working with fewer variables is ideal for delivering clear and consistent insights. With this in mind, the suitability of the 28 survey question variables for a factor analysis was tested using KMO. As shown in the table below, all of the variables fall within the KMO range of 0.90 to 1.00, which suggests that the data is excellent for factor analysis.
Since the variables Q1 to Q28 acted as the measures for student satisfaction, points of correlation were extracted from them using PCA (Principal Component Analysis) and then validated further with parallel analysis. Both of these tests identified two components that represent the main areas of focus for the 28 survey questions. In the scree plot diagram below, components 1 and 2 both have a higher variance from the other components and a higher correlation to the 28 survey questions, as the elbow curve only begins to flatten at component 3.
Figure 4
The PCA then generated correlation scores for the two components based on the 28 survey questions. Out of the scores, component 2 scores correlated highly with Q1 to Q12 and component 1 Q13 to Q28, as shown in the table below.
Based on the questions posed, Q1 to Q12 are focused on course satisfaction, and Q13 to Q28 are focused on instructor satisfaction. Both of the components were added to the "TSE Analysed.csv" dataset under the following variable names:
- Component 1 is “InstrSat”
- Component 2 is “CourseSat”
The test to follow will look at if the type of instructor and course students are assigned has an effect on student satisfaction, as well as how much of an effect and the impact. The aforementioned will be broken up into two hypotheses: a null hypothesis (H0) and an alternative hypothesis (H1). The table below shows the name, classification, and category of each variable, followed by the two hypotheses.
InstrSat, CourseSat ~ Instr * Class (DV1, DV2 ~ IV1 * IV2)
H0: There is no effect of instr and/or class on InstrSat and CourseSat. H1: There is an effect of instr and/or class on InstrSat and CourseSat.
For this analysis, the Manova test will be used to identify significance if any, and for this test, a few parametric assumptions need to be met.
Figure 5: InstrSat (DV1)
Figure 6: CourseSat (DV2)
The parametric assumptions of normality and equal sample group sizes have both been validated, which means that hypothesis testing can now commence.
After running the omnibus test, H1 was confirmed, resulting in H0 being rejected. Manova showed a significant main effect of instructor [F(2,5806) = 39.64, p < 0.001, V = 0.03] and class [F(11,5806) = 11.55, p < 0.001, V = 0.04] on InstrSat and CourseSat.
Further investigation was carried out by running individual ANOVAs on the DVs. They showed that the results from both the MANOVA and ANOVAs confirmed instructors and courses having an effect on instructor and course satisfaction. However, the effect size (η2) and Pillai-Bartlet trace measurement (V) both indicate that their effect on these two dependent variables (DV) is minor.
DV1 (InstrSat) ANOVA Results: The result of the ANOVA was significant in both instr (F[2,5806] = 22.44, p < 0.001, η2 = 0.007) and class (F[11,5806] = 14.66, p < 0.001, η2 = 0.03)
DV2 (CourseSat) ANOVA Results: The result of the ANOVA was significant in both instr (F[2,5806] = 57.29, p < 0.001, η2 = 0.02) and class (F[11,5806] = 8.51, p < 0.001, η2 = 0.03)
Confidence interval plots are another useful tool as they support the significant results from our ANOVA tests by visualising the significant difference of DVs against each IV, if any. Figures 7 and 8 show the estimated mean of student scores (InstrSat and CourseSat) for each instructor with a 95% confidence level (meaning that 95/100 of the mean student score would fall within this confidence interval).
In Figure 7 below, we found that students under instructors 1 and 3 had the same level of instructor satisfaction, whereas students under instructor 2 had a higher level of instructor satisfaction than instructors 1 and 3.
Figure 7
In Figure 8 below, we found that the students from each instructor had different levels of course satisfaction. Students under instructor 1 had the highest level of course satisfaction, and students under instructor 3 had the lowest level of course satisfaction.
Figure 8
Figures 9 and 10 show the estimated mean of student scores (InstrSat and CourseSat) for each course with a 95% confidence level (meaning that 95/100 of the mean student score would fall within this confidence interval).
For figure 9 below, we found that students shared fairly the same level of instructor satisfaction regardless of the course they were assigned. However, students from course 8 had the highest level of instructor satisfaction in comparison to the other courses.
Figure 9
For figure 10 below, we found that course satisfaction did vary depending on the course students were assigned. Students from courses 2 and 10 had the highest level of course satisfaction in comparison to students from the other courses. However, as seen in figure 2, course 2 may not be very significant because of the difference in sample size when compared to course 10.
Figure 10
THSD Result 1
THSD Result 2
Following the results of significance from the manova and the individual anovas, a comparison was done on the type of instructors and the type of courses for both DVs using a post-hoc test called Tukey HSD. The exact results can be seen in THSD results 1 and 2. The Tukey HSD test seeks to communicate where the significant difference may lie within the significant independent variables.
Based on the post-hoc tests, it appears that students assigned to instructor 2 were more satisfied with their instructor than students from other instructors. Students assigned to Instructor 1 were more satisfied with their course than students from other instructors.
The post-hoc tests also revealed that students on course 8 were more satisfied with their instructor than students from other courses, and students on courses 9 and 10 were more satisfied with their course than students on other courses.
Higher education institutes are built to provide a learning platform for all students. Since learning is the focal point of these institutes, they have a vested interest in assessing their overall effectiveness. It has been revealed that student satisfaction is a KPI for a satisfactory learning platform for most higher education institutes. An analysis of data capturing students’ opinions on a specific educational institution revealed that depending on the type of instructor and/or course a student is assigned, their level of satisfaction with the educational institution can be affected.
The aim of the report was to help institutions improve student satisfaction. This was accomplished through the following objectives:
- determining what factors have an effect on student satisfaction and how much of an effect there is.
- Understand which factors have the most significant effect on student satisfaction in comparison to others.
The Turkiye dataset was used to understand the data and factors that may affect student satisfaction (firstly, statistical charts). They revealed that there was a bit of an uneven spread of students across the different courses and instructors in the data. However, with the exception of course 12, there is a minimum average of 100 student observations per course and 750 student observations per instructor. This information helped to add credibility to the relevance of the analysis.
Before using inferential statistics, the dataset was modified because the large number of student observations captured for the dataset would have made it challenging to analyse. As a result, to aid efficiency and reduce the complexity of the analysis, a dimensionality reduction was carried out using a principal component analysis (PCA) and then a parallel analysis test (as a second confirmation). Two new factors or variables were formed: instructor satisfaction and course satisfaction (which together represent student satisfaction).
The inferential statistics method (MANOVA) used revealed that student satisfaction is affected by the type of instructor and course students' were assigned; however, the effect they had was very small according to the dataset.
Since the analysis results indicate that the type of instructor and the course students are assigned to can have an effect on their overall satisfaction with an institution. To ensure student satisfaction levels are maintained, educational institutions may need to ensure that a good teaching calibre is continuously shown by all instructors at the institution and the tutorial sessions for courses are tailored to the expectations and interests of students.
- The results of this dataset cannot be generalised as meaning the same effects for other educational institutions.
- Outliners issue in normality assumption.
- The data may not be an accurate representation of the opinions of students because of the data collection method (Likert-scale style questions) used.
- Loss of information because of the dimensionality reduction.
Being a part of the finance industry means that the German-Hellenic bank faces stiff competition amongst competitors on all fronts. With this in mind, anonymised data was collected for analysis. The aim of this report is to produce statistical information that can be used by the company’s marketing department to better understand their clients and to build more attractive services for customers.
This data comprises information for 45,211 (observations) clients, with personal demographics, responses, and opinions captured for 17 fields (variables).
Figure 11
The chart above is a visual representation of the age of all clients in the dataset. Based on this chart, it appears that the analysis to follow may be most applicable to middle-aged clients, as the dataset mostly comprises clients between 30 and 50 years old.
Figure 12
The chart above is a visual representation of clients with or without a housing loan. It appears that the dataset has a fairly good proportion of both groups of clients. This will aid the statistical power of the analysis outcome because of the high number of observations.
After a visual assessment of the data, variable 12 (duration) captured information and responses that can affect variable 17 (y) and cause there to be a bit of correlation between them. Since this can skew the performance of the model, variable 12 was removed before continuing the analysis.
Multiple linear regression modelling was chosen because it can help “make predictions, recognise patterns, and categorise objects” (R, 2021) on the dataset. In preparation for building a model, the data was first tested for any redundant data (this yielded no results) and then split into a train (dftrain) and a test dataset (dftest). Although overfitting is unlikely due to the large number of observations (clients) recorded in the dataset, splitting the dataset helps the test get a realistic evaluation of the model against data that is not part of the trained data.
Out of this split, two separate datasets were created: dftrain with 80% and dftest with 20%. A full model was then built with ‘Age’ as the outcome variable and all other variables (except variable 12) as predictors. The model identified numerous predictors that vary in significance. Subsequently, to have the best fitting model for the data, the stepwise method was used to create a final model that has only the most impactful predictors from the full model.
This model was named ‘linstep’ and consists of the following variables:
Binary logistic regression modelling was chosen to build the second model around “housing” because it helps to explore the relationship of predictors against the outcome variable (IBM, 2022) and also to access that relationship through probabilities (percentages). The first phase of this method was the same as the linear regression model method, with the outcome variable instead being changed to ‘housing’.
The final model was named ‘logstep’ and consists of the following variables:
Below is a summary of the regression models used:
The BIC scores for both models 1 and 2 showcase these stepwise models as being the best models in comparison to others. Model 1 was reduced to 9 predictors that explain 42% of the variance in clients’ age, and Model 2 was reduced to 10 predictors that explain 26% of the variance in housing loans on the training sets. The Bayes Factor value represents how many times better the model fits the data than the full model.
The linear model was significant overall [F[34,36136] = 761.7, p < 0.001], and the predictors mentioned above for ‘linstep’ are all significant. The logistic model was significant overall, and some of the predictors were significant. The t-values for the linear model (linstep) and the z-values for the logistic model (logstep) show this and can be seen below.
linstep model significant predictors:
logstep model significant predictors:
As mentioned previously, the dataset was split into two, with dftest being unused and the latter being used for the model. To access the performance of the Linstep model, its predictive ability was tested against unseen data (dftest). Against the dftest, the model had an R-squared of 0.42 (same as the Linstep model), and the predictions made about clients ages were fairly good in comparison to their actual ages.
Figure 13
The diagram above visualises the R-squared/performance of the model by plotting the predicted age of clients (predlin) with the actual age of clients (age) in the dftest together to form a scatterplot. It also shows a positive linear relationship between predlin and age, as the plots converge diagonally with the line of best fit.
Figure 14
The plot above shows that the area of variance is very small, which reflects a high r squared. This indicates that the model (Linstep) built from dftrain was good at predicting the client’s age in dftest.
- Homoscedasticity
Figure 15
The scatterplot shows a typical fitted value vs. residual plot in which homoscedasticity is present.
- Normality
Figure 16
The Q-Q plot shows a normal distribution is followed.
For the second model, the dataset was again split into two, with dftest being unused and the latter (dftrain) for the model. Log odds (coefficients) from the model were converted into probabilities. The summary of the probabilities shows that generally, the type of job, method of contact, the last month of contact, and the outcome of the previous marketing campaign decreased the likelihood of a client having a housing loan. However, the variables job blue-collar, contact unknown, and month may greatly increase the likelihood of a client having a housing loan. This can be seen below.
Similar to Linstep, the predictive ability of Logstep was tested against unseen data (dftest). The predictions made represented the probability of clients having a housing loan. These predictions were quite good and can be found in the “BMC Analysed.csv” dataset.
A confusion matrix is a great tool as it reflects a summative table of the correct and incorrect predictions made. But, before deriving the final confusion matrix table, a roc curve chart was created to determine the best cut-off point for a more sensitive matrix.
Figure 17
The chart above is a ROC curve of the outcome variable in the test set. The higher the point is on the y-axis, the more precisely it will (Vadakkanmarveettil, 2015) “measure the proportion of actual positives that are correctly identified” in the confusion matrix. Based on the chart above, using 0.4 as the cut-off point will increase the number of positive predictions. Since the aim of this analysis is to determine what variables can make the most accurate predictions for clients’ ages, having a more sensitive matrix will align with this aim, as can be seen in the table below.
0 = clients without a housing loan TN = True Negative FP = False Positive 1 = clients with a housing loan FN = False Negative TP = True Positive
Finally statistical measures was executed to show the model’s performance.
-
Multicollinearity VIF values for the predictors in the logstep model are all less than 5, which means that there is no significant relationship between the predictors.
-
Extreme Outliners
There are outliners however, they are not extreme.
Businesses around the world are constantly seeking progressive improvements in the way they carry out business. Data-driven research has been found to be one of the most effective ways to help compose effective strategies in all aspects of a business. For this reason, an analysis was carried out on data containing client information for a bank.
Arising out of the analysis, a few factors that capture clients’ personal demographics, previous and present campaign information, and clients’ opinions stood out as being great predictors of clients’ age and housing loan status. For the bank, understanding them can help its marketing team re-evaluate their product or service approach and their strategic objectives.
The aim of the report is to derive useful data for the bank using a data-driven analysis. This was accomplished through the following objectives:
- Determine what factors can significantly predict the age of clients and whether or not they have a housing loan.
-
- Gauge the performance of these factors when making predictions.
The data used for the analysis featured a substantial amount of client information, and as with any analysis, having a large amount of data is ideal. The analysis commenced by testing the predictive ability of all other factors in the dataset provided against client age and their housing loan status (separately). For clients age, job, marital status, educational background, account balance, housing loan status, contact method used, month contacted, and clients’ opinions of the previous marketing campaign when modelled together appear to be good predictors of clients’ age. These factors or predictors accounted for 42% of the predictability element of clients’ ages.
In the case of the housing loan status of clients, age, type of job, marital status, educational background, account balance, contact method used, day of last contact, month contacted, contacts during the present campaign, days post contact, and clients’ opinions of the previous marketing campaign when modelled together appear to be good predictors of clients’ housing loan status. In further accessing the predictors of clients’ housing loan status, the probabilities of the housing loan’s response to increases and decreases in the predictors were captured.
The summary of the probabilities shows that primarily the type of job, method of contact, the last month of contact, and the outcome of the previous marketing campaign decreased the likelihood of a client having a housing loan. However, clients in blue-collar jobs contacted with an unknown method and/or during the month of May greatly increased the likelihood of a client having a housing loan.
The analysis results revealed that there are quite a few factors that can act as predictors for clients’ age and housing loan status, as mentioned previously. With this in mind, the bank's marketing department should make their goods and services more appealing to clients by being more strategic and specific about the clients to be targeted for marketing campaigns, products, offers, and/or promotions.
The analysis has also revealed that external elements like the month and day should most certainly be considered, as they were both proven to be significant predictors of what kind of clients are engaged. For example, a house loan promotion with low interest rates for clients under 30 years old can be run at the end of the month. This would attract clients in blue-collar jobs because of its availability (as these clients typically receive their salary around that time) and because the majority of clients at this bank are under 30 years old, based on the data.
- The significance of the test may become biassed as the p-value may be affected by the normality issue.
-
- Linear correlation between independent variables (predictors) diminishes their predictive power.
-
- The data collected could be inaccurate or outdated, which would skew the outcome of the analysis.
Gunduz, G. & Fokoue, E. (2013). UCI Machine Learning Repository [[https://archive.ics.uci.edu]]. Irvine, CA: University of California, School of Information and Computer Science.
S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
R, S., 2021. Data Flare Up. [Online] Available at: https://www.dataflareup.com/why-use-linear-regression-models-instead-of-neural-networks/ [Accessed 22 December 2022].
IBM, 2022. What is logistic regression? [Online] Available at: https://www.ibm.com/topics/logistic-regression [Accessed 22 December 2022].
Vadakkanmarveettil, J., 2015. Sensitivity vs. Specificity in Logistic Regression. [Online] Available at: https://www.jigsawacademy.com/sensitivity-vs-specificity-in-logistic-regression/ [Accessed 03 January 2023].
- Exploratory Data Analysis (PCA, Univariate analysis, Bivariate analysis)
- Descriptive Analytics
- Predictive Analytics
- Prescriptive Analytics
- R
- Psych
- caret
- ggplot2
- Paran
- lsr
- verfication
- car
- sjplot
- stats