Predicting Customer Loan Default
Quick Summary
Goal: Develop a predictive model to assess the likelihood of customer loan defaults using simulated financial and demographic data.
Process: The project followed the CRISP-DM framework, progressing through Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. The model was built using a Random Forest classifier to predict default risk based on key financial and demographic factors.
Insights:
The final model achieved 82% accuracy, exceeding the target of 70%.
Key risk indicators:
Customers with low checking account balances (CHK_ACCT) were more likely to default.
Longer loan durations (DURATION) correlated with increased risk of default.
Higher loan amounts (AMOUNT) also increased default likelihood.
The model's specificity (56%) was lower than its sensitivity (93%), meaning it was better at identifying non-defaulting customers than defaulting ones.
Ethical considerations were addressed to minimize potential biases in the dataset.
Project Summary
The Predicting Customer Loan Default project aimed to simulate a real-world credit risk assessment process by developing a machine learning model capable of identifying high-risk loan applicants. The project leveraged simulated historical customer data to train a Random Forest model, enabling the financial institution to reduce financial losses and improve lending decisions.
By applying the CRISP-DM framework, each phase of the project—ranging from data exploration to model evaluation—was systematically executed to ensure accuracy and business relevance.
Project Environment
Tools and Technologies Used:
RStudio – Data cleaning, exploratory data analysis, and model building.
Random Forest model – Used for predictive analysis, ensuring robustness in classification.
Context & Business Purpose:
The project was conducted as a simulation to explore credit risk prediction techniques.
High-risk applicants pose a financial impact of 150% of the remaining loan balance, making accurate predictions crucial.
Constraints & Limitations:
The dataset was simulated, meaning real-world noise and unpredictability were not present.
Specificity (correctly identifying defaulting customers) was lower than desired (56%), indicating room for model improvement.
Scope and Project Steps
Objectives:
Develop a predictive model for loan default classification.
Identify key financial and demographic indicators that contribute to default risk.
Improve the institution’s ability to mitigate financial losses by integrating the model into daily credit risk assessments.
Major Steps in Analysis:
Data Collection & Understanding: Analyzed 31 financial and demographic variables related to loan defaults.
Data Cleaning & Preparation:
Removed irrelevant variables (e.g., job title, number of dependents) to prevent noise.
Ensured data consistency and formatting for optimal model performance.
Feature Engineering & Model Building:
Random Forest model trained on 70% of the data, tested on the remaining 30%.
Hyperparameters optimized (1,000 trees, 4 variables per split, 5 samples per node).
Model Evaluation:
Accuracy: 82% (exceeding the target of 70%).
Sensitivity: 93% (high ability to detect non-defaulting customers).
Specificity: 56% (moderate ability to identify defaulting customers).
Ensuring Accuracy & Reliability:
Applied descriptive statistics and correlation analysis to determine key predictive variables.
Performed feature selection to eliminate variables that did not contribute to model performance.
Data Sources and Data Gathering
Source:
The dataset was simulated and did not contain real financial data.
Cleaning and Preparation:
Data cleaning was conducted in RStudio, with steps including:
Checking for missing values (none found).
Identifying outliers (no extreme anomalies detected).
Formatting categorical and numerical variables appropriately.
Challenges:
No major challenges in data collection, as the dataset was pre-structured for simulation purposes.
Data Checks and Summary Metrics
Key Statistics Analyzed:
Descriptive statistics: Average loan duration was 21 months, and the average applicant age was 35 years.
Correlation analysis:
CHK_ACCT (Checking Account Balance) had a strong negative correlation with default risk(customers with lower balances were more likely to default).
DURATION had a strong positive correlation (longer loan terms increased default likelihood).
AMOUNT (Loan Size) was a key risk factor, contributing significantly to the model’s decision-making.
Unexpected Trends & Patterns:
The model struggled more with predicting defaults (56% specificity) than non-defaults (93% sensitivity).
Loan applicants with frequent travel had a higher default risk, a pattern worth investigating further.
Combining CHK_ACCT and SAV_ACCT into a "cash-on-hand" metric could improve model performance.
Modeling and Evaluation
Selected Model: Random Forest
Key Metrics:
Accuracy: 82% (higher than the target of 70%).
Sensitivity: 93% (strong ability to detect non-defaulting customers).
Specificity: 56% (moderate ability to detect defaulting customers).
AUC (Area Under Curve): 84%, indicating a strong classification model.
Improvements Considered:
Feature Engineering: Merging checking and savings account balances could enhance predictive accuracy.
Hyperparameter Tuning: Adjusting mtry, nodesize, or number of trees could improve specificity.
Addressing Class Imbalance: Balancing data between defaulting and non-defaulting customers to reduce bias.
Deployment & Business Impact
Deployment Plan:
The model was developed as a simulation and not deployed in a real banking environment.
If implemented in real-world applications, a batch process would run each night, scoring new loan applicants and flagging high-risk cases for review.
Ethical Considerations:
The dataset was synthetic and did not contain actual customer financial information.
Variables with potential bias (e.g., marital status for men only) were removed to reduce discrimination risk.
Future iterations should confirm customer consent for data usage in real-world scenarios.
Main Takeaways
A predictive model can accurately assess loan default risk, even in a simulated setting.
Key risk indicators include checking account balance, loan duration, and loan amount.
The model performed well in predicting non-defaults but could be improved in detecting defaults.
Future refinements, such as feature engineering and hyperparameter tuning, could further enhance performance.
Though a simulation, this project demonstrates how machine learning can assist in financial risk assessment.