Predicting Customer Loan Default

Quick Summary

  • Goal: Develop a predictive model to assess the likelihood of customer loan defaults using simulated financial and demographic data.

  • Process: The project followed the CRISP-DM framework, progressing through Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. The model was built using a Random Forest classifier to predict default risk based on key financial and demographic factors.

  • Insights:

    • The final model achieved 82% accuracy, exceeding the target of 70%.

    • Key risk indicators:

      • Customers with low checking account balances (CHK_ACCT) were more likely to default.

      • Longer loan durations (DURATION) correlated with increased risk of default.

      • Higher loan amounts (AMOUNT) also increased default likelihood.

    • The model's specificity (56%) was lower than its sensitivity (93%), meaning it was better at identifying non-defaulting customers than defaulting ones.

    • Ethical considerations were addressed to minimize potential biases in the dataset.


Project Summary

The Predicting Customer Loan Default project aimed to simulate a real-world credit risk assessment process by developing a machine learning model capable of identifying high-risk loan applicants. The project leveraged simulated historical customer data to train a Random Forest model, enabling the financial institution to reduce financial losses and improve lending decisions.

By applying the CRISP-DM framework, each phase of the project—ranging from data exploration to model evaluation—was systematically executed to ensure accuracy and business relevance.

Project Environment

  • Tools and Technologies Used:

    • RStudio – Data cleaning, exploratory data analysis, and model building.

    • Random Forest model – Used for predictive analysis, ensuring robustness in classification.

  • Context & Business Purpose:

    • The project was conducted as a simulation to explore credit risk prediction techniques.

    • High-risk applicants pose a financial impact of 150% of the remaining loan balance, making accurate predictions crucial.

  • Constraints & Limitations:

    • The dataset was simulated, meaning real-world noise and unpredictability were not present.

    • Specificity (correctly identifying defaulting customers) was lower than desired (56%), indicating room for model improvement.

Scope and Project Steps

  • Objectives:

    • Develop a predictive model for loan default classification.

    • Identify key financial and demographic indicators that contribute to default risk.

    • Improve the institution’s ability to mitigate financial losses by integrating the model into daily credit risk assessments.

  • Major Steps in Analysis:

    1. Data Collection & Understanding: Analyzed 31 financial and demographic variables related to loan defaults.

    2. Data Cleaning & Preparation:

      • Removed irrelevant variables (e.g., job title, number of dependents) to prevent noise.

      • Ensured data consistency and formatting for optimal model performance.

    3. Feature Engineering & Model Building:

      • Random Forest model trained on 70% of the data, tested on the remaining 30%.

      • Hyperparameters optimized (1,000 trees, 4 variables per split, 5 samples per node).

    4. Model Evaluation:

      • Accuracy: 82% (exceeding the target of 70%).

      • Sensitivity: 93% (high ability to detect non-defaulting customers).

      • Specificity: 56% (moderate ability to identify defaulting customers).

  • Ensuring Accuracy & Reliability:

    • Applied descriptive statistics and correlation analysis to determine key predictive variables.

    • Performed feature selection to eliminate variables that did not contribute to model performance.

Data Sources and Data Gathering

  • Source:

    • The dataset was simulated and did not contain real financial data.

  • Cleaning and Preparation:

    • Data cleaning was conducted in RStudio, with steps including:

      • Checking for missing values (none found).

      • Identifying outliers (no extreme anomalies detected).

      • Formatting categorical and numerical variables appropriately.

  • Challenges:

    • No major challenges in data collection, as the dataset was pre-structured for simulation purposes.

Data Checks and Summary Metrics

  • Key Statistics Analyzed:

    • Descriptive statistics: Average loan duration was 21 months, and the average applicant age was 35 years.

    • Correlation analysis:

      • CHK_ACCT (Checking Account Balance) had a strong negative correlation with default risk(customers with lower balances were more likely to default).

      • DURATION had a strong positive correlation (longer loan terms increased default likelihood).

      • AMOUNT (Loan Size) was a key risk factor, contributing significantly to the model’s decision-making.

  • Unexpected Trends & Patterns:

    • The model struggled more with predicting defaults (56% specificity) than non-defaults (93% sensitivity).

    • Loan applicants with frequent travel had a higher default risk, a pattern worth investigating further.

    • Combining CHK_ACCT and SAV_ACCT into a "cash-on-hand" metric could improve model performance.

Modeling and Evaluation

  • Selected Model: Random Forest

  • Key Metrics:

    • Accuracy: 82% (higher than the target of 70%).

    • Sensitivity: 93% (strong ability to detect non-defaulting customers).

    • Specificity: 56% (moderate ability to detect defaulting customers).

    • AUC (Area Under Curve): 84%, indicating a strong classification model.

  • Improvements Considered:

    • Feature Engineering: Merging checking and savings account balances could enhance predictive accuracy.

    • Hyperparameter Tuning: Adjusting mtry, nodesize, or number of trees could improve specificity.

    • Addressing Class Imbalance: Balancing data between defaulting and non-defaulting customers to reduce bias.

Deployment & Business Impact

  • Deployment Plan:

    • The model was developed as a simulation and not deployed in a real banking environment.

    • If implemented in real-world applications, a batch process would run each night, scoring new loan applicants and flagging high-risk cases for review.

  • Ethical Considerations:

    • The dataset was synthetic and did not contain actual customer financial information.

    • Variables with potential bias (e.g., marital status for men only) were removed to reduce discrimination risk.

    • Future iterations should confirm customer consent for data usage in real-world scenarios.

Main Takeaways

  • A predictive model can accurately assess loan default risk, even in a simulated setting.

  • Key risk indicators include checking account balance, loan duration, and loan amount.

  • The model performed well in predicting non-defaults but could be improved in detecting defaults.

  • Future refinements, such as feature engineering and hyperparameter tuning, could further enhance performance.

  • Though a simulation, this project demonstrates how machine learning can assist in financial risk assessment.

Dataset/Files

Github

PDF Report