Credit risk analysis

Radhin krishna
Jul 6, 2024
5 min read

JPMorgan Chase & Co. - Quantitative Research Job Simulation from Forage

Part 2: Job Simulation

1. Introduction

Retail banks face significant risk when issuing loans due to the possibility of borrowers defaulting on their payments. This project aims to develop a model using a Random Forest algorithm to predict loan default probabilities (PD). Predicting PD allows banks to estimate potential losses and set aside appropriate reserves, ensuring financial stability.

2. Data Description

The provided loan data is assumed to be in a tabular format with the following features:

Link: https://cdn.theforage.com/vinternships/companyassets/Sj7temL583QAYpHXD/JiwEkbBq8pFwMRYLc/1684245611456/Task%203%20and%204_Loan_Data.csv

customer_id (int64): Unique identifier for each borrower.
credit_lines_outstanding (int64): Total amount of credit lines currently available to the borrower across all accounts.
loan_amt_outstanding (float64): Current outstanding balance on the specific loan being analyzed.
total_debt_outstanding (float64): Total outstanding debt of the borrower across all sources (loans, credit cards, etc.).
income (float64): Borrower's annual income.
years_employed (int64): Number of years the borrower has been employed at their current job.
fico_score (int64): Borrower's credit score, typically ranging from 300 to 850, with higher scores indicating better creditworthiness.
default (boolean): Indicates whether the borrower has defaulted on a loan in the past (True) or not (False). This is the target variable for the model.

The data was clean, and no inconsistencies were detected.

3. Model Development with Random Forest

3.1. Feature Engineering

While the provided features offer valuable insights, we may consider creating additional features to enhance the model's predictive power:

Debt-to-Income Ratio: This ratio (total debt / income) can provide a clearer picture of the borrower's financial burden.
Credit Utilization Ratio: This ratio (loan amount / credit line) indicates how much of the available credit the borrower is using.

3.2. Random Forest Model

A Random Forest algorithm will be employed for its effectiveness in credit risk analysis. This ensemble method combines multiple decision trees, improving accuracy and reducing overfitting.

The model will be trained on the prepared data, including both original and potentially derived features.
Hyperparameter tuning will be performed to optimize the model's performance. This involves adjusting parameters like the number of trees in the forest and the maximum depth of each tree.

Standard Result

My Solution

4. Model Evaluation

Evaluating the model's performance is crucial to ensure its reliability. We will utilize metrics like:

AUC-ROC (Area Under the Receiver Operating Characteristic Curve): This metric assesses the model's ability to differentiate between defaulters and non-defaulters. A higher AUC-ROC indicates better performance.
Confusion Matrix: A table that visually summarizes how many predictions were correct (TP, TN) and incorrect (FP, FN) for each class (defaulter, non-defaulter).
Classification Report: Provides detailed metrics (precision, recall, F1-score) to evaluate how well the model identifies defaulters and non-defaulters.

5. Expected Loss Calculation

An expected loss function will be created to estimate potential losses associated with each loan. This function will consider:

Predicted Probability of Default (PD): Obtained from the trained Random Forest model for a particular loan applicant.
Recovery Rate: This represents the percentage of the loan amount that can be recovered in case of default. A typical value might be 10%.
Loan Amount: The total amount of the loan being considered.

The expected loss is calculated by multiplying the predicted PD by the potential loss amount (loan amount multiplied by (1 - recovery rate)).

6. Conclusion

This Random Forest model offers a valuable tool for predicting loan default probabilities. By estimating potential losses through the expected loss function, banks can make informed decisions regarding loan approvals, credit limits, and reserve allocation.

7. Future Considerations

Explore the performance of other machine learning algorithms like Logistic Regression or Gradient Boosting for comparison.
Analyze feature importance to identify the factors with the most significant impact on default risk. This can help refine future data collection efforts.
Regularly monitor and update the model as new loan data becomes available to maintain its accuracy and effectiveness.

sub-task 2:Quantization for FICO Scores

Introduction:

This project aims to develop a Random Forest model for predicting loan default probabilities (PD) in the retail banking arm. However, a member of the risk team wants to ensure the model's effectiveness for future data sets with potentially different FICO score distributions.

Objective: Develop a general approach to categorize FICO scores into a predefined number of buckets (corresponding to input labels for the model) while preserving the most relevant credit risk information.

Challenges:

Mapping FICO Scores to Ratings: We need to create a "rating map" that translates FICO scores into a rating system, with lower ratings indicating higher credit risk (better scores).

Solution Approaches:

Mean Squared Error (MSE) Quantization:

View this as an approximation problem.
Aim to minimize the squared error between the actual FICO score and a single representative value assigned to each bucket.

Log-likelihood Quantization:

This is a more sophisticated approach that maximizes a log-likelihood function.
This function considers the impact of quantization on:
Discretization "roughness" (how well the buckets capture the data distribution)
Default density within each bucket (concentration of high-risk borrowers)

Implementation Strategies:

The log-likelihood approach can be tackled by dividing it into subproblems:
Example: Create five buckets for FICO scores (0-600) and five buckets (600-850).
This allows for an incremental (step-by-step) solution using dynamic programming techniques.

import numpy as np
from scipy.optimize import minimize

def quantize_fico_scores(fico_scores, num_buckets):
    """
    Quantizes FICO scores into specified number of buckets using squared error and log-likelihood approaches.

    Args:
        fico_scores (numpy.ndarray): Array of FICO scores.
        num_buckets (int): Number of desired buckets.

    Returns:
        List of bucket boundaries, mean squared error, and log likelihood.
    """
    # Sort FICO scores in ascending order
    sorted_scores = np.sort(fico_scores)

    # Initialize bucket boundaries
    initial_boundaries = np.linspace(min(sorted_scores), max(sorted_scores), num_buckets + 1)

    # Define squared error function
    def squared_error(boundaries):
        boundaries = np.concatenate(([min(sorted_scores)], boundaries, [max(sorted_scores)])) # Add boundaries for the first and last buckets
        bucket_means = [np.mean(sorted_scores[(sorted_scores >= boundaries[i]) & (sorted_scores < boundaries[i+1])])
                        for i in range(num_buckets)]
        return np.mean([(bucket_means[i] - (boundaries[i] + boundaries[i+1]) / 2) ** 2 for i in range(num_buckets)])

    # Minimize squared error, allowing for multiple boundaries to be optimized
    result = minimize(squared_error, initial_boundaries[1:-1],  # Optimize inner boundaries
                       bounds=[(min(sorted_scores), max(sorted_scores))] * (num_buckets - 1),  # Bounds for each inner boundary
                       method='L-BFGS-B')
    squared_error_boundaries = np.concatenate(([min(sorted_scores)], result.x, [max(sorted_scores)])) # Include min and max scores as boundaries

    # Calculate the number of defaults in each bucket
    bucket_indices = np.searchsorted(squared_error_boundaries, sorted_scores)
    num_defaults_per_bucket = np.bincount(bucket_indices, minlength=num_buckets+1)[1:] # Count defaults per bucket, excluding the first boundary

    # Calculate log-likelihood for each bucket
    bucket_sizes = np.diff(np.searchsorted(sorted_scores, squared_error_boundaries))
    default_probs = num_defaults_per_bucket / bucket_sizes  # Calculate default probabilities per bucket
    log_likelihood = np.sum(num_defaults_per_bucket * np.log(default_probs))

    # Calculate mean squared error
    mse = squared_error(result.x)

    # Print bucket boundaries, mean squared error, and log likelihood
    print("Bucket boundaries (Squared Error):", squared_error_boundaries)
    print("Mean Squared Error:", mse)
    print("Log-Likelihood:", log_likelihood)

    return squared_error_boundaries, mse, log_likelihood

# Example usage
fico_scores = data['fico_score'].values
num_buckets = 5
bucket_boundaries, mse, log_likelihood = quantize_fico_scores(fico_scores, num_buckets)

Benefits:

Quantization helps the model generalize better to unseen data by focusing on relevant trends within FICO score ranges instead of specific score values.
This can improve the model's robustness and applicability to future loan applications with potentially different score distributions.

Next Steps:

Research and implement specific algorithms for both MSE and log-likelihood quantization approaches.
Evaluate the impact of quantization on the model's performance using metrics like AUC-ROC.
Choose the quantization method that offers the best balance between accuracy and generalizability.

Conclusion:

By incorporating FICO score quantization, this project aims to enhance the Random Forest model's generalizability and effectiveness in predicting loan defaults for future loan applications.

Credit risk analysis

Part 2: Job Simulation

Recent Posts

Comments