Acadlore takes over the publication of IJCMEM from 2025 Vol. 13, No. 3. The preceding volumes were published under a CC BY 4.0 license by the previous owner, and displayed here as agreed between Acadlore and the previous owner. ✯ : This issue/volume is not published by Acadlore.
Computational Experiment to Compare Techniques in Large Datasets to Measure Credit Banking Risk in Home Equity Loans
Abstract:
In the 1960s, coinciding with the massive demand for credit cards, financial companies needed a method to know their exposure to risk insolvency. It began applying credit-scoring techniques. In the 1980s credit-scoring techniques were extended to loans due to the increased demand for credit and computational progress. In 2004, new recommendations of the Basel Committee (as called Basel II) on banking supervision appeared. With the ensuing global financial crisis, a new document, Basel III, appeared. It introduced more demanding changes on the control of borrowed capital.
Nowadays, one of the main problems not addressed is the presence of large datasets. This research is focused on calculating probabilities of default in home equity loans, and measuring the computational efficiency of some statistical and data mining methods. In order to do these, some Monte Carlo experiments with known techniques and algorithms have been developed.
These computational experiments reveal that large datasets need BigData techniques and algorithms that yield faster and unbiased estimators.
1. Introduction
There are a variety of methodologies available for assessing credit risk, from the personalized study of an expert in risk analysis, to different statistical and econometric methods of Credit Scoring. However, in a first step, it is not feasible to apply specific analyses in the study of home equity loans.
Credit Scoring methods are more efficient and are more objective and consistent in their predictions and, so, can be analyzed and used to make decisions about a lot of credit applications quickly and inexpensively. Credit Scoring can be considered, as observed by some authors, as a way to identify different groups within a population. One of the first proposals to solve this problem was introduced in statistics by [1] using discriminant analysis and multivariate statistical technique. He sought to distinguish three varieties of plants by physical measurements. [2] was the first to recognize that the same statistical techniques could be used to optimize the differentiation between good and bad loans.
It is called credit scoring and all credit rating systems allow for assessing the risk associated to a banking operation automatically. The risk may depend on the customers and credit characteristics, such as solvency, type of credit, maturity, loan amount and other features inherent to financial operations. It is an objective system in which approval of credit does not depend on the discretion of the analyst. This system must be automatic to reduce costs and processing time.
As for an automatic evaluation it is necessary to use fast and adaptive techniques like machine learning to calculate, in a reasonable period of time, there is a probability of default with historical massive datasets.
In the 1960s, the United States began to develop and apply the techniques of Credit Scoring for credit risk assessment to estimate probability of default [3]. From 1970, Credit Scoring models were based on statistical techniques (in particular, discriminant analysis), but were then generalized in 1990 [4]. Best statistical resources were developed at the same time that technology progressed. It was necessary for financial institutions to make their risk assessment more effective and efficient.
The use of credit scoring models is not only due to the generalization of credit. Banking regulation and supervision has also encouraged its use in the past three decades. Financial and credit institutions are subject to the so-called ‘prudential policy’. It means that the equity must be maintained to ensure smooth operation and to cover several risks to which they are subject, including credit risk [5].
During the late twentieth and early twentyfirst century, there has been economic growth and consumer credit has increased spectacularly. The need for financial institutions to increase the market share is a current reality; the larger the volume of credit granted by a company, the greater its potential benefits, but should be linked to an increase in the quality, because otherwise the end result would be a significant deterioration in the income. Statistical methods for assessing credit risk have become increasingly important [6].
Since Basel II, the use of advanced methods of credit scoring has become a regulatory requirement for banks and financial institutions, in order to improve the efficiency of capital allocation. Basel III introduces more demanding changes on the control of borrowed capital. An increase in reserves based on their risk occurs in financial institutions. The improvement in the accuracy of the assessment of credit risk is a potential benefit to the financial institutions, even if it is small. Over the past decades there were several different investigations that have compared different methods for measuring risk.
Today, credit scoring models are based on mathematics, econometric techniques and artificial intelligence [7, 8]. Empirical studies by various authors present alternative approaches that compare different techniques and algorithms (decision trees, logistic regressions, discriminant analysis, parametric and not parametric method, support vector machines, etc.), see [9 – 30].
All the methods suggested in the scientific literature referred are suitable for classifying good or bad credit. Each methodology analyzed by different authors has its own advantages and disadvantages. The method or algorithm used depends on the structure of the data, the features used, the possibility of separating the classes by using these features and the purpose of the classification of the data structure [31, 32].
Therefore, scientific literature has not solved the problem efficiently. In addition, there is an increase in financial operations, with a consequent increase in the volume of databases. The volume of databases that manage financial companies is so great, and it is necessary to fit this problem. BigData techniques applied to massive financial datasets for segment risks groups is the solution. Big Data helps to extract the value of data and thus make better decisions without the runtime component, which involves high cost that makes the problem intractable. In this paper, two methods for solving the problem of credit scoring in home equity loans are proposed. First of all we measure how a loan can be classified and how cost impacts execution time. To evaluate this, different Monte Carlo simulation experiments are performed.
The execution time component may be important in deciding which method has to be applied, due to the massive volume of data. It can be much more competitive as a computationally efficient method provides advantages in terms of time expected in resolving requests. Our main goal is to compare credit scoring methods that can be both effective and efficient.
In Section 2, the methods used in our research are presented. In Section 3, simulation experiments are developed, and several efficient measures are shown. Finally, in Section 5, conclusions and recommendations are offered.
2. The Models
Let us consider two methods supported by two different models. One of them is a classical statistical procedure, Quadratic Discriminant Analysis (QDA). The second one is a data mining class of procedures, Support Vector Machines (SVM).
Quadratic Discriminant Analysis is a technique more advanced than Linear Discriminant Analysis (LDA) formulated by Fisher [1]. LDA is a classifier that assumes homogeneous covariance matrix for each class. In QDA is not assumption that the covariance matrix of each class is homogeneous and better for classification [33]. QDA algorithm is more recommended than LDA in large datasets [34]. In our research we use MASS [35] package of R software [36] with two discriminant variables.
SVM algorithms are supervised models to analyze binary class labels of a response variable. In a SVM, a hyperplane has been built in order to separate observations for classification. Several SVM algorithms can be found in the literature. In this research we use an SVM with linear kernel (LSVM) that is very closely related to a linear programming problem. In our research we use the e1071 [37] package of R software [36].
For the sake of brevity we skip developed formulas, because it is very easy to find them in the literature.
3. Simulation Experiments
The Monte Carlo simulation experiment is designed to compare the success rate of wellclassified loans for QDA and LSVM techniques and the time that it involves.
In the web agustinperez.edu.umh.es/academia/research/papers/ numerical results of the simulation study are available under Big Data 2016 tag.
Two sets of random data are generated, to obtain training and testing datasets. Training dataset allows to obtain the models parameters (QDA and LSVM). These models parameters are used to predict target variables with testing dataset. These predictions will be used to calculate the success rate on a number of correct classifications. Each dataset is generated as mixed regression model (a fixed effect and a random effect) as follows:
For $i=1, \ldots, I, j=1, \ldots, n_i$:
- First explanatory variable: $x_{i j 1}=\left(b_i-a_i\right) U_{i j}+a_{\mathrm{i}}$ with $U_{i j}=\frac{j}{n_i+1} \cdot a_i=1, b_i=1+\frac{1}{I}(I+i)$.
- Other explanatory variable: Generate as an uniform distribution from $x_{i j 2}$ to $x_{i j p}$.
- Random effects and errors: $u i \sim N\left(0, \sigma_1^2=1\right) . e i j \sim N\left(0, \sigma_1^2=1\right)$.
- Target variable: Calculate:
$y_{i j}=\beta_0+\beta_1 x_{i j 1}+\cdots+u_i+\beta_p x_{i j p}+e_{i j}, \quad \text { with } \beta_0=\cdots=\beta_p=1$
- Recategorize target variable to success and default cases:
$\begin{aligned}& y_{i j} \leq \operatorname{median}(y) \Rightarrow y_{i j}=0 \\& y_{i j}>\operatorname{median}(y) \Rightarrow y_{i j}=1\end{aligned}$
The simulation experiment follows the steps:
1. Repeat $K=10^4$ times $(k=1, \ldots, K)$
1.1. Generate training and testing datasets of size $n=\sum_{i=1}^I n_i$
1.2. Calculate the models parameters with the training dataset.
1.3. Calculate the confusion matrix for QDA and SVM with the testing dataset.
1.4. Calculate the success rate with the successes of the confusion matrix (elements of the main diagonal) and total elapsed time of both methods.
2. Calculate the average success rate and the average time for each method.
The simulations are carried out for the 4 combinations of sizes (records) presented in Table 1.
For each combination of table 1, 5 groups of explanatory variables x have been included. The number of explanatory variables are p = 1, 2,10, 50,100. With these values, we finally generated and analyzed 40 x 104 datasets belonging to two methods.
All the simulations and procedures have been developed in a dedicated Intel Xeon E5620 server with Linux Debian squeeze operating system 64 bits, 8 CPUs at 2.4GHz and 24GB Ddr3 RAM and implemented in R software [36].
4. Results
Firstly, in the simulation experiment we focus our attention on the success rate for QDA and LSVM methods. In most of the combinations of datasets (16 of 20), the LSVM method arises as the best procedure to determine the success rate. See figure 1. The percentage of well-classified increases as does the number of explanatory variables, from 64.5% to 88.43%. For large datasets (5000 records) and a great number of explanatory variables, LSVM has better success rates. When the number of explanatory variables are equal or less than 10, differences are unnoticeable.
After searching for the best method in terms of efficiency in prediction, let us see the results of our computational problem on Big Data. It is well known that LSVM is one of the slower methods existing today. In our research we have tried to link with the increase of the explanatory variables. In figure 2, the average execution times for process are plotted. For the sake of better visualization of the execution times, only values for p = 1,2,10 have been plotted. It can be seen that the QDA method is the fastest. And LSVM is far slower than QDA when the number of record n increases.
After searching for the best method in terms of efficiency in prediction, let us see the results of our computational problem on Big Data. It is well known that LSVM is one of the slower methods existing today. In our research we have tried to link with the increase of the explanatory variables. In figure 2, the average execution times for process are plotted. For the sake of better visualization of the execution times, only values for p = 1,2,10 have been plotted. It can be seen that the QDA method is the fastest. And LSVM is far slower than QDA when the number of record n increases.
In figure 2, it seems that increase in execution time grows exponentially. The relationship between increases in p and increases in execution time can be:
g I(g) | 1 10 | 2 20 | 3 30 | 4 50 |
nᵢ | 100 | 100 | 100 | 100 |
n | 1000 | 2000 | 3000 | 5000 |


• Constant. This is impossible because in figure 2 the time grows.
• Linear. The ratio between times are constant, that is, it depends on the number of explanatory variables.
• Exponential. The ratio between groups of times grows in multiplicative way.
Figure 3 has been created in order to observe how the time increases with increasing number of explanatory variables. We have calculated the ratio between execution time for each p and execution time for p = 1. This ratio has been weighted by the increase that occurs concerning p = 1 as follows:

Relative Increment for $p=\frac{\operatorname{time}(p)}{p \times \operatorname{time}(p=1)}$
For example, the relative time increment for $p=50$ is $\frac{\operatorname{time}(p=50)}{50 \times \operatorname{time}(p=1)} . p$ in the denominator acts as a modulator. If the ratio between times depends on the number of explanatory variables, when $p$ appears in the denominator, the value of the Relative Increment reaches 1. As it can be seen in figure 3, the values of the Relative Increment of time are all less than 1. This means that the increase in the number of explanatory variables does not affect in the same way.
5. Conclusions
In our research experiment we attempted to create files that can represent loans for any bank branch. For this reason we simulated dataset from n = 1000 to n = 5000 record and from p = 1 to p = 100 explanatory variables.
Two methods have been proposed, QDA and LSVM. We calculated measures of effectiveness and efficiency. Regarding the effectiveness, we have found that, usually, LSVM is the best method for estimating credit risk. But in terms of computational efficiency, LSVM takes more time than QDA to solve the same problem. At the worst case, LSVM method takes 20 times more runtime than QDA.
A linear relationship between time and the number of explanatory variables has been encountered. It would be very productive to find a functional relationship between runtime and number of explanatory variables. It would also be very appropriate if it can be increased to a higher number of procedures.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors are grateful to the Conselleria the Educación, Generalitat Valenciana for the economical support under the grant GVA/2016/053. That has made possible this research.
The authors declare that they have no conflicts of interest.
