Sažetak (engleski) | Credit risk assessment provides an estimate of the likelihood that the client will not be able to pay back a given loan during a specific performance period. The aim of credit risk measurement is to classify borrowers into two groups: good clients (non defaulters) and bad clients (defaulters), using predictive models and historical information on clients’ repayment behaviour. Credit risk measurement is one of the key factors for long-term success for financial institutions and sustainability of the credit business. It is widely used in financial institutions for lending decisions and asset quality monitoring and is one of the most widely used applications of statistical models and data mining methods in practice. Therefore, making a correct assessment of the likelihood of the client defaulting is of utmost importance for financial institutions and increasing accuracy in default prediction would be of great interest for financial institutions. Data used for credit risk evaluation is often imbalanced, nonhomogeneous, incomplete, noisy and shows nonlinear interdependencies. In imbalanced data sets the number of observations representing the minority class (bad clients) is much lower than the ones of the majority class (good clients). Class imbalance has been identified as one of the key challenges in data mining and has received a lot of attention in recent years. Performance of standard classification algorithms tends to deteriorate when class imbalance is present as the cost of misclassifying a case in credit risk measurement is greatly asymmetrical. In nonhomogeneous datasets groups of more or less similar characteristics of group members can be identified within the initial data set. Class imbalance often varies across different groups of clients with similar features, which makes the classification process even more complex. The purpose of this thesis is to explore the issue of improving classification accuracy in credit risk models under these settings by using a hybrid data mining framework with adaptive evolutionary clustering and data preprocessing techniques, designed specifically for imbalanced data sets. The thesis is organized into six chapters. The introductory chapter summarizes research motivation, background work, research goal, hypothesis, methodology and plan. Chapter 2 provides a literature review and theoretical overview, focusing on data mining methods, namely classification and clustering algorithms, credit risk models and genetic algorithms. A lot of research effort has been committed to evaluating classification algorithms in credit scoring, ranging from traditional statistical methods and machine learning algorithms to hybrid and ensemble classifiers and deep learning methods in recent years. A number of benchmark studies have been performed, comparing classification accuracy of different classification algorithms. However, there is no consensus among researchers as to which algorithms yield the best performance. In fact, it seems that the choice of algorithm should take into account the domain, dataset characteristics and performance criterion. Nevertheless, for credit risk assessment using hybrid models that combine several individual methods has shown promising results. Design and further improvement of hybrid models in order to improve prediction accuracy represents the direction in which future research is headed. The aim of clustering is to group data instances into individual clusters where instances bear a stronger resemblance to each other than they do to instances that belong to different groups. Research on the impact of clustering on credit risk models’ classification accuracy is limited. It has focused mainly on using clustering methods to identify representative examples prior to applying classification algorithms. The theoretical overview is followed by a description of data related issues common in credit risk assessment and approaches for addressing them. Data related issues occur very frequently as the data used to assess credit risk often contains incorrect or missing values, is imbalanced and nonhomogeneous. Adequate attention should be paid to data related issues as non-representative instances or instances with incorrect values can impair the quality of the predictive model. A number of approaches have been proposed to deal with imbalanced data sets. These approaches can be broadly categorized into three groups: 1) using data preprocessing techniques, that alter the underlying class distribution in order to diminish the impact of class imbalance, 2) cost-sensitive learning, where higher misclassification costs of the minority class instances is factored in when constructing the classifier in order to minimize the overall cost and 3) adjusting classification algorithms to take the class imbalance problem into account. A few comparisons of different approaches were performed, but with different conclusions reached on which approach yields highest accuracy, indicating that further analysis is needed. Special attention should be paid to the choice of performance metric when evaluating performance of predictive models built using imbalanced data sets. Evaluating classification performance should be carried out using specific performance metrics that take into account the skewed class distribution. To summarize, the focus of research so far has been proposing new algorithms and comparing classification accuracy of different classification algorithms, whereas improving classification accuracy by using clustering approaches was somewhat disregarded. Therefore, the impact of clustering on classification accuracy of credit risk models when dealing with imbalanced, nonhomogeneous, incomplete, noisy and nonlinear input data sets where the level of imbalance as well as the misclassification cost varies was identified as a research gap. The central part of the thesis, the hybrid AEGiK framework based on adaptive evolutionary clustering is proposed and described in detail in Chapter 3. The AEGiK framework is designed to identify clusters in input data, which can be imbalanced, nonhomogeneous, noisy or incomplete, as a preparatory phase for classification, in order to increase classification accuracy. The adaptive evolutionary mechanism repeats a multi-stage process in each step of the algorithm, consisting of feature selection, clustering, feature selection for classification for each identified cluster and classification. As a result, cluster configuration and classification methods are continuously evaluated and evolve with each step. In other words, generation of new hypotheses is based on the current hypotheses, thereby leading to the generation of better hypotheses in the next step. Given the importance of the overall design of the framework, chromosome encoding scheme, genetic operators and the fitness function were carefully designed and tailored to the context of evolutionary data clustering and credit risk assessment in a synergistic fashion to utilize the full potential of the hybrid framework. A hybrid approach was used for chromosome encoding, combining several encoding schemes: 1) modified centroid based representation, 2) instance-based representation and 3) cluster-dimensionality based representation. Genetic operators were tailored specifically to exchange cluster configuration information and to minimize the possibility of incorrect or illogical solutions. The selection was assumed to be proportionate. Three crossover operators were used: 1) single-point crossover, 2) uniform crossover and 3) linear crossover, as well as four mutation operators: 1) changing cluster label, 2) Gaussian noise, 3) replacing centroid and 4) eliminating and splitting clusters. The framework assumes hard partitioning clustering, where clusters are assumed to be non-overlapping. A customized initialization procedure explores the potential hypothesis space. An important factor is the level of imbalance of the sample, which can have a significant impact on the overall result. In credit risk assessment, default prediction is of greater interest since misclassifying a bad client attracts a much higher cost than misclassifying a good client. The issue of class imbalance was addressed by using two approaches: data preprocessing techniques and cost-sensitive learning, whereas data quality issues were addressed by employing adequate data preparation methods. Data preprocessing techniques alter the original imbalanced data sample with the aim of producing a more balanced class distribution. They can be divided into three categories: 1) oversampling, where the number of minority class instances is increased, 2) undersampling, where the number of majority class instances is reduced and 3) hybrid methods, that combine both undersampling and oversampling techniques. By using cost-sensitive learning approach cost information is embedded in the learning algorithm by taking into account the actual cost of misclassification with respect to the different classes. The fitness function that governs the hypothesis space search process takes into account the validity of the proposed cluster configuration and misclassification cost simultaneously. In order to avoid overfitting and crowding, a number of restrictions were introduced in several stages of the algorithm. An additional advantage of the framework is its wide applicability due to the fact that no assumptions are made on the relationship between the independent and target attributes. The proposed AEGiK framework is a more advanced approach to identifying homogeneous clusters as an alternative to using expert based clustering, which is commonly used in practice even though it is subject to bias and preconceived notions. Processing begins with the preparation of the training data set, where observed data related deficiencies are eliminated. Afterwards data preprocessing techniques are applied, altering the underlying class distribution. Following the data preparation phase initial population is generated and evaluated. The core of the algorithm consists of repeating the crossover and mutation operations, generating and evaluating new clustering configurations and associated classification methods, until the stopping condition is satisfied. The sequence and method of applying genetic operators reduces the possibility of crowding, a phenomenon where extremely good individuals reproduce rapidly and "overwhelm" the population, which may reduce population diversification and slow down further progress of the algorithm. The result is the configuration of the clusters and associated classification methods that achieves the highest value of the fitness function. Based on the formal definition of the AEGiK, a prototype was designed and evaluated using a corporate portfolio and related financial reports. The prototype and experimental evaluation of the proposed framework is described in Chapter 4. In addition to analysing the impact of using the AEGiK framework, the impact of changing the level of imbalance, using various data processing techniques and classification algorithms were also analysed. Evaluation was performed by using appropriate cost – sensitive performance metrics and by comparing the proposed framework with a benchmark method and an independent test set. Logistic regression was used as the benchmark method, given that it is the standard method used for credit scoring among practitioners and one of the most commonly used methods for credit scoring. Initially results of using the benchmark method are presented, followed by results of using the proposed AEGiK framework, with and without employing data preprocessing techniques. Results are discussed in detail in Chapter 5, followed by the final chapter providing a general overview of the thesis, conclusions and outlining possible future research direction. The results confirm that by using the AEGiK framework without employing data preprocessing techniques classification accuracy can be increased and misclassification cost reduced, by dividing the dataset into homogeneous clusters using adaptive evolutionary clustering prior to classification, thus confirming the original research hypothesis. The aforementioned conclusion was confirmed in both the training data set and the test data set. Improvement in classification performance was observed in relation to the initial configuration of the algorithm and to the benchmark classification method. Results also indicate that using almost all the data preprocessing techniques results in decreased misclassification cost. The results also provide valuable insight on the suitability of using different classification algorithms and data preprocessing techniques by assessing their sensitivity to class imbalance. Variability in the efficiency of applying different data processing techniques was noted. Undersampling techniques displayed better performance on average compared to oversampling techniques, especially undersampling techniques guided by heuristics. However, for majority of the techniques, performance improved as class imbalance decreased. The impact of data processing techniques on classification accuracy depending on the level of imbalance was also analysed. In this case it was also observed that employing most of the data processing techniques resulted in misclassification cost reduction and an increase in classification accuracy. Different data preprocessing techniques have shown different levels of sensitivity to the level of imbalance. On average, undersampling methods showed less sensitivity to the level of imbalance level than oversampling methods. Classification algorithms that result in the lowest misclassification cost, selected by the AEGiK framework, are presented. The analysis was performed separately when data preprocessing techniques are not used and depending on the data preprocessing method used and the level of imbalance. It was observed that classification algorithms display different levels of sensitivity to the level of imbalance, especially when cost-sensitive performance metric is used. The most successful algorithm was the neural network, which in most cases resulted in the lowest misclassification cost given the clustering configuration determined by the AEGiK framework. The hybrid framework prototype and the experimental evaluation described confirmed the initially set research goal, that it is possible to identify the optimal configuration of the credit risk model to maximize the classification accuracy, and the initial hypotheses. Credit scoring remains a very important research topic for both academics and financial institution practitioners since any improvement in classification accuracy would bring significant savings to financial institutions. Given the recent advances in automating lending decision processes, online applications, new business models emerging, the ever-growing richness in data collected, the big data trend and capital adequacy optimization, we foresee this to remain a very active research area. |