Machine Learning predicts the likelihood of debt repayment

In 2014, the Data Analysis and Machine Learning Contest organizer Kaggle announced a contest to predict a loan applicant’s repayment based on their background information. Imperial College London hosted the competition and offered a large chart of anonymized financial data about the loan applications, applicants and their loan installments.

The aim of the competition was to create a program that would be best able to predict the applicant’s loan repayment based on his background information. A table of anonymized, unlabeled background information was made available to list details of all the repayments. In addition to the anonymization, data titles were omitted, so each column in the table listed information about some unstated variable of unknown purpose.

For a group of repayments called the training set, the lists also included information whether the repayment had been paid due date, the debt was in default, or if there were other repayment irregularities. The contestants were expected to train their prediction engine with the background data from the training set, generate answers to the competition entries with their engine and submit the answers for evaluation.

I figured the machine learning application would have to give its prediction of the likelihood of repayment based on how much the circumstances of a payment transaction would resemble a successful repayment and how much the failure to pay. The ML should then be used to predict the expected loss as accurately as possible.

A method was required for the analysis of a large background factor set. The contents of the material did not appear in any way clearly from the table presentation. Unknown data interfered with making an effective interpretation, and the amount of data was and the number of variables were too large to be effectively handled on a spreadsheet. A suitable multivariate analysis tool had to be developed. Numerical background information could then be treated as a large vector space.

Nearest Neighbor Networks

I had worked on Nearest Neighbor Networks before the time of competition. NNN works on multivariate data in which each sample is treated as a vector in multidimensional vector space. In the NNN system, for each vector, a number of most similar vectors are found in the vector space. These similar vectors are by definition close to the original vector by some metric, like in Euclidian sense, so they are the original vector’s neighbors.

Consider a parallel of the 3-dimensional situation, neighbors in a town. In an apartment building nearest neighbors can be found on the other side of the wall, from above and also below. The nearest neighbors in the direction of the windows may be found across the street. For a parallel of the 2-dimensional vector space, think of the countryside where neighbors may be found somewhere in the map in any direction of the compass.

In machine learning applications, the number of the dimensions in the vector space may exceed 200. The dimensions and geometry of the samples in such a vector space can not be interpreted intuitively. An algorithm is needed that interprets the relationship between the observations on behalf of the human being.

The interpreter may resort to some kind of dimensionality reduction to provide a a 2d-visual output of the data which the user may browse. The interpreter may also be programmed to measure the quality of the interpretation in some tangible way. I chose the latter way. The quality of the interpretation would be measured by the accuracy of the predictions created when using those interpretations.

In the loan portfolio problem, I approached the problem so that loan applicants with sufficiently similar background factors are likely to produce a similar outcome in the loan repayment. Against this assumption, I could measure the correctness of any definition of similarity, or any interpretation which makes two or more samples “neighbors” in the vector space. Any interpretation which results in a more accurate estimate of the debt repayments is a better interpretation.

Various background factors would certainly have variable relevance. The number of debtor’s current payment defaults would be of more value than the their postal code. Had the organizer placed random numbers in the data table, their relevance would be even lower.

NNN was intended to enable combining the data properly. If NNN could be used to find out which of the debtors were similar enough to be included in the same group for statistical analysis, the loan repayment statistics of all those debtors might be utilized to predict the loan repayment outcome of an individual debtor.

An accurate prediction would then rely on adequate sample size where the samples are weighted according to their relevance. If the NNN produced a mistaken interpretation of similarities between debtors, some of the similar debtor’s statistics might get too little or no weight in the prediction. Also, the prediction might be polluted data of debtors whose statistics are given too much weight considering their relevance to the case.

Assessing the relevance of the background factors

Predicted debt loss in some case would then be calculated as a weighed average of all the relevant debt losses, weighed by their relevance. A highly similar debt/debtor case would receive large weight, and a dissimilar case little to no weight. This kind of calculation required to normalize the rate of change of relevance across vector components.

The normalization would be accomplished by multiplying the background information vectors with a weighing vector before handling them over to the NNN. A weighing vector with components in the range [0,1] would zero out meaningless components and keep the most meaningful components. The resulting vector, dot product of the original background vector and the weighing vector, would satisfy the requirements for normalization.

A unit movement to any direction in the normalized vector space would produce the same change in relevance. All the surrounding samples with relevance of r or higher would be found within distance r, in volume the shape of a hypersphere.

The weigh vector would have to be found empirically. A series of weight vectors could be tested in an optimization scheme. The predictive value of the NNN-system with some weight vector would be measured and compared to that of another weight vector. The weights could be altered randomly and the optimization scheme could be made to prefer vectors producing a better predictions. With the right kind of scheme the weight vector would hopefully converge to an optimal point.

Surprise in the competition

I was working on the optimization system when the news hit: the competition host had made an error. They had included loan repayments information in the background information. The answers were hidden in the data set, as if the teacher had left the exam answers on the table and left the room.

The organizers pulled the answers away, but the damage had already been done. The right answers were in the air. The score charts showed a huge number of high-quality competition entries appearing all of a sudden. The competition continued on schedule, but no competition entries could be ascertained of fair or foul play.

The competition community theorized that the programs created by the members could be used after the end of the competition to evaluate some other data, thereby negating the advantage of having used the leaked data. It was unsure if this was ever done.

Finishing up

The results from the iterated NNN-system with improved weighing vector turned out pretty well. The method was able to provide quite an accurate prediction of the likelihood of repayment of the loan. The iteration produced a weighing vector which corresponded to the relevance of the variables pretty accurately.

For some of the variables, my approach did not work at all. There were, for example, account numbers, loan numbers, etc. which revealed the borrower’s other loans. Treating these variables with NNN in linear vector space was an error. The weighing system weighed out these factors even though another algorithm could have derived information from them. NNN was a compromise that worked out well but more could have been done.

My experience of the NNN application was positive. Clearly the method could be used well with continuous variables. The process revealed however that a Decision Tree solution – or maybe Random Forest – might have been of better quality, at least for some variables describing categories or object identities.

Given the data set, the application was able to give strong recommendations for accepting or denying loan applications. There are limitations to this approach. Not having understood the financial landscape of the situation, not having an expert opinion, the application wasn’t particularly suited to a changing financial environment.

Nevertheless, the project can be considered a success. The process to find a suitable normalization vector went smoothly. The predictive value of the application was pretty good. The process paved way for further practical improvements to the NNN-implementation and kernel estimation.

Machine Learning predictor can’t be completely accurate, but I can vouch for the statement made in the title: Machine Learning predicts the likelihood of debt repayment. More predictive accuracy can be gained with more data, more work and increased research efforts. The kind of ML available in 2014 predicted repayments then and certainly ML-applications available today predict the repayments now.

Limitations of data fitting predictors

The kind of data fitting approach used in my predictor and other competition entries finds the few risky debtors from a group of loan applicants, but it can’t predict a financial crisis in a changing situation. All that the prediction engine does is interpret new data in terms of old data. The predictions fail when the old data is not applicable to the new situation, and they can fail either way.

An automatic loan application assessment system would have to have some safeguards to protect the creditor. Other parties like the general public and the financial supervisory authorities might be interested to know what kinds of decision making processes the financial institutions use in making their credit decisions.

Automated loan application systems have been in use for some time now. Most of them seem rule-based with manually encoded rules, with unknown amount of the decision making process relying on Machine Learning. Some use Machine Learning extensively to derive credit decisions from a large array of weak hints, like online behavior, hobbies etc.

The data driven predictor isn’t any more discriminatory than it is fair. If the user knows how it is used, it can be anything. Therein lies the dilemma: conservative historical data-based prediction might reduce the risk to the creditor and to the general public from a possible bailout, but it might work to maintain and amplify existing prejudices. The creditor might miss some business opportunities from emerging client base and new sectors due to it’s overreliance on historical data. A predictor de-emphasizing historical data that makes positive credit decisions in unknown circumstances creates a risk of unknown size.

There are some unknowns in autonomous financial decision making systems, some of which might make the financial sector ever more volatile and increase risk. Nevertheless, a great deal more of relevant information can be properly handled with a Machine Learning based system in a financial environment than could be handled without. Risk-limiting methods can and should be utilized to ensure safe operation of Machine Learning systems and the business operations relying on such systems.

All interested parties should acquire sufficient information regarding automatic and ML-assisted financial decision-making. Those who are interested in developing a predictor or having a predictor developed for them would do well to perform a careful study of the design goals and the options available. A properly designed ML-system fulfills the design goals and can be used in part of financial decision-making process which performs profitably according to regulations and meets the expectations of the stakeholders.