Credit rating algorithm of corporate bonds based on Gaussian process mixture model and improved K-means

: The primary challenge in credit analysis revolves around uncovering the correlation be-tween repayment terms and yield to maturity, constituting the interest rate term structure-an essential model for corporate credit term evaluation. Presently, interest rate term structures are predominantly examined through economic theoretical models and quantitative models. However, predicting trea-sury bond yields remains a challenging task for both approaches. Leveraging the clustering analysis algorithm theory and the attributes of an insurance company’s customer database, this paper enhances the K-means clustering algorithm, specifically addressing the selection of initial cluster centers in extensive sample environments. Utilizing the robust data fitting and analytical capabilities of the Gaussian process mixture model, the study applies this methodology to model and forecast Trea-sury yields. Additionally, the research incorporates customer credit data from a property insurance company to investigate the application of clustering algorithms in the analysis of insurance customer credit.


Introduction
Credit represents a class of marketable securities [1], offering holders stable cash flow returns at specific future times [2].Various factors influence bonds, encompassing both micro-entities and the macro-environment [3].Credit categories include national bonds, policy bank financial bonds, corporate bonds, and municipal bonds based on the issuing entity [4].Government bonds, issued based on the nation's credit, possess the highest credit rating [5].Owing to their distinctive issuing entity, government bonds frequently serve as benchmarks for pricing other types of credit.The key determinants of credit value in practice include the issuing entity, denomination, coupon interest, repayment method, repayment period, and yield.Interest refers to the compensation received by the lender over a specific period, and the ratio of interest to the lent amount over that time is the interest rate or rate of return on funds [6].
Corporate credit rating is a management activity where an independent social intermediary assesses a company's borrowing and lending behavior's reliability and safety, providing an assessment report with professional symbols according to a specified methodology [7].In essence, it is an evaluation of the enterprise's creditworthiness to repay principal and interest as promised, assessing the credit risk of the bond [8].The credit rating solely judges the credit risk of the issued credit and does not reflect the rated credit's profitability and liquidity level.Therefore, rating results aid credit investors in gauging credit risk but should not be the sole basis for credit buying, selling, or holding decisions [9].
Given that ratings only assess a credit's risk without considering other factors like market price, supply and demand, and investor preferences, they serve as just one factor in investment decisions, not the sole basis [10].When making credit investment decisions, investors must consider both the risk and return aspects of credit.
Credit ratings have a validity period, reflecting a specific credit's creditworthiness only during that period.Even within this timeframe, a credit's rating may change due to external environmental and internal operational conditions of the debt issuer [11].
A rating agency holds no legal responsibility for an investor's use of a rating.A credit rating from an agency serves as an indication to investors regarding the risk profile of various credits.It represents the agency's opinion, and investors are not obliged to share or adhere to it.Legally, there is no direct connection between the rating agency and the consequences investors face when using rating results [12].
The Gaussian process mixture model (MGP) is a potent statistical learning tool with robust learning and fitting capabilities.MGP models effectively describe multimodal data and reflect data volatility.They can be categorized into generative and discriminative models from the generative process perspective and into mixing in the time domain (MGP models) and mixing in the output space (mixGP models) from the mixing mode perspective.
The enhancement of the corporate credit rating system has spurred considerable scholarly interest in the rating methodology, a pivotal component of the system.Fitzpatrick (1932) conducted a univariate bankruptcy prediction study using ratios like net income to stockholders' equity and stockholders' equity to debt to predict firm bankruptcy [13].Another study by [14] resulted in the well-known Zscore model and ZETA credit risk model, which utilized multivariate discriminant analysis for rating debt securities.Neural network analysis was applied to predict the financial crisis of Italian companies in [15].In recent years, domestic scholars have delved deeper into this area, as seen in [16], which employed the internal rating method to enhance the current credit rating method of commercial banks in China [17].This method considers not only the target data but also the relevance of each indicator, providing more valuable information and credit insights [18].
Given the limitations of financial factors in corporate credit rating analysis, such as lag, incompleteness (due to the largely incomplete or even false information disclosed in financial statements), and short-term focus, scholars are increasingly focusing on the role of non-financial factors in corporate credit rating.They argue that credit-issuing companies operate in an open system, subject to external factors, making non-financial factors early warning signs of future loan risk [19,20].
This paper employs the MGP model to analyze corporate credit term structure data.Treasury yield data represent "time-flow" data, with each data point correlated with neighboring points.This correlation is depicted by the covariance matrix of the MGP model.Due to policy influences and other factors at different time points, the volatility of Treasury yield data varies over time.The MGP model captures this differential volatility by expressing it through each GP component separately.These components describe local variations and are combined to enhance the MGP model's overall representation of data variability.

Gaussian Process (GP) Model
Mathematically, Y(X) is considered a Gaussian process if, for any given N and X = (x 1 , • • • , x N ), the corresponding Y = (y 1 , • • • , y N ) follows a Gaussian distribution.In mathematical terms, a Gaus-Computational Support for Two-Person Interactive Behavior Recognition Based on Multi-Channel.....161 sian process can be expressed as: In general problems, it is often assumed that m(X) = 0.In this paper, the Squared Exponential (SE) covariance function is utilized: For ease of representation, let the parameter be θ = σ 2 1 , σ 2 2 , σ 2 3 , and the parameter learning of the GP model is efficiently performed using the maximum likelihood estimation algorithm.

MGP Model
In this paper, an MGP model in the form of a generative model is used, where each GP component is independent of each other.It is assumed that the Gaussian mixture model includes C Gaussian components, and each GP model is denoted as GPC.The Gaussian process mixture model generates the sample dataset 1. First, the hidden variable z c n is introduced to describe the attribution of the sample to the GP component and follows the following distribution: where

Under the condition z c
n = 1, the sample input x n follows a normal distribution with a mean of µ c and a covariance of

Define
} as the sample label, input, and output of the c-th GP component, respectively.The c-th GP is defined as follows: From the above three steps, we can see that the information flow direction of the MGP model is "Z → X → Y", which is consistent with the characteristics of the Treasury yield and the application scenario of the MGP model.Based on the dataset D, it is easy to derive the following log-likelihood function of MGP: where , denote the hyperparameters and parameters in the MGP model, respectively.

Algorithm Design
In this paper, we use the EM algorithm to learn hyperparameters In practice, the main algorithms for learning parameters of MGP models are the MCMC algorithm, the variational Bayesian (VB) algorithm, and the EM algorithm.Although the MCMC algorithm is generally able to obtain more accurate estimation results, the algorithm requires a large number of adoptions, is inefficient, and the results are not stable.In the VB algorithm, we need to assume that the parameters and hidden variables in the model are independent of each other, which often leads to the estimation results deviating from the true values and poor learning results but is an effective simplifying computational strategy.
The core idea of the Hardcut EM algorithm is to convert the posterior distribution of samples into a 01 binomial distribution using the maximum posterior probability criterion and then assign the samples to the model components with posterior probability p = 1.Due to the distribution characteristics of the data, the majority of the samples have 01 posterior probability distribution, and the error of the hardcut strategy is small in these samples; in the samples at the edges of the model components, the hardcut strategy generates larger errors, but the total error is small due to the small number of samples.On the other hand, the HardcutEM algorithm greatly simplifies the calculation of the Q function in the EM algorithm and improves the speed of the algorithm: 1. Initialization: Use kmeans algorithm to classify sample D into C classes, and initialize hyperparameters (Θ, Ψ); 2. M-step: learning parameters in three steps: 3. Update posterior probability p z c n = 1|x n : Update the model parameters The hyperparameters of each GP component are learned independently using a very large likelihood estimation algorithm θ c . 5.
Step E: update the category information of the sample according to the maximum posterior probability principle: n in two iterations is less than the threshold.

Algorithm Description
Data types in real databases are complex, and a data object often contains several variables of different types at the same time.It is necessary to process the data before performing calculations.Assume that the data set contains different types of variables and the data matrix is Computational Support for Two-Person Interactive Behavior Recognition Based on Multi-Channel.....163 To simplify the calculation process, the variables of different types are transformed to a common value space [0.0, 1.0], and the dissimilarity H between objects i and j is defined as where f represents the variable; x i f or x j f represents the metric of the object i or the variable f of the object.
1.When f is a binary or nominal variable, if x i f = x j f , d ( f ) i j = 0, otherwise d ( f ) i j = 1.If x i f or x j f is missing, or x i f = x j f and both are asymmetric binary variables, the indicator term δ ( f ) i j = 0, otherwise δ ( f ) i j = 1. 2. When f is an ordinal variable, assume that the variable f has V states, corresponding to the sequence V as the rank corresponding to x i f , and r i j ∈ 1, • • • , N f , when the weight V can be used instead of x i f .
3. When f is the interval scalar variable, the metric of S f is standardized and the mean absolute deviation S f is calculated as where m f is the average of the f -measure values , i.e.
Then the normalized metric Z i f , is When calculating the phase difference: Here h traverses all non-vacant objects of the variable f .Three kinds of distances are involved in this paper: point-to-point distance; point-to-cluster distance; cluster-to-cluster distance: 1.The distance between points is the most commonly used Euclidean distance, i.e.
2. The distance between points and clusters is defined as 3. The distance between clusters is defined as the average value of the two clusters, with Suppose P non-repeating small sample sets u 1 , u 2 , • • • u P are randomly selected in the target database, each small sample set u i (1, i, P) contains n objects, the number of output classes is K, c m denotes the cluster of small samples, and i = r = 1.

Fuzzy Evaluation Method
Enterprise credit rating is fuzzy, and the influence obtained by using the one-dimensional linear affiliation function, which is called "single-factor affiliation", and each indicator is evaluated individually.Secondly, according to the weight of each indicator, the composite operation of the fuzzy matrix is performed on each single factor affiliation to calculate the comprehensive affiliation, and the index value of comprehensive assessment is obtained; Thirdly, the credit status of the enterprise is assessed according to the index value of comprehensive assessment.
Therefore, this paper selects five indicators: total assets, return on assets, turnover rate of total assets, gearing ratio and long-term debt ratio to evaluate the business risks, financial status and debt issuance projects of debt issuing enterprises (as shown in Table 1).
After analyzing the large sample, the optimal, actual, and impermissible values of each indicator are obtained.Assuming that the actual value of the ith indicator of a credit is A is , the standard value (weight) of the indicator is D i , and A is the composite indicator index of the credit.
When the indicator is positive, the single-factor affiliation d i of the indicator for this credit is: When the indicator is an inverse indicator, the single-factor affiliation d i of the indicator for this credit is: When the indicator is an inverse indicator, the single-factor affiliation d i of the indicator for this credit is: The total assets, return on assets, and total asset turnover in this example are positive indicators, while the corporate gearing ratio and long-term debt ratio are inverse indicators, and the composite index is: The creditworthiness of a company is assessed based on the value of the indicators of the comprehensive assessment, and the closer the rating result is to 0, the worse the creditworthiness is, and the closer it is to 1, the better the creditworthiness is.

Results
In the experiment, we first modeled the difference between the 10-year Treasury yield and the 5year Treasury yield, denoted as "105"; next, we modeled the difference between the 5-year Treasury yield and the 1-year Treasury yield, denoted as "5"; and finally, the 10-year Treasury yield is modeled as "10".Figure 1 shows the curves of "105", CPI, IP, and interbank 7-day pledged repo rate.Since the CPI and IP are updated monthly by the National Bureau of Statistics, the CPI and IP are changed to daily updated values by linear interpolation to maintain consistency.
Based on the form of the data, the paper applies the further improved K-means clustering algorithm to the credit information classification of individual insurance customers.With the help of insurance professionals, some individual customer attributes and business indicators are extracted from the customer information database of a property and casualty insurance company to describe individual customer credit, such as age, gender, education, marital status, employment status, renewal rate, claim rate, and premium amount.
In the experiment, the data from the insurance customer information database for the past two years are selected as the target database, and five small sample sets containing 400 customer information are randomly selected to form the large sample set.The data objects contain various types of variables, which need to be processed before clustering.For example, the age attribute is [20].For example, the age attribute is divided into 8 intervals such as [20], etc., and the corresponding weight z is calculated with {1, 2, . . ., 8} as the corresponding state value.As the value of the variable, the number of output categories is set to 4. Due to space limitations, Table 2 shows some of the processed data.
The original K-means clustering algorithm and the improved K-means algorithm were used to enhance the efficiency of processing time for large sample sets.The analysis results indicate that the probability values of the differences between categories are less than 0.001, and the clustering effect is good.After clustering the sample data multiple times, the stability of the improved algorithm is 0.795, higher than the original algorithm.Furthermore, we use the term spread 5-1, the term spread 105, and the 10-year Treasury yield as time series datasets, respectively.Firstly, the time series data are reconstructed using different regression (or recursive) orders and sampling intervals, where the input and output of the reconstructed data are , where p is the regression (or recursive) order and d is the sampling interval.Secondly, the Gaussian mixture model with RBF model and SVM regression model are applied to the three datasets.In the experiments, we selected p = 1, . . ., 6, and d = 1, . . ., 8, conducting 48 sets of experiments.Table 2 shows the best experimental results for each algorithm in the 48 sets of experiments and the corresponding p and d.The best experimental results for each algorithm were selected from the 48 sets of experiments on the reconstructed data of the three datasets [21,22].
From Table 3, we can see that the MGP model obtains the best prediction error RMSE for all three data reconstructions, and we can also observe that the p and d of the reconstructions with the best prediction error differ for different data.This makes it challenging to obtain the optimal p and d in practical applications.Obtaining optimal p and d by model selection algorithms is a promising research direction in the future.In terms of running time, the MGP model still takes the longest time, which is consistent with the results of the first set of experiments.

Conclusions
The exploration and study of corporate credit term structures have garnered significant attention due to their substantial value in corporate credit analysis and market investment.This topic has become a crucial area in financial engineering, attracting scholars and investors alike.This paper initiates an analysis of domestic and international approaches to interest rate term structures.It observes that existing studies are limited to exploring the characteristics of interest rate term structures based on known market behavior.By delving into a substantial amount of historical data, this paper identifies three key factors influencing the term structure of government bond interest rates: the inflation index CPI, the growth rate of industrial value added IP, and a crucial measure of market funding-the interbank 7-day pledged repo rate.Breaking away from the traditional academic thinking framework, this paper employs a Gaussian process mixture (MGP) model to predict future behavior effectively.This approach considers market participants' perspectives while respecting historical changes in the market.Experimental results demonstrate that the MGP model achieves more accurate prediction results compared to other machine learning algorithms.It also exhibits a significant advantage over traditional linear regression algorithms in capturing market dynamics.

Table 1 .
Indicators of Corporate Bond Credit Ratings 4. Experimental Results and Analysis

Table 2 .
Processed Insurance Customer Information Data

Table 3 .
Results of Regression Analysis of Each Algorithm on Three Sets of Recombination Data