Modeling of Human Development Index Using Ridge Regression Method

This article aims to model factors affecting HDI (Human Development Index) in North Sumatera by 2015 using ridge regression method. This ridge regression method is used because in the IPM data there is a multicolinearity problem so that the least squares regression method, as regression method commonly used in statistical modeling, is not suitable for use any more. This study compares the models resulting from the use of the least squares method and the ridge regression method to the HDI data. This study proves that the ridge regression method produces a better model and can eliminate the multicolinearity effect, while the least squares method can not. The significant factors in affecting HDI on North Sumtera data in 2015 are Average School length and Total expenditure / capita / month. The indicator of the goodness of this ridge regression model is 81.81% which means that the model is good and could be accepted.


Introduction
The basically target of development is human development. While the main goal of development is to create an environment that enables people in the nation to enjoy longevity, health and live a productive life. Human development places humans as the ultimate goal of development rather than a tool of development. The success of human development can be seen from how much fundamental human problems can be solved. These are the issues of health and education [1][2][3].
Human Development Index (HDI) is an indicator that is used to measure the important aspects related to the quality of economic development outcomes, ie., longevity, health and live a productive life [2][3][4][5]. These three elements are very important elements in determining the level of HDI. The three elements are interrelated and can not stand alone. In addition, HDI is also influenced by other factors, which is in the form of economic growth, infrastructure and government policies. So the HDI in an area will increase if there is an increase in the three elements. An area is said to have good economic development characterized by high HDI value. In other words, there is a positive correlation between the value of HDI with the degree of success of economic development.
Since the importance of HDI as an indicator of successful economic development of a region, it is important to identify the level of HDI in an area periodically. In determining the factors of HDI' model, it used the commonly modeling method is ordinary least squares (OLS) method.
One method to solve multicollinearity problem is ridge regression. The regression coefficient produced of this method is more stable and variance of the regression coefficient is smaller than classical method (OLS) [7,8]. Basically this method is a modification of the OLS method. The basically process in this method is the correlation matrix of the independent variable are converted using ridge regression method so that the value of the regression coefficient estimation is easy to be obtained.
Based on the above description, the problem in this research is the modeling of factors that affect HDI in North Sumatra in 2015 by using ridge regression method. In this article we will also compare the results of the OLS analysis and the ridge regression method.

Data
The data used in this study are secondary data, ie Human Development Index (HDI) data and influencing factors for each district (33 districts) in North Sumatera at 2015. Factors assumed to affect the HDI are Number of poor ( 1 ), Population density ( 2 ), GRDP (Gross Regional Domestic Product) ( 3 ), Total expenditure / capita / month ( 4 ), Average school length ( 5 ) , Number of educational facilities ( 6 ) and Number of health facilities ( 7 ). Data obtained from Central Statistics Agency (BPS) of North Sumatra (BPS, 2016). Data has multicolinearity problems. Then the data is modeled using OLS dan ridge regression method.

Ridge Regression Analysis
Ridge regression method is bias estimator, but values for variance is small. Ridge regression is one method that can be used to solve the problem of multicollinearity that categorized as less perfect. This method do any modification to the OLS method [7], [8]. The modification is accomplished by adding the bias constant of k on the diagonal of the ′ correlation matrix, so that the coefficient of the ridge estimator is influenced by the magnitude of the bias constant k. The k values for the ridge regression coefficients are generally between 0 and 1. In the simplest form, the ridge regression procedure is as follows: Let Z be centrally and scaled of matrix X when the regression problem is in the form of correlation. The ridge prediction value vector is obtained by minimizing the mean square error (MSE) for the regression formed model. Estimated values of ridge regression are as follows [7], [10]: Where * β R is ridge regression estimator, k is as ridge parameter, I is as identity matrix and Z is matrix X which is centering and scaling.
Mathematically there is a relationship between the ridge regression estimator, * β R and the OLS estimator, * β OLS as follows [8] : (2) The properties of ridge regression estimators are :

| E K S A K T A : B e r k a l a I l m i a h B i d a n g M I P A ( S c i e n c e P e r i o d s E K S A K T A o f M I P A )
Minimum Variance, the variance matrix and the covariance of the ridge regression are given by Variance for ridge regression is as follows: Variance for the OLS method is: When compared to the OLS variance in Equation (4) with the variance of the ridge regression in Equation (3) the variance of the ridge regression estimator is less than the OLS variance.

The Selection for Bias Constant, k
Selection of the magnitude for bias constant k has to be considered carefully. The desired bias constant, k will produce relatively small biases and produces a relatively stable parameter estimate. There are several ways to choose the magnitude of k, one of them is by minimizing MSE (Mean Square Error) of ridge regression [8] : So, it will be obtained : With p is number of parameters except 0 , while 2 and ̂ are obtained from OLS estimation method. Kibria & Banik (2016) did iteration procedure to determine the value of k, by following any steps below : 1. Determine the initial value, by making Equation (6) as the initial value that is =̂2 ′̂ then substituted the value of to the equation and then substitute the value of 1 to the equation

| E K S A K T A : B e r k a l a I l m i a h B i d a n g M I P A ( S c i e n c e P e r i o d s E K S A K T A o f M I P A )
After obtaining the expected value of the ridge regression parameter, k, the parameter value can be restored to the initial form before the variable is standardized by the formula [7] : Test of regression significance simultaneously can be done by F test or ANOVA test, and individual test with t test.

Results and Discussions 3.1 Multicollineary Test
The following is multicollinearity test on HDI data in North Sumatra 2015 using several ways :

VIF and Tolerance values
The multicollinearity problem can be detected by using VIF and tolerance values for each independent variable. If the VIF value of the independent variables is greater than 10 or the tolerance value of the independent variables is less than 0.1 then the multicollinearity is less than perfect. Table 1 below presents the results of multicollinearity test on HDI data. 11.533 0.0867  Table 1 shows that the VIF values of 1 , 3 , 6 and 7 are greater than 10 and also the tolerance values 1 , 3 , 6 and 7 are less than 0.1. Thus it can be concluded that there is an imperfect multicolinearity problem between the independent variables.

| E K S A K T A : B e r k a l a I l m i a h B i d a n g M I P A ( S c i e n c e P e r i o d s E K S A K T A o f M I P A )
2. Determinant of Correlation Matrix. Multicollinearity can be detected by using the determinant of correlation matrix. If the determinant value of the correlation matrix is close to 0, it indicates that there is an incomplete multicollinearity problem.
The following matrix R contains the correlation coefficient between the selected independent variables in the HDI data. The determinant of the correlation matrix R above is 0.000060719. Since the determinant value of the correlation matrix is close to 0, this means that the correlation matrix is almost singular, so it can be concluded that there is an imperfect multicolinearity problem between the independent variables. 3. Condition Value.
Multicolinearity can also be measured in terms of the ratio of the largest and smallest values of eigenvalues, obtained and expressed as the condition values of the correlation matrix. Eigen value is calculated using NCSS software. The condition value for this HDI data is : The value of the obtained condition is larger than 100, so it can be concluded that imperfect multicolinearity due among independent variables.

Modeling the HDI Data Using OLS
In this section we will estimate the HDI model using OLS method. The regression analysis was performed with the help of SPSS software, parameter estimated are presenting in Table 2. Table 2. Parameter Estimated Using OLS  Table 3. The hypothesis for this test is as follows: 0 : The variable X simultaneously has no effect on the predicted value of Y 1 : There is at least one variable X simultaneously affecting the predicted value of Y  Table 3 it can be seen that the value of FCount more than FTable. This means that there is at least one of the independent variable affects the predicted value of Y significantly. Thus, t test should be done to determine the significant independent variables. Following are the hypothesis:

| E K S A K T A : B e r k a l a I l m i a h B i d a n g M I P A ( S c i e n c e P e r i o d s E K S A K T A o f M I P A )
The independent variable has no significant effect on the predicted value of Y 1 : Individually independent variables significantly influence the value of predictor Y The t test statistic are presented in Table 4.  Based on Table 4 it is known that the tcount value of the variables 4 and 5 is greater than ttable, meaning independent variables 4 and 5 individually significance to influence the value of estimated Y. The value of determination coefficient or 2 is 94.3 percent which means that the variability of the dependent variable can be explained by the regression model is 94.3 percent.

| E K S A K T A : B e r k a l a I l m i a h B i d a n g M I P A ( S c i e n c e P e r i o d s E K S A K T A o f M I P A )
The HDI data has muticolinearity problem. Following are the consequences of multicolinearty if apply OLS method. First, in the partial test, it resulted that only few variables are statistically significant, as presented at Table 4. Second, the coefficient of the regression estimated is not suitable, look at the variable number of poor people ( 1 ) should have a negative sign, because based on previous research the number of poor people have a negative relationship with HDI. Third, the standard errors from OLS method are also quite large. Thus, OLS method cannot be applied to the data which has multicolinearity problem.

Modeling with Ridge Regression Method
To solve the multicollinearity problem that occurred in HDI data in North Sumatera in 2015, we used ridge regression method [8]. In the process of estimating the ridge regression parameters, the first step is doing centralization and scaling of the data [12], [13]. By using iteration procedure then we obtain the best value for k is 0.300719. At k = 0.300719 are obtained the expected value for all model parameters as presented in Table 5 below.  The parameter estimated then be transformed to the original values before standardized, the values are presented in Table 6.

Comparison of Parameter Estimation Model with Ridge Regression and Ordinary Least Square (OLS)
This section presents the results of ordinary least squares (OLS) method and ridge regression, presented in Table 7 below.  Based on Table 7 it is known that the VIF value for all independent variables of the ridge regression results is less than 10. Thus the multicolinearity problem has been solved by using the ridge regression.

| E K S A K T A : B e r k a l a I l m i a h B i d a n g M I P A ( S c i e n c e P e r i o d s E K S A K T A o f M I P A )
In addition, the parameter predictor mark for the variable of the poor population ( 1 ) has been in accordance with the theory that should be, that is negative value. Table 7 also shows that the all standard errors from ridge regression are smaller than the standard errors of the OLS method. Thus it can be concluded that the ridge regression method could produce better proposed model than the OLS method, as due to multicolinearity problem.

Test The Significance Of Model Parameters
Then will be tested the significance of the model parameters, to perform linear regression testing is done with the hypothesis as follows: 0 : The variable X simultaneously has no effect on the predicted value of Y 1 : There is at least one variable X simultaneously affecting the predicted value of Y Following Table 8  Based on Table 8 it can be seen that value of Fcount is more than Ftable, thus we should to reject 0 . Thus it can be stated that at least one of independent variables has a significant effect on the dependent variable. T test then has to done to determine the independent variables that have significant influence on HDI. The results of the T test are presented in Table 9.