Development of Multivariate Flood Damage Function for Flood Damage Assessment in Gunsan City, Korea

Insang Yu; Imee V. Necesito; Hayong Kim; Tae Sung Cheong; Sangman Jeong

doi:10.9798/KOSHAM.2017.17.2.247

Abstract

한국의 경제성장과 재해방재기술 향상에도 불구하고 홍수와 같은 자연재해는 여전히 국민들을 위협하고 있으며 특히, 태풍에 의해 발생한 홍수피해는 막대한 재산과 인명피해의 원인이 되어왔다. 홍수에 의한 피해액을 추정하는 것은 홍수대책마련과 홍수피해를 저감하는데 매우 중요하다. 이와 관련하여 피해 및 손실을 평가 및 분석하고 재해 위험 관리 계획에 따라 해당 지역의 재난 위험을 평가하도록 지정된 정부 기관인 국민안전처는 새로운 손실 산정 방법을 개발하고 있다. 본 연구의 목적은 범용 최소자승 회귀분석법과 지리적 가중 회귀분석법을 이용하여 주거용, 상업용, 농업용 건물 유형을 기반으로 군산시의 홍수 피해를 추정하는 홍수피해추정 손실함수를 개발하는 것이다. 모델은 홍수심, 홍수 지속시간, 범람 면적, 가족 수입, 토지 가격을 매개 변수로 구축 된다. 본 연구에서는 홍수피해추정을 위해 범용 최소자승 회귀분석법과 지리적 가중 회귀분석법을 평가하였으며 지리적 가중 회귀 분석법이 홍수피해 추정에는 더 적합한 것으로 분석 되었다.

핵심용어: Flood Damage Function, Geographically Weighted Regression, Ordinary Least Squares Regression

핵심용어: 홍수손실함수, 지리적 가중 회귀분석, 범용 최소자승 회귀분석

요지

Despite the growing economy and improving disaster prevention techniques of Korea, natural disasters such as floods, typhoon, drought have still threatened people. The counteractions made by flood disasters that were also induced by typhoons have caused significant damages to properties and human life. Estimating flood damage is essential make countermeasures in order to mitigate flood disaster. In this regard, the Ministry of Safety and Security (MPSS), the government institution designated to assess and analyze the damages and losses as well as evaluate the disaster risks of the said areas in accordance to their disaster risk management plans, are now developing a new estimating method for damages and losses. This study aims to develop flood damage functions that will estimate the flood damages of Gunsan City based on the building type: residential, commercial and agricultural facilities, by utilizing the Ordinary Least Squares Regression (OLS) and later on, the Geographically Weighted Regression (GWR). The model building process includes flood depth, flood duration, inundated area, family income and land price as the parameter variables. Both OLS and GWR were evaluated in this study, but the search for which among them is the ‘best fit’ resulted to the use of GWR.

핵심용어: Flood Damage Function, Geographically Weighted Regression, Ordinary Least Squares Regression

핵심용어: 홍수손실함수, 지리적 가중 회귀분석, 범용 최소자승 회귀분석

1. Introduction

Korea has become one of the leading countries in Northwestern Pacific region after 1960’s due to its booming economic growth (Kim et al., 2007). However, this fast-growing economic growth was counteracted by the damages and losses caused by weather-related disasters. Korea has gone through a lot of disastrous events most especially brought about by typhoons. Typhoon Rusa (2002), Maemi (2003) and Sanba (2012) caused significant damages to properties and human life in the country. The amount of losses, both economic and insured, was estimated to be billions of US dollars.

Due to these catastrophic events, Korea’s National Disaster Management Institute (NDMI) is now taking steps to provide a better prevention and solution for the increasing damage and losses during flood events. The Multi-Dimensional Flood Damage Analysis (MDFDA) for estimating floods damage used decades ago, is proven to overly or under estimate flood damages. Since this method originated from Japan, and thus, the way it was derived was not based from the actual living condition of Korea, the factors used in the Multi-Dimensional Method may not be accurately applicable.

Korea has been subjected to different natural hazards such as typhoons, floods, droughts, landslides, snowstorms, tsunami and earthquakes both at smaller and larger scales. Apart from this fact, this nation, at present, has a growing population of 51,202,130 and a population density of about 513 people per sq. km, (Ministry of Security and Public Administration, 2014). With this approximate 4% increase in population from 2009, in proportion to its growing vulnerability against hazards, priorities for a better disaster management is needed to ensure the safety of its people.

Flood damage refers to the effects and all the varieties of harm, which was caused by flooding (Messner and Meyer, 2005). The tangible damages are those which can be evaluated quantitatively in economic terms such as damage to lifelines, buildings, etc. These can be expressed monetarily. On the other hand, the intangible damages include the damages recorded through loss of lives etc. and thus cannot be measured by money. The effects of flood damage are further divided into two: direct (physical damage) and indirect (the after-effect of the hazard, let us say flood). The indirect damage is the most challenging part in dealing with the recovery phase. Examples of indirect damage include traffic disruptions, reduced productivity and reduce competitiveness of different economic sectors due to affected public services (Smith and Ward 1998). Moreover, direct damages are caused by physical contact of floodwater and indirect flood damages are those caused through interruption and disruption of economic and social activities as a consequence of direct flood damage as shown in Fig. 1.

Fig. 1

Concept Diagram of Direct and Indirect Damages

Flood depth-damages curve (Smith, 1994), also known as loss function (Smith, 1994; White, 1945, 1964), is the most frequently used way to estimate damage. There are two ways to find the depth-damage curve. One is to do the statistical analysis by using the data collection of the damage after flood, and the other is to do the hypothetical analysis by simulating the flood condition and generating the synthetic depth-damage curve (Smith, 1994; Dutta et al., 2003). U.S. Army Corps of Engineers used the 1983’s, 1986’s, 1995’s, and 1996’s data after the flood events, which happened in California Central Valley. Penning-Rowsell et al. (1977) divided the buildings into 21 categories, to determine the total 168 depth-damage curves of each type of building for two kinds of duration and four types of societal condition. Rainfall, topography, meteorological, physical and human factors such as flood prevention measures also influence the acquired damage (Yang et al., 2005). Besides, building type (Smith, 1994; FEMA, 1977; McBean et al., 1988; Chang et al., 2008), other parameters such as income per household (Lekuthai et al., 2001; McBean et al., 1988), flood forecast and alarm systems (Wind et al., 1999; David, 2001; Du Plessis, 2002), time of recognize flood occurrence in advance (Penning-Rowsell et al., 1977; Thieken et al., 2005), flood experiences (McBean et al., 1988; Wind et al., 1999; McPherson et al., 1977), disaster prevention (Penning-Rowsell et al., 1977), frequency of flood (Lekuthai et al., 2001; McBean et al., 1988; Thieken et al., 2005), flood velocity (Smith, 1994; CH2M HILL, 1974; Black, 1975; Beck et al., 2002), number of family per household (McBean, 1988; Shaw et al., 2005) and building location (Chang et al., 2008; Shaw et al., 2005) were also found to affect the damage. As the flood damage is caused by various factors, Shaw et al. (2005) suggested the use of a multiple regression analysis model. This study develops the multivariate flood damage functions for residential, commercial and agricultural sectors based on the collected data of the following parameters: flood impact, building characteristics, socio-economic status and damage after the flood event on August, 2012 in Gunsan City.

2. Study Area

Gunsan City (see Fig. 2), is a city in North Jeolla Province (Jeollabuk-do) located in the south of Geum River. It sits on the fertile western Honam plain where much rice is harvested. Currently, Gunsan’s economy thrives on fishing and agriculture. This city has a total area of 680.11 sq. km. and has a population of 278,495 or 111,275 households. Thus, its population density reached 410/km². Jeollabuk-do owned 14 districts and ranked 8th in the country’s most flood vulnerable provinces. In this regard, the authors were motivated to provide a statistical analysis that can be defined out of the available data in Gunsan City. The area has been the subject of several flood events in the past. The city It has been experiencing a significant population growth especially during the last decade. In proportion to the increasing level of urbanization of the city, the degree of its vulnerability to flood damages is also increasing.

Fig. 2

Gunsan City, Korea

2.1 Flood Impact Parameters

Certain factors have proven to affect the impacts of flooding incidents. Therefore, the relevance of knowing the factors is needed, because it would help us recognize how to improve the disaster risk management system in terms of awareness and preparedness issues. Water depth, flood duration, inundated area, family income and land price are the flood damage influential factors used in this study.

Stage-damage curves are the usual representation of flood damage versus flood depth with the latter as the independent variable. This is one of the methods to predict the most probable amount of damage to be acquired given a certain flood depth. Also, this factor is one of the most abundant data during flooding incidents. It is easy to observe and is therefore easy to quantify. Flood duration is also a factor in determining the impacts of floods. Run-up time and the time between the first warning and the actual flood define flood duration. In other cases, like flash floods, where there is short run-up time, a perilous threat as well as damage is achieved. Inundated area is also regarded as a factor in this study. The larger the area submerged to flood, the larger the amount of damages it could cause. Thus, the number of people, buildings and other infrastructure and properties included in a specific area are therefore affected.

Family income is also one factor to be considered. The numbers of people living in a certain type of building, may it be residential, commercial or any other types are to be considered. An example would be a family of three living in house, wherein two of which (the father and the mother) are working, the expected income (for example) would be about five million won (5,000 USD). If that is the case, calculation for the subject property of the owner through the use of some capitalization rate methods would help get the property value. Since flooded area is one of the considerations for monetary flood impacts, land price should also be taken into consideration. In the case of Korea, commercial land and agricultural land have great weight in the economic ecosystem. If the said land types were flooded, then the inclusion of damages would be placed in the amount of the total reported accumulated damages. An example is the agricultural land wherein if it were to be flooded, the damaged land would also account for the damaged fruits and vegetables that would have been reaped.

All of the factors stated above are contributors to the flood impacts. The impacts might vary in certain degrees most especially in the quantified direct, indirect, tangible and intangible damages which are not under the scope of this study.

3. Methodology

3.1 Ordinary Least Squares

Ordinary Least Squares (OLS) Regression is one of the ways to ideally show the relationship among the parameter variables. In OLS, one of the conditions is that the error terms, ε_i’s are assumed to be Independently, Identically Distributed (IID) random variables with mean zero and a constant variance σ². This is the general equation of OLS model:

(1)

y1 = βo + ∑j=1pXij βj + εi

Where:

β_o = Intercept coefficient

β j = Slope coefficient for the jth independent variable Xij

ε_i = Random error term

I = n × n identity matrix

The model can now be in the form of:

(2)

Y = Xβ + ε

Moreover, using an independent assumption and constant variance, β would be in the form of:

(3)

β^ = (XTX)−1XTy

Where:

(X^TX)^-1 = Inverse Mat

T = Transpose matrix

The use of Analysis of Variance (ANOVA) is one method to validate and check whether the regression model is statistically significant or not. This makes use of F-value or p-value. However, T-test (p-value) is used to check the statistical significance of each parameters or variables. The following is the formula for the T-test.

(4)

T = β^ssxx

Where the following sub-equations are necessary to arrive at the general equation presented above:

(5)

β^ = sxysxx

(6)

sxy = ∑(x−x¯)(y−y¯)

(7)

sxx = ∑(x−x¯)2

Lastly, the value for R-squared is calculated as

(8)

R2 = s2xysxxsyy

Where:

(9)

syy = ∑(y−y¯)2

The redundancies of model variables are expressed through collinearity. In this research, we used the multi-co-linearity condition number to determine the existing problem in the model. If the multi-co-linearity condition number exceeded 30, the model would depict large variances and co-variances, as well as large confidence intervals and insignificant coefficients. Therefore, if the value were to exceed the maximum then this would imply that the model is not reliable.

Another way to check the normality of the given data sets is through the method applied by Shapiro-Wilk with the equation stated below:

(10)

W = (∑i=1n(aixi)2∑i=1n(xi−x¯)2

Where:

x_i = sample value

α_i = constant generated from mean, variance and covariance from the normal distribution

If the variable, W, of Shapiro-Wilk is too small, then the distribution is said to be not normalized. If this happens then another approach done by Jarque-Bera to test the normality can also be performed. The equation is as follows:

(11)

JB = n6 (S2 + K24)

Where the following sub-equations are needed to arrive at the general equation presented above:

(12)

S = ∑i=1N(Y−μ)3/ns3

(13)

K = ∑i=1N(Y−μ)4/ns4 − 3

Where:

n = number of observations

Y = sample data

μ = mean

s = standard deviation

Hence, if the value for Jarque-Bera test (JB) is statistically significant, the normality assumption is then rejected. Therefore, one more method used by Breusch-Pagan to test for the random coefficients and the White test for the specification robust. Both were performed to check the presence of spatial heteroscedasticity.

To illustrate, if

(14)

Yi = α + βXi+ εi

Where:

i = 1, … N

E(ε_i) = 0

Then, the auxiliary regression is given by

(15)

Z2i = ∅ + δXi + υi

Where the following sub-equations are desired to arrive at the general equation presented above:

(16)

S2 = ∑u2^iN

(17)

Z2 = ∑u2^is2

It should be noted that the symbol “ ^ “ indicates the estimated value). Consequently, if the coefficient of Xi is 0, the error variance is homoscedastic, otherwise, it is heteroscedastic. The same goes with White test. If the model of regression is,

(18)

Yi = a + β1X + β2W + εi

Then, the auxiliary regression model is,

(19)

u2^i = ∅ + δ1Xi + δ2W1 + δ3Xi + δ4W2 + δ5XixWi+ υi

If the same result of the coefficient of Xi is 0 for the white test, then the error variance is homoscedastic, if not, it is heteroscedastic.

3.2 Geographically Weighted Regression

The goal of Geographically Weighted Regression (GWR) is to indicate the presence of non-stationarity where the locally weighted regression coefficients move away from their global values (Bivand, 2014). GWR assumes the possibility that the obtained coefficient values of the global model will not be accurate enough compared to the local model. If there is an existing local variation, it can be taken as an indication of non-stationarity. In some studies, GWR was able to provide better specification compared to other global models like OLS (Yrigoyen et al., 2008).

As the spatially varied characteristics in flood damages are taken into account, damage function can be modified by using Geographically Weighted Regression (GWR) Method:

(20)

yi = β0(ui, υi)+ β1(ui, υi)xi + β2(ui, υi)xi2 + εi

Wherein, β₀(u_i, v_i), β₁(u_i, v_i) and β₂(u_i, v_i) are is the realization of the continuous function at point i and ε_i is the residual of point (u_i, v_i). Since, GWR recognizes the possibility of spatial variations (Chang et al., 2008) the estimate in GWR the equation is:

(21)

β = (XTWX)-1XTWY

Wherein, is n×n matrix whose off-diagonal elements are zero; the diagonal elements denote the geographical weighting of observed data for point i. The weighting of each observed data is given at:

(22)

wij(ui, νi)= (1−(dij(ui, νi)/h)3)3

Where the d_ij is the Euclidean distance between observed data i, j and h are the bandwidth.

3.3 Box-Cox Method

Box-Cox method is a very useful method to normalize non-normal datasets. The following defines Box-Cox Method of transformation:

(23)

yλ = {yλ−1λif λ ≠ 0;otherwise,log(y);

The λ is treated as a parameter in the likelihood function and the profile likelihood function is evaluated in order to get the optimal λ value.

4. Results and Discussion

4.1 Descriptive Statistics

A descriptive information and statistics of the model variables should be provided for the general overview of the data sets. The results showed that variations in the mean values for the given parameters were due to the differences in the units used in each parameter. Additionally, the standard deviation reveals variability in the dispersion of the variables.

Therefore, by analyzing the statistics of the datasets, we could see that the standard error of the mean is quiet high and the distance of the sample mean from it, being likely to be far from the true population mean is not that precise. However, the standard deviation for most of the values has significantly high amounts. This just proves how the data points are spread out over large range of values.

Skewness, on the other hand, showed positive values except for the flood duration (dur) of commercial facilities (See Table 1 to 3). These positively skewed datasets simply indicate that they are skewed to the right. The negative value is of course, skewed to the left. The Shapiro-Wilk Test, when calculated, consistently showed p-values of less than 0.01 (p-value = 0.00000), which violates the rule of normality. Thus, in order to have a clear comparison, the author normalized the datasets through Box-Cox Transformation as shown in Tables 4 and 5, the coefficient results.

Table 1

Descriptive Statistics of the Model Variables for Residential Facilities

	*dam	*dep	*dur	*far	*inc	*lp
N	496	496	496	496	496	496
Min	100.000	0.010	1.000	12.000	1.000	19.459
Max	60,000.000	2.000	965.000	37,008.000	5.000	2,367.230
Mean	2,656.661	0.262	185.718	706.450	2.050	337.611
SE Mean	265.929	0.019	5.006	175.255	0.026	8.884
Std Dev	5,922.519	0.425	111.498	3,903.124	0.572	197.867
Skewness	4.728	2.492	2.144	8.572	0.917	3.865
Kurtosis	28.638	5.555	12.501	74.228	3.408	27.660

*dam-damage amount *dep-flood depth *dur-flood duration *far-inundated area *inc-family income *lp-land price

Table 2

Descriptive Statistics of the Model Variables for Commercial Facilities

	*dam	*dep	*dur	*far	*inc	*lp
N	752	752	752	752	752	752
Min	100.000	0.010	4.000	2.000	1.000	19.459
Max	2.1⁵	2.250	299.000	1⁵	9.000	2.596³
Mean	6.54³	0.392	183.019	1³	2.407	550.160
SE Mean	597.108	0.020	2.792	187.346	0.034	15.120
Std Dev	1.64³	0.548	76.566	5.14	0.926	414.641
Skewness	6.971	2.070	-0.511	12.195	2.723	2.003
Kurtosis	62.597	3.398	-0.904	199.362	14.661	4.972

*dam-damage amount *dep- flood depth *dur-flood duration *far-inundated area *inc-family income *lp-land price

Table 3

Descriptive Statistics of the Model Variables for Agricultural Facilities

	*dam	*dep	*dur	*far	*inc	*lp
N	30	30	30	30	30	30
Min	100.000	0.020	14.000	123.000	1.000	8.760
Max	21,008.000	1.960	973.000	4,409.000	4.000	526.244
Mean	2,718.433	0.722	207.700	959.967	1.933	96.172
SE Mean	964.748	0.109	31.118	173.444	0.172	27.340
Std Dev	5,284.144	0.598	170.440	949.994	0.944	149.749
Skewness	2.515	0.632	3.123	2.102	0.929	2.150
Kurtosis	5.933	-0.974	14.222	5.119	0.233	3.750

*dam - damage amount *dep- flood depth *dur-flood duration *far-inundated area *inc-family income *lp-land price

Table 4

OLS and GWR Results for Untransformed Datasets

Parameter Variables	Coefficient (OLS) Residential	Coefficient (GWR) Residential	Coefficient (OLS) Commercial	Coefficient (GWR) Commercial	Coefficient (OLS) Agricultural	Coefficient (GWR) Agricultural
Intercept	-149.248	1203.646	2692.710	754.041	-4790.713	-7643.386
dep	7339.759	5742.459	11892.550	17247.551	3509.237	3001.392
dur	-0.041	1.142	-3.954	-0.135	6.721	22.325
far	-0.014	1.233	0.203	-5.403	0.936	3.370
inc	491.837	94.574	134.682	435.851	1222.567	-5.295
lp	-0.322	-4.341	-1.155	-2.476	3.301	-11.530

*dam-damage amount *dep- flood depth *dur-flood duration *far-inundated area *inc-family income *lp-land price

Table 5

OLS and GWR Results for Transformed Datasets

Parameter Variables	Coefficient (OLS) Residential	Coefficient (GWR) Residential	Coefficient (OLS) Commercial	Coefficient (GWR) Commercial	Coefficient (OLS) Agricultural	Coefficient (GWR) Agricultural
Intercept	-4.0^-6	-1.842^-3	0.230	-1.842^-3	0.073	-0.035
BCDEP	9.59^-4	6.140^-4	-0.150	6.140^-4	-0.025	-0.021
BCDUR	2.0^-6	1.3^-5	2.7^-5	1.3^-5	-7.11^-4	-0.003
BCFAR	1.0^-3	1.453^-3	0.025	1.453^-3	0.452	0.346
BCINC	5.5^-5	8.0^-5	1.0^-3	8.0^-5	-0.020	0.014
BCLP	-8.0^-5	2.85^-4	2.522^-3	2.85^-4	2.57^-4	0.428

*BCDAM - transformed damage amount *BCDEP - transformed flood depth *BCDUR - transformed flood duration

*BCFAR - transformed inundated area *BCINC - transformed family income *BCLP - transformed land price

4.2 Analysis of OLS and GWR Results

The coefficients obtained from the OLS untransformed model of residential facilities all showed a negative relationship except for flood depth and family income. Flood depth has 7339.759 and family income at 491.837. Flood duration has -0.041, inundated area has -0.014, and land price has -0.322. The positive results means that in every increase of flood depth and the amount of income of the affected families, there is an expected increase in damage and a decrease in flood duration, inundated area and land price. However, the constant value of the regression analysis showed a -149.248, which is a little bit different from the other untransformed OLS models. This could strongly signify that missing variables do exist (e.g. if the given five parameters are zero). On the transformed datasets, the intercept (Intercept = -4.0-6) values showed a negative relationship with regards to the damage amount. The same goes with the land price (BCLP = -8.0-5). The inundated area (BCFAR = 1.0-3), flood duration (BCDUR = 2.0-6), family income (BCINC = 5.5-5) and flood depth (BCDEP = 9.59-4) with again the BCFAR garnering the highest value. In case of the transformed data, the inundated area gets the highest bearing for flood damage followed by flood depth. Consistently, flood depth has the highest influence in the model as particularly having the largest coefficient value in the untransformed datasets, while inundated area gains the crown for the datasets that were transformed.

Nonetheless, the coefficients obtained from the OLS untransformed model of commercial facilities all showed a positive relationship. Flood depth has 11892.550, flood duration has -3.954, inundated area has 0.203, family income has 134.682 and land price with -1.155. The positive results show that in every increase of flood depth, the area of flooded region and the amount of income of the affected families, there is an expected increase in damage and a decrease in flood duration and land price. However, the constant value of the regression analysis showed a +2692.710. This indicates a positive relationship with respect to the amount of flood damage. On the other hand, on the transformed datasets, the intercept (Intercept = 0.230) values showed a positive relationship with regards to the damage amount. The same goes with the inundated area (BCFAR = 0.025), land price (BCLP = 2.522^-3) and flood duration (BCDUR = 2.7^-5) with the BCFAR gaining again the highest value. The rest are of negative values (BCDEP = -0.150 and BCINC = - 0.010). This shows that in the case of the transformed data, the inundated area gets the highest bearing for flood damage followed by flood depth.

The coefficients obtained from the OLS untransformed model of agricultural facilities all showed a positive relationship. Flood depth has 3509.237, flood duration has 6.721, inundated area has 0.936, family income has 1222.567 and land price has 3.301. The positive results show that in every increase of flood depth, duration in the flooding event, the area of flooded region, the amount of income of the affected families as well as the land price of the affected region, there is an expected rise in damage. However, the constant value of the regression analysis showed a -4790.713. This indicates a negative relationship with respect to the amount of flood damage. On the other hand, on the transformed datasets, the intercept (Intercept = 0.073) values showed a positive relationship with regards to the damage amount. The same goes with the inundated area (BCFAR = 0.456) and land price (BCLP = 2.57^-4) with the former gaining the highest value. All the other parameter variables are of negative values (BCDEP = -0.025, BCDUR = -7.11^-4 and BCINC = - 0.020). This shows that in case of the transformed data, the inundated area gets the highest bearing for flood damage followed by family income.

Some positive values in OLS have become negative in GWR and vice versa. This happens both in untransformed and transformed datasets. This, however, indicates that the factors have varying effects in the global and local conditions. Several parametric evaluation like coefficient of determination (R²), log-likelihood and AIC were used to evaluate the OLS and GWR models (see Table 6 and 7).

Table 6

OLS and GWR Evaluation for Untransformed Datasets

Parameter	OLS (Residential)	GWR (Residential)	OLS (Commercial)	GWR (Commercial)	OLS (Agricultural)	GWR (Agricultural)
R²	0.284	0.614	0.175	0.566	0.270	0.979
Log-likelihood	4928.850	4775.675	8291.250	8049.553	294.519	241.335
AIC	9871.699	9752.683	16596.500	16435.690	603.038	534.615

Table 7

OLS and GWR Evaluation for Transformed Datasets

Parameter	OLS (Residential)	GWR (Residential)	OLS (Commercial)	GWR (Commercial)	OLS (Agricultural)	GWR (Agricultural)
R²	0.138	0.481	0.190	0.320	0.305	0.873
Log-likelihood	2904.051	3030.034	1446.733	1512.389	68.421	93.960
AIC	-5794.102	-5845.397	-2879.465	-2671.310	-122.843	-134.956

The R-squared value or the coefficient of determination of all the above models have shown improved values from OLS to GWR. Commercial and residential facilities still showed an increase, but only ranges from 0.57 to 0.61, while the agricultural facilities achieved the highest R-squared value with 0.98. However, we should be reminded of the fact that the coefficient of determination is not the sole criteria for us to tell a significant improvement of the OLS model to the GWR model. In addition the Log-likelihood and AIC, are two necessary approach to evaluate the performance of the model. The log-likelihood of the GWR models of the untransformed datasets was lower than the log-likelihood of OLS. As for the AIC, as the rule says, an absolute difference for the AIC should be 3 in order to consider it as an improved performance. All of the generated models have found to have satisfied this condition.

Several tests were also performed to analyze the datasets. T-statistic test for the untransformed data showed that ‘flood depth’ (dep) is the only statistically significant parameter variable at 1% of significance level (p-value = 0.00000). For the transformed data, BCDEP (flood depth) and BCFAR (inundated area) were the statistically significant on 1% and 5% level, respectively.

The Jarque-Bera Test in this case have also failed the normality of the residual distribution for the untransformed data (JB =8812.902, p-value = 0.00000). Moreover, the Breusch-Pagan Test (BP = 1152.643, p-value = 0.00000), Koenker-Basset Test (KB = 106.607, p-value = 0.582) and the White Test on specification of robust test (WT = 132.148, p-value = 0.00000) confirmed the presence of spatial heteroscedasticity. This leads to the model being spatially non-stationary. However, the transformed data showed that JB = 53443.481, p-value = 0.00000, which means that it is not under normal distribution. Therefore, the Breusch-Pagan Test (BP = 98.723, p-value = 0.00000), Koenker-Basset Test KB = 3.776, p-value = 0.582) and the White Test on specification of robust test (WT = 7.106, p-value = 0.996) all failed the prediction of the data being spatially non-stationary.

For the datasets of residential facilities, the multi-co-linearity condition number was found to be 10.950 for the untransformed and 42.476 for the transformed. The untransformed value is less than 30, while that of the other is greater than the said standard value. Thus, the latter has an issue with multi-co-linearity. In case of commercial facilities of untransformed datasets, only 17.49% of the variation in the dependent variable is explained. Thus, this model tells only an approximately 17.49% of the flood damage in the 2012 flood event in Gunsan City. For the transformed data, it increased to 19.00%.

The results of the t-statistic test for the untransformed data showed that ‘flood depth’ (dep) is the only statistically significant parameter variable at 1% of significance level (p-value = 0.00000). For the transformed data, BCDEP (flood depth) is statistically significant on 1% level together with BCINC (family income). In this case, flood depth is indeed the highest influencing factor among the other four.

The Jarque-Bera Test in this case have failed the normality of the residual distribution for the untransformed data (JB = 96078.419, p-value = 0.00000). Additionally, the Breusch-Pagan Test (BP = 2656.296, p-value = 0.00000), Koenker-Basset Test (KB = 94.588, p-value = 0.00000) and the White Test on specification of robust test (WT = 117.630, p-value = 0.00000) confirmed the presence of spatial heteroscedasticity. This leads to the model being spatially non-stationary. However, the transformed data showed that JB = 8.797, p-value = 0.012, which means it is not under normal distribution. The following tests: Breusch-Pagan Test (BP = 6.184, p-value = 0.289), Koenker-Basset Test (KB = 5.205, p-value = 0.391) and the White Test on specification of robust test (WT = 17.955, p-value = 0.590) all failed the prediction of the data being spatially non-stationary.

The test for multi-co-linearity was also performed in commercial facilities datasets. The multi-co-linearity condition number was found to be 10.254 for untransformed and 38.679 for transformed. The untransformed value is less than 30, while that of the other is greater than the said standard value. Therefore, the latter has an issue with multi-co-linearity.

As shown in the results of the untransformed datasets, only 26.97% of the variation in the dependent variable is explained. Thus, this model tells only an approximately 26.97% of the flood damage in the 2012 flood event in Gunsan City. In the transformed data, it then increased to 30.48%.

All the resulting coefficients of the parameter variables are given in the same units as their associated explanatory variables. The coefficient reflects the expected change in the dependent variable for every 1 unit change in the associated explanatory variable, holding all other variables constant.

However, the results of the t-statistic test for the untransformed data showed that ‘flood depth’ (dep) is the only statistically significant parameter variable at 5% of significance level (p-value = 0.04). The t-test is used to assess whether or not an explanatory variable is statistically significant. The null hypothesis is that the coefficient is, for all intents and purposes, equal to zero (and consequently is NOT helping the model). When the probability is very small, the chance of the coefficient being essentially zero is also small. This again proves that flood depth is indeed the highest influencing factor among the other four.

The Jarque-Bera Test also reject the normality of the residual distribution for the untransformed data at 1% level (JB = 6.705, p-value = 0.04). Breusch-Pagan Test (BP = 26.905, p-value = 0.00006), Koenker-Basset Test (KB = 16.123, p-value = 0.00065) and the White Test on specification of robust test (WT = 29.427, p-value = 0.080) confirmed the absence of spatial heteroscedasticity. This leads to the model being stationary. However, the transformed data showed that JB = 1.649, p-value = 0.438, which means it is under normal distribution. In addition, the Breusch-Pagan Test (BP = 1.552, p-value = 0.907), Koenker-Basset Test (KB = 2.795, p-value = 0.732) and the White Test on specification of robust test (WT = 23.924, p-value = 0.246) all failed the prediction of the data being spatial non-stationary.

The test for multi-co-linearity was also performed. The multi-co-linearity condition number arrived with 8.283 for untransformed and 13.574 for transformed. In this final case, both values are less than 30 and thus indicate that multi-co-linearity problem no longer exist.

5. Conclusions

This study did not just respond to the underlying difficulties in dealing with thousands of datasets, but rather as proclaimed in this master thesis, the data sets from the Gunsan City’s August 12 flood event were statistically explored to be able to arrive in the flood damage functions that would estimate the amount of flood damage of the said city. Such functions were expressed in terms of flood depth, flood duration, inundated area, family income and land price. Flood depth has been found to be influential to flood damages long before. However, this paper aims to identify the ‘other’ possible factors that could contribute in the flood damage estimation model. The candidate models were obtained by utilizing the Ordinary Least Squares (OLS) Regression and Geographically Weighted Regression (GWR).

The OLS and GWR were both used to generate the functions. The GWR, however, proved to be more of a suitable fit for the three sets of facilities. The coefficients of determination for residential, commercial and agricultural facilities are 0.0614, 0.566 and 0.979, respectively. The log-likelihood values were 4775.675, 8049.553 and 241.335 for the three facilities. Nevertheless, AIC values of 9752.683, 16435.690 and 534.615 for the said facilities were also obtained.

The author tried to solve the normality issue without trying to lessen the reliability of the models. In order to do so, the Box-Cox Method was then applied (refer to transformed datasets). It was found out that after transforming the data into a normalized distribution, all other tests have been passed. Therefore, it would be evident to show that the underlying factors for the normality issue might point out to the presence of extreme values that resulted to skewed distribution and due to sorting of data as well as facility classification.

Pointing out on the normality issue on the untransformed model, one reason is the presence of extreme values that resulted to skewed distribution. With this, the author recommends that a need for including more flood events will be helpful to improve the models, since the consideration of only one flood event was used, so extreme values are therefore inevitable. No other flood events were considered and therefore, no other values could support the said extreme observations. The presence of the extreme values is important, because it is actually the reason why we are estimating. Deleting those values without valid reason is therefore unacceptable.

The data used in the study are sorted out and are also classified into three groups: residential, commercial and the agricultural facilities. The methodology includes sorting out the data into three before analyzing it and therefore, some data were removed (1,278 out of 3,111 facilities) affecting the lower, middle and upper specification of the datasets. Also, the fact that facilities, e.g. commercial facilities, exists not only in one zone, have an effect regarding the normality issue. Therefore, it is recommended that modifying this type of grouping order into zonal classification would better improve the normality and the models as well.