Machine learning for Classification of
Flash flood Events
Using Machine Learning to analyse the impact of Spatiotemporal storm features in the classification of flash flood events in the Lower Mekong Basin
This study was done in cooperation with the ADCP
Researcher:
Irivbogbe Hudson E.
Mentors:
Dr. Gerald Corzo Perez
Mr. Miguel Laverde-Barajas
Supervisor:
Prof. Dr. Dimitri Solomatine
Summary
Flash floods are recurring hazards that have substantially impacted the Lower Mekong Basin due to the frequent occurrence of heavy rainfall, which is triggered by storms, especially during the monsoon period. Due to the rapid onset of flash floods, they have become very costly hazards in the Lower Mekong Basin. It is, therefore, imperative to proactively prepare for flash floods by predicting their occurrence. However, such predictions require a strong understanding of the triggering factors that cause flash floods. Since the floods are triggered mainly by storms, understanding the spatiotemporal dynamics of the storm’s physical variables can aid more accurate and improved flash flood predictions and preparedness.
Hence, this study presents the analysis of the impact of spatiotemporal rainstorm features and static triggering factors in the classification of flash floods using machine learning models. Some flash flood events within the monsoon period of 2014 to 2018 were gathered. A database of the flash flood events at the sub-basin level and their related static variables and spatiotemporal features extracted from rainstorm datasets were assessed for model training and testing. The spatiotemporal storm datasets were obtained from the historical database of the Servir-Mekong Project. Spearman’s correlation analysis of the flash flood triggering variables shows that the storm's total magnitude is highly correlated with the rainstorm volume and maximum intensity, having values of (ρ = +0.97 and ρ = +0.84), respectively. Among the static features, slope has a strong negative correlation with TWI (ρ = -0.9) and a strong positive correlation with SPI (ρ = +0.9). Random Forest and XGBoost classifiers were used to build global models for the study location, while XGBoost was used to build regional models. Randomised search hyperparameter optimization with a five-fold cross-validation was used to determine the optimal values for some selected parameters. Permutation Feature importance and SHAP were used to perform model agnostic interpretations on the models’ predictions to understand better each feature's impact on the model capability to classify the flash floods in the lower Mekong basin.
The global model classification results show that XGBoost performs better than random forest to classify the flash floods. XGBoost classified the flash floods with an accuracy of 99% and an F1-score of 0.88, while Random Forest classifies the flash floods with an accuracy of 98.1% and an F1-Score of 0.72. The feature importance analysis shows that more of the spatiotemporal features have a higher impact on classification models’ performance than static features. Both models rank the rainstorm volume as the most important feature. On the other hand, the classification based on the regional models produces slightly different results based on regions. The results show that the combination of each region's input variables impacts the models' performance for the respective region. However, most of the spatiotemporal features extracted from the rainstorm are ranked as the most important features that mainly impact the models' performance for all the regional models. Three out of the four regional models built ranks the rainstorm volume as the most important feature.
This study demonstrates the efficacy of using machine learning in classifying flash flood events across a large domain, using various input variables. The findings show the effectiveness of the methodology in analysing the static and spatiotemporal storm features.
Problem Description
Earth observation has hugely contributed to the monitoring of extreme events. However, it remains challenging to use earth observation systems to predict extreme events, especially those with short duration, such as flash floods. This is because the spatiotemporal dynamics of the storm events are not well understood. Available data-driven modelling approaches mainly use the storm intensity as a driving force variable, leaving out the other physical variables of the storm events such as storm extension, duration, volume, et. c. Hence, spatiotemporal storm variables incorporated into data-driven models can aid more accurate flash flood predictions and serve as an alternative approach to preparing for flash floods. The research question is given as follows;
How important are spatiotemporal storm features in the classification of flash flood events in the Lower Mekong Basin, using machine learning?
To answer the main question above, the following sub-questions need to be answered.
• What kind of association exists among the variables that trigger the flash floods in the Lower Mekong Basin?
• Given the data, which of the machine learning algorithms applied in this study is more suited to classify the flash flood events in the Lower Mekong Basin?
• Which of the spatiotemporal rainstorm variables mainly impact the machine learning models' performance, and are they more important than the static variables in classifying the flash flood events in the Lower Mekong Basin
Objectives
This research's main objective is to analyse the impact of the static and spatiotemporal rainstorm variables in the machine learning classification of flash flood events in the Lower Mekong Basin. The specific objectives are as follows; • To analyse the correlation between the flash flood triggering variables extracted from remotely sensed data covering the Lower Mekong Basin. 5 • To develop machine learning models and compare their performance in the classification of flash flood events at the sub-basin level of the study area. • To interpret the models by analysing the feature importance using model-agnostic approaches.
Methodology
Data Collection
Collection of Historical Flash Flood events and their locations The research commenced with a robust compilation of accurate and substantial past flash flood events. Also important is the need to prepare such inventory based on records from reliable sources. Flash flood events, covering the monsoon period of 2014 to 2018, were acquired for this research. The sources from which they were obtained as well as the flash flood inventory compilation process, are described as follows;
Sources of Historical Flash Flood events In this research, the main sources that served as the basis for the compilation of past flash flood events are given as follows;
Annual Mekong Flood Report (AMFR) from 2014 to 2018.
Emergency Events Database (EM-DAT | the international disasters database).
ASEAN Coordinating Centre for Humanitarian Assistance on disaster management (AHA).
GLobal IDEntifier (GLIDE) Number database.
Floodlist.
Extreme Gradient Boosting (XGBoost)
Extreme Gradient Boosting algorithm is a known novel application of the gradient boosting machine. It was proposed by (Chen and Guestrin 2016), and it is centred on boosting. Boosting here means combining predictions obtained from a series of weak learners that aids the development of a strong learner by implementing additive training operations. The Extreme Gradient Boosting algorithm (XGBoost) aims to prevent overfitting while optimizing the computation resources. To achieve this, XGBoost performs a simplification of the objective functions by giving room for the combination of predictive and regularization terms while keeping to an optimum computation speed. In addition to this, parallel automatic calculations are executed when the model is being trained. (Fan et al. 2018).
XGBoost is used in solving supervised learning problems, and unlike random forest, where the simultaneous training of forest trees is executed, XGBoost can create ensembles by implementing an iteration-based training of the decision trees on training samples to which weights are attached. These weights are updated at each iteration to reveal the ensemble's residual error at that iteration step (La Cava et al. 2019).
To further expatiate on the working process of XGBoost, Fan et al. (2018) gave the general principle and process involved in optimizing with XGBoost (additive learning). This is given as follows;
The starting point involves fitting the first learner to all of the input data.
Create the next model and fit it on the residuals to deal with the weak learner's problems.
As described in step two, the process of fitting is repeated till it reaches a stopping criterion.
The final model prediction is obtained by computing the sum of the prediction given by each learner. The general function denoting the prediction at step t is denoted by equation 4.1-1 as follows;
Equation 2.4-1
fi(t)= j=1tfjai= fit-1 + ftai
fjai = learner at step t; fi(t) = predictions at step t; fit-1= predictions at step t-1;
ai = input variable
To prevent issues with overfitting and still maintain the computational speed of the model, XGBoost uses the expression denoted by the equation for the evaluation of the model’s goodness from the original function given by equation 2.5-2 as follows;
Equation 2.4-2
L(t)= k=1nl(yi,yi)+ k=1t(fi)
Equation 2.4-3
and f=γT+ 12λ∥ω∥ 2
L(t) is the regularised objective to be minimised; l is referred to as the loss function that measures the difference between the prediction (yi) and the target (yi); is the additional term that aids the prevention of overfitting. T denotes the number of leaves, is the scores in the leaves, is the regularization parameter, and is the minimum loss required for the additional partition of the node. Detailed description, information and computational processes of XGBoost can be found in (Chen and Guestrin 2016).
Spatiotemporal Characteristics of Rainstorms in Machine learning - Present Study
The studies discussed in section 2.3 show the efficacy of using machine learning models to predict flash flood. Most of the studies used similar flash flood triggering factors and even explored different machine learning models to perform a comparative analysis of their predictive modelling accuracy. However, almost all of them did not fully explore the possibility of incorporating spatiotemporal characteristics (features) such as those extracted from the rainstorm’s dataset.
Although Alipour et. al. (2020) attempted to use a few selected spatiotemporal features, their research focused on implementing a flash flood damaged prediction model, which is not the case in this study. The inclusion of a robust number of watershed properties as predictors in their model was not considered. In addition to this, their study made no comparative analysis of the machine learning model (Random forest) explored with any other machine learning model, which is an obvious step in most studies performed in the area of flash flood prediction using machine learning. This present research seeks to address these issues and, more importantly, include a broader range of spatiotemporal storm features in classifying flash flood events using machine learning algorithms.
Hence, this present study aims to complement the wider research works conducted and are ongoing. Two machine learning algorithms, Random Forest and Extreme Gradient Boosting (XGBoost) are implemented to classify the event and a comparative analysis of their performance. A paramount aspect of the research is the analysis of the sensitivity of a wide range of flash flood triggering features (both static and dynamic (spatiotemporal characteristics extracted from the rainstorm dataset)) and interpretation of the machine learning models using a model-agonistic approach. All input variables used in this study were extracted from remotely sensed data.
Results
fsdfdsfaf
Flash floods impacted each country differently during these years, and this is more evident in the analysis of the distribution of the monthly events, as shown in figure 5.1-2 and figure 5.1-3. These figures show that many of these events occurred within July, one of the key months in the monsoon in which rainfall events intensity, as indicated by the inventory sources (Mekong River Commission 2018). Such intense rainfall usually continues into the month of August and as seen from the inventory compilation, a significant number of the events captured in this study, for 2014,2015, 2016 and 2018, occurred in the month of August. It is also important to note that lesser Year Cambodia Lao PDR Thailand 2014 16 9 38 2015 11 21 2 2016 4 7 16 2017 13 15 16 2018 14 16 11 Total 58 68 83 40 events were gathered for the months of September and October, with October having the lowest number of events.
Comparison of Extreme events
Also, based on the flash flood inventory compilation exercise, it was discovered that several provinces in each of the countries had experienced recurrent flash flood events, some of which experienced more than one event in a single year, especially during the monsoon season. As shown in the inventory list, some Cambodian provinces, such as Kampong Cham, Kampot, Kampong Thom, Phnom Penh and Preah Vihear, were impacted by flash flood events. Provinces such as Kampot and Kampong Thom experienced recurring flash floods at different times within the period considered. Lao PDR, located in the western part of the study location, has also been faced with severe flash flood events with notable hits in provinces such as Luang Prabang, Champsack, Bolikhamxai and Houaphan. In Thailand, a large number of provinces have experienced flash flood events. From the compilation, notable provinces that experience recurrent flash floods are Chaing Mai, Phayao, Sakon Nakhon, Ubon Ratchathani, most of which are located at the northern and north-eastern parts of the country, a place in which devastating flash floods have been recorded over time (Mekong River Commission 2014).
Conclusion
The study applied machine learning techniques to analyse the impact of a large range of spatiotemporal rainstorm features and static features causative to flash floods in the classification of flash floods in the Lower Mekong Basin. The input features were extracted from remotely sensed data and have shown a high level of reliability in using them in machine learning classification models for classifying flash floods. The impact of these features on the flash floods classification was determined and analysed using model-agnostic methods.
The main research question read as follows:
How important are spatiotemporal storm features in the classification of flash floods in the Lower Mekong Basin, using machine learning?
To answer the main question above, some sub-questions were constructed and answered. The answers to these sub-questions are given as follows;
What kind of association among the variables that trigger the flash floods in the Lower Mekong Basin?
From the correlation analysis, there exists a strong association among some of the triggering variables extracted from the remotely sensed data covering the LMB. However, these associations are significantly clustered. Most spatiotemporal storm variables are highly correlated with themselves, which is the same for most static features. The storm's characteristic, such as the Volume, Total magnitude and Maximum Intensity, are highly correlated with each other, with the correlation between the total magnitude and volume is the highest. Being that the case, though feature selection before model development is not the focus in this study, it is safe to say that either the total magnitude of the volume could be selected as part of the machine learning model input variable without the need of including both features in the model. Similarly, amongst the static features, slope is highly correlated with TWI and SPI. Furthermore, a correlation threshold may serve as the basis for selecting the needed features for the model.
Given the data, which of the machine learning algorithms applied in this study is more suited to classify the flash flood events in the Lower Mekong Basin?
The machine learning task involved in this study was the classification of flash floods in the LMB, which could involve using several machine learning algorithms (Classifiers). In this study, two machine learning algorithms were implemented. These were Random Forest and XGBoost. They were both used in the global predictive classification of flash floods. The model classification results and evaluation show that XGBoost performed better than Random forest in predicting the flash floods, despite the dataset imbalanced dataset used in this study. Both algorithms are ensembles and are known to have better performance even when the available dataset is imbalanced. This could partially explain why they both performed well. However, since XGBoost performed better than Random forest, only XGBoost was used to develop the regional models, providing accurate predictions.
Which of the spatiotemporal storm variables (factors) mainly impact the machine learning models' performance, and are they more important than the static variables in classifying the flash floods?
From the model-agnostic interpretations performed on the global models, the rainstorm volume impacts the models' performance. The total magnitude and the maximum intensity also play a role in impacting the performance of the models. The rainstorm volume is assigned the highest permutation importance score. Should the Volume feature be removed from the list of features, it will drop the f1-score for Random Forest and XGBoost by 0.528 and 0.527, respectively. Also, SHAP feature importance gives similar results, indicating the rainstorm volume as the highest contributing feature to the model predictions.
As discussed in section 5.4.2, the resulting permutation feature importance analysis of the four regional models developed show that three out of these four regional models rank the volume as the most important making the feature “volume” the strongest predictor. These results strongly confirm that the models built mainly depend on the rainstorm volume for accurate classifying flash floods.
As shown in the permutation importance results obtained, most of the spatiotemporal storm variables have a higher impact on the model predictions when compared to the static variables. Five spatiotemporal features against twelve static features were used in the model. For every model built, three out of the five spatiotemporal features rank highest in feature importance. The variables are Volume, total magnitude and Maximum intensity of the rainstorm. For all models built, the static variables were not ranked higher than these features. But it is also important to state that the model performance (global and regional) would not have been achieved without the combined effect of the static and spatiotemporal rainstorm features, even though the models, without a doubt, mainly depend on the spatiotemporal features for highly accurate and precise predictions.
In addition to the above, some of these static variables could act as dynamic variables. This is because changes in land use and configuration of the topography, especially in developing areas of the basin, will lead to changes in how flash floods are triggered in those locations. This further confirms that though the features from rainstorms are highly important, static features cannot be ignored even though they undergo long term changes.
Hence, spatiotemporal storm variables are generally more important than static features. However, both static and spatiotemporal storm features will lead to a more precise classification of the flash floods using machine learning models.