Exploring Machine Learning Classification Models for Estimating the Occurrence of Moving Area Displacements

Case Study: Ometepe Island, Nicaragua

Researcher: Victor Arcia

Supervisor: Prof. Dimitri Solomatine

Mentor: Dr Ir. Gerald A. Corzo (IHE Delft) and Dr Heyddy Calderon (IGG Nicaragua)

Overview

This research analyses the use of spatio-temporal Remote Sensing information and Machine learning (ML) for the estimation of displacement occurrence of Moving Areas. Mass Movements are frequent in Central American countries, mainly due to the combined extreme hydro-meteorological events combined with the seismic activity and the characteristics of the geological formations in the region. A common situation in these countries is the lack of databases with information about the detailed location of extreme events (e.g. landslides). A particular case is a high number of landslides occur every year in Ometepe Island, which is located in Lake Cocibolca, Nicaragua; it has two volcanoes and Mass Movements happen quite often in the area, representing a important concern for the local population. The triggering factors for these Mass Movements is under continuous research, and currently there is initiative to try to evaluate the use of new techniques to be able to forecast events.

Since the extreme events have happened in the past during the rainy season, there is a clear physical justification that the large active movements that end in landslides are present with high precipitations. This combined with the slope cohesion theories applied by engineers in the region, were the weight of the water is one of the main triggers when soils are saturated, makes important to explore this area. This work is divided mainly in two parts, the first is the identification of movements with remote sensing techniques and the second is the evaluation of spatiotemporal correlation of rain patterns with the spatial information of the movements. First, Remote Sensing was used to create the Displacement Occurrence Inventory, using InSAR technique with Sentinel-1 SAR images from 2015 to 2020. SNAP software was used to locate occurrences of displacements in the island.

The results of the identification of displacements from Remote Sensing images were used to characterise the intensity of displacement and their location. Second, a time window of aggregated spatial precipitation was analysed and transformed into different features to develop ML models. This study tested Logistic Regression (LR), Random Forest (RF) and Support Vector Machine (SVM) to detect occurrence of displacement in a particular area of the island. The results were analysed performance-wise and compared to each other. The results were binary classification problem (event or no event). After creating the models, RF was the algorithm that produced the best results, but an important factor to have in consideration is the coherence threshold as it can change the estimation of displacements when entering the variables into the ML algorithms. Nevertheless, the models showed correlation between precipitation and displacement occurrence. One way of making the for more accurate results these models should include not only remote sensing data but also corroborated data from the study area. This methodology is one of the first steps into a larger framework of spatiotemporal analysis for forecasting using Machine Learning.

Problem description

Mass movements are amongst the major hazards to humanity, taking the life of over 55 000 people between 2004-2016 alone (Froude, 2018); and causing economic losses of 4.7 billion euros in Europe per year (Haque, U., Blum, P., da Silva, P.F. et al., 2016). Mass movements are a problem that affects humanity in many regions in the world, including Nicaragua.

Nicaragua has been classified as the fourth most vulnerable country in the world to climate for the last 20 years (Kreft, 2015). On top of that landslide movements are the ones that highest cause of death due to natural disaster in the last 15 years, representing the 74.6% of the figures (Instituto de Geología y Geofísica (IGG), 2017).

Ometepe Island has two volcanoes: Concepcion (active) and Maderas (extinct) and mass movements are frequent, caused by

Remote sensing analysis

At first, standard processing using SNAP software developed by ESA (European Space Agency) was used to create the Displacement Maps from the SAR images. The steps basically cover the following sub-processes: co-registration, debursting, creation of interferogram, filtering, unwrapping and terrain correction.

After doing the steps just mentioned, the results were not as expected. The C-band from Sentinel-1 Satellite cannot penetrate dense forest cover; for this reason, during test stage the wrapped and unwrapped interferograms carried a lot of noise due to low coherence between the the emitter and receiver signals.

In consequence, it was decided to look on the Coherence part as it could give right away direct correlation between the actual displacements. By doing this, the actual value of the Displacement wouldn’t be possible to estimate, but the occurrence could be measured for each pixel of the image. This was the approach taken for the development of this research; however, there is one question remaining: what is the threshold to separate noise from actual displacements?

Machine learning methodology

The ML methods used in this study were Logistic Regression (LR), Random Forest (RF) and Support Vector Machine (SVM). A total of 30 trees was used as parameter for RF and the SVM used Radial Basis Function as analyzer. The dataset was split into training and testing sets, with a proportion of 70/30, respectively. In order to ensure the best performance of the models, the statistical characteristics of the displacement maps (DM) were split evenly on both the training and testing sets. This means that the proportion of events and no events were the same for both sets.


Support Vector Machine (SVM)


Support Vector Machine is a powerful and versatile supervised algorithm for ML that can be used in regression and classification problems. SVM uses kernels with the goal of transforming the input data into the right form. The are various types of kernels, like Linear, Polynomial and Radial Basis Function kernel; where the latter is the one that was used in this study(Steinwart, 2008).


K(x, xi) = exp(−gamma ∗ sum(x − x2))


where gamma is a values that ranges from 0 to 1 and need to be specified. The default value for this method and the value chosen for this model was 0.1


Random Forest (RF)


Random Forest (RF) is a supervised learning algorithm that consists in creating a series of decision trees (DT) with high variance and low bias, that by taking the average of all results of the DTs can lead to better performances than LR. These DTs use different parts of the dataset, varying in terms of the data itself and the size of the data to train each individual DT.


In the following Figure is showed how RF classifier works. It’s a series of decision trees that are separated from each other, and the depth of each tree varies as per the selected sample, but in the end, voting is performed to determine the final class.


In total, there was one condition that was changed and tested during training and testing stage: Coherence Threshold with values of 0.95, 0.96 & 0.97. From here, the models were run for each coherence threshold. As this is a binary classification problem, the model will be trained and tested to generate a confusion matrix per ML method. The classifiers were the commonly used:

• True Positive (TP): when the outcome of the model (prediction) is positive and the actual result is also positive;

• False Positive (FP): when the outcome of the model (prediction) is positive and the actual result is also negative;

• False Negative (FN): when the outcome of the model (prediction) is negative and the actual result is also positive;

• True Negative (TN): when the outcome of the model (prediction) is negative and the actual result is also negative;


Experiments

The ML methods used in this study were Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM). A total of 30 trees was used as a parameter for RF and the SVM used Radial Basis Function as an analyzer. The dataset was split into training and testing sets, with a proportion of 70/30, respectively. In order to ensure the best performance of the models, the statistical characteristics of the displacement maps (DM) were split evenly on both the training and testing sets. This means that the proportion of events and no events were the same for both sets.


In total, there was one condition that was changed and tested during the training and testing

stage: Coherence Threshold. The different coherence thresholds are listed below:


•Objective 1: To develop a historic inventory of surface displacements from Remote Sensing data of Ometepe Island, Nicaragua.

•Conclusion: It was possible to develop the historic inventory of displacement occurrences from 2015 to 2020. However, many factors need to be taken into account, like the amount of images, coherence values, and cluster minimum size.


•Objective 2: To evaluate the performance of the model results using, Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM).

•Conclusion: Random Forest was the best performing model overall, followed closely by Logistic Regression. The results of the models are very sensitive to the coherence threshold and should be picked the value that guarantees the results that are more similar to corroborated data from the study area.

•Objective 3: To analyse the applicability of the resulting models in other case scenarios.

•Conclusion: Machine Learning can be a good technique for estimation of occurrence of displacements as long as the amount of events is more or less equal to the amount of no events. For more accurate results these models should be explored the possibility of including not only remote sensing data from the study area but also corroborated data from the location.