Water resources decision making analysis using Natural Processing Language

An Exploratory Analysis of Digital Information using Natural Language Processing for the Planning and Decision Making Process of Water Resources in Bolivia

Researcher: Camilo Gonzales

Mentors: Gerald Corzo & Hector Angarita

Supervisor : Dimitri Solomatine

Abstract


In recent years, the community is much more participatory in the planning and decision-making processes of Integrated Water Resources Management. However, differences between competing stakeholders prevent the identification of important variables in decision-making. In addition, the COVID-19 situation has prevented activities from being face-to-face with the community where fundamental information is collected for the planning process. Faced with this panorama, and with the aim of complementing the characterization of a water system, and provide an alternative that collaborates in the planning and decision-making process, this research focuses on analyzing digital information sources from the public media, obtaining useful information from articles associated with a basin. The case study corresponds to La Paz - Choqueyapu river basin in Bolivia. The information from 6 representative newspapers of that country, related to water resources, was extracted. An exploratory analysis of the information is executed and it is associated with historical information on hydrological phenomena such as precipitation in the last decade, finding a good correlation between both sources of information. Through the application of Named Entity Recognition, it was possible to identify different entities associated with bodies of water, dams, authorities, and communities that are present in the basin.

Each of the articles is associated with a positive or negative sentiment according to its content in order to carry out a qualitative analysis of the basin. From the article and its associated sentiment, sentiment text classification models are build in the context of water resources with the extracted articles with different techniques of word embedding and classification machine learning algorithms. It was found that the model with the best performance corresponds to the SVM algorithm with linear kernel and Word2vec continuous bag of words word embedding, obtaining 84% accuracy. This result was compared with the value obtained through the Spanish Sentiment Analysis library of 63%, evidencing a high improvement in the classification of texts associated with water resources in the Spanish language. Finally, by finding the most frequent words in a positive or negative context, important variables can be evidenced for the improvement of the planning and decision-making process.


ThesisCamiloGonzalezAyala