Present day instrumentation networks in rivers provide huge quantities of multi-dimensional data. Although there are numerous machine learning tools that can extract trends, find patterns and predict future states given some data, it is crucial to properly optimize these techniques according to the semantic content of the data. Hydrology is a data immense science, which requires efficient mining of trajectories of measurements taken at different time points and positions. The underlying dynamics are highly non-linear when examined in a short time window (minutes) and become chaotic in the long term, although they are governed by a periodical annual procedure. In this project we will deal with a multi-dimensional time-series dataset, extracted in parallel from multiple sampling and control hydrological stations in the Orava River in Slovakia. We will investigate how we could optimize and tweak conventional prediction systems in order to make full use of the spatio-temporal hydrological data we have obtained.
The purpose of this project is to try to predict the water height in several locations along the Orava River. These values could further be utilized in order to predict floods. In the first case we have a regression problem, while in the second a classification problem. In order to construct a prediction system, we will not use anything from the conventional hydraulics theory, but rather use machine learning tools. The data used are collected from 16 hydrological measurement stations under the Orava reservoir. An initial goal is to make prediction based on each station’s measurements only. The predictions could be the expected water height trend (up/down), the expected runoff, and a long-term water height, that could be used for flood prediction. Afterwards, we will try to integrate all the stations’ measurements’ and base our predictions from this new dataset. The challenges of this task is to extract as much information from these multiple data sources, and properly design a data mining procedure that would take the semantic content of these information into account. We could then try to minimize the time that model needs for training.