Skip to content
This repository was archived by the owner on May 7, 2021. It is now read-only.

The Complex Event Machine Learning framework

Shreekantha Devasya edited this page May 19, 2020 · 1 revision

The Complex-Event Machine Learning (CEML) [CEML introduction] is a framework that combines Complex-Event Processing (CEP) and Machine Learning (ML) applied to the IoT. The framework was developed to be deployed everywhere, from the edge of the network to the cloud. Furthermore, the framework can manage itself and works autonomously. The following section briefly describes the different aspects that CEML covers. The framework must automate the learning process and the deployment management. This process can be broken down in different phases: (1) the data must be collected (2) The data must be pre-processed for attribute extraction. (3) The learning process takes place. (4) The learning must be evaluated. (5) When the evaluation shows that the model is ready, the deployment must take place. Finally, all these phases happen continuously and repetitively, while the environment constantly changes. Therefore, the model and the deployment must adapt as well.

Data Propagation Phase

Data in the IoT is produced in several places, protocols, formats, and devices. Although this deliverable does not address the problem of data heterogeneity in detail, the learning agents require a mechanism to acquire and manage the heterogeneity of the data. The mechanism must be scalable and, at the same time, the protocol should handle the asynchronous nature of IoT. Finally, the protocol must provide tools to handle the pub/sub-characteristics of the CEP engines. Therefore, we have chosen MQTT , a well-established Client Server publish/subscribe messaging transport protocol. The topic-based message protocol provides a mechanism to manage the data heterogeneity by making a relation between topics and payloads. It allows deployments in several architectures, OS, and hardware platforms; basic constraints at the edge of the network. The protocol is payload agnostic, and as such allows for maximum flexibility to support several types of payloads.

Data Pre-Processing (Munging) Phase

Usually ML is tied to stored datasets, which incurs several drawbacks. Firstly, the learning can take place only with persistent data. Secondly, typically the models generated are based on historical data, not current data. Both constraints, in the IoT, have dire consequences. It is neither feasible nor profitable to store all data. Also, embedded devices do not have much storage capacity which makes it impossible to use ML algorithms on them. Furthermore, IoT deployments are commonly exposed to ever-changing environments.
Using historical data for off-line learning could cause outdated models learning old patterns rather than current ones, producing drifted models. Although some IoT platforms support storage of historical data, it may be too time and storage consuming to create large enough times series. Therefore, there is also a need for non-persistence manipulation tools. This is precisely what the CEP engine provides in the CEML framework. This means, the CEP engine decides which and how the data is manipulated using predefined CEP statements deployed in the engine. Each statement can be seen as a topic, to which each learning model is subscribed. Any update of the subscribers provides a sample to be learnt in the learning phase.

Learning Phase

There is no pre-selection of algorithms in the framework. They are selected by the restrictions imposed by the problem domain. For example, in extreme constrained devices, algorithms such as Algorithm Output Granularity (AOG) may be the right choice. In other cases where the model changes quickly, one-shot algorithms may be the best fit. Artificial Neural Networks are good for complex problems but only with stable phenomena. This means that the algorithm selection should be made case-by-case. The framework provides mechanisms for the management and deployment of the learning models, and the process of how the model is fed with samples. In general, the process is based on incremental learning albeit with online and non-persistent data. The process can be summarized as follows: The samples, without the target/label/ground truth is provided in the last phase, are used to generate a prediction. The prediction will then be sent to the next phase. Thereafter, the sample with the target/label/ground truth is applied to update the model. Thus, all updates are used for the learning process.

Continuous Validation Phase

This section describes how the validation of the learning models is done inside the CEML. This phase does not influence the learning process nor validate the CEML framework itself.
ML model validation is a challenging topic in real-time environments, and the evaluation of distributed environments or embedded devices is not addressed extensively. There are two possible strategies. Either we holdout an evaluation dataset by taking a control subset for given time-frame (time window), or we use Predictive Sequential, also known as Prequential, in which we asses each sequential prediction against the observation. The following section describes the continuous validation.

Window Evaluation (Classification Validation)

Instead of accumulating a sample for validation, we analyze the predictions made before the learning takes place. All predictions are assessed each time an update arrives. The assessment is an entry for the confusion matrix which is accumulated in an accumulated confusion matrix. The matrix contains the accumulation of all assessed predictions done before. In other words, the matrix does not describe the current validation state of the model, but instead the trajectory of it. The matrix is the accumulated validation metrics (e.g., Accuracy, Precision, Sensitivity, etc.) are being calculated. This methodology does have some drawbacks and advantages, explained more extensively in Mixing and moving complex event processing and machine learning to the edge of the network for IoT applications.

Accumulated Evaluation (Regression Validation)

This method is based in the prequetial approach described in Knowledge Discovery from Data Streams, basically, the the validation metric is calculated for each update and accumulate to the previous metrics result. The previous accumulated metrics are "weighted" such that "reduce" their importance in time. This because, as the model gets updates the old errors becomes less significant.

Deployment Phase

The continuous validation opens the possibility of assessing the status of the model each time a new update arrives, e.g., if it is accrued or not. With this, the CEML framework can decide if the model should or should not be deployed into the system at any time. If the model is behaving well, then it should be deployed. Otherwise, it should be removed from the deployment. The decision is made by user-provided thresholds w.r.t. evaluation metrics. If a threshold is reached, the CEML inserts the model into the CEP engine and starts processing the streams using the model. Otherwise, if the model does not reach the threshold then its remove from the CEP engine.