AI and Machine Learning-Based Anomaly Detection

Alex Moskvin
Sep 13, 2021
12 min read

In today’s digital environments, anomalies in data patterns occur almost every day. Affecting both data in transfer and static data, these anomalies can cause large interruptions if they are not intercepted. Some real-world data anomalies – and their impact – include the following:

Energy production by wind farms and lack of predictive maintenance may lead to severe hardware failure.
Application and web delays can cost businesses significant income losses. Large web-based retailers lose billions of dollars in annual sales due to these delays.
Banking fraud occurs constantly. Half of companies have experienced banking fraud in the past three years; some of the lost revenue was never returned.

In 2020,over 80% of companies worldwide faced some downtime. A typical outage lasts around four hours and costs around $2 million.

To uncover data anomalies, most organizations are still using manual or outdated statistical detection techniques. That’s why a lot of anomalies go undetected and cause damage. These outdated approaches will no longer be relevant within a few years for the following reasons:

Cyber-threats are becoming more sophisticated. It is harder to detect fraudulent behavior among large volumes of data. Hackers are introducing new methods to misuse existing vulnerabilities and the lack of efficient fraud detection tools. In recent years, there have been large data breaches with huge losses, signaling the need for improved prevention and detection of potential fraud.
The business environment is also evolving. Database inconsistencies can cause a business several million dollars of lost revenue. To prevent this, companies should build real-time threat analysis systems and make use of advanced business intelligence (BI) to improve their decision making, minimize downtime, and scale up their effectiveness.
We are living in the era of big data. In the past two years, worldwide users produced 90% of global data. Without BI tools and automations, engineers cannot effectively analyze huge volumes of data. To illustrate this issue, consider the fact that it is very difficult to detect a network anomaly in a distributed system, as it’s almost impossible to manually recognize true signals from the noise. Manual alerts can identify only certain failures; beyond that, they are not effective.

Machine Learning Basics

Machine learning is based on the ability of specialized software to make its own logic and independently perceive connections among data. There are three main approaches of machine learning:

Supervised learning involves forming a training data set and using it to teach the algorithm. It can solve both regression and classification problems.
Unsupervised learning does not use a training set and, instead, only deploys techniques for finding hidden patterns in the data set being analyzed. An example is the classification of anomalies (e.g., fraud) into groups with similar characteristics.

Learning with support (corroboration), i.e., the Markov decision process, means that the new state and the reward depend only on the previous state and the actions taken, not on the entire history of the process. Rewards implicitly define the goal of the agent. They are determined in order to maximize the reward, while the agent is doing the required job.

In practice, machine learning can be realized via an interactive process with the following steps:

Provide relevant data to solve a given problem.
Prepare data for analysis by machine learning, with the aim of working only with reliable data (i.e., filter/clean the training data set).
Select an appropriate machine-learning algorithm.
Algorithm training, which can be supervised, unsupervised, or supported, is necessary to form a quality model.
Evaluate the model to select the algorithm with the best performance on a specific problem.
The created model is distributed to users in the form of an application.
Users solve the problem based on their own data, which the model has not seen before.
Assess the validity of the solution to the problem and then return to the beginning of the process until a sufficiently good solution is obtained.

Application of Machine Learning for Anomaly Detection in a Real-Time Data Flow

Real-time data flow (e.g., streaming) is a sequence of data bytes that are real-time, continuous, and ordered either implicitly (by arrival time) or explicitly (by timestamp). Real-time data occurs in many industries: e-commerce, banking (transaction logs), online auctions, telecom (phone calls), etc. It is characterized by the inability to control the arrival of bytes in a correct sequence, as well as the inability to save the entire stream entirely locally.

There are eight ways to detect anomalies in real-time data:

Density
Distance
Projection
Regression
Support vector machine
Decision tree
Grouping/clustering
Time series

Density-based methods estimate the probability of distribution (or in nonparametric methods, the probability of density for each point), and data points in the low probability range are considered to be anomalies. This means that when new data arrives, its density is estimated, and if the estimated density is below a given threshold value, this data (value) is considered an anomaly.

Distance-based methods define anomalies based on how far that observation is from its nearest “neighbor.”

Projection-based methods project data onto other data, most often low-dimensional, with anomalies being those values or points with a high error in reconstruction.

Regression

Regression-based methods create a regression model and define anomalies based on deviations from the predicted value. The method calculates the prediction interval from recent historical data as it detects the anomaly. Training is done as a series with 10-fold cross-validation. There are two outcomes: either the anomalies are included in the window used to calculate the predictions, or they are discarded. This determines whether the model is adaptable to anomalies or not. There are four different predictors:

A naive predictor assumes that the attributes are independent of the class, which is why it is called naive. It is very simple and not very useful in most cases.
A nearest cluster predictor assigns the training data to clusters using k-means, predicts to which cluster the new data belongs, and calculates the cluster mean as the value of the new data. Since it operates in series, it is not very adaptable to changes in data distribution.
A single-layer linear network predictor involves a linear combination of values in the window, so the weights are updated in each iteration. It’s the best option for online learning problems because it updates the weights with each iteration.
A multilayer perceptron predictor involves a nonlinear combination of input variables (neural network).

Support Vector Machines

The support vector machine (SVM) method is an approach to classification where support vectors form the boundaries of a class. In detecting anomalies, a single-class SVM can be used to define a normal class, and data points that are outside the class boundaries can be defined as anomalies.

Davy et al. (2006) constructed a model of SVMs from a kernel, considering that kernels are not sensitive to data dimensions and do not require the data to fit into certain statistical distributions. The boundaries of a normal class are defined so that most points lie within them. In making it, they used data points that were classified as normal. Data points that were not classified as normal could also be used, but only if the number of anomalies is very small (so their impact would be negligible). Incremental addition and removal of vectors is an iterative process that can lead to a bottleneck during computation.

Decision Trees

Decision tree-based methods create classification or regression decision trees, and according to them, anomalies are those values that deviate from the direction in which the decision tree has moved.

Tree-based methods create a tree structure from data, where the tree is updated with new data, but if new data causes significant changes in the tree structure, the model needs to be re-trained.

To detect an anomaly, Tan (2011) used a single-class method based on half-space trees formed by a set of nodes, each node containing a certain amount of data within a subset of data streams. This model assumes that the data values are in the range of 0 to 1, or that the data should be mapped in that interval.

A similar method was developed by Guha (2016), but this method is based on a random cut forest, which appears due to a random separation of individual attributes (at a certain time) at a randomly selected location within the interval of that attribute. The complexity of this model is computed with the sum of the depths of the forest nodes. Data is marked as an anomaly if the common distribution of nodes is significantly different from the distribution that excludes them. Unlike Tan’s method, here the interval of data values is not strictly defined, but it is necessary to know the interval in which the data occurred.

Grouping (Clustering)

Methods based on grouping data into clusters define anomalies in two ways. The first way is to observe how far some data is from other clusters or cluster centers, and the second is to observe whether the data forms an atypical cluster.

Clusters are stored as condensed information: the number of data (points) in the cluster, the sum of the cluster data, and the sum of the squares of the cluster data.

The centroid is the average of the points, and radius describes the tightness of the cluster, or how far the points are from the centroid. Centroid and radius can be calculated from cluster features, and cluster features of two clusters can be easily merged.

A tree node contains one or more cluster features. Cluster features also have a threshold that indicates whether they can absorb multiple data points. The radius must be less than the threshold’s value, or the distance to the data point must be less than the threshold. When a new data point arrives, its nearest cluster is searched and checked to see if the threshold is exceeded. If it crosses, it is added to the cluster. In the second case, a new cluster is formed, but only if the maximum number of clusters has not been reached. If the maximum number of clusters is reached, the threshold value is increased and the model is rebuilt.

In his model for anomaly detection, Burbeck (2007) assumes that training data does not contain anomalies. The model consists of the cluster’s number (ID) and the index of a tree, whose leaves contain clusters. Clusters can be removed from the tree if data corresponding to those clusters has not appeared for some time.

Time Series

The methods used by time series are based on anomalies from the constructed model. Numerous authors have developed such models, but we will mention two.

Khreich (2012) bases his method on hidden Markov models (HMM). According to him, there are ways to incrementally update the HMM, but the performance is not good for streaming data. His model does not update the existing HMM, but builds a new one as new data series arrive. To merge HMMs, an incremental Boolean combination is used that aims to maximize the receiver operating characteristic curve.

Yamanishi (2002) focused on an auto-regressive model that refers to a data set that contains both categorical and numerical variables. Learning takes place through the sequential classification of estimates, which the auto-regressive (AR) model makes for particular data in the stream.

The AR model from data k is a linear combination of k from previous time points. The data anomaly can be calculated by comparing the distance of the distributions before or after the occurrence of that data (time points) or by calculating the density distance.

Benefits for Businesses

By applying machine learning, companies can improve the speed and efficiency of anomaly and fraud detection in their environment. Since anomalies can occur in any type of data, a wide variety of different industries can gain large benefits by using AI-powered anomaly detection.

Telecom: Telecom operators tend to want to improve the network’s health and operation, so they actively apply tools for network anomaly detection. Some providers use fraud detection techniques based on machine learning to uncover real-time atypical behaviors, deviant charges, phishing calls, and even inefficient or inaccurate sales patterns across their organization.
Healthcare: Fraud detection in this field is another hot area for employing machine learning and AI techniques. Machine learning can enhance the speed and accuracy of diagnosis procedures; for example, unsupervised anomaly detection is used in CT scanning image analysis. Real-time anomaly detection will also become necessary with the increased demand for medical IoT devices and health wearables. Anomaly detection in these use cases can help spot the early signs of a health issue and alert the patient and medical staff. Finally, machine-learning algorithms can be used to find fraud (corruption) in medical payments, forming the basis of cutting-edge insurance fraud detection techniques applied in medical institutions.
Supply chain: Inefficiency of the supply chain can cause significant disruptions in many industries (e.g., retail) and also lower productivity and revenue. Machine learning-based anomaly detection can be applied for demand planning and provides better predictions than traditional analytics in most cases. AI tools can also improve inventory management, automate root-cause analysis, and enhance supply and production planning.

Anomaly Detection Example

Let’s consider a simplified example of using anomaly detection in a wind power production system.

A wind turbine is a complex electromechanical system that consists of several components and subsystems. There are multiple reasons why anomaly detection and predictive maintenance are highly important for this kind of systems:

Wind turbines are often located in remote locations and downtime can last for days before the required spare parts reach their destination.
Errors classified as major failures, meaning that the failure has an associated downtime greater than one day, constitute 25% of the number of errors but are responsible for 95% of the downtime.
Not only can predictive maintenance reduce the number of failures by correcting errors, but it can also reduce the number of redundant hours being spent on routine controls or maintenance of well-functioning components.

The major components of a wind turbine include rotor, bearings, mechanical shaft, gearbox, generator, power electronic interface, and sensors as shown in the figure below.

The wind turbine system is subject to several types of faults within various components as shown below.

Continuous monitoring of wind turbine health using early fault detection methods helps to optimize the maintenance period and to save maintenance expenses and, even more important, generate warnings in due time to avoid major damages or even technical disasters.

In order to build representative models for the system, it is necessary to have the historical monitoring data of an adequate timeframe for the customization of the models. In this project, we used open-source data found on Kaggle (URL: https://www.kaggle.com/wallacefqq/wind-turbines-scada-datasets) obtained from the SCADA (Supervisory Control And Data Acquisition) system.

SCADA systems are used for both monitoring as well as controlling industrial systems remotely and provides an efficient way for industries to gather and analyze data in real-time. Hundreds of thousands of sensors are often used in larger SCADA systems, generating large amounts of data.

Our data has the following columns:

Unit location: In total, two locations(WTG40 and WTF43) of data are present. In our case, we will only consider WTG40.
Timestamp local: The data is recorded between 2020-03-01 00:00 and 2020-12-31 23:50:00.
Windspeed: average wind speed in meters per second.
Power: power production in KWh
Wind direction angle (degrees)
Rotor RPM: rotor speed in rotation per minute
Pitch Angle
Generation
Wheel hub temperature
Ambient temperature
Tower bottom ambient temperature
Failure time

While we have a failure time in the dataset, the exact failure time is unclear. Data available in the dataset does not tell clearly about the runtime between failures or total runtime before failure.

Let’s use One-class SVM unsupervised learning algorithm to be able to detect anomalies. The following examples are using Python, Pandas, and scikit-learn frameworks to implement the model.

Let’s firstly observe our data

df = pd.read_csv('../input/windturbinescada/wt-scada.csv')
df.describe()

First, we need to perform the data cleanup.

# Remove spaces in columns name
df.columns = df.columns.str.replace(' ','_')

# Splitting two WT in different dataset using location of turbine
grouped = df.groupby(df.unitlocation)
wtg40 = grouped.get_group("WTG40")

# Failure times
fault_time_wtg40 = wtg40.failure_time.unique()

# Get the failure status in dataset
wtg40['total_runtime'] = [i*10 for i in range(1,len(wtg40)+1)]
wtg40['is_fault'] = wtg40['total_runtime'] < fault_time_wtg40.max()

# Remove columns that are unnecessary
df = wtg40.drop(['unitlocation', 'ttimestamplocal','total_runtime', 'failure_time'], axis=1)

df['is_fault'] = df['is_fault'].astype('category')
df['is_fault'] = df['is_fault'].cat.codes

# Final datasets for training
y = df['is_fault']
X = df.drop(['is_fault'], axis=1)

Before proceeding further, let’s check there is any null value present in the dataset or not.

X.isnull().values.any()
> False

“y” contains failure status where:

1 -> points to a failure/outlier
0 -> points to a normal operation.

Let’s check how our data set looks like after the cleanup:

X.describe()

Before starting with the model, we would like to reduce the dimensionality to build a simpler predictive model that may have better performance when making predictions on new data.

The most popular technique for dimensionality reduction in machine learning is Principal Component Analysis, or PCA for short. This is a technique that comes from the field of linear algebra and can be used as a data preparation technique to create a projection of a dataset prior to fitting a model.

Let’s apply PCA with a number of components equal to two.

# Apply PCA to reduce dimensionality reduction
X = StandardScaler().fit_transform(X)
pcamodel = PCA(n_components=2)
X = pcamodel.fit_transform(X)

PCA algorithm has selected first (PC1) and second (PC2) principal components as follows:

plt.scatter(X[:, 0], X[:, 1], color='blue')
plt.ylabel("PCA1")
plt.xlabel("PCA2")
plt.show()

Let’s split testing and training data sets:

X_train = X[:2000]
X_test = X[2000:2010]
y_train = y[:2000]
y_test = y[2000:2010]

Then we fit and train our model

clf = OneClassSVM(nu=0.15, gamma=0.35)
clf.fit(X_train)
clf.get_params(deep=True)

Let’s generate some abnormal novel observations and classify them

X_outliers = np.random.uniform(low=-5, high=0, size=(10, 2))
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

Let’s plot the result in two-dimensional space

# Plot the result in two dimensional space
xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500))
plt.figure(1)
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
legend1 = {}
legend1["boundry"] = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='green')
legend1_values_list = list(legend1.values())
legend1_keys_list = list(legend1.keys())
legend1_values_list[0].collections[0]

plt.figure(1)  
plt.title("Fault Detection in Wind Turbine")
plt.scatter(X_train[:, 0], X_train[:, 1], color='blue')
plt.scatter(X_test[:, 0], X_test[:, 1], color='black')
plt.scatter(X_outliers[:, 0], X_outliers[:, 1], color='red')
plt.xlim((xx.min(), xx.max()))
plt.ylim((yy.min(), yy.max()))
plt.legend([legend1_values_list[0].collections[0]],     
           labels= ["train"," test",'outlier'],
           loc="best",
           prop=matplotlib.font_manager.FontProperties(size=11))
plt.ylabel("PCA1")
plt.xlabel("PCA2")
plt.show()

Let’s check the model accuracy

print("accuracy: ", metrics.accuracy_score(y_train, y_pred_train))
print("area under curve (auc): ", metrics.roc_auc_score(y_train, y_pred_train))

Making use of the model

To use the model on new data (e.g. in JSON format) we could do something like this:

data = pd.read_json(some_json)
model.predict(data)

If our output is -1 the model has predicted the data to be an outlier (which means wind turbine failure), a +1 means an inlier (normal operation). To use the model outside of our development environment we need to save it to disk. Fortunately, this is quite straightforward:

outputfile = 'oneclass_v1.model'
from sklearn.externals import joblib
joblib.dump(model, outputfile, compress=9)

Then in our deployed code we and load the model back in with:

from sklearn.externals import joblib
model = joblib.load('oneclass_v1.model')
# then predict with model.predict(..)

Conclusion

About 30% of all industrial equipment does not benefit from predictive maintenance technologies and instead relies on periodic inspections to detect anomalies.

Detecting anomalies can stop a minor issue from becoming a widespread, time-consuming problem. By using the latest machine learning methods, you can track trends, identify opportunities and threats, and gain a competitive advantage with anomaly detection. Do you want to leverage anomaly detection and stay ahead of the curve? Choose a technology partner that can check and improve your business’s technological readiness. Contact us today!

AI and Machine Learning-Based Anomaly Detection

Machine Learning Basics

Application of Machine Learning for Anomaly Detection in a Real-Time Data Flow

Regression

Anomaly Detection Example

Making use of the model

Conclusion

Comments

Have a question?

- Ahtri tn 12, Tallinn, 15551, Estonia
- 18 Yunosti ave., Vinnytsia, 21000, Ukraine
- 275 New North Road, London, England, N1 7AA

info@plexteq.com

+372 6 10 42 43
+380 67 395 35 34

Machine Learning Basics

Application of Machine Learning for Anomaly Detection in a Real-Time Data Flow

Regression

Anomaly Detection Example

Making use of the model

Conclusion

Comments

Have a question?

- Ahtri tn 12, Tallinn, 15551, Estonia - 18 Yunosti ave., Vinnytsia, 21000, Ukraine - 275 New North Road, London, England, N1 7AA

info@plexteq.com

+372 6 10 42 43 +380 67 395 35 34

- Ahtri tn 12, Tallinn, 15551, Estonia
- 18 Yunosti ave., Vinnytsia, 21000, Ukraine
- 275 New North Road, London, England, N1 7AA

+372 6 10 42 43
+380 67 395 35 34