What is data anomaly?
Data anomaly detection may be a technique to identify unusual patterns that don’t conform to expected behavior, called outliers. It has various applications in the business field, ranging from intrusion detection to even system health monitoring (such as spotting a malignant tumor in an MRI scan), and even from fraud detection in MasterCard transactions to figure fault detection in respective operating environments.
This overview will present many methods and ways of detecting anomalies, also because of the thanks to building a detector in Python using a simple moving average (SMA) or low-pass filter.
Various data anomaly detection concepts and techniques
What Are Anomalies? Before getting started, it’s important to determine some boundaries on the definition of an anomaly. Anomalies can be broadly categorized as
1. Point anomalies: one instance of knowledge is anomalous if it’s too far away from the remainder. To further elaborate Business use case: A Detecting credit card fraud based on “amount spent.”
2. Contextual anomalies: The abnormality is context-specific. This kind of anomaly is common in time-series data. Business use case: Spending $100 on food a day during the vacation season is normal, but could also be odd otherwise.
3. Collective anomalies: a group of knowledge instances collectively helps in detecting anomalies. Business use case: Someone is trying to repeat data form a foreign machine to an area host unexpectedly, an anomaly that might be flagged as a possible cyber attack. Anomaly detection is analogous to — but not entirely an equivalent as — noise removal and thus novelty detection.
Novelty detection is specifically concerned with significantly identifying an unobserved pattern in new observations not included in training data — like a sudden interest in a new channel on YouTube during Christmas, for instance. Noise removal (NR) is removing noise from an otherwise meaningful signal.
Using Machine Learning (supervised and unsupervised) for anomaly detection
Most anomaly detection techniques use labels to determine whether the instance is normal or anomalous as the final decision. Getting labeled data that is accurate and representative of all types of behaviors is quite difficult and prohibitively expensive.
Anomaly detection techniques can be divided into three-mode bases on the availability to the labels: Supervised Anomaly Detection: This kind of anomaly detection techniques has the assumption that the training data set with accurate and representative labels for normal instance and the anomaly is available.
In such cases, the usual approach is to develop a predictive model for normal and anomalous classes. Any test data instance is computed in this model and determined which classes it belongs to.
However, these technologies have some similar challenges – Data anomaly detection
• A much smaller number of anomaly sentence are available due to the “normal” examples may contain an unknown set of outliers This issue is termed as the Positive-Unlabeled Classification(PUC) problem.
• Getting accurate and representative labels, especially the anomaly is difficult. Since the anomaly is determined through a mixture of multiple attributes. Such a situation is quite common in scenarios such as fraud detection.
Semi-Supervised Anomaly Detection: This kind of technique assumes that the train data has labeled instances for just the normal class. Since they do not ask for labels for the anomaly, they are widely applicable to supervised techniques. For example, a semi-supervised algorithm in an online social network.
Unsupervised Anomaly Detection: These techniques don’t need training data set and thus are most generally used. Unsupervised anomaly detection methods can “pretend” that the entire data set contains the normal class and develop a model of the normal data and regard deviations from the then normal model as an anomaly. Many Semi-supervised techniques can be used to operate in an unsupervised mode through operating a sample of the unlabeled data set as training data such adaptation obeys the assumption that the test data contains a little number of anomalies and the model learned during training is robust to these few anomalies.
Clustering algorithms for anomaly detection
DBSCAN is a density-based clustering algorithm (DBSCAN rightfully stands for Density-Based Spatial Clustering of Applications with Noise), what this algorithm does is look for areas of high density and assign clusters to them, whereas points in less dense regions are not even included in the clusters (they are labeled as anomalies). This is, actually, one of the main reasons if one personally likes DBSCAN, not only he/she can detect anomalies in a test, but anomalies in training will also be detected and not affect my results.
Gaussian mixture models
This represents a probabilistic model that assumes all the info points are generated from a mix of a finite number of gaussian distributions. The algorithms try to recover the original gaussian that generated this distribution. To do so it uses the expectation-maximization (EM) algorithm, which initializes a random of n initial Gaussian distribution and then tweaks the parameters looking for a combination that maximizes the likelihood of the points being generated by that distribution.
Why not K-Means? Data anomaly detection
While K-Means is maybe the best-known and most commonly used clustering algorithm for other applications it’s not well suited for this one. The main reason for this is that it’s only well suited when clusters are expected to have quite regular shapes; as soon as this is not fulfilled the model is not able to successfully separate the clusters.
Another reason is that all points are fitted into the clusters, so if you have anomalies in the training data this point will belong to the clusters and probably affect their centroids and, especially, the radius of the clusters. This can cause you to not detect anomalies in the test set due to the increase in the threshold distance. Another possibility is that you even form a cluster of anomalies since there is no lower limit for the number of points in a cluster. If you have no labels (and you probably don’t, otherwise there are better methods than clustering), when new data comes in you could think it belongs to a normal-behavior cluster, when it’s actually a perfectly defined anomaly.