Monika Wysoczańska Feb 23, 2021 - Artificial Intelligence

Searching for a black sheep in your data – Introduction to Anomaly Detection

Authors: Monika Wysoczańska, Paweł Kotowski. Image source: [6]

Imagine one day you go to a shop and you cannot pay for your purchase because your card got blocked by your bank. Once you call the bank it turns out there was a suspicious transaction trial with your card details on the other side of the globe. Considering you are not travelling at that moment , at first you get surprised because of the awkward situation at your local shop. After a while though you realize that your bank could actually save you from losing some significant amount of money, thanks to suspicious behavior detection – an anomaly.

What is anomaly detection?

Anomaly Detection is the identification of rare occurrences, items, or events of concern due to their differing characteristics from majority of the processed data. Anomalies, or outliers as they are also called, can represent bank fraud problems, such as the one given in the introduction, other security errors, structural defects, and even medical problems.

Regarding medical problems the area of anomaly detection applications spans from medical images analysis, such as X-rays to detect some structural anomalies of bones, up to the sensory data analysis such as heart rate or glucose levels in order to alert patients whenever their measurements indicate some potential diseases.

The other interesting medical use case is prediction of outbreaks of some infectious diseases based on the collective information from official data sources such as Center for Disease Control or World Health Organization. This is what actually happened on December 30, 2019, when BlueDot, a Toronto-based startup alerted its private sector and government clients about a cluster of “unusual pneumonia” cases happening around a market in Wuhan, China. This turned out to be the first recognition of the Covid-19, couple of days before WHO released its statement.

Anomaly detection is a general term of simply finding observations that are somehow different from ‘typical ones’. There is no common technique for all the cases, however most of them share the same requirement of processing massive amounts of data. This is where Machine Learning (ML) comes into the table, since ML methods allow us to automatically find some interesting patterns in big data collections.

Intro to ML techniques for anomaly detection

On the contrary to its simple definition, anomaly detection in general is not a trivial task. Ideally, we would like to throw all the data to our model and automatically get observations that deviate from a typical pattern as outputs, which is not that simple at all. The question remains what is ‘typical’ and where do we set the boundary between normal and abnormal? Depending on available data we can divide anomaly detection approaches using Machine Learning techniques to 3 main groups:

  • Supervised – labels of normal/anomalous data points are given. In this case we explicitly show to our model what we consider as normal and what as outlier. In fact it boils down to the typical classification problem. However, such an approach suffers from 2 main problems. First, obtaining labels is hard if not impossible as anomalous instances might not be known in advance. Moreover, typically we observe a lot more normal data points, thus the approach faces an imbalanced classes problem.
  • Semi-supervised – only normal instances in data set are present. The goal here is to build a representational model of normal data and check if incoming data can be explained by the model. The exemplary technique is One-class SVM.
  • Unsupervised – in this case both normal and anomalous instances are present in data set, but with no labels given. Here we assume that normal instances are far more frequent and different from abnormal data. Those methods include probability density function modelling, clustering, decision trees and others, for example: Isolation Forest, GMMs, K-means, DBSCAN.

The choice of a specific method within each of these groups depends on the characteristics of data and ease of putting the model into production. High-dimensional data can be projected into lower subspace where anomalies are more apparent and easier to detect. Neural networks (autoencoders) are often used for this purpose. On the other hand, in case of low dimensional data and low computational power available in target model setting, simple linear models might be the way to go. Getting to know the nature of data before training models is important, as it may save a lot of unnecessary training of complex models when simple ones work just as well and are orders of magnitude faster. Faster models also allow for much easier and complete validation of the solution in multiple conditions. The term for familiarizing yourself with data through various statistics, correlations analysis, histograms and lots of other visualizations is exploratory data analysis and it is an inherent part of each machine learning project.

None of those steps would be possible without a stable and trustworthy data collection pipeline. In unsupervised/semi-supervised setting non-anomalous can be collected and explored first. Then, to train the model and validate it properly, anomalous data that reflect a wide range of various non-normal conditions should be fed through the model. The process sometimes has to be repeated multiple times, for example for testing the solution in different normal conditions. Automation of the collection-exploration-training pipeline along is crucial for deriving right conclusions about model capabilities.

To sum up, Machine Learning techniques can be highly useful for the anomaly detection task which has a broad range of practical applications. The choice of a specific approach is mostly dependent on the nature of data we are dealing with, provided that first we build a reliable pipeline for data collection. In the next blogpost of this series, we will see how ML techniques can be applied in automotive and EdgeAI context.

 

References:

 

1. https://deepai.org/machine-learning-glossary-and-terms/anomaly-detection (Accessed: 3.02.2021)

2. Charu C. Aggarwal. 2016. Outlier Analysis (2nd. ed.). Springer Publishing Company, Incorporated.

3. A review on outlier/anomaly detection in time series data: https://arxiv.org/abs/2002.04236

4. Unsupervised Anomaly Detection for X-Ray Images https://arxiv.org/pdf/2001.10883.pdf

5. How Canadian AI start-up BlueDot spotted Coronavirus before anyone else had a clue. https://diginomica.com/how-canadian-ai-start-bluedot-spotted-coronavirus-anyone-else-had-clue (Accessed: 3.02.2021)

6. https://pixabay.com/pl/photos/czarna-owca-owiec-dane-dane-clay-3702973/

 

Share post: