[CSUR21] 异常检测方法、模型和分类 (I) - 总览

Author: Steven Date: May 23, 2022 Updated On: May 27, 2022
Boukerche, Azzedine, Lining Zheng, and Omar Alfandi. “Outlier Detection: Methods, Models, and Classification.” ACM Computing Surveys 53, no. 3 (May 31, 2021): 1–37. https://doi.org/10.1145/3381028.

## 1. 异常检测

• 超高维度
• 高速无界 (unbounded) 流数据
• 大规模分布式

## 2. 定义

• (1) Outliers are different from the norm with respect to their features;
• (2) outliers are rare in a dataset compared to normal instances.

Outliers和Anomalies往往被交换使用，但有轻微的概念化差异：

In general, anomalies suggest a different underlying generative mechanism. In contrast, outliers tend to emphasize statistical rarity and deviation, and whether they are generated by a different mechanism is not straightforwardly addressed.

• swamping: mistakenly identifying normal instances as outliers
• masking: closely clustered outliers making themselves hard to be detected

## 3. 分类

### 3.1 Outliers分类

• point outliers: an individual data instance that deviates largely from the rest of the dataset.
• The detection of local outliers relies on the characteristic differences (e.g., the difference in neighborhood density) between the outlier and its nearest neighbors, whereas global outliers address the difference with the entire dataset.
• collective outliers: collection of data instances that appear anomalous with respect to the rest of the entire dataset. However, each instance within the collection may not constitute an outlier individually.

### 3.2 Outlier Detection方法分类

• supervised (binary classification with imbalanced training data)
• semi-supervised (only normal labels or a majority of unlabeled data and small amount of labeled data)
• unsupervised (这篇survey的重点)

• Advanced approaches are developed upon the fundamental ones, to address new challenges
• New challenges include high-dimensional data (“curse of dimensionality”), unbounded and dynamic data streams, big data in a distributed setting, and effective usage of very limited labels
• Proximity-based approaches rely on nearest-neighbor-based techniques or clustering algorithms to quantify an outlier’s proximity to nearby data points，见系列笔记第二篇
• Projection-based methods adopt techniques such as LSH and space-filling curves, to convert the original data into a new space/structure with reduced complexity, where the outlier scores are defined based on the characteristics of new space，见系列笔记第三篇
• “distance-based” vs “density-based”: 两个概念有overlap，因此分类比较杂乱，density往往需要依赖计算distance。density的定义往往是“一个点为outlier如果其在特定radius内的邻居少于一定数量”。此文中，将其都称之为nearest neighbor-based。
