cancel
Showing results for 
Search instead for 
Did you mean: 

SAP HANA Anomaly function - PAL

former_member186543
Active Contributor
0 Kudos

Hi SAP PA team,

I have been working on trying to model anomaly detection in SAP HANA and I found that SAP HANA PAL has provided a function "ANOMALY" which as per its documentation uses the K-Means with distance function(identifies furthest points as outliers).

However consider an example which fits more like a high width and low height ellipse. Hence PAL's ANOMALY function is likely to fail in this scenario.

True anomaly is red but closer to centroid, however yellow will be detected as anomaly since it is further from centroid X:


In this case even non-anomalous examples might fit at a far away distance( high probability) as compared to an anomalous one( low probability ) appearing very close to centroid, hence for anomaly detection we should prefer using a probability based model like GMM ( Gaussian mixture ) function over the ANOMALY function.

We would want some expert from the PA team( Orla Cullen ),  to shed more light on using "ANOMALY" function in this scenario and as to why has K-means with distance been implemented to identify anomalies( default algorithm ) rather than probability of presence.

Thanks,

Hasan Rafiq

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Kudos

Hi Hasan,

Happy to see your question, and hope my answer will help.

There are several ways to Categorize Anomaly Detection:

  1. Supervised Anomaly Detection
  2. Semi Supervised Anomaly Detection
  3. Unsupervised Anomaly Detection.

Supervised Anomaly Detection : Here the assumption is that in an controlled setup, an expert has correctly labelled the anomalies and so model could be build to identify anomalies for the fresh data. For several applications there are not known in advance and hence have limited applications.

Semi Supervised Anomaly Detection: Here the basic idea is train model using normal data without anomalies. The model is learned for the normal behaviour and and the  anomaly in the new data is defined based on how much they deviate from that Model. Density estimation methods like GMM or Kernel Density Estimation function fall into this category.

Unsupervised Anomaly Detection: These techniques does not require any labels. The idea is that the algorithm identifies anomalies solely based on intrinsic properties of the dataset. Typically distances or densities are used to give some estimate of the normal patterns in the data, and anything that does not comply to this is termed as anomaly.

No coming to your question : There is no one preferred way , the technique a data scientist will pick up will depend on the context of his workflow and business problem. Yes, PAL Anomaly detection uses Kmeans, but there are other clustering algorithms with different proximity measures that could be used by the data scientist to define how he/she wants to identity anomaly and also depending on how he is going to use them.

As proposed by you GMM is also one of the techniques that could be used. And for now, if that is what you need as per the business problem you are working on, then it is possible to included them in Expert Analytics via R- Extensions. Having said that your suggestion to have GMM based anomaly detection. I would propose you include this idea on to the idea place  SAP BusinessObjects Predictive Analytics: Home.

Cheers

Paul

Henry_Banks
Product and Topic Expert
Product and Topic Expert
0 Kudos

very helpful for my own understanding  Paul, thank you

former_member186543
Active Contributor
0 Kudos

Hi Paul,

Thanks a lot for your response !

Basically I was a little bit confused as to why K-means with distance has been given as the default implementation for "anomaly" detection.

The scenario which I have quoted above is quiet a common one and having a practical machine learning experience, we expected that the anomalies should always be detected on the basis of probability.

So giving priority to distance over probability is the point which confused us. In other case, if we are running the function(ANOMALY) on an a training set, we are more likely to achieve a very low F1 score or to be precise very high false positives and false negatives on the cross validation set.

Just to add: As per your suggestion, I have submitted this as an idea: GMM based Anomaly detection algorithm : View Idea

Thanks,

Hasan Rafiq

Former Member
0 Kudos

  Keep up your contributions in the area of Data Science and keep sending your feedback about the product.

achab
Product and Topic Expert
Product and Topic Expert
0 Kudos

Hi Hasan Rafi, thanks for the question. Can you please give credit to for the great answer, and flag the post as "Answered"?

former_member186543
Active Contributor
0 Kudos

Hi Antoine,

Indeed I am overwhelmed and thankful to Paul's great response but somehow my question is still unanswered.

I can understand from the answer that in all three methods( supervised, semi, unsupervised) - density(probability) can be used as an estimate compared to K-means distance(only unsupervised) but when we as developers and data scientists started to apply the "Anomaly" function in the PAL package, our expectations as per the default definition of Anomaly detection problem was - probability based.

I am afraid, anyone who has prior experience with Machine learning would be able to make out that K-means with distance should not have been provided as the default implementation for the "ANOMALY" function.

So the question is "the reason as to choosing this approach as the default implementation?".

We just wanted ourselves to be completely aligned with how SAP approached this logic.

Thanks a lot to you guys for helping us through out !

Thanks,

Hasan

achab
Product and Topic Expert
Product and Topic Expert
0 Kudos

Hi Hasan,

Thanks for clarifying further. I am looping SAP colleagues from the PAL team to comment on your question.

Best regards,

Antoine

Former Member
0 Kudos

Hi Hasan,

As Paul mentioned, there is no preferred way for outlier detection. From my point view, distance-based approach like k-means tends to form spherical clusters, which doesn't seem to fit in your original data. Though GMM works on your dataset, it has an assumption that the data are generated  from a mixture of Normal distributions. This is not universally true. When data don't follow normal distribution, we might need to use density-based algorithm like DBSCAN. In many real cases, there are categorical variables, and in those cases, GMM might not be a good choice either due to the distribution assumption.

From my experience in machine learning, there is no clear evidence that GMM is better than k-means. Again we need to choose the right algorithm according to the data and the problem.

BTW, GMM is included in PAL as well. You are welcome to use.

Best regards,

Xingtian

former_member186543
Active Contributor
0 Kudos

Hi Xingtian,

Thanks a lot for your response !

I understand it better now and will use GMM from PAL for our scenario.

/Hasan

Answers (0)