SAP HANA Anomaly function - PAL

Hi SAP PA team,

I have been working on trying to model anomaly detection in SAP HANA and I found that SAP HANA PAL has provided a function "ANOMALY" which as per its documentation uses the K-Means with distance function(identifies furthest points as outliers).

However consider an example which fits more like a high width and low height ellipse. Hence PAL's ANOMALY function is likely to fail in this scenario.

True anomaly is red but closer to centroid, however yellow will be detected as anomaly since it is further from centroid X:


In this case even non-anomalous examples might fit at a far away distance( high probability) as compared to an anomalous one( low probability ) appearing very close to centroid, hence for anomaly detection we should prefer using a probability based model like GMM ( Gaussian mixture ) function over the ANOMALY function.

We would want some expert from the PA team( @Jayanta Roy Orla Cullen ),  to shed more light on using "ANOMALY" function in this scenario and as to why has K-means with distance been implemented to identify anomalies( default algorithm ) rather than probability of presence.

Thanks,

Hasan Rafiq

Add comment
10|10000 characters needed characters exceeded

  • Get RSS Feed

1 Answer

  • Best Answer
    avatar image
    Former Member
    Sep 06, 2016 at 07:41 AM

    Hi Hasan,

    Happy to see your question, and hope my answer will help.

    There are several ways to Categorize Anomaly Detection:

    1. Supervised Anomaly Detection
    2. Semi Supervised Anomaly Detection
    3. Unsupervised Anomaly Detection.

    Supervised Anomaly Detection : Here the assumption is that in an controlled setup, an expert has correctly labelled the anomalies and so model could be build to identify anomalies for the fresh data. For several applications there are not known in advance and hence have limited applications.

    Semi Supervised Anomaly Detection: Here the basic idea is train model using normal data without anomalies. The model is learned for the normal behaviour and and the  anomaly in the new data is defined based on how much they deviate from that Model. Density estimation methods like GMM or Kernel Density Estimation function fall into this category.

    Unsupervised Anomaly Detection: These techniques does not require any labels. The idea is that the algorithm identifies anomalies solely based on intrinsic properties of the dataset. Typically distances or densities are used to give some estimate of the normal patterns in the data, and anything that does not comply to this is termed as anomaly.

    No coming to your question : There is no one preferred way , the technique a data scientist will pick up will depend on the context of his workflow and business problem. Yes, PAL Anomaly detection uses Kmeans, but there are other clustering algorithms with different proximity measures that could be used by the data scientist to define how he/she wants to identity anomaly and also depending on how he is going to use them.

    As proposed by you GMM is also one of the techniques that could be used. And for now, if that is what you need as per the business problem you are working on, then it is possible to included them in Expert Analytics via R- Extensions. Having said that your suggestion to have GMM based anomaly detection. I would propose you include this idea on to the idea place  SAP BusinessObjects Predictive Analytics: Home.

    Cheers

    Paul

    Add comment
    10|10000 characters needed characters exceeded

Skip to Content