I have another question towards the SAP PAL Anomaly Detection (AD) which is installed on HANA version1 in my case.
I have a column which holds data in an ordinal fashion. Such as Red = 1, Orange = 2 and Yellow = 3, in order to map the distances between the colours in one variable. However, for this variable/column the centres are defined as 0, which does not make any sense since i defined them in a numerical/ordinal fashion. Could it be that the AD filtered the variable for its characteristics?
In case this is an AD algorithm specific issue: To my undestanding the AD works as K-Means, could i simply use a KMeans which finds its number of clusters itself, and then use it as alternative? I would of course have to calculate the distances to their cluster centres but that is the topic of another thread.
Thanks in advance
Best regards
Nicholas
Hello,
Your column containing colors is normally called categorical data in PAL.
Normally it should be transfer to binary columns like (1, 0, 0) for Red, (0, 1, 0) for Orange and (0, 0, 1) for Yellow. The column number equals the category number, and each category makes exactly one column value 1 and leaves the rest value 0.
Since AD dose not support category values, it have to be transferred manually.
Or you can use K-means instead, as you mentioned. But you have to calculate the distance and identify anomaly points by hand after you get all clusters, maybe with a SQL or something.
Best Regards
Zee
Hello Zee again :),
what you are suggesting is something like one-hot-encoding.
I think i need to rephrase my question. The variable is definitely categorical, however there is an ordering or hierarchy within. For example Red, Orange, Yellow. The colours Red and Yellow are further away from each other, than Red and Orange. Therefore, I dont want to create 3 binary variables which do not capture this similarity between two categories. Hence, here I want to encode Red = 1, Orange = 2 and Yellow =3. This way the variable resembles the appropriate distance between the colours.
Unfortunately, I could not make the anomaly detection to work with this idea, since the centres table gave me zeros all the way for the variable, (colours in this case). Possibly it was excluded by the method, I am wondering if there is any mathematical reason for its exclusion. I am going to try K, or G-Means in this case and see if there is any improvement with this.
Best regards
Nicholas
Hello,
Well, the reason is simple.
AD does not recognize your ordering, it simply treats it as an numerical variable.
Kmeans does not recognize your ordering either.
In this condition I suggest make ordering start with 0 which still holds your ordering and similarity.
Then categorize the result to the nearest one if it returns an float number.
Hope it helps.
Best Regards
Zee