Skip to Content
0

k-means multiple categorical columns for similarity analysis

Dec 21, 2016 at 11:04 AM

77

avatar image

Hi ,

I am trying to implement a machine learning requirement, where we want to find out similar incidents / support tickets from our database on the basis of attributes like product category, priority, impact, code group and other categorical as well as numerical attributes.

As per SAP HANA PAL and my ML knowledge I believe we can try using the K-means clustering algorithm however we have many categorical columns. SAP documentation says that weights can be assigned to category column but I want to understand if I want to assign different weights to different categorical columns to K-means input, will it be possible in HANA PAL ?

In case it's not possible, which other clustering algorithm from PAL can suffice the requirement.

Thanks,

Hasan

10 |10000 characters needed characters left characters exceeded
* Please Login or Register to Answer, Follow or Comment.

3 Answers

Best Answer
avatar image
Former Member
Jan 23, 2017 at 05:23 AM
0

Hi Hasan, currently we cannot set different weights to different categorical columns. For the second question you raised in your reply, PAL has a function called cluster assignment which will label the new data according to a cluster results from previously-run cluster functions.

Best regards,

Xingtian

Share
10 |10000 characters needed characters left characters exceeded
Antoine CHABERT
Dec 21, 2016 at 03:34 PM
0

Hi Hasan, what's your business question? Will you be using supervised or unsupervised clustering? Thanks & regards Antoine

Share
10 |10000 characters needed characters left characters exceeded
Hasan Rafiq Dec 22, 2016 at 06:30 AM
0

Hi Antoine, It's good to see you again my friend.

The business problem is to find similar notifications( and also rank on basis of similarity( closeness to new input) ) in the system created on the basis of failing material, return code from material, code group, reason code, etc etc.

However I don't think this need not be supervised learning as it's easily possible and dynamic to incorporate clustering without using labelled data and in fact training data will keep on changing as materials change with new and better solutions.

As per my knowledge, I would say that the right solution would have been to spread the whole training data into a Euclidean space( k-means / DBSCAN ) and find the closest point to the new point( Nearest neighbor search (NNS) ) but no such algo exists in PAL.

However I am not sure if this is possible in k-means as it finds the center of cluster and ranking will only be around that center( not around my new data point ). Currently I am thinking that there's no other choice rather than using k-means or DBSCAN and continue with ranking using distances from cluster center only.

Thanks,

Hasan

Share
10 |10000 characters needed characters left characters exceeded