cancel
Showing results for 
Search instead for 
Did you mean: 

k-means multiple categorical columns for similarity analysis

former_member186543
Active Contributor
0 Kudos

Hi ,

I am trying to implement a machine learning requirement, where we want to find out similar incidents / support tickets from our database on the basis of attributes like product category, priority, impact, code group and other categorical as well as numerical attributes.

As per SAP HANA PAL and my ML knowledge I believe we can try using the K-means clustering algorithm however we have many categorical columns. SAP documentation says that weights can be assigned to category column but I want to understand if I want to assign different weights to different categorical columns to K-means input, will it be possible in HANA PAL ?

In case it's not possible, which other clustering algorithm from PAL can suffice the requirement.

Thanks,

Hasan

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Kudos

Hi Hasan, currently we cannot set different weights to different categorical columns. For the second question you raised in your reply, PAL has a function called cluster assignment which will label the new data according to a cluster results from previously-run cluster functions.

Best regards,

Xingtian

Answers (2)

Answers (2)

former_member186543
Active Contributor
0 Kudos

Hi Antoine, It's good to see you again my friend.

The business problem is to find similar notifications( and also rank on basis of similarity( closeness to new input) ) in the system created on the basis of failing material, return code from material, code group, reason code, etc etc.

However I don't think this need not be supervised learning as it's easily possible and dynamic to incorporate clustering without using labelled data and in fact training data will keep on changing as materials change with new and better solutions.

As per my knowledge, I would say that the right solution would have been to spread the whole training data into a Euclidean space( k-means / DBSCAN ) and find the closest point to the new point( Nearest neighbor search (NNS) ) but no such algo exists in PAL.

However I am not sure if this is possible in k-means as it finds the center of cluster and ranking will only be around that center( not around my new data point ). Currently I am thinking that there's no other choice rather than using k-means or DBSCAN and continue with ranking using distances from cluster center only.

Thanks,

Hasan

achab
Product and Topic Expert
Product and Topic Expert
0 Kudos

Hi Hasan, what's your business question? Will you be using supervised or unsupervised clustering? Thanks & regards Antoine