Solved: k-means multiple categorical columns for similarit...

former_member186543 · ‎12-21-2016

Hi ,

I am trying to implement a machine learning requirement, where we want to find out similar incidents / support tickets from our database on the basis of attributes like product category, priority, impact, code group and other categorical as well as numerical attributes.

As per SAP HANA PAL and my ML knowledge I believe we can try using the K-means clustering algorithm however we have many categorical columns. SAP documentation says that weights can be assigned to category column but I want to understand if I want to assign different weights to different categorical columns to K-means input, will it be possible in HANA PAL ?

In case it's not possible, which other clustering algorithm from PAL can suffice the requirement.

Thanks,

Hasan

Former Member · ‎01-23-2017

Hi Hasan, currently we cannot set different weights to different categorical columns. For the second question you raised in your reply, PAL has a function called cluster assignment which will label the new data according to a cluster results from previously-run cluster functions.

Best regards,

Xingtian

former_member186543 · ‎12-22-2016

Hi Antoine, It's good to see you again my friend.

The business problem is to find similar notifications( and also rank on basis of similarity( closeness to new input) ) in the system created on the basis of failing material, return code from material, code group, reason code, etc etc.

However I don't think this need not be supervised learning as it's easily possible and dynamic to incorporate clustering without using labelled data and in fact training data will keep on changing as materials change with new and better solutions.

As per my knowledge, I would say that the right solution would have been to spread the whole training data into a Euclidean space( k-means / DBSCAN ) and find the closest point to the new point( Nearest neighbor search (NNS) ) but no such algo exists in PAL.

However I am not sure if this is possible in k-means as it finds the center of cluster and ranking will only be around that center( not around my new data point ). Currently I am thinking that there's no other choice rather than using k-means or DBSCAN and continue with ranking using distances from cluster center only.

Thanks,

Hasan

achab · ‎12-21-2016

Hi Hasan, what's your business question? Will you be using supervised or unsupervised clustering? Thanks & regards Antoine

k-means multiple categorical columns for similarity analysis

Accepted Solutions (1)

Accepted Solutions (1)

Answers (2)

Answers (2)

Re: "Failed to update setup engine executables. Pr...

Re: Timer showing while sending the mails from SAP

Re: Best practice to connect to multiple databases...

SAP Hana Calculation view input parameter from JPA...

Re: Adaptation error in sap mdg ui screen customer