Skip to Content

k-means multiple categorical columns for similarity analysis

Hi ,

I am trying to implement a machine learning requirement, where we want to find out similar incidents / support tickets from our database on the basis of attributes like product category, priority, impact, code group and other categorical as well as numerical attributes.

As per SAP HANA PAL and my ML knowledge I believe we can try using the K-means clustering algorithm however we have many categorical columns. SAP documentation says that weights can be assigned to category column but I want to understand if I want to assign different weights to different categorical columns to K-means input, will it be possible in HANA PAL ?

In case it's not possible, which other clustering algorithm from PAL can suffice the requirement.

Thanks,

Hasan

Add comment
10|10000 characters needed characters exceeded

  • Get RSS Feed

3 Answers

  • Best Answer
    avatar image
    Former Member
    Jan 23, 2017 at 05:23 AM

    Hi Hasan, currently we cannot set different weights to different categorical columns. For the second question you raised in your reply, PAL has a function called cluster assignment which will label the new data according to a cluster results from previously-run cluster functions.

    Best regards,

    Xingtian

    Add comment
    10|10000 characters needed characters exceeded

  • Dec 21, 2016 at 03:34 PM

    Hi Hasan, what's your business question? Will you be using supervised or unsupervised clustering? Thanks & regards Antoine

    Add comment
    10|10000 characters needed characters exceeded

  • Dec 22, 2016 at 06:30 AM

    Hi Antoine, It's good to see you again my friend.

    The business problem is to find similar notifications( and also rank on basis of similarity( closeness to new input) ) in the system created on the basis of failing material, return code from material, code group, reason code, etc etc.

    However I don't think this need not be supervised learning as it's easily possible and dynamic to incorporate clustering without using labelled data and in fact training data will keep on changing as materials change with new and better solutions.

    As per my knowledge, I would say that the right solution would have been to spread the whole training data into a Euclidean space( k-means / DBSCAN ) and find the closest point to the new point( Nearest neighbor search (NNS) ) but no such algo exists in PAL.

    However I am not sure if this is possible in k-means as it finds the center of cluster and ranking will only be around that center( not around my new data point ). Currently I am thinking that there's no other choice rather than using k-means or DBSCAN and continue with ranking using distances from cluster center only.

    Thanks,

    Hasan

    Add comment
    10|10000 characters needed characters exceeded