cancel
Showing results for 
Search instead for 
Did you mean: 

HANA Text Analysis : Create variant with pattern & symbols

somnath
Active Participant
0 Kudos

Hello,

I am very new to HANA text analysis topic and in my scenario I need to create custom dictionary and custom configurations. I am able to manage the $TA data but stuck in below two scenarios:

Could someone please suggest here?

Scenario 1- I have a string like: MB, <45W0

In my custom dictionary the element I am willing to create as:

<entity_name standard_form="MB<45W">
     <variant name="MB, <45W0"/>
</entity_name>

But it fails to accept this '<' symbol but interestingly it is accepting the '>' sign. Is there any approach to accommodate '<' sign under variant name?

Scenario 2: The string is USB, 12W12345 or it may be USB, 12W45678

So instead of creating two separate entities, I want to use a pattern like

<entity_name standard_form="USB 12W">
    <variant name="USB, 12W*"/>
</entity_name>

But the wildcard character is not accepting like this but it ask to separate it out with blank space like

<entity_name standard_form="USB 12W">
    <variant name="USB, 12W *"/>
</entity_name>

But if I do so indexing fails to tokenise and hence it is missing to capture for analysis.

Please let me know how to deal with these.

Thanks in advance!

- Regards, Somnath

Accepted Solutions (0)

Answers (1)

Answers (1)

somnath
Active Participant
0 Kudos

Yes, answering this question which I only raised :), as thought beginners like me might get benefited if they are also having similar issue.

So I had two contexts:

Scenario 1- I have a string like: MB, <45W0

I was facing the symbol '<' in my dictionary to accommodate as it is. It is basically nothing to do specific for dictionary but a basic concept for HTML that we need to replace the symbol with "<" yes ended with semicolon.

<entity_name standard_form="MB\<45W">
     <variant name="MB, \<45W0"/>
</entity_name>

Here back slash adding up just for show casing what needs to be used in the dictionary but the same backslash is not required to be mentioned inside the dictionary.

Scenario 2: The string is USB, 12W12345 or it may be USB, 12W45678

Now here I realised this particular thing can't be tokenised as no space in between 12W and 12345

more over using wildcard leads to performance issue, so I am not going with this route rather I am trying something alternate.

- Thanks, Somnath