cancel
Showing results for 
Search instead for 
Did you mean: 

Problem with REGEXP_SUBSTR

Former Member
0 Kudos

I am having a problem with REGEXP_SUBSTR(). I am trying to extract "captureThis" from the following string:

Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis blah blah'

The rules are that it must be preceded by "Token1" followed by arbitrary text, terminated by "Token2". Except that if the same pattern appears in parentheses in the arbitrary text, everthing in parentheses should be ignored.

If I run the following REGEXP_SUBSTR() statement, I get a result that correctly ends in "captureThis"

select REGEXP_SUBSTR('Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis blah blah',
    'Token1\\s+((\\([^\\)]+\\))|(((?!Token2).)+))*\\s+Token2\\s+\\S*' ,
  1,1);

Result: Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis

However, if I use a Positive Lookbehind Zero-Width Assertion to filter out the tokens, I get a different result:

select REGEXP_SUBSTR('Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis blah blah',
'(?<=Token1\\s+((\\([^\\)]+\\))|(((?!Token2).)+))*\\s+Token2\\s+)\\S*'
,  1,1);

Result: ignoreThis (I need it to be "captureThis")

I also found an interesting result by playing with the fourth parameter, occurrence-number. In the first statement above, occurrence-number 1 is a string ending in "captureThis", and occurrence-number 2 is a string ending in "ignoreThis".

However, in the second statement, with the Lookbehind, the order is reversed. Occurrence-number 1 is "ignoreThis", and occurrence-number 2 is "captureThis".

Is there any way to alter the regular expression in the second statement so occurrence-number 1 will be "captureThis"?

Thanks,

Eric

Accepted Solutions (1)

Accepted Solutions (1)

Ali_Chalhoub
Advisor
Advisor
0 Kudos

Hello,

I have discussed this with my development team and here is our analysis:

Here our breakdown and annotation of the of the expression:


'(?<=Token1\\s+((\\([^\\)]+\\))|(((?!Token2).)+))*\\s+Token2\\s+)\\S*'

'
(?<=                                                       zero width look-behind assertion
    Token1                                                             match Token1
    \\s+                                                   one or more spaces
    (
      (                                                                                        alternative 1
        \\(                                                                    match '('
                [^\\)]+                                                                    match anything but ')'
        \\)                                                                    match ')'
      )
     |
      (                                                                                        alternative 2
        (
          (?!Token2)                                                                  look ahead and do _not_ match 'Token2' (negative zero width look ahead assertion)
          .                                                                       match any character
        )+                                                                      match one or more - i.e. match any character up to (but not including) 'Token2'
      )
    )*                                                                        match zero or more of the two alternatives
    \\s+
    Token2
    \\s+
)                                                              end zero width look-behind assertion
\\S*
'

Now note that the zero width look-behind assertion will match
                Token1 blah, blah, (Token1 blah
because the ‘(‘ in this string matches alternative 2 (i.e. the dot that does not precede ‘Token2’) !!

Hence regexp_subtr is working as intended.

If that answer your question, please mark the question answered.

Thank you

Former Member
0 Kudos

Both statements will correctly locate both "captureThis" and "ignoreThis", but not in the same order. In my case, I need occurrence-number 1 to be the one I'm looking for.

Is there any way to alter the regular expression in the second statement so "captureThis" will be occurrence-number 1 instead of occurrence-number 2?

Another way to put it is, once it's located the first Token1, is there any way to get it to ignore the Token2 that's inside the parentheses?

Ali_Chalhoub
Advisor
Advisor
0 Kudos

Can you tell us exactly in details what result you are looking for? Both statement are correct, but not the result that you are looking for. So we need to know exactly what are you looking for because we are not following your request. Please be specific in every step.

Thank you

Former Member
0 Kudos

In this statement

select REGEXP_SUBSTR(

'Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis blah blah',

'(?<=Token1\\s+((\\([^\\)]+\\))|(((?!Token2).)+))*\\s+Token2\\s+)\\S*',  1,1);

I want the regular expression to completely ignore everything inside the parentheses

"(Token1 blah Token2 ignoreThis blah)",

so it returns "captureThis" when I pass 1 as the third parameter (occurrence-number).

Ali_Chalhoub
Advisor
Advisor
0 Kudos

So your RE matches ignoreThis because the zero width look behind assertion is matching  “Token1 blah Token2 “ – i.e. a substring within the input.

What we need is  that the RE given to regexp_substr() is not anchored at the beginning of the string. (it needs a ‘^’ at the beginning of the RE to do this).

So what exactly you want to match?

If the answer is that the intention was that the zero width look behind assertion is to match from the beginning then this RE seems to work:

select REGEXP_SUBSTR('Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis blah blah',
'(?<=^Token1\\s+((\\([^\\)]+\\))|[^(])*\\s+Token2\\s+)\\S*'
,  1,1);

Note the anchor (^) before the first Token1 and note that I have replaced the dot with just ‘match anything that is not ‘(‘. i.e. [^(]

Regarding the (?!Token2) in the second alternative – it serves no purpose.

Let me know if that helps.

Thank you

Former Member
0 Kudos

Finally, the light has dawned. Thanks for your help and your patience. My confusion can be explained (or cleared up) by imagining a cursor position as the regular expression engine is scanning the input string.

In my first statement (without the look-behind), the cursor begins on the first instance of Token1, correctly passes the stuff in parentheses as an example of alternative 1, and stops at the first word following the 2nd Token2, which is "captureThis".

I was wrongly expecting the second statement to behave the same way. But because of the look-behind, the cursor begins on the first word following a Token2, which is "ignoreThis". It then scans backwards and finds a match consisting of what's inside the parentheses.

So now it makes sense to me that simply adding a look-behind to an otherwise identical regular expression applied to the identical input string would produce results in reverse order. The first statement uses instances of Token1 to anchor its searches, while the second uses Token2 to anchor its searches.

Unfortunately, I can't use your anchor(^) solution for my problem because there will be text before Token1. But I can use trimmed results from my non-look-behind version to get what I need in the order I need. Also, you're right about (?!Token2). I can take it out.

Thanks,

Eric

Answers (1)

Answers (1)

Ali_Chalhoub
Advisor
Advisor
0 Kudos

So you want the result of the second expression to be like the first:

Result: Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis

Please confirm!!!