Skip to Content

Problem with REGEXP_SUBSTR

I am having a problem with REGEXP_SUBSTR(). I am trying to extract "captureThis" from the following string:

Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis blah blah'

The rules are that it must be preceded by "Token1" followed by arbitrary text, terminated by "Token2". Except that if the same pattern appears in parentheses in the arbitrary text, everthing in parentheses should be ignored.

If I run the following REGEXP_SUBSTR() statement, I get a result that correctly ends in "captureThis"

select REGEXP_SUBSTR('Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis blah blah',
    'Token1\\s+((\\([^\\)]+\\))|(((?!Token2).)+))*\\s+Token2\\s+\\S*' ,
  1,1);

Result: Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis

However, if I use a Positive Lookbehind Zero-Width Assertion to filter out the tokens, I get a different result:

select REGEXP_SUBSTR('Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis blah blah',
'(?<=Token1\\s+((\\([^\\)]+\\))|(((?!Token2).)+))*\\s+Token2\\s+)\\S*'
,  1,1);

Result: ignoreThis (I need it to be "captureThis")

I also found an interesting result by playing with the fourth parameter, occurrence-number. In the first statement above, occurrence-number 1 is a string ending in "captureThis", and occurrence-number 2 is a string ending in "ignoreThis".

However, in the second statement, with the Lookbehind, the order is reversed. Occurrence-number 1 is "ignoreThis", and occurrence-number 2 is "captureThis".

Is there any way to alter the regular expression in the second statement so occurrence-number 1 will be "captureThis"?

Thanks,

Eric

Add comment
10|10000 characters needed characters exceeded

  • Get RSS Feed

2 Answers

  • Best Answer
    Jan 20, 2015 at 11:02 PM

    Hello,

    I have discussed this with my development team and here is our analysis:

    Here our breakdown and annotation of the of the expression:


    '(?<=Token1\\s+((\\([^\\)]+\\))|(((?!Token2).)+))*\\s+Token2\\s+)\\S*'

    '
    (?<=                                                       zero width look-behind assertion
        Token1                                                             match Token1
        \\s+                                                   one or more spaces
        (
          (                                                                                        alternative 1
            \\(                                                                    match '('
                    [^\\)]+                                                                    match anything but ')'
            \\)                                                                    match ')'
          )
         |
          (                                                                                        alternative 2
            (
              (?!Token2)                                                                  look ahead and do _not_ match 'Token2' (negative zero width look ahead assertion)
              .                                                                       match any character
            )+                                                                      match one or more - i.e. match any character up to (but not including) 'Token2'
          )
        )*                                                                        match zero or more of the two alternatives
        \\s+
        Token2
        \\s+
    )                                                              end zero width look-behind assertion
    \\S*
    '

    Now note that the zero width look-behind assertion will match
                    Token1 blah, blah, (Token1 blah
    because the ‘(‘ in this string matches alternative 2 (i.e. the dot that does not precede ‘Token2’) !!

    Hence regexp_subtr is working as intended.

    If that answer your question, please mark the question answered.

    Thank you

    Add comment
    10|10000 characters needed characters exceeded

    • Finally, the light has dawned. Thanks for your help and your patience. My confusion can be explained (or cleared up) by imagining a cursor position as the regular expression engine is scanning the input string.

      In my first statement (without the look-behind), the cursor begins on the first instance of Token1, correctly passes the stuff in parentheses as an example of alternative 1, and stops at the first word following the 2nd Token2, which is "captureThis".

      I was wrongly expecting the second statement to behave the same way. But because of the look-behind, the cursor begins on the first word following a Token2, which is "ignoreThis". It then scans backwards and finds a match consisting of what's inside the parentheses.

      So now it makes sense to me that simply adding a look-behind to an otherwise identical regular expression applied to the identical input string would produce results in reverse order. The first statement uses instances of Token1 to anchor its searches, while the second uses Token2 to anchor its searches.

      Unfortunately, I can't use your anchor(^) solution for my problem because there will be text before Token1. But I can use trimmed results from my non-look-behind version to get what I need in the order I need. Also, you're right about (?!Token2). I can take it out.

      Thanks,

      Eric

  • Jan 20, 2015 at 05:58 PM

    So you want the result of the second expression to be like the first:

    Result: Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis

    Please confirm!!!

    Add comment
    10|10000 characters needed characters exceeded