R regex compiler working differently for the given regex

Mandy8055 :

I was working on the refinement of this answer; and figured out that the regex given below is not working properly(as per its meaning) in R.

 +?on.*$

According to my understanding of regex, the above regex matches:

lazily space one or more times followed by on followed by anything(except newline) till the end.

INPUT:

Posted by ondrej on 29 Feb 2020.
Posted by ona'je on 29 Feb 2020.

OUTPUT (according to me, if above regex pattern in test string is replaced by "")

Posted by
Posted by 

And when I'm trying to test it in python (implementation here), javascript and java (implementation here); I'm getting the result as I expected.

const myString = "Posted by ondrej on 29 Feb 2020.\nPosted by ona'je on";

console.log(myString.replace( new RegExp(" +?on.*$","gm"),""));

On the other hand, if I'm trying to implement the same regex in R (implementation here); I'm getting the result as

Posted by ondrej
Posted by ona'je

and this is unexpected.

Doubt

I thought that maybe regex parser for R works differently(perhaps from right to left). I read the documentation of how regex work in R but found nothing different from other languages for the above regex. I may be missing something here. I am not well-versed with R but as far as my regex knowledge; I believe that the above regex should work as it works in java, javascript and python(may be in pcre too.) for every standard regex engines(as far as I know). My question is why the above regex is working differently in R?

Wiktor Stribiżew :

It looks like TRE regex engine (used by default in base R regex functions), based on the regex library initially written by Henry Spencer in 1986, matches the shortest match at the end of the string if the first pattern in the regular expression starts with a lazy quantifier and ends with $ anchor.

Compare these cases:

sub(" +?on.*$", "", Data)  # "Posted by ondrej" "Posted by ona'je"
sub(" +?on.*", "", Data)   # "Posted bydrej on 29 Feb 2020." "Posted bya'je on 29feb 2020"
sub(" +?on(.*)", "", Data) # as expected
sub(" +on.*", "", Data)    # as expected

What is going on?

  • The first case is sub(" +?on.*$", "", Data) and the first pattern sets the greediness of all the quantifiers on the same level in the regex. So, the second quantifier, *, will be set to lazy even without ? after it as the first space was quantified with +?, a lazy quantifier. It is a known TRE "bug", also present in some other regex engines based on Henry Spencer's regexl library.

  • The second sub(" +?on.*", "", Data) matches the same way as if it were written " +?on.*?" (again, due to the first pattern setting the greediness level to lazy on that level) and that would only match 1 or more spaces and then on, .*? matches nothing when at the end of the pattern.

  • The third one, sub(" +?on(.*)", "", Data), yields the expected results because the second quantified pattern, .*, is on the other level (one level deep) and its greediness is not affected by the +? that is on another level. So, (.*) matches greedily here.

  • The fourth one, sub(" +on.*", "", Data), yields the expected results because the first pattern is greedy, so the next quantified pattern greediness is also greedy.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

TOP Ranking

HotTag

Archive