I'm trying to incrementally build the following regex pattern in python using reusable pattern components. I'd expect the pattern p to match the text in lines completely but it ends up matching only the first line..
import re
nbr = re.compile(r'\d+')
string = re.compile(r'(\w+[ \t]+)*(\w+)')
p1 = re.compile(rf"{string.pattern}\s+{nbr.pattern}\s+{string.pattern}")
p2 = re.compile(rf"{nbr.pattern}\s+{string.pattern}")
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")
lines = (f"aaaa 100284 aaaa\n"
f"aaaa 365870 bbbb\n"
f"757166 cccc\n"
f"111054 cccc\n"
f"999657 dddd\n"
f"999 eeee\n"
f"2955 ffff\n")
match = p.search(lines)
print(match)
print(match.group(0))
here's what gets printed: <re.Match object; span=(0, 14), match='aaaa 1284 aaaa'> aaaa 1284 aaaa
The problem is here:
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")
In p
the \n
is appended to p1orp2
, but this influences the scope of the |
in p1orp2
: the added \n
belongs to the second option, not to the first option. It is the same if you would have attached that \n
already in the definition of p1orp2
:
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}\n")
p = re.compile(rf"({p1orp2.pattern})+")
...while you really want to allow the p1
pattern to be followed by \n
as well:
p1orp2 = re.compile(rf"{p1.pattern}\n|{p2.pattern}\n")
p = re.compile(rf"({p1orp2.pattern})+")
To achieve that with the \n
where it was, you could use parentheses in the definition of p1orp2
so it limits the scope of the |
operator:
p1orp2 = re.compile(rf"({p1.pattern}|{p2.pattern})")
p = re.compile(rf"({p1orp2.pattern}\n)+")
With this change it will work as you intended.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments