Python regex pattern building

Joe

I'm trying to incrementally build the following regex pattern in python using reusable pattern components. I'd expect the pattern p to match the text in lines completely but it ends up matching only the first line..

import re
nbr = re.compile(r'\d+')
string = re.compile(r'(\w+[ \t]+)*(\w+)')
p1 = re.compile(rf"{string.pattern}\s+{nbr.pattern}\s+{string.pattern}")
p2 = re.compile(rf"{nbr.pattern}\s+{string.pattern}")
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")

lines = (f"aaaa 100284 aaaa\n"
         f"aaaa 365870 bbbb\n"
         f"757166 cccc\n"
         f"111054 cccc\n"
         f"999657 dddd\n"
         f"999 eeee\n"
         f"2955 ffff\n")

match = p.search(lines)
print(match)
print(match.group(0))

here's what gets printed: <re.Match object; span=(0, 14), match='aaaa 1284 aaaa'> aaaa 1284 aaaa

trincot

The problem is here:

p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")

In p the \n is appended to p1orp2, but this influences the scope of the | in p1orp2: the added \n belongs to the second option, not to the first option. It is the same if you would have attached that \n already in the definition of p1orp2:

p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}\n")
p = re.compile(rf"({p1orp2.pattern})+")

...while you really want to allow the p1 pattern to be followed by \n as well:

p1orp2 = re.compile(rf"{p1.pattern}\n|{p2.pattern}\n")
p = re.compile(rf"({p1orp2.pattern})+")

To achieve that with the \n where it was, you could use parentheses in the definition of p1orp2 so it limits the scope of the | operator:

p1orp2 = re.compile(rf"({p1.pattern}|{p2.pattern})")
p = re.compile(rf"({p1orp2.pattern}\n)+")

With this change it will work as you intended.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related