Match the first occurrence of an optional pattern

Junitar

I am trying to extract names in messy strings like the following:

genus species subsp. name […] x name […] var. name; genus2 species2 subsp. name2 var. name2  
genus species subsp. name […] x name […] var. name  
genus species subsp. name […] var name  
genus species subsp. name var. name  
genus species subsp. name

Where […] can be a succession of any characters with no regular patterns.

The desired output is:

subsp. name x name var. name  
subsp. name x name var. name  
subsp. name var. name  
subsp. name var. name  
subsp. name

My regex looks like this:

(?i).*?\b((?:aff|cf|ssp|subsp|var)[\.\s]+)([a-z-]+).*?(\sx\s+[a-z-]+)?.*?(\svar[\.\s]+[a-z-]+)?.*

Here is a demo.

I'm using the lazy quantifier *? to find the first occurrence of some sort of anchors (e.g. subsp, x and var) in the strings that I can use to match a given pattern. The problem is that I don't manage to get the regex work for all instances because (\sx\s+[a-z-]+)? and (\svar[\.\s]+[a-z-]+)? are optional as the patterns matched don't exist in all the strings.

Is there a simple solution to get around this issue?

Wiktor Stribiżew

You can wrap the optional patterns with optional non-capturing groups to make the necessary capturing groups obligatory and force the regex engine to make at least one attempt to search for the patterns.

That means you need to change all .*?(pattern-to-extract)? patterns to (?:.*?(pattern-to-extract))?. When the whole group is optional it may match an empty string and consider job done. When the group is wrapped with an optional group it is tried at least once and the initial .*? is guaranteed to get expanded as many times as necessary to get to the capturing group pattern.

Use

(?i).*?\b((?:aff|cf|ssp|subsp|var)[.\s]+)([a-z-]+)(?:.*?(\sx\s+[a-z-]+))?(?:.*?(\svar[.\s]+[a-z-]+))?.*

Note that dots inside character classes match literal dots, no need to escape them.

See the regex demo.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

How to match until first occurrence of a pattern?

How to match the first occurrence of a pattern (IP addr) on each line of the file?

Regex pattern needs to match only the first occurrence and not be "greedy"

get first occurrence of pattern

Javascript regex to match between two patterns, where first pattern is optional

Regex: match first occurrence of :

match optional pattern

How to match the first occurrence of a pattern (or character) except for lines where that pattern is at the beginning of the line?

get last occurrence of first pattern

Replace first occurrence of pattern in a string

Replace the first occurrence of a pattern in a String

Match only the first occurrence of a phrase

Find and replace text in a file after match of pattern only for first occurrence using sed

Replace first occurrence of a pattern if not preceded with another pattern

Match optional pattern with find command

Match first pattern in a file

Regex match only the third occurrence of a pattern

How to match the last occurrence of a pattern using regex

How to extract everything until first occurrence of pattern

How to change the first occurrence of a line containing a pattern?

Keep string up to first occurrence of pattern in R

Append text to first occurrence of a pattern using sed

Keep first occurrence of value pattern in column

Need to extract the text in first occurrence of pattern

Only execute AWK action for first occurrence of pattern?

Regex Match First Occurrence per NewLine

Regex to match first occurrence of a string is matching the last

How to match only first occurrence of space at line

PHP Regex Match from Start to First Occurrence