我必须搜索任何出现的The XXth (?:and XXth)? session of the XX body
它可以是任何会话并且有几个机构。我想出了一种模式,当它们在一个句子中是唯一的时找到它们,但是当该文本重复不止一次时就会失败。请参阅下面的示例:
import re
test = """1. The thirty-fifth session of the Subsidiary Body for Implementation (SBI) was held at the International
Convention Centre and Durban Exhibition Centre in Durban, South Africa, from 28 November to 3 December 2011. 10.
Forum on the impact of the implementation of response measures at the thirty-fourth and thirty-fifth sessions of the
subsidiary bodies, with the objective of developing a work programme under the Subsidiary Body for Scientific and
Technological Advice and the Subsidiary Body for Implementation to address these impacts, with a view to adopting,
at the seventeenth session of the Conference of the Parties, modalities for the operationalization of the work
program and a possible forum on response measures.[^6] """
pattern = re.compile(r".*(The [\w\s-]* sessions? of the (?:Subsidiary Body for Implementation|Conference of the "
r"Parties|subsidiary bodies))", re.IGNORECASE)
print(pattern.findall(test))
这打印:['The thirty-fifth session of the Subsidiary Body for Implementation', 'the seventeenth session of the Conference of the Parties']
我想得到:['The thirty-fifth session of the Subsidiary Body for Implementation', 'the thirty-fourth and thirty-fifth sessions of the subsidiary bodies', 'the seventeenth session of the Conference of the Parties']
我认为问题在于模式太宽,但不知道如何限制它,因为我以不同的方式结束......
有关如何改善此结果的任何线索?
问题是and <NUMERAL>
在数字之后。您可以使用
The\s+\S+(?:\s+and\s+\S+)?\s+sessions?\s+of\s+the\s+(?:Subsidiary\s+Body\s+for\s+Implementation|Conference\s+of\s+the\s+Parties|subsidiary\s+bodies)
请参阅正则表达式演示。
详情:
The
- 固定字符串\s+\S+
- 一个或多个空格和一个或多个非空格字符(?:\s+and\s+\S+)?
- 一个可选序列,and
包含一个或多个空白字符,然后是一个或多个非空白字符\s+
- 一个或多个空格sessions?
-session
或sessions
\s+of\s+the
- 一个或多个空格, of
, 一个或多个空格,the
\s+
- 一个或多个空格(?:
- 非捕获组的开始:
Subsidiary\s+Body\s+for\s+Implementation
- Subsidiary
+ 一个或多个空格 + Body
+ 一个或多个空格 + for
+ 一个或多个空格 +Implementation
|
- 或者Conference\s+of\s+the\s+Parties
- Conference
+ 一个或多个空格 + of
+ 一个或多个空格 + the
+ 一个或多个空格 +Parties
|
- 或者subsidiary\s+bodies
- subsidiary
+ 一个或多个空格 +bodies
)
- 小组结束。本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句