按新行和大写字母的正则表达式拆分

Rohan 发表于 Dev

111

罗汉

我一直在努力用Python中的regex表达式拆分字符串。

我有一个加载的文本文件，其格式为：

"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch 
 at Kate's house. Kyle went home at 9. \nSome other sentence 
 here\n\u2022Here's a bulleted line"

我想得到以下输出：

['Peter went to the gym; he worked out for two hours','Kyle ate lunch 
at Kate's house. He went home at 9.', 'Some other sentence here', 
'\u2022Here's a bulleted line']

我正在寻找一个新行和Python中的大写字母或项目符号点来分割我的字符串。

我已经尝试解决问题的前半部分，只用换行和大写字母将我的字符串分开。

这是我到目前为止的内容：

print re.findall(r'\n[A-Z][a-z]+',str,re.M)

这给了我：

[u'\nKyle', u'\nSome']

这只是第一个字。我已经尝试过该正则表达式的变体，但是我不知道如何获得其余的内容。

我假设也要以短划线分割，我只需要包含一个OR正则表达式，其格式与大写字母分割的正则表达式相同。这是最好的方法吗？

我希望这是有道理的，如果我的问题仍然不清楚，我们将感到抱歉。:)

阿努巴瓦

您可以使用此split功能：

>>> str = u"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch at Kate's house. Kyle went home at 9. \nSome other sentence here\n\u2022Here's a bulleted line"
>>> print re.split(u'\n(?=\u2022|[A-Z])', str)

[u'Peter went to the gym; \nhe worked out for two hours ',
 u"Kyle ate lunch at Kate's house. Kyle went home at 9. ",
 u'Some other sentence here',
 u"\u2022Here's a bulleted line"]

代码演示

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。