我正在尝试清理句子以及要删除句子中的这些标签的方式(它们以下划线形式,后跟一个单词,例如“ _UH”)。基本上我想删除下划线之后的字符串(也删除下划线本身)
文本:
['hanks_NNS sir_VBP',
'Oh_UH thanks_NNS to_TO remember_VB']
需要的输出:
['hanks sir',
'Oh thanks to remember']
以下是我尝试的代码:
for i in text:
k= i.split(" ")
print (k)
for z in k:
if "_" in z:
j=z.replace("_",'')
print (j)
电流输出:
ThanksNNS
sirVBP
OhUH
thanksNNS
toTO
rememberVB
RemindVB
您可以使用re.sub()
。匹配字符串中所需的子字符串,然后用空字符串替换子字符串:
import re
text = ['hanks_NNS sir_VBP', 'Oh_UH thanks_NNS to_TO remember_VB']
curated_text = [re.sub(r'_\S*', r'', a) for a in text]
print curated_text
输出:
['hanks sir', 'Oh thanks to remember']
正则表达式:
_\S* - Underscore followed by 0 or more non space characters
text = ['hanks_NNS sir_VBP', 'Oh_UH thanks_NNS to_TO remember_VB']
curated_text = [] # Outer container for holding strings in text.
for i in text:
d = [] # Inner container for holding different parts of same string.
for b in i.split():
c = b.split('_')[0] # Discard second element after split
d.append(c) # Append first element to inner container.
curated_text.append(' '.join(d)) # Join the elements of inner container.
#Append the curated string to the outer container.
print curated_text
输出:
['hanks sir', 'Oh thanks to remember']
实际上,您只想用'_'
空字符串替换'_'
,之后用空字符串替换字符。
for i in text:
k= i.split(" ")
print (k)
for z in k:
if "_" in z:
j=z.replace("_",'') # <--- 'hanks_NNS' becomes 'hanksNNS'
print (j)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句