我在不同的数据框中有3个不同的列,如下所示。
第1列包含句子模板,例如“他想在本周内[采取行动]”。
第2列有成对的单词,例如“锻炼,游泳”。
3d列具有单词对的类型,例如[action]。
我认为R中应该有一些类似于“融化”的东西,但是我不确定如何进行替换。
我想创建一个新的列/数据框,它将为每个句子模板(每行一个句子)提供所有可能的选项:
他本周想锻炼。
他想这周游泳。
模板的数量明显少于我的单词数。单词对有几种类型(动作,描述,对象等)。
#a simple example of what I would like to achieve
import pandas as pd
#input1
templates = pd.DataFrame(columns=list('AB'))
templates.loc[0] = [1,'He wants to [action] this week']
templates.loc[1] = [2,'She noticed a(n) [object] in the distance']
templates
#input 2
words = pd.DataFrame(columns=list('AB'))
words.loc[0] = ['exercise, swim', 'action']
words.loc[1] = ['bus, shop', 'object']
words
#output
result = pd.DataFrame(columns=list('AB'))
result.loc[0] = [1, 'He wants to exercise this week']
result.loc[1] = [2, 'He wants to swim this week']
result.loc[2] = [3, 'She noticed a(n) bus in the distance']
result.loc[3] = [4, 'She noticed a(n) shop in the distance']
result
首先Series.str.extract
使用来自的单词创建新列words['B']
,然后Series.map
使用替换值:
pat = '|'.join(r"\[{}\]".format(re.escape(x)) for x in words['B'])
templates['matched'] = templates['B'].str.extract('('+ pat + ')', expand=False).fillna('')
templates['repl'] =(templates['matched'].map(words.set_index('B')['A']
.rename(lambda x: '[' + x + ']'))).fillna('')
print (templates)
A B matched repl
0 1 He wants to [action] this week [action] exercise, swim
1 2 She noticed a(n) [object] in the distance [object] bus, shop
然后替换列表理解:
z = zip(templates['B'],templates['repl'], templates['matched'])
result = pd.DataFrame({'B':[a.replace(c, y) for a,b,c in z for y in b.split(', ')]})
result.insert(0, 'A', result.index + 1)
print (result)
A B
0 1 He wants to exercise this week
1 2 He wants to swim this week
2 3 She noticed a(n) bus in the distance
3 4 She noticed a(n) shop in the distance
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句