熊猫数据框中的文本模式识别

M PAUL 发表于 Dev

保罗

我正在尝试让 python 匹配 Pandas 数据框中的文本模式。

我正在做的是

list = ['sarcasm','irony','humor']
pattern = '|'.join(list)
pattern2 = str("( " + pattern.strip().lstrip().rstrip() + " )").strip().lstrip().rstrip()

frame = pd.DataFrame(docs_list, columns=['words'])
# docs_list is the list containing the snippets

#Skipping the inbetween steps for the simplicity of viewing
cp2 = frame.words.str.extract(pattern2)
c2 = cp2.to_frame().fillna("No Matching Word Found")

这给出了这样的输出

Snips                                     pattern_found    matching_Word
A different type of humor                    True             humor
A different type of sarcasm                  True             sarcasm 
A different type of humor and irony          True             humor
A different type of reason                   False            NA
A type of humor and sarcasm                  True             humor
A type of comedy                             False            NA

因此，python 检查模式并给出相应的输出。

现在，这是我的问题。根据我的理解，只要 python 没有遇到片段中模式中的单词，它就会不断检查整个模式。一旦遇到模式的一部分，它就会选择该部分并跳过剩余的单词。

我如何让 python 查找每个单词而不仅仅是第一个匹配的单词，以便它像这样输出？

Snips                                     pattern_found    matching_Word
A different type of humor                    True             humor
A different type of sarcasm                  True             sarcasm 
A different type of humor and irony          True             humor
A different type of humor and irony          True             irony
A different type of reason                   False            NA
A type of humor and sarcasm                  True             humor
A type of humor and sarcasm                  True             sarcasm
A type of comedy                             False            NA

一个简单的解决方案显然是将模式放在一个列表中，并通过检查每个片段中的每个单词来迭代 for 循环。但时间是一种约束。尤其是因为我正在处理的数据集很大，而且剪辑相当长。

耶斯列

对我来说extractall，reset_index用于删除级别MultiIndex，最后join到原始。

L = ['sarcasm','irony','humo', 'humor', 'hum']
#sorting by http://stackoverflow.com/a/4659539/2901002
L.sort()
L.sort(key = len, reverse=True)
print (L)
['sarcasm', 'humor', 'irony', 'humo', 'hum']

pattern2 = r'(?P<COL>{})'.format('|'.join(L))
print (pattern2)
(?P<COL>sarcasm|irony|humor|humo|hum)

cp2 = frame.words.str.extractall(pattern2).reset_index(level=1, drop=True)
print (cp2)
       COL
0    humor
1  sarcasm
2    humor
2    irony
4    humor
4  sarcasm

frame = frame.join(cp2['COL']).reset_index(drop=True)
print (frame)
                                 words pattern_found matching_Word      COL
0            A different type of humor          True         humor    humor
1          A different type of sarcasm          True       sarcasm  sarcasm
2  A different type of humor and irony          True         humor    humor
3  A different type of humor and irony          True         humor    irony
4           A different type of reason         False           NaN      NaN
5          A type of humor and sarcasm          True         humor    humor
6          A type of humor and sarcasm          True         humor  sarcasm
7                     A type of comedy         False           NaN      NaN

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。