Pandas 在行中查找文本并基于此分配一个虚拟变量值

edyvedy13

我有一个包含文本列的数据框，df["input"]即

我想创建一个新变量，它检查df["input"]列是否包含给定列表中的任何单词，如果先前的虚拟变量等于 0（逻辑为 1），则分配值 1（逻辑为 1）创建一个等于零的虚拟变量 2）如果它包含给定列表中的任何单词并且它不包含在以前的列表中，则将其替换为一个。）

# Example lists
listings = ["amazon listing", "ecommerce", "products"]
scripting = ["subtitle",  "film", "dubbing"]
medical = ["medical", "biotechnology", "dentist"]

df = pd.DataFrame({'input': ['amazon listing subtitle', 
                             'medical', 
                             'film biotechnology dentist']})

看起来像：

input
amazon listing subtitle
medical 
film biotechnology dentist

最终数据集应如下所示：

input                           listings  scripting  medical
amazon listing subtitle            1         0         0
medical                            0         0         1          
film biotechnology dentist         0         1         0

寡妇

一种可能的实现是str.contains在循环中使用来创建 3 列，然后用于idxmax获取第一个匹配项的列名（或列表名），然后从这些匹配项中创建一个虚拟变量：

import numpy as np
d = {'listings':listings, 'scripting':scripting, 'medical':medical}
for k,v in d.items():
    df[k] = df['input'].str.contains('|'.join(v))

arr = df[list(d)].to_numpy()
tmp = np.zeros(arr.shape, dtype='int8')
tmp[np.arange(len(arr)), arr.argmax(axis=1)] = arr.max(axis=1)
out = pd.DataFrame(tmp, columns=list(d)).combine_first(df)

但在这种情况下，使用嵌套的 for 循环可能更有效：

import re
def get_dummy_vars(col, lsts):
    out = []
    len_lsts = len(lsts)
    for row in col:
        tmp = []
        # in the nested loop, we use the any function to check for the first match 
        # if there's a match, break the loop and pad 0s since we don't care if there's another match
        for lst in lsts:
            tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
            if tmp[-1]:
                break
        tmp += [0] * (len_lsts - len(tmp))
        out.append(tmp)
    return out

lsts = [listings, scripting, medical]
out = df.join(pd.DataFrame(get_dummy_vars(df['input'], lsts), columns=['listings', 'scripting', 'medical']))

输出：

                        input listings medical scripting
0     amazon listing subtitle        1       0         0
1                     medical        0       1         0
2  film biotechnology dentist        0       0         1

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。