根据 Pandas Python 中另一个数据帧的条件从一个数据帧中删除行

坦迈耆那教

我有两个 Pandas 数据框在 python 中包含数百万行。我想根据三个条件从第一个数据框中删除包含秒数据框中单词的行：

如果单词连续出现在句子的开头
如果单词连续出现在句尾
如果单词出现在连续句子的中间（确切的单词，而不是子集）

例子：

第一个数据框：

This is the first sentence
Second this is another sentence
This is the third sentence forth
This is fifth sentence
This is fifth_sentence

第二个数据框：

Second
forth
fifth

预期输出：

This is the first sentence
This is fifth_sentence

请注意，我在两个数据框中都有数百万条记录，如何以最有效的方式处理和导出？

我试过了，但需要很长时间

import pandas as pd
import re

bad_words_file_data = pd.read_csv("words.txt", sep = ",", header = None)
sentences_file_data = pd.read_csv("setences.txt", sep = ".", header = None)

bad_words_index = []
for i in sentences_file_data.index:
    print("Processing Sentence:- ", i, "\n")
    single_sentence = sentences_file_data[0][i]
    for j in bad_words_file_data.index:
        word = bad_words_file_data[0][j]
        if single_sentence.endswith(word) or single_sentence.startswith(word) or word in single_sentence.split(" "):
            bad_words_index.append(i)
            break
            
sentences_file_data = sentences_file_data.drop(index=bad_words_index)
sentences_file_data.to_csv("filtered.txt",header = None, index = False)

谢谢

索福克勒斯

您可以使用numpy.wherefunction 并创建一个名为“remove”的变量，如果您概述的条件得到满足，它将标记为 1。首先，创建一个包含值的列表df2

条件 1：将检查单元格值是否以列表中的任何值开头

条件 2：与上述相同，但它会检查单元格值是否以列表中的任何值结尾

条件 3：拆分每个单元格并检查拆分器字符串中是否有任何值在您的列表中

此后，您可以通过过滤掉以下内容来创建新的数据框1：

# Imports
import pandas as pd
import numpy as np

# Get the values from df2 in a list
l = list(set(df2['col']))

# Set conditions
c = df['col']

cond = (c.str.startswith(tuple(l)) \
        |(c.str.endswith(tuple(l))) \
        |pd.DataFrame(c.str.split(' ').tolist()).isin(l).any(1))

# Assign 1 or 0
df['remove'] = np.where(cond,1,0)

# Create 
out = (df[df['remove']!=1]).drop(['remove'],axis=1)

out 印刷：

                          col
0  This is the first sentence
4      This is fifth_sentence

参考：

熊猫行选择字符串以列表中任何项目开头的位置

检查列是否包含列表中的任何 str

使用的数据帧：

>>> df.to_dict()

{'col': {0: 'This is the first sentence',
  1: 'Second this is another sentence',
  2: 'This is the third sentence forth',
  3: 'This is fifth sentence',
  4: 'This is fifth_sentence'}}

>>> df2.to_dict()

Out[80]: {'col': {0: 'Second', 1: 'forth', 2: 'fifth'}}

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-08-29

我来说两句

0 条评论

登录后参与评论

上一篇：如何在 Angular 中共享从父组件到多个子组件的数据？

根据 Pandas Python 中另一个数据帧的条件从一个数据帧中删除行

根据 Pandas Python 中另一个数据帧的条件从一个数据帧中删除行

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用