根据重叠模式删除字符串的一部分

德申

我有以下数据：

dat <- data.frame(x               = c("this is my example text", "and here is my other text example", "my other text is short"),
                  some_other_cols = c(1, 2, 2))

此外，我具有以下模式向量：

my_patterns <- c("my example", "is my", "my other text")

我要实现的是删除my_patterns中出现的所有文本dat$x。

我尝试了以下解决方案，但问题是，一旦我从文本中删除第一个模式（此处为“我的示例”），我的解决方案便无法检测到第二个模式的存在（此处为“ is my”））或第三种模式。

错误的解决方案：

library(tidyverse)
my_patterns_c <- str_c(my_patterns, collapse = "|")

dat_new <- dat %>%
  mutate(short_x = str_replace_all(x, pattern = my_patterns_c, replacement = ""))

我想我可以做某事。例如遍历所有模式，在dat $ x中收集与我的模式匹配的字符串位置，然后将它们组合成一个范围并从文本中删除该范围。例如，我将列添加到我dat喜欢的数据帧start_pattern_1和end_pattern_1等。因此，对于第一行1，对于第一个模式，我得到9（开始）和18（结束），对于第二个模式，我得到6/10。然后，我需要检查是否有任何end位置与任何位置重叠start（此处为开始9和结束10），并将它们合并为6-18的范围，并从文本中删除此范围。

问题是我可能会有许多新的开始/结束列（在我的情况下可能是几百个模式），如果我需要成对比较重叠范围，我的计算机可能会崩溃。

所以我想知道如何使它工作或如何最好地采用此解决方案。也许（我希望如此）有一个更好/更优雅/简单的解决方案。

的所需输出为dat：

x                                    some_other_cols    short_x
this is my example text              1                  this text
and here is my other text example    2                  and here example
my other text is short               2                  is short

感谢你的帮助！谢谢。

滕西拜

Uwe在问题下的注释中提到了带有str_locate_all的新选项，这大大简化了代码：

library(stringr)
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length 
remove_matching_parts <- function(text, positions) {
  if (nrow(positions) == 0) return(text)
  ret <- strsplit(text,"")[[1]]
  lapply(1:nrow(positions), function(x) { ret[ positions[x,1]:positions[x,2] ] <<- NA } )
  paste0(ret[!is.na(ret)],separator="",collapse="")
}

# Loop over the data to apply the pattern
# row = length of vector, columns = length of pattern
matches <- lapply(dat$x, function(x) {
  do.call(rbind,str_locate_all(x, my_patterns)) # transform the list output of str_locate in a table of start/end
})

# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
 dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[[i]])
}

如果您可以控制模式定义并可以手动创建，则可以使用正则表达式解决方案来实现：

> gsub("(is )?my (other text|example)?","",dat$x)
[1] "this  text"        "and here  example" " is short"

这个想法是创建带有可选部分的模式（?在分组括号之后）。

因此，我们大致有：

(is )? <=可选的“是”，后跟空格
my <=文字“ my”后跟空格
(other text|example)?<=在“我的”之后的可选文本，可以是“其他文本”或|“示例”

如果您没有控制权，事情就会变得混乱，我希望我已经对其进行了充分的评论，使其易于理解，根据其中包含的循环数量，不要指望它会很快：

# Given datas
dat <- data.frame(x               = c("this is my example text", "and here is my other text example", "my other text is short","yet another text"),
                some_other_cols = c(1, 2, 2, 4))

my_patterns <- c("my example", "is my", "my other text")

# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length 
remove_matching_parts <- function(text, positions) {
  ret <- strsplit(text,"")[[1]]
  lapply(positions, function(x) { ifelse(is.na(x),,ret[ x[1]:x[2] ] <<- NA ) } )
  paste0(ret[!is.na(ret)],separator="",collapse="")
}

# Create the matches between a vector and a pattern
# First argument is the pattern to match, second is the vector of charcaters
match_pat_to_vector <- function(pattern,vector) {
  sapply(regexec(pattern,vector), 
         function(x) {
           if(x>-1) { 
             c(start=as.numeric(x), end=as.numeric(x+attr(x,"match.length")) ) # Create a start/end vector from the index and length of the match
           }
         })
}

# Loop over the patterns to create a dataframe of matches
# row = length of vector, columns = length of pattern
matches <- sapply(my_patterns,match_pat_to_vector,vector=dat$x)

# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
 dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[i,])
}

运行后结果：

> dat
                                  x some_other_cols           result
1           this is my example text               1        this text
2 and here is my other text example               2 and here example
3            my other text is short               2         is short
4                  yet another text               4 yet another text

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-01-22

我来说两句

0 条评论

登录后参与评论

上一篇：如何在Python中仅打印JSON之类的字符串

TOP 榜单

文章

根据重叠模式删除字符串的一部分

根据重叠模式删除字符串的一部分

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用