根据重叠模式删除字符串的一部分

德申

我有以下数据:

dat <- data.frame(x               = c("this is my example text", "and here is my other text example", "my other text is short"),
                  some_other_cols = c(1, 2, 2))

此外,我具有以下模式向量:

my_patterns <- c("my example", "is my", "my other text")

我要实现的是删除my_patterns中出现的所有文本dat$x

我尝试了以下解决方案,但问题是,一旦我从文本中删除第一个模式(此处为“我的示例”),我的解决方案便无法检测到第二个模式的存在(此处为“ is my”) )或第三种模式。

错误的解决方案:

library(tidyverse)
my_patterns_c <- str_c(my_patterns, collapse = "|")

dat_new <- dat %>%
  mutate(short_x = str_replace_all(x, pattern = my_patterns_c, replacement = ""))

我想我可以做某事。例如遍历所有模式,在dat $ x中收集与我的模式匹配的字符串位置,然后将它们组合成一个范围并从文本中删除该范围。例如,我将列添加到我dat喜欢的数据帧start_pattern_1end_pattern_1等。因此,对于第一行1,对于第一个模式,我得到9(开始)和18(结束),对于第二个模式,我得到6/10。然后,我需要检查是否有任何end位置与任何位置重叠start(此处为开始9和结束10),并将它们合并为6-18的范围,并从文本中删除此范围。

问题是我可能会有许多新的开始/结束列(在我的情况下可能是几百个模式),如果我需要成对比较重叠范围,我的计算机可能会崩溃。

所以我想知道如何使它工作或如何最好地采用此解决方案。也许(我希望如此)有一个更好/更优雅/简单的解决方案。

的所需输出为dat

x                                    some_other_cols    short_x
this is my example text              1                  this text
and here is my other text example    2                  and here example
my other text is short               2                  is short

感谢你的帮助!谢谢。

滕西拜

Uwe在问题下的注释中提到了带有str_locate_all的新选项,这大大简化了代码:

library(stringr)
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length 
remove_matching_parts <- function(text, positions) {
  if (nrow(positions) == 0) return(text)
  ret <- strsplit(text,"")[[1]]
  lapply(1:nrow(positions), function(x) { ret[ positions[x,1]:positions[x,2] ] <<- NA } )
  paste0(ret[!is.na(ret)],separator="",collapse="")
}

# Loop over the data to apply the pattern
# row = length of vector, columns = length of pattern
matches <- lapply(dat$x, function(x) {
  do.call(rbind,str_locate_all(x, my_patterns)) # transform the list output of str_locate in a table of start/end
})

# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
 dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[[i]])
}

如果您可以控制模式定义并可以手动创建,则可以使用正则表达式解决方案来实现:

> gsub("(is )?my (other text|example)?","",dat$x)
[1] "this  text"        "and here  example" " is short" 

这个想法是创建带有可选部分的模式(?在分组括号之后)。

因此,我们大致有:

  • (is )? <=可选的“是”,后跟空格
  • my <=文字“ my”后跟空格
  • (other text|example)?<=在“我的”之后的可选文本,可以是“其他文本”或|“示例”

如果您没有控制权,事情就会变得混乱,我希望我已经对其进行了充分的评论,使其易于理解,根据其中包含的循环数量,不要指望它会很快

# Given datas
dat <- data.frame(x               = c("this is my example text", "and here is my other text example", "my other text is short","yet another text"),
                some_other_cols = c(1, 2, 2, 4))

my_patterns <- c("my example", "is my", "my other text")

# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length 
remove_matching_parts <- function(text, positions) {
  ret <- strsplit(text,"")[[1]]
  lapply(positions, function(x) { ifelse(is.na(x),,ret[ x[1]:x[2] ] <<- NA ) } )
  paste0(ret[!is.na(ret)],separator="",collapse="")
}

# Create the matches between a vector and a pattern
# First argument is the pattern to match, second is the vector of charcaters
match_pat_to_vector <- function(pattern,vector) {
  sapply(regexec(pattern,vector), 
         function(x) {
           if(x>-1) { 
             c(start=as.numeric(x), end=as.numeric(x+attr(x,"match.length")) ) # Create a start/end vector from the index and length of the match
           }
         })
}

# Loop over the patterns to create a dataframe of matches
# row = length of vector, columns = length of pattern
matches <- sapply(my_patterns,match_pat_to_vector,vector=dat$x)

# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
 dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[i,])
}

运行后结果:

> dat
                                  x some_other_cols           result
1           this is my example text               1        this text
2 and here is my other text example               2 and here example
3            my other text is short               2         is short
4                  yet another text               4 yet another text

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章