基于动态模式过滤行

克里斯·鲁勒曼

df在列的数据框中有语音数据Orthographic

df <- data.frame(
  Orthographic = c("this is it at least probably",
                   "well not probably it's not intuitive",
                   "sure no it's I mean it's very intuitive",
                   "I don't mean to be rude but it's anything but you know",
                   "well okay maybe"),
  Repeat = c(NA, "probably", "it's,intuitive", "I,mean,it's", NA),
  Repeat_pattern = c(NA, "\\b(probably)\\b", "\\b(it's|intuitive)\\b", "\\b(I,mean|it's)\\b", 
                     NA))

我想要filter基于动态模式的,即在列中列出的任何单词之前出现no, never,not作为单词 OR 但是,将模式column 中的交替模式一起使用,我收到此错误:n't Repeat\\b(no|never|not)\\b|n't\\b\\s Repeat_pattern

df %>%
   filter(grepl(paste0("\\b(no|never|not)\\b|n't\\b\\s", Repeat_pattern), Orthographic))
                             Orthographic         Repeat         Repeat_pattern
1    well not probably it's not intuitive       probably       \\b(probably)\\b
2 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
Warning message:
In grepl(paste0("\\b(no|never|not)\\b|n't\\b\\s", Repeat_pattern),  :
  argument 'pattern' has length > 1 and only the first element will be used

我不知道为什么“只使用第一个元素”,因为这两个模式组件似乎连接得很好:

paste0("\\b(no|never|not)\\b|n't\\b\\s", df$Repeat_pattern)
[1] "\\b(no|never|not)\\b|n't\\b\\sNA"                     "\\b(no|never|not)\\b|n't\\b\\s\\b(probably)\\b"      
[3] "\\b(no|never|not)\\b|n't\\b\\s\\b(it's|intuitive)\\b" "\\b(no|never|not)\\b|n't\\b\\s\\b(I,mean|it's)\\b"   
[5] "\\b(no|never|not)\\b|n't\\b\\sNA"

预期的输出是这样的:

2                   well not probably it's not intuitive       probably       \\b(probably)\\b
3                sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
4 I don't mean to be rude but it's anything but you know    I,mean,it's    \\b(I,mean|it's)\\b
维克多·斯特里比尤夫

这里看起来像是矢量化问题,您需要在stringr::str_detect此处使用而不是grepl.

此外,您没有很好地将否定词替代品分组,所有这些都必须位于一个组中,并且您n't现在必须在一个字符串中。

另外,NA值被强制为文本并添加到正则表达式模式中,而您似乎想丢弃Repeat_patternis所在的项目NA

您可以使用以下方法修复您的代码

df %>%
    filter(ifelse(is.na(Repeat_pattern), FALSE, str_detect(Orthographic, paste0("(?:\\bno|\\bnever|\\bnot|n't)\\b.*", Repeat_pattern))))

输出:

                                            Orthographic         Repeat         Repeat_pattern
1                   well not probably it's not intuitive       probably       \\b(probably)\\b
2                sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
3 I don't mean to be rude but it's anything but you know    I,mean,it's    \\b(I|mean|it's)\\b

我也认为最后一个模式一定是\\b(I|mean|it's)\\b,不是\\b(I,mean|it's)\\b

如果“no”单词和Repeat列中的单词之间只能有空格.*,请\\s+在我的模式中替换我过去常常.*\b确保“否”词右侧的任何地方都有匹配项。

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章