使用环视技术从抓取的数据中获取正则表达式不起作用

杰夫·亨德森

我试图弄清楚为什么我的一个regex命令有效,而另一个则无效。这是从中提取两个字符串的示例。由于刮擦而产生的新行垃圾具有一致性,因此我尽最大可能利用了这一点:

"\n\tMenghe a'Nyam\n\t\n\n  \n\n  \n\n  \n\n  \n  Position:\n  \n  Forward\n\n\n\n  6-5, 215lb (196cm, 
97kg) \n  \n\n  \n\n  \n  \n  \n\n  School: Canisius\n\n\n\n\n\n  More player info\n\n\n\n\n\n"

"\n\tJordan Aaberg\n\t\n\n  \n\n  \n\n  \n\n  \n  Position:\n  \n  Guard\n\n\n\n  6-9, 225lb (206cm, 
102kg) \n  \n\n  Hometown: Rothsay, MN\n\n\n\n  \n\n  High School: Rothsay\n\n\n\n  \n  \n  \n\n  
School: North Dakota State\n\n\n\n\n\n  More player info\n\n\n\n\n\n"

我的目标是从中获取所需数据,例如位置(分别为前锋,后卫),最重要的是,身高(分别为6-5、6-9)。我成功完成了以下工作:

test <- df %>%
  mutate(position = str_extract(player, "(?<=Position:\n  \n  ).*?(?=\n\n\n\n  \\d-\\d)")) 

但是,当我按照类似的方法为高度添加另一个col时,它返回NA:

test <- df %>%
  mutate(position = str_extract(player, "(?<=Position:\n  \n  ).*?(?=\n\n\n\n  \\d-\\d)")) %>%
  mutate(height = str_extract(player, "(?<=\\w+\n\n\n\n  ).*?(?=, \\d{3}lb)"))

如果有帮助,以下是上述df前3行调用的结果示例:

structure(list(player = c("\n\tMenghe a'Nyam\n\t\n\n  \n\n  \n\n  \n\n  \n  Position:\n  \n  Forward\n\n\n\n  6-5, 215lb (196cm, 97kg) \n  \n\n  \n\n  \n  \n  \n\n  School: Canisius\n\n\n\n\n\n  More player info\n\n\n\n\n\n"  , 
"\n\tJordan Aaberg\n\t\n\n  \n\n  \n\n  \n\n  \n  Position:\n  \n  Forward\n\n\n\n  6-9, 225lb (206cm, 102kg) \n  \n\n  Hometown: Rothsay, MN\n\n\n\n  \n\n  High School: Rothsay\n\n\n\n  \n  \n  \n\n  School: North Dakota State\n\n\n\n\n\n  More player info\n\n\n\n\n\n"  , 
"\n\tKarl Aaker\n\t\n\n  \n\n  \n\n  \n\n  \n  Position:\n  \n  Forward\n\n\n\n  6-5, 210lb (196cm, 95kg) \n  \n\n  Hometown: Reno, NV\n\n\n\n  \n\n  \n  \n  \n\n  School: Portland\n\n\n\n\n\n  More player info\n\n\n\n\n\n"  
), position = c("Forward", "Forward", "Forward"), height = c(NA_character_, 
NA_character_, NA_character_)), row.names = c(NA, 3L), class = "data.frame")    
维克多·史翠比维

您可以+在之后删除\w因为ICU正则表达式引擎不支持lookbehinds内部的无限长度的字符串匹配模式,并用于\s匹配任何空格:

test <- df %>%
  mutate(position = str_extract(player, "(?<=Position:\n  \n  ).*?(?=\n\n\n\n  \\d-\\d)")) %>%
  mutate(height = str_extract(player, "(?<=\\w\n{4}\\s{2}).*?(?=,\\s+\\d{3}lb)"))

参见正则表达式演示

细节

  • (?<=\w\n{4}\s{2}) -在比赛之前,应该有一个单词char,然后是4个换行符,然后是任意2个空白字符
  • .*? -除换行符以外的任何0个或更多字符,应尽可能少
  • (?=,\s+\d{3}lb)-比赛结束后,应立即包含一个逗号,一个或多个空格字符,3位数字和lb子字符串。

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章