我试图弄清楚为什么我的一个regex命令有效,而另一个则无效。这是从中提取两个字符串的示例。由于刮擦而产生的新行垃圾具有一致性,因此我尽最大可能利用了这一点:
"\n\tMenghe a'Nyam\n\t\n\n \n\n \n\n \n\n \n Position:\n \n Forward\n\n\n\n 6-5, 215lb (196cm,
97kg) \n \n\n \n\n \n \n \n\n School: Canisius\n\n\n\n\n\n More player info\n\n\n\n\n\n"
"\n\tJordan Aaberg\n\t\n\n \n\n \n\n \n\n \n Position:\n \n Guard\n\n\n\n 6-9, 225lb (206cm,
102kg) \n \n\n Hometown: Rothsay, MN\n\n\n\n \n\n High School: Rothsay\n\n\n\n \n \n \n\n
School: North Dakota State\n\n\n\n\n\n More player info\n\n\n\n\n\n"
我的目标是从中获取所需数据,例如位置(分别为前锋,后卫),最重要的是,身高(分别为6-5、6-9)。我成功完成了以下工作:
test <- df %>%
mutate(position = str_extract(player, "(?<=Position:\n \n ).*?(?=\n\n\n\n \\d-\\d)"))
但是,当我按照类似的方法为高度添加另一个col时,它返回NA:
test <- df %>%
mutate(position = str_extract(player, "(?<=Position:\n \n ).*?(?=\n\n\n\n \\d-\\d)")) %>%
mutate(height = str_extract(player, "(?<=\\w+\n\n\n\n ).*?(?=, \\d{3}lb)"))
如果有帮助,以下是上述df前3行调用的结果示例:
structure(list(player = c("\n\tMenghe a'Nyam\n\t\n\n \n\n \n\n \n\n \n Position:\n \n Forward\n\n\n\n 6-5, 215lb (196cm, 97kg) \n \n\n \n\n \n \n \n\n School: Canisius\n\n\n\n\n\n More player info\n\n\n\n\n\n" ,
"\n\tJordan Aaberg\n\t\n\n \n\n \n\n \n\n \n Position:\n \n Forward\n\n\n\n 6-9, 225lb (206cm, 102kg) \n \n\n Hometown: Rothsay, MN\n\n\n\n \n\n High School: Rothsay\n\n\n\n \n \n \n\n School: North Dakota State\n\n\n\n\n\n More player info\n\n\n\n\n\n" ,
"\n\tKarl Aaker\n\t\n\n \n\n \n\n \n\n \n Position:\n \n Forward\n\n\n\n 6-5, 210lb (196cm, 95kg) \n \n\n Hometown: Reno, NV\n\n\n\n \n\n \n \n \n\n School: Portland\n\n\n\n\n\n More player info\n\n\n\n\n\n"
), position = c("Forward", "Forward", "Forward"), height = c(NA_character_,
NA_character_, NA_character_)), row.names = c(NA, 3L), class = "data.frame")
您可以+
在之后删除,\w
因为ICU正则表达式引擎不支持lookbehinds内部的无限长度的字符串匹配模式,并用于\s
匹配任何空格:
test <- df %>%
mutate(position = str_extract(player, "(?<=Position:\n \n ).*?(?=\n\n\n\n \\d-\\d)")) %>%
mutate(height = str_extract(player, "(?<=\\w\n{4}\\s{2}).*?(?=,\\s+\\d{3}lb)"))
参见正则表达式演示
细节
(?<=\w\n{4}\s{2})
-在比赛之前,应该有一个单词char,然后是4个换行符,然后是任意2个空白字符.*?
-除换行符以外的任何0个或更多字符,应尽可能少(?=,\s+\d{3}lb)
-比赛结束后,应立即包含一个逗号,一个或多个空格字符,3位数字和lb
子字符串。本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句