我有一列包含多条推文:
ID | Tweet
1 @ChipotleTweets @ChipotleTweets Becky is very nice
2 Happy Halloween! I now look forward to $3 booritos at @ChipotleTweets
3 Considering walking to @.ChipotleTweets in my llama onesie.
目标是删除 '@___' 和 @ 之后的任何内容 - 但不删除该字符串之外的文本。
目前正在播放此代码以检测“@”,但如果它不在句子的第一个位置,我将不会捡到任何东西
tweet_pattern <- " @\\w+"
Customer <- Customer %>%
clean_Tweet = ifelse(str_detect(text, tweet_pattern),
str_remove(text, tweet_pattern),
NA_character_))
期望的输出:
ID | Tweet | cleaned_tweet
1 @ChipotleTweets @ChipotleTweets Becky is very nice Becky is very nice
2 Happy Halloween! I now look forward to $3 booritos at @ChipotleTweets Happy Halloween! I now look forward to $3 booritos at
3 Considering walking to @.ChipotleTweets in my llama onesie. Considering walking to in my llama onesie.
我们可以更改模式以匹配零个或多个空格 ( \\s*
) 后跟@
一个或多个非空格 ( \\S+
)str_remove_all
以删除这些子字符串
library(stringr)
library(dplyr)
Customer %>%
mutate(Cleaned_Tweet = str_remove_all(Tweet, "\\s*@\\S+"))
-输出
ID Tweet Cleaned_Tweet
1 1 @ChipotleTweets @ChipotleTweets Becky is very nice Becky is very nice
2 2 Happy Halloween! I now look forward to $3 booritos at @ChipotleTweets Happy Halloween! I now look forward to $3 booritos at
3 3 Considering walking to @.ChipotleTweets in my llama onesie. Considering walking to in my llama onesie.
注意:str_remove
只删除匹配的第一个实例,即如果单个字符串中有多个匹配项,它会跳过其他匹配项并仅匹配第一个。我们需要str_remove_all
删除匹配模式的所有实例。
Customer <- structure(list(ID = 1:3, Tweet = c("@ChipotleTweets @ChipotleTweets Becky is very nice",
"Happy Halloween! I now look forward to $3 booritos at @ChipotleTweets",
"Considering walking to @.ChipotleTweets in my llama onesie."
)), class = "data.frame", row.names = c(NA, -3L))
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句