I'm trying to analyze a list of emails stored inside a dataframe (data$Email.Address) and I want to start by splitting the emails into parts, so that [email protected], [email protected], and [email protected] end up like this:
email firstpart secondpart thirdpart
1 [email protected] example1 gmail com
2 [email protected] example2 outlook org
3 [email protected] example3 comcast net
With my current code, however, I can't all match all strings — since some include domains like (some-url.com) or (us.army.mil). This means that [email protected] shows up as:
email firstpart secondpart thirdpart
4 [email protected] example4 us army
My goal is to read "some-url" or "us.army" as the second part, and "com" and "mil" as the third parts, so that is shows up like this:
email firstpart secondpart thirdpart
4 [email protected] example4 us.army mil
Here's the code I have:
library(tidyverse)
library(dplyr)
library(stringr)
library(rebus)
email_pattern <- capture(one_or_more(WRD)) %R%
"@" %R% capture(one_or_more(x = WRD)) %R%
DOT %R% capture(one_or_more(WRD))
#Split the emails into parts based on the pattern
email_parts <- str_match(data$Email.Address, pattern = email_pattern)
How can I change the code so that all the domains can be read? Thank you!
Using stringi
and data.table
's tstrsplit()
:
library(stringi)
library(data.table)
df[paste0("part", 1:3)] <-
tstrsplit(stri_replace_last(df$email, fixed = ".", "@"), split = "@")
email part1 part2 part3
1 [email protected] example1 gmail com
2 [email protected] example2 outlook org
3 [email protected] example3 comcast net
4 [email protected] example4 us.army mil
Reproducible data (please provide yourself next time):
df <- data.frame(
email = c(
"[email protected]", "[email protected]", "[email protected]", "[email protected]"
)
)
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments