I am trying to match city/county names (fortunately information on state names are provided) with their corresponding state names and then append their first three phone digit as another column using left_join()
. My initial thought would be to replicate the city/county name column and then replacing them with their state names using sapply()
along with grep()
, then using left_join()
to merge it with phone digit column, but it doesn't seem that my code works.
library(dplyr)
location <- data.frame(location = c('Asortia, New York', 'Buffalo, New York', 'New York, New York', 'Alexandra, Virginia', 'Fairfax, Virginia', 'Baltimore, Maryland', 'Springfield, Maryland'), number = c(100, 200, 300, 400, 500, 600, 700))
state <- data.frame(state = c('New York', 'Virginia', 'Maryland'))
sapply(as.character(state$state), function(i) grep(i, location$location))
### doesn't work! ###
### my desired output would be ###
location number
1 New York 100
2 New York 200
3 New York 300
4 Virginia 400
5 Virginia 500
6 Maryland 600
7 Maryland 700
left_join
to merge the output generated from above with their three digit phone number. For example,
df <- location
names(df)[1] <- 'state'
digit <- data.frame(state = c('New York', 'Virginia', 'Maryland'), digit = c(212, 703, 410))
new_df <- left_join(df, digit, by = 'state')
### the desired output ###
location number digit
1 New York 100 212
2 New York 200 212
3 New York 300 212
4 Virginia 400 703
5 Virginia 500 703
6 Maryland 600 410
7 Maryland 700 410
I have referenced this and this thread, but didn't quite get the clue. Hope someone could help me on this.
## UpdateI found that using grepl
in a for loop
also works, but the processing may be slow if you have large amount of data (the data I'm working on has two million observations).
for (i in state$state) {
location$location[grepl(i, location$location)] <- i
}
May be we can use str_remove
by paste
ing (str_c
) the pattern
vector in 'state' column from 'state' dataset as a regex lookaround to match anything that precedes the vector (to remove)
library(stringr)
library(dplyr)
location %>%
mutate(location = str_remove(location, str_c(".*(?=(",
str_c(state$state, collapse = "|"), "))")))
# location number
#1 New York 100
#2 New York 200
#3 New York 300
#4 Virginia 400
#5 Virginia 500
#6 Maryland 600
#7 Maryland 700
Or another option is to separate
into two column and remove the first
library(tidyr)
location %>%
separate(location, into = c('unwanted', 'location'), sep=",\\s*") %>%
select(-unwanted)
Or if we have a specific pattern, remove the prefix part by matching one or more characters that are not a ,
from the start (^
) followed by ,
and zero or more spaces (\\s*
) as pattern in the str_remove
location %>%
mutate(location = str_remove(location, '^[^,]+,\\s*'))
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments