I am having difficulties with matching multiple strings in r. The dataframe that I have looks like this:
Var1 Var2
1 SJDJWK P04TGI7F3;P030Y7Y11;PE35RV747;Q2UKLVVX4
2 ODJSMDK Q2UKLVVX4;PWER00711;PE35RV747;Q2UKLVVX4
3 JDKSAKDJ PE35RV747;P0F071G1G
I would like to match the strings that are separated with ";" with the values in the following dataframe:
Var_x Var_y
1 P04TGI7F3 good
2 P030Y7Y11 normal
3 PE35RV747 bad
4 Q2UKLVVX4 normal
So that the resulting dataframe looks like:
Var1 Var2 Var3
1 SJDJWK P04TGI7F3;P030Y7Y11;PE35RV747;Q2UKLVVX4 good;normal;bad;normal
2 ODJSMDK Q2UKLVVX4;PWER00711;PE35RV747;Q2UKLVVX4 normal;bad;normal
3 JDKSAKDJ PE35RV747;P0F071G1G bad
So far, I tried to do this with a fuzzy join:
fuzzy_left_join(Data1, Data2, by = c("Var2"="Var_x"), match_fun = str_detect)
This does the job, but it uses a lot of memory (my dataset is very large and R stops working). I was trying to do this with a for loop, but I cannot figure out how to do it. Someone who knows?
Here is an idea via tidyverse
. We separate the rows, merge on the second data frame and again concatenate based on Var1
,
library(tidyverse)
df1 %>%
separate_rows(Var2) %>%
left_join(df2, by = c('Var2' = 'Var_x')) %>%
group_by(Var1) %>%
summarise_all(funs(paste(., collapse = ';')))
which gives,
# A tibble: 3 x 3 Var1 Var2 Var_y <fct> <chr> <chr> 1 JDKSAKDJ PE35RV747;P0F071G1G bad;NA 2 ODJSMDK Q2UKLVVX4;PWER00711;PE35RV747;Q2UKLVVX4 normal;NA;bad;normal 3 SJDJWK P04TGI7F3;P030Y7Y11;PE35RV747;Q2UKLVVX4 good;normal;bad;normal
If you do not want to include NAs
, we can omit before joining (as @akrun mentions), i.e.
df1 %>%
separate_rows(Var2) %>%
filter(Var2 %in% df2$Var_x) %>%
left_join(df2, by = c('Var2' = 'Var_x')) %>%
group_by(Var1) %>%
summarise_all(funs(paste(., collapse = ';')))
which gives,
# A tibble: 3 x 3 Var1 Var2 Var_y <fct> <chr> <chr> 1 JDKSAKDJ PE35RV747 bad 2 ODJSMDK Q2UKLVVX4;PE35RV747;Q2UKLVVX4 normal;bad;normal 3 SJDJWK P04TGI7F3;P030Y7Y11;PE35RV747;Q2UKLVVX4 good;normal;bad;normal
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments