Matching multiple strings and join

PieterD

I am having difficulties with matching multiple strings in r. The dataframe that I have looks like this:

      Var1                                      Var2
1   SJDJWK   P04TGI7F3;P030Y7Y11;PE35RV747;Q2UKLVVX4
2  ODJSMDK   Q2UKLVVX4;PWER00711;PE35RV747;Q2UKLVVX4
3 JDKSAKDJ                       PE35RV747;P0F071G1G

I would like to match the strings that are separated with ";" with the values in the following dataframe:

      Var_x    Var_y
1 P04TGI7F3     good
2 P030Y7Y11   normal
3 PE35RV747      bad
4 Q2UKLVVX4   normal

So that the resulting dataframe looks like:

      Var1                                      Var2                    Var3
1   SJDJWK   P04TGI7F3;P030Y7Y11;PE35RV747;Q2UKLVVX4  good;normal;bad;normal
2  ODJSMDK   Q2UKLVVX4;PWER00711;PE35RV747;Q2UKLVVX4       normal;bad;normal
3 JDKSAKDJ                       PE35RV747;P0F071G1G                     bad

So far, I tried to do this with a fuzzy join:

fuzzy_left_join(Data1, Data2, by = c("Var2"="Var_x"), match_fun = str_detect)

This does the job, but it uses a lot of memory (my dataset is very large and R stops working). I was trying to do this with a for loop, but I cannot figure out how to do it. Someone who knows?

Sotos

Here is an idea via tidyverse. We separate the rows, merge on the second data frame and again concatenate based on Var1,

library(tidyverse)

df1 %>% 
 separate_rows(Var2) %>% 
 left_join(df2, by = c('Var2' = 'Var_x')) %>% 
 group_by(Var1) %>% 
 summarise_all(funs(paste(., collapse = ';')))

which gives,

# A tibble: 3 x 3
  Var1     Var2                                    Var_y                 
  <fct>    <chr>                                   <chr>                 
1 JDKSAKDJ PE35RV747;P0F071G1G                     bad;NA                
2 ODJSMDK  Q2UKLVVX4;PWER00711;PE35RV747;Q2UKLVVX4 normal;NA;bad;normal  
3 SJDJWK   P04TGI7F3;P030Y7Y11;PE35RV747;Q2UKLVVX4 good;normal;bad;normal

If you do not want to include NAs, we can omit before joining (as @akrun mentions), i.e.

df1 %>% 
 separate_rows(Var2) %>% 
 filter(Var2 %in% df2$Var_x) %>% 
 left_join(df2, by = c('Var2' = 'Var_x')) %>% 
 group_by(Var1) %>% 
 summarise_all(funs(paste(., collapse = ';')))

which gives,

# A tibble: 3 x 3
  Var1     Var2                                    Var_y                 
  <fct>    <chr>                                   <chr>                 
1 JDKSAKDJ PE35RV747                               bad                   
2 ODJSMDK  Q2UKLVVX4;PE35RV747;Q2UKLVVX4           normal;bad;normal     
3 SJDJWK   P04TGI7F3;P030Y7Y11;PE35RV747;Q2UKLVVX4 good;normal;bad;normal

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

VBA: matching multiple strings

Join multiple strings with delimiters

RegEx matching multiple exact strings

RegEx for not matching multiple exact strings

Matching and replace multiple strings in python

Join data frames based fuzzy matching of strings

Join two data frames by searching & matching strings

matching multiple regular expressions using join

How to join two tables with multiple matching records?

Delete strings matching multiple patterns from a list

Regex replace strings matching pattern with multiple charercters

exact string matching for multiple strings with grep

Get multiple sub strings by matching pattern

Matching multiple encased strings in a single line

Matching strings loop over multiple columns

AngulaJS count filtered results matching multiple strings

RegEx matching multiple strings with StartWith and EndWith strings in a long string

How can contentEquals compare multiple strings with all strings matching?

Join two pandas tables on multiple strings

Join multiple strings enclosed in quotation marks with underscore

Fuzzy join strings on multiple columns in one dataset

Pandas GroupBy Join Strings Multiple Columns

Join two tables matching multiple ID's to names

Check for non matching rows in an outer join query with multiple keys

Adding a dataframe column of a lookup matching multiple columns, but not a merge / join

Finding IDs that join with multiple alternate IDs matching criteria

SQL Join with multiple key with default non matching value

Matcher Class Java - Matching Multiple Sub-strings on Same Line

How to search and replace strings matching a replacement list for multiple files