I would like to filter rows that contain 2 or more words located in another column.
I have a dataframe like this:
df <- data.frame(name1 = c("Carlos Lopez Rey", "Monica Naranjo Garcia", "Antonio Perez Reverte", "Alejandro Martinez Amor", "Iñigo Muruzabal"),
name2 = c("Lopez, Carlos", "Monica de Naranjo", "Garcia, Antonio", "Alejandro Martinez de Amor", "Muruzabal, Javier"))
And I would like to create a condition that filters rows that contain 2 or more same words in the first column (name1) and in the second column (name2). The result I would like to have is:
name1 | name2 |
---|---|
Carlos Lopez Rey | Lopez, Carlos |
Monica Naranjo Garcia | Monica de Naranjo |
Alejandro Martinez Amor | Alejandro Martinez de Amor |
* Notice that "Antonio Perez Reverte" and " Iñigo Muruzabal" are not filtered because the first column only matches 1 word with the second column.
Split the string on words, find common words using length(intersect(...))
and select only rows that have at least 2 words in common.
result <- subset(df, mapply(function(x, y) length(intersect(x, y)),
strsplit(name1, ',|\\s+'), strsplit(name2, ',|\\s+')) >= 2)
result
# name1 name2
#1 Carlos Lopez Rey Lopez, Carlos
#2 Monica Naranjo Garcia Monica de Naranjo
#4 Alejandro Martinez Amor Alejandro Martinez de Amor
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments