提取在两个或多个字段中具有重复值但在另一个字段中具有不同值的行

目的是从dataframe / data.table中提取具有以下内容的行:

  • 两个或多个字段(此处为NAME和DOB)中的值相同;
  • 另一个字段(此处为ID)中的不同值

目前,我正在这样做:

library(data.table)

# load the data
customers <- structure(list(
  NAME = c("GEETA SHYAM RAO", "B V RAMANA", "GONTU VENKATARAMANAIAH", 
           "DAMAT RAMAKRISHNA", "MARIAM SUDHAKAR", "VELPURI LAKSHMI SUJATHA", 
           "MOHAMMED LIYAKHAT ALI", "VENKATESHWARAN PONNAMBALAM",
           "DEVARAKONDA SATISH BABU", "GEEDI RAMULU", "KANDU OBULESU",
           "J PARVATHALU(TEMP.SUB-STAFF)", "DOKKA RAJESH", "G TULASIRAM REDDY",
           "MALLELA CHIRANJEEVI", "MANEPALLI VENKATA RAVAMMA", 
           "DOKKA JAGADEESHWAR", "K KRISHNA", "B SUDARSHAN", "B ANNAPURNA", 
           "CHAVVA SHIVA RAMULU", "BIKASH BAHADUR CHITRE", "DARBAR ASHOK", 
           "VEMULAPALLY SANGAMESHWAR RAO", "MOHAMMED ABDUL HAKEEM ANWAR", 
           "MANEPALLI SHIV SHANKAR RAO", "MOHD MISKEEN MOHIUDDIN",
           "KOTLA CHENNAMMA", "NAYAK SURYAKANTH", "GOPIREDDY INDIRA", 
           "MEKALA SREEDEVI", "K KRISHNA", "B V RAMANA", "KUMMARI VENKATESHAM",
           "BHAVANI CONSRUCTIONS", "UPPUTHOLLA KOTAIAH", "YEDIDHA NIRMALA DEVI",
           "MARIAM SUDHAKAR", "B ANNAPURNA", "VELPURI LAKSHMI SUJATHA",
           "DARBAR ASHOK", "AMMANA VISHNU VARDHAN REDDY", "ZAITOON BEE",
           "MOHD CHAND PASHA", "PALERELLA RAMESH", "GEEDI SRINIVAS", 
           "RAMAIAH SADU", "BIMAN BALAIAH", "KOTLA CHENNAMMA", 
           "VENKATESHWARAN PONNAMBALAM"), 
  DOB = c("13-02-1971", "15-01-1960", "01-07-1970", "10-03-1977", 
          "24-01-1954", "28-06-1971", "26-01-1980", "14-04-1969", "23-09-1978", 
          "15-08-1954", "09-10-1984", "20-02-1975", "29-09-1984", "03-03-1975", 
          "26-01-1979", "01-01-1964", "21-01-1954", "01-05-1964", "12-03-1975", 
          "12-12-1962", "10-03-1982", "14-05-1983", "03-01-1950", "04-03-1962", 
          "12-05-1966", "01-06-1960", "10-03-1964", "15-07-1958", "26-06-1979", 
          "02-04-1974", "10-01-1975", "01-05-1964", "15-01-1960", "08-08-1977", 
          NA, "05-04-1981", "29-08-1971", "24-01-1954", "12-12-1962",
          "28-06-1971", "03-01-1950", "23-06-1970", "20-02-1960", "05-07-1975",
          "10-01-1979", "31-08-1982", "10-08-1983", "10-03-1964", 
          "15-07-1958", "14-04-1969"), 
  ID = c(502969, 502902, 502985, 502981, 502475, 502267, 502976, 
         502272, 502977, 502973, 502986, 502978, 502989, 502998, 502967, 
         502971, 502988, 502737, 502995, 502878, 502972, 502984, 502639, 
         502968, 502975, 502970, 502997, 502466, 502991, 502982, 502980, 
         502737, 502902, 502999, 502994, 502987, 502990, 502047, 502877, 
         502251, 502548, 502992, 503000, 502993, 502983, 502974, 502996, 
         502979, 502467, 502290),
  PIN = c(500082, 500032, 500032, 500032, 
          500032, 500084, 500032, 500032, 500032, 500032, 500032, 500084, 
          500032, 500084, 500084, 500032, 5e+05, 500050, 500032, 500084, 
          500032, 500032, 500032, 500050, 500032, 500032, 500045, 500032, 
          500084, 500032, 500032, 500084, 500035, 500084, 500032, 500032, 
          500032, 500032, 500084, 500032, 500084, 500033, 500084, 500032, 
          500032, 500032, 500084, 500032, 500032, 500032)),
  .Names = c("NAME", "DOB", "ID", "PIN"), 
  class = c("data.table", "data.frame"), row.names = c(NA,-50L))

签出数据:

dim(customers)
#[1] 50  4

head(customers)
                      NAME        DOB     ID    PIN
#1:         GEETA SHYAM RAO 13-02-1971 502969 500082
#2:              B V RAMANA 15-01-1960 502902 500032
#3:  GONTU VENKATARAMANAIAH 01-07-1970 502985 500032
#4:       DAMAT RAMAKRISHNA 10-03-1977 502981 500032
#5:         MARIAM SUDHAKAR 24-01-1954 502475 500032
#6: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084

步骤1:在NAME和DOB列中获取具有相同值的行-

dup1 <- customers[, .(ID, PIN, .N), keyby=.(NAME, DOB)][N>1][, -"N"]
dup1
#                          NAME        DOB     ID    PIN
# 1:                B ANNAPURNA 12-12-1962 502878 500084
# 2:                B ANNAPURNA 12-12-1962 502877 500084
# 3:                 B V RAMANA 15-01-1960 502902 500032
# 4:                 B V RAMANA 15-01-1960 502902 500035
# 5:               DARBAR ASHOK 03-01-1950 502639 500032
# 6:               DARBAR ASHOK 03-01-1950 502548 500084
# 7:                  K KRISHNA 01-05-1964 502737 500050
# 8:                  K KRISHNA 01-05-1964 502737 500084
# 9:            KOTLA CHENNAMMA 15-07-1958 502466 500032
#10:            KOTLA CHENNAMMA 15-07-1958 502467 500032
#11:            MARIAM SUDHAKAR 24-01-1954 502475 500032
#12:            MARIAM SUDHAKAR 24-01-1954 502047 500032
#13:    VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
#14:    VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
#15: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
#16: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032

在以上结果中,“ BV RAMANA”和“ K KRISHNA”的ID值在重复行中相同,因此需要删除。

步骤2:在NAME,DOB和ID列中获取具有相同值的行-

dup2 <- dup1[, .(PIN, .N), keyby=.(NAME, DOB, ID)][N>1][, -"N"]
dup2
#         NAME        DOB     ID    PIN
#1: B V RAMANA 15-01-1960 502902 500032
#2: B V RAMANA 15-01-1960 502902 500035
#3:  K KRISHNA 01-05-1964 502737 500050
#4:  K KRISHNA 01-05-1964 502737 500084

第3步:现在从第1步中的行中删除第2步中的行以获得最终结果-

result <- fsetdiff(dup1, dup2)
result
#                          NAME        DOB     ID    PIN
# 1:                B ANNAPURNA 12-12-1962 502878 500084
# 2:                B ANNAPURNA 12-12-1962 502877 500084
# 3:               DARBAR ASHOK 03-01-1950 502639 500032
# 4:               DARBAR ASHOK 03-01-1950 502548 500084
# 5:            KOTLA CHENNAMMA 15-07-1958 502466 500032
# 6:            KOTLA CHENNAMMA 15-07-1958 502467 500032
# 7:            MARIAM SUDHAKAR 24-01-1954 502475 500032
# 8:            MARIAM SUDHAKAR 24-01-1954 502047 500032
# 9:    VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
#10:    VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
#11: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
#12: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032

在上述每种情况下,NAME和DOB列均具有重复值,但ID列中这些重复行的值必然不同。

这是获取结果的三行处理代码,但我确定必须有其他方法。在此示例中,只有四个字段。可以说,将50个以上的字段(即使带有复制粘贴)放在代码中将是一件繁琐的工作。因此,创建可以用作输入的可重用功能真的很酷-

  • 数据框/数据表
  • 仅包含重复值的字段名称的向量
  • 一个必须包含不同值的字段名

并将结果输出为dataframe / data.table。请点子。

旁注:该功能在欺诈分析中被认为非常重要,以至商业软件“ CaseWare IDEA”以“重复密钥排除”的名称提供此功能。实际使用此功能:https : //www.youtube.com/watch?v=XqL4j8UXsKw

坦率

我认为OP的方法已经非常好。但是...

  • 有了j = .N它自己,它将更有效率。有关?GForce详细信息,请参见
  • 我认为OP的两个步骤在OP和链接的视频中描述的“重复密钥排除”任务中没有成功:

    在上述每种情况下,NAME和DOB列均具有重复值,但ID列中这些重复行的值必然不同。

对于OP的两个步骤,有...

bycols = c("NAME", "DOB")
dcol = "ID"

cols = c(bycols, dcol)

w1 = customers[customers[, .N, by=bycols][N > 1L, !"N"], on=bycols, which=TRUE]
customers[w1][!customers[w1, .N, by=cols][N > 1L, !"N"], on=cols]

对于引用的任务...

mDT = customers[!duplicated(customers, by=cols), .N, by=bycols][N > 1L]
customers[mDT[, !"N"], on=bycols]

无论哪种方式,OP的示例

                          NAME        DOB     ID    PIN
 1:            MARIAM SUDHAKAR 24-01-1954 502475 500032
 2:            MARIAM SUDHAKAR 24-01-1954 502047 500032
 3:    VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
 4:    VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
 5: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
 6: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
 7:                B ANNAPURNA 12-12-1962 502878 500084
 8:                B ANNAPURNA 12-12-1962 502877 500084
 9:               DARBAR ASHOK 03-01-1950 502639 500032
10:               DARBAR ASHOK 03-01-1950 502548 500084
11:            KOTLA CHENNAMMA 15-07-1958 502466 500032
12:            KOTLA CHENNAMMA 15-07-1958 502467 500032

mDT 是描述重复项的摘要表,便于浏览:

> mDT
                         NAME        DOB N
1:            MARIAM SUDHAKAR 24-01-1954 2
2:    VELPURI LAKSHMI SUJATHA 28-06-1971 2
3: VENKATESHWARAN PONNAMBALAM 14-04-1969 2
4:                B ANNAPURNA 12-12-1962 2
5:               DARBAR ASHOK 03-01-1950 2
6:            KOTLA CHENNAMMA 15-07-1958 2

由__San __(原始海报)编辑:

修改数据集以显示如何处理Frank先前指出的情况:“ ID = AAB的组将丢失两个As并保留B”

library(data.table)

# load the data
customers <- structure(list(
  NAME = c("GEETA SHYAM RAO", "B V RAMANA", "GONTU VENKATARAMANAIAH", 
           "DAMAT RAMAKRISHNA", "MARIAM SUDHAKAR", "VELPURI LAKSHMI SUJATHA", 
           "MOHAMMED LIYAKHAT ALI", "VENKATESHWARAN PONNAMBALAM", 
           "DEVARAKONDA SATISH BABU", "GEEDI RAMULU", "KANDU OBULESU",
           "B V RAMANA", "DOKKA RAJESH", "G TULASIRAM REDDY", 
           "MALLELA CHIRANJEEVI", "MANEPALLI VENKATA RAVAMMA", 
           "DOKKA JAGADEESHWAR", "K KRISHNA", "B SUDARSHAN", "B ANNAPURNA", 
           "CHAVVA SHIVA RAMULU", "BIKASH BAHADUR CHITRE", "DARBAR ASHOK", 
           "VEMULAPALLY SANGAMESHWAR RAO", "MOHAMMED ABDUL HAKEEM ANWAR", 
           "MANEPALLI SHIV SHANKAR RAO", "MOHD MISKEEN MOHIUDDIN",
           "KOTLA CHENNAMMA", "NAYAK SURYAKANTH", "GOPIREDDY INDIRA", 
           "MEKALA SREEDEVI", "K KRISHNA", "B V RAMANA", 
           "KUMMARI VENKATESHAM", "BHAVANI CONSRUCTIONS", 
           "UPPUTHOLLA KOTAIAH", "YEDIDHA NIRMALA DEVI", "MARIAM SUDHAKAR", 
           "B ANNAPURNA", "VELPURI LAKSHMI SUJATHA", "DARBAR ASHOK", 
           "AMMANA VISHNU VARDHAN REDDY", "ZAITOON BEE", "MOHD CHAND PASHA",
           "PALERELLA RAMESH", "GEEDI SRINIVAS", "RAMAIAH SADU",
           "BIMAN BALAIAH", "KOTLA CHENNAMMA", "VENKATESHWARAN PONNAMBALAM"),
  DOB = c("13-02-1971", "15-01-1960", "01-07-1970", "10-03-1977", 
          "24-01-1954", "28-06-1971", "26-01-1980", "14-04-1969", 
          "23-09-1978", "15-08-1954", "09-10-1984", "15-01-1960", 
          "29-09-1984", "03-03-1975", "26-01-1979", "01-01-1964", 
          "21-01-1954", "01-05-1964", "12-03-1975", "12-12-1962", 
          "10-03-1982", "14-05-1983", "03-01-1950", "04-03-1962", 
          "12-05-1966", "01-06-1960", "10-03-1964", "15-07-1958", 
          "26-06-1979", "02-04-1974", "10-01-1975", "01-05-1964",
          "15-01-1960", "08-08-1977", NA, "05-04-1981", "29-08-1971",
          "24-01-1954", "12-12-1962", "28-06-1971", "03-01-1950",
          "23-06-1970", "20-02-1960", "05-07-1975", "10-01-1979", 
          "31-08-1982", "10-08-1983", "10-03-1964", "15-07-1958",
          "14-04-1969"),
  ID = c(502969, 502902, 502985, 502981, 502475, 502267, 502976, 
         502272, 502977, 502973, 502986, 502910, 502989, 502998, 502967, 
         502971, 502988, 502737, 502995, 502878, 502972, 502984, 502639, 
         502968, 502975, 502970, 502997, 502466, 502991, 502982, 502980, 
         502737, 502902, 502999, 502994, 502987, 502990, 502047, 502877, 
         502251, 502548, 502992, 503000, 502993, 502983, 502974, 502996, 
         502979, 502467, 502290), 
  PIN = c(500082, 500032, 500032, 500032, 500032, 500084, 500032, 500032,
          500032, 500032, 500032, 500033, 500032, 500084, 500084, 500032,
          5e+05, 500050, 500032, 500084, 500032, 500032, 500032, 500050,
          500032, 500032, 500045, 500032, 500084, 500032, 500032, 500084,
          500035, 500084, 500032, 500032, 500032, 500032, 500084, 500032,
          500084, 500033, 500084, 500032, 500032, 500032, 500084, 500032,
          500032, 500032)),
  .Names = c("NAME", "DOB", "ID", "PIN"),
  row.names = c(NA, -50L), class = c("data.table", "data.frame"))

# define function for duplicate key exclusion
dupKeyEx <- function(DT, dup_cols, unique_cols) {
  cols <-  c(dup_cols, unique_cols)
  mDT <-  DT[!duplicated(DT, by=cols), .N, by=dup_cols][N > 1L]
  ans <- unique(DT[mDT[, !"N"], on=dup_cols], by=cols)
  return(ans)
}

# call function
result <- dupKeyEx(customers, c("NAME", "DOB"), "ID")
result

结果告诉我们,BV RAMANA(名称和DOB相同)已获得多个ID,并显示以下不同ID:

                          NAME        DOB     ID    PIN
 1:                B ANNAPURNA 12-12-1962 502877 500084
 2:                B ANNAPURNA 12-12-1962 502878 500084
 3:                 B V RAMANA 15-01-1960 502902 500032
 4:                 B V RAMANA 15-01-1960 502910 500033
 5:               DARBAR ASHOK 03-01-1950 502548 500084
 6:               DARBAR ASHOK 03-01-1950 502639 500032
 7:            KOTLA CHENNAMMA 15-07-1958 502466 500032
 8:            KOTLA CHENNAMMA 15-07-1958 502467 500032
 9:            MARIAM SUDHAKAR 24-01-1954 502047 500032
10:            MARIAM SUDHAKAR 24-01-1954 502475 500032
11:    VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
12:    VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
13: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
14: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032

如果此修改后的数据集使用OP方法,则会丢失两行具有相同BV RAMANA ID的ID,并且结果将仅显示颁发给BV RAMANA的ID(总共三个)中的一个ID,因为使用该方法“ ID = AAB的组将失去两个As并保留B”(引用Frank)。Uwe Block的其他data.table解决方案也是如此。不符合目的的结果如下所示。

                          NAME        DOB     ID    PIN
 1:                B ANNAPURNA 12-12-1962 502878 500084
 2:                B ANNAPURNA 12-12-1962 502877 500084
 3:                 B V RAMANA 15-01-1960 502910 500033
 4:               DARBAR ASHOK 03-01-1950 502639 500032
 5:               DARBAR ASHOK 03-01-1950 502548 500084
 6:            KOTLA CHENNAMMA 15-07-1958 502466 500032
 7:            KOTLA CHENNAMMA 15-07-1958 502467 500032
 8:            MARIAM SUDHAKAR 24-01-1954 502475 500032
 9:            MARIAM SUDHAKAR 24-01-1954 502047 500032
10:    VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
11:    VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
12: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
13: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032

直到有人想到新方法失败的情况,我认为我们已经为“重复键排除”找到了正确的解决方案。Frank的出色捕获和d​​ata.table解决方案。


有关扩展内容以涵盖由于前导或尾随空格而导致值不匹配的情况的扩展,请参见OP的后续部分,如何引用函数中变量中保存的多个列名,使用stringr::str_trim()并得出以下结论:

dupKeyEx <- function(DT, dup_cols, unique_cols) {
  cols <-  c(dup_cols, unique_cols)
  chr_cols <- cols[sapply(DT[, ..cols], is.character)]
  DT[, (chr_cols) := lapply(.SD, stringr::str_trim), .SDcols=chr_cols]
  mDT <-  DT[!duplicated(DT, by=cols), .N, by=dup_cols][N > 1L]
  ans <- unique(DT[mDT[, !"N"], on=dup_cols], by=cols)
  setorderv(ans, c(dup_cols, unique_cols))
  return(ans)
}

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

如何为一个对象列表的一个字段与另一个具有公共字段的对象的列表的字段设置值?

如果另一个字段具有值(动态),则使用jquery禁用字段

MYSQL连接两个具有相同字段值的表字段,但如果另一个字段与另一个字段不匹配,则仍将包括

如何在另一个字段的任何行中查找具有一个字段内容的MySQL记录

如何编写Kusto查询以仅选择一个字段中具有唯一值的行

选择在一个字段上具有值并且在另一个上的两个值之间的行

确保对象字段在Json Schema的另一个字段中具有值

获取一个字段的最大值,但仅针对另一个字段中具有相同值的其他文档

如何计算具有相同值的文档数,并使用mongoDB中的另一个字段将它们分组?

在SQLite3中,是否有一种方法可以在另一个字段具有相同的值时不允许重复的字段?

如何在Elasticsearch中搜索具有多个值的一个字段?

Rails会验证一个字段值,除非另一个字段具有值

Matlab结构-在一个字段中仅复制具有特定值的元素

在一个字段中具有多个值的MySQL表

在Access VBA中从具有唯一编号的另一个字段值创建串联代码

从数据库中查询,该数据库可能具有一个字段的多个值

MS-Access 2007:查询在另一个字段中具有两个或多个不同值的名称

如何在一个字段或另一个字段中具有值的mongo和group记录中将$ group和$ or组合在一起?

提取具有重复最后一个字段的行

如何读取名称在另一个字段中的行字段的值

仅使用不同 ArrayList 中的一个字段搜索具有多个字段的 ArrayList

为csv文件中的另一个字段标识具有特殊字符的值

SQL - 字段有两个记录值,只想显示另一个字段中基于总计的值之一,然后显示两个值的总计

查找一个字段的实例,其中另一个字段的对应值(应该是 1-1)具有多个值

需要一个 Elasticsearch 查询,将结果限制为在一个字段中具有相同值但在另一个字段中具有不同值的那些

SQL:查询以在一个字段中查找具有共同值的多个项目

SELECT 在两个字段中具有相同值的行,在另一个字段中具有不同的值

通过自联接计数,其中一个字段为空但另一个字段具有值

JQ - 将两个字段合并为一个具有另一个字段名称的数组