目的是从dataframe / data.table中提取具有以下内容的行:
目前,我正在这样做:
library(data.table)
# load the data
customers <- structure(list(
NAME = c("GEETA SHYAM RAO", "B V RAMANA", "GONTU VENKATARAMANAIAH",
"DAMAT RAMAKRISHNA", "MARIAM SUDHAKAR", "VELPURI LAKSHMI SUJATHA",
"MOHAMMED LIYAKHAT ALI", "VENKATESHWARAN PONNAMBALAM",
"DEVARAKONDA SATISH BABU", "GEEDI RAMULU", "KANDU OBULESU",
"J PARVATHALU(TEMP.SUB-STAFF)", "DOKKA RAJESH", "G TULASIRAM REDDY",
"MALLELA CHIRANJEEVI", "MANEPALLI VENKATA RAVAMMA",
"DOKKA JAGADEESHWAR", "K KRISHNA", "B SUDARSHAN", "B ANNAPURNA",
"CHAVVA SHIVA RAMULU", "BIKASH BAHADUR CHITRE", "DARBAR ASHOK",
"VEMULAPALLY SANGAMESHWAR RAO", "MOHAMMED ABDUL HAKEEM ANWAR",
"MANEPALLI SHIV SHANKAR RAO", "MOHD MISKEEN MOHIUDDIN",
"KOTLA CHENNAMMA", "NAYAK SURYAKANTH", "GOPIREDDY INDIRA",
"MEKALA SREEDEVI", "K KRISHNA", "B V RAMANA", "KUMMARI VENKATESHAM",
"BHAVANI CONSRUCTIONS", "UPPUTHOLLA KOTAIAH", "YEDIDHA NIRMALA DEVI",
"MARIAM SUDHAKAR", "B ANNAPURNA", "VELPURI LAKSHMI SUJATHA",
"DARBAR ASHOK", "AMMANA VISHNU VARDHAN REDDY", "ZAITOON BEE",
"MOHD CHAND PASHA", "PALERELLA RAMESH", "GEEDI SRINIVAS",
"RAMAIAH SADU", "BIMAN BALAIAH", "KOTLA CHENNAMMA",
"VENKATESHWARAN PONNAMBALAM"),
DOB = c("13-02-1971", "15-01-1960", "01-07-1970", "10-03-1977",
"24-01-1954", "28-06-1971", "26-01-1980", "14-04-1969", "23-09-1978",
"15-08-1954", "09-10-1984", "20-02-1975", "29-09-1984", "03-03-1975",
"26-01-1979", "01-01-1964", "21-01-1954", "01-05-1964", "12-03-1975",
"12-12-1962", "10-03-1982", "14-05-1983", "03-01-1950", "04-03-1962",
"12-05-1966", "01-06-1960", "10-03-1964", "15-07-1958", "26-06-1979",
"02-04-1974", "10-01-1975", "01-05-1964", "15-01-1960", "08-08-1977",
NA, "05-04-1981", "29-08-1971", "24-01-1954", "12-12-1962",
"28-06-1971", "03-01-1950", "23-06-1970", "20-02-1960", "05-07-1975",
"10-01-1979", "31-08-1982", "10-08-1983", "10-03-1964",
"15-07-1958", "14-04-1969"),
ID = c(502969, 502902, 502985, 502981, 502475, 502267, 502976,
502272, 502977, 502973, 502986, 502978, 502989, 502998, 502967,
502971, 502988, 502737, 502995, 502878, 502972, 502984, 502639,
502968, 502975, 502970, 502997, 502466, 502991, 502982, 502980,
502737, 502902, 502999, 502994, 502987, 502990, 502047, 502877,
502251, 502548, 502992, 503000, 502993, 502983, 502974, 502996,
502979, 502467, 502290),
PIN = c(500082, 500032, 500032, 500032,
500032, 500084, 500032, 500032, 500032, 500032, 500032, 500084,
500032, 500084, 500084, 500032, 5e+05, 500050, 500032, 500084,
500032, 500032, 500032, 500050, 500032, 500032, 500045, 500032,
500084, 500032, 500032, 500084, 500035, 500084, 500032, 500032,
500032, 500032, 500084, 500032, 500084, 500033, 500084, 500032,
500032, 500032, 500084, 500032, 500032, 500032)),
.Names = c("NAME", "DOB", "ID", "PIN"),
class = c("data.table", "data.frame"), row.names = c(NA,-50L))
签出数据:
dim(customers)
#[1] 50 4
head(customers)
NAME DOB ID PIN
#1: GEETA SHYAM RAO 13-02-1971 502969 500082
#2: B V RAMANA 15-01-1960 502902 500032
#3: GONTU VENKATARAMANAIAH 01-07-1970 502985 500032
#4: DAMAT RAMAKRISHNA 10-03-1977 502981 500032
#5: MARIAM SUDHAKAR 24-01-1954 502475 500032
#6: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
步骤1:在NAME和DOB列中获取具有相同值的行-
dup1 <- customers[, .(ID, PIN, .N), keyby=.(NAME, DOB)][N>1][, -"N"]
dup1
# NAME DOB ID PIN
# 1: B ANNAPURNA 12-12-1962 502878 500084
# 2: B ANNAPURNA 12-12-1962 502877 500084
# 3: B V RAMANA 15-01-1960 502902 500032
# 4: B V RAMANA 15-01-1960 502902 500035
# 5: DARBAR ASHOK 03-01-1950 502639 500032
# 6: DARBAR ASHOK 03-01-1950 502548 500084
# 7: K KRISHNA 01-05-1964 502737 500050
# 8: K KRISHNA 01-05-1964 502737 500084
# 9: KOTLA CHENNAMMA 15-07-1958 502466 500032
#10: KOTLA CHENNAMMA 15-07-1958 502467 500032
#11: MARIAM SUDHAKAR 24-01-1954 502475 500032
#12: MARIAM SUDHAKAR 24-01-1954 502047 500032
#13: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
#14: VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
#15: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
#16: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
在以上结果中,“ BV RAMANA”和“ K KRISHNA”的ID值在重复行中相同,因此需要删除。
步骤2:在NAME,DOB和ID列中获取具有相同值的行-
dup2 <- dup1[, .(PIN, .N), keyby=.(NAME, DOB, ID)][N>1][, -"N"]
dup2
# NAME DOB ID PIN
#1: B V RAMANA 15-01-1960 502902 500032
#2: B V RAMANA 15-01-1960 502902 500035
#3: K KRISHNA 01-05-1964 502737 500050
#4: K KRISHNA 01-05-1964 502737 500084
第3步:现在从第1步中的行中删除第2步中的行以获得最终结果-
result <- fsetdiff(dup1, dup2)
result
# NAME DOB ID PIN
# 1: B ANNAPURNA 12-12-1962 502878 500084
# 2: B ANNAPURNA 12-12-1962 502877 500084
# 3: DARBAR ASHOK 03-01-1950 502639 500032
# 4: DARBAR ASHOK 03-01-1950 502548 500084
# 5: KOTLA CHENNAMMA 15-07-1958 502466 500032
# 6: KOTLA CHENNAMMA 15-07-1958 502467 500032
# 7: MARIAM SUDHAKAR 24-01-1954 502475 500032
# 8: MARIAM SUDHAKAR 24-01-1954 502047 500032
# 9: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
#10: VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
#11: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
#12: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
在上述每种情况下,NAME和DOB列均具有重复值,但ID列中这些重复行的值必然不同。
这是获取结果的三行处理代码,但我确定必须有其他方法。在此示例中,只有四个字段。可以说,将50个以上的字段(即使带有复制粘贴)放在代码中将是一件繁琐的工作。因此,创建可以用作输入的可重用功能真的很酷-
并将结果输出为dataframe / data.table。请点子。
旁注:该功能在欺诈分析中被认为非常重要,以至商业软件“ CaseWare IDEA”以“重复密钥排除”的名称提供此功能。实际使用此功能:https : //www.youtube.com/watch?v=XqL4j8UXsKw
我认为OP的方法已经非常好。但是...
j = .N
它自己,它将更有效率。有关?GForce
详细信息,请参见。我认为OP的两个步骤在OP和链接的视频中描述的“重复密钥排除”任务中没有成功:
在上述每种情况下,NAME和DOB列均具有重复值,但ID列中这些重复行的值必然不同。
对于OP的两个步骤,有...
bycols = c("NAME", "DOB")
dcol = "ID"
cols = c(bycols, dcol)
w1 = customers[customers[, .N, by=bycols][N > 1L, !"N"], on=bycols, which=TRUE]
customers[w1][!customers[w1, .N, by=cols][N > 1L, !"N"], on=cols]
对于引用的任务...
mDT = customers[!duplicated(customers, by=cols), .N, by=bycols][N > 1L]
customers[mDT[, !"N"], on=bycols]
无论哪种方式,OP的示例
NAME DOB ID PIN
1: MARIAM SUDHAKAR 24-01-1954 502475 500032
2: MARIAM SUDHAKAR 24-01-1954 502047 500032
3: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
4: VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
5: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
6: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
7: B ANNAPURNA 12-12-1962 502878 500084
8: B ANNAPURNA 12-12-1962 502877 500084
9: DARBAR ASHOK 03-01-1950 502639 500032
10: DARBAR ASHOK 03-01-1950 502548 500084
11: KOTLA CHENNAMMA 15-07-1958 502466 500032
12: KOTLA CHENNAMMA 15-07-1958 502467 500032
mDT
是描述重复项的摘要表,便于浏览:
> mDT
NAME DOB N
1: MARIAM SUDHAKAR 24-01-1954 2
2: VELPURI LAKSHMI SUJATHA 28-06-1971 2
3: VENKATESHWARAN PONNAMBALAM 14-04-1969 2
4: B ANNAPURNA 12-12-1962 2
5: DARBAR ASHOK 03-01-1950 2
6: KOTLA CHENNAMMA 15-07-1958 2
由__San __(原始海报)编辑:
修改数据集以显示如何处理Frank先前指出的情况:“ ID = AAB的组将丢失两个As并保留B”
library(data.table)
# load the data
customers <- structure(list(
NAME = c("GEETA SHYAM RAO", "B V RAMANA", "GONTU VENKATARAMANAIAH",
"DAMAT RAMAKRISHNA", "MARIAM SUDHAKAR", "VELPURI LAKSHMI SUJATHA",
"MOHAMMED LIYAKHAT ALI", "VENKATESHWARAN PONNAMBALAM",
"DEVARAKONDA SATISH BABU", "GEEDI RAMULU", "KANDU OBULESU",
"B V RAMANA", "DOKKA RAJESH", "G TULASIRAM REDDY",
"MALLELA CHIRANJEEVI", "MANEPALLI VENKATA RAVAMMA",
"DOKKA JAGADEESHWAR", "K KRISHNA", "B SUDARSHAN", "B ANNAPURNA",
"CHAVVA SHIVA RAMULU", "BIKASH BAHADUR CHITRE", "DARBAR ASHOK",
"VEMULAPALLY SANGAMESHWAR RAO", "MOHAMMED ABDUL HAKEEM ANWAR",
"MANEPALLI SHIV SHANKAR RAO", "MOHD MISKEEN MOHIUDDIN",
"KOTLA CHENNAMMA", "NAYAK SURYAKANTH", "GOPIREDDY INDIRA",
"MEKALA SREEDEVI", "K KRISHNA", "B V RAMANA",
"KUMMARI VENKATESHAM", "BHAVANI CONSRUCTIONS",
"UPPUTHOLLA KOTAIAH", "YEDIDHA NIRMALA DEVI", "MARIAM SUDHAKAR",
"B ANNAPURNA", "VELPURI LAKSHMI SUJATHA", "DARBAR ASHOK",
"AMMANA VISHNU VARDHAN REDDY", "ZAITOON BEE", "MOHD CHAND PASHA",
"PALERELLA RAMESH", "GEEDI SRINIVAS", "RAMAIAH SADU",
"BIMAN BALAIAH", "KOTLA CHENNAMMA", "VENKATESHWARAN PONNAMBALAM"),
DOB = c("13-02-1971", "15-01-1960", "01-07-1970", "10-03-1977",
"24-01-1954", "28-06-1971", "26-01-1980", "14-04-1969",
"23-09-1978", "15-08-1954", "09-10-1984", "15-01-1960",
"29-09-1984", "03-03-1975", "26-01-1979", "01-01-1964",
"21-01-1954", "01-05-1964", "12-03-1975", "12-12-1962",
"10-03-1982", "14-05-1983", "03-01-1950", "04-03-1962",
"12-05-1966", "01-06-1960", "10-03-1964", "15-07-1958",
"26-06-1979", "02-04-1974", "10-01-1975", "01-05-1964",
"15-01-1960", "08-08-1977", NA, "05-04-1981", "29-08-1971",
"24-01-1954", "12-12-1962", "28-06-1971", "03-01-1950",
"23-06-1970", "20-02-1960", "05-07-1975", "10-01-1979",
"31-08-1982", "10-08-1983", "10-03-1964", "15-07-1958",
"14-04-1969"),
ID = c(502969, 502902, 502985, 502981, 502475, 502267, 502976,
502272, 502977, 502973, 502986, 502910, 502989, 502998, 502967,
502971, 502988, 502737, 502995, 502878, 502972, 502984, 502639,
502968, 502975, 502970, 502997, 502466, 502991, 502982, 502980,
502737, 502902, 502999, 502994, 502987, 502990, 502047, 502877,
502251, 502548, 502992, 503000, 502993, 502983, 502974, 502996,
502979, 502467, 502290),
PIN = c(500082, 500032, 500032, 500032, 500032, 500084, 500032, 500032,
500032, 500032, 500032, 500033, 500032, 500084, 500084, 500032,
5e+05, 500050, 500032, 500084, 500032, 500032, 500032, 500050,
500032, 500032, 500045, 500032, 500084, 500032, 500032, 500084,
500035, 500084, 500032, 500032, 500032, 500032, 500084, 500032,
500084, 500033, 500084, 500032, 500032, 500032, 500084, 500032,
500032, 500032)),
.Names = c("NAME", "DOB", "ID", "PIN"),
row.names = c(NA, -50L), class = c("data.table", "data.frame"))
# define function for duplicate key exclusion
dupKeyEx <- function(DT, dup_cols, unique_cols) {
cols <- c(dup_cols, unique_cols)
mDT <- DT[!duplicated(DT, by=cols), .N, by=dup_cols][N > 1L]
ans <- unique(DT[mDT[, !"N"], on=dup_cols], by=cols)
return(ans)
}
# call function
result <- dupKeyEx(customers, c("NAME", "DOB"), "ID")
result
结果告诉我们,BV RAMANA(名称和DOB相同)已获得多个ID,并显示以下不同ID:
NAME DOB ID PIN
1: B ANNAPURNA 12-12-1962 502877 500084
2: B ANNAPURNA 12-12-1962 502878 500084
3: B V RAMANA 15-01-1960 502902 500032
4: B V RAMANA 15-01-1960 502910 500033
5: DARBAR ASHOK 03-01-1950 502548 500084
6: DARBAR ASHOK 03-01-1950 502639 500032
7: KOTLA CHENNAMMA 15-07-1958 502466 500032
8: KOTLA CHENNAMMA 15-07-1958 502467 500032
9: MARIAM SUDHAKAR 24-01-1954 502047 500032
10: MARIAM SUDHAKAR 24-01-1954 502475 500032
11: VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
12: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
13: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
14: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
如果此修改后的数据集使用OP方法,则会丢失两行具有相同BV RAMANA ID的ID,并且结果将仅显示颁发给BV RAMANA的ID(总共三个)中的一个ID,因为使用该方法“ ID = AAB的组将失去两个As并保留B”(引用Frank)。Uwe Block的其他data.table解决方案也是如此。不符合目的的结果如下所示。
NAME DOB ID PIN
1: B ANNAPURNA 12-12-1962 502878 500084
2: B ANNAPURNA 12-12-1962 502877 500084
3: B V RAMANA 15-01-1960 502910 500033
4: DARBAR ASHOK 03-01-1950 502639 500032
5: DARBAR ASHOK 03-01-1950 502548 500084
6: KOTLA CHENNAMMA 15-07-1958 502466 500032
7: KOTLA CHENNAMMA 15-07-1958 502467 500032
8: MARIAM SUDHAKAR 24-01-1954 502475 500032
9: MARIAM SUDHAKAR 24-01-1954 502047 500032
10: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
11: VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
12: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
13: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
直到有人想到新方法失败的情况,我认为我们已经为“重复键排除”找到了正确的解决方案。Frank的出色捕获和data.table解决方案。
有关扩展内容以涵盖由于前导或尾随空格而导致值不匹配的情况的扩展,请参见OP的后续部分,如何引用函数中变量中保存的多个列名,使用stringr::str_trim()
并得出以下结论:
dupKeyEx <- function(DT, dup_cols, unique_cols) {
cols <- c(dup_cols, unique_cols)
chr_cols <- cols[sapply(DT[, ..cols], is.character)]
DT[, (chr_cols) := lapply(.SD, stringr::str_trim), .SDcols=chr_cols]
mDT <- DT[!duplicated(DT, by=cols), .N, by=dup_cols][N > 1L]
ans <- unique(DT[mDT[, !"N"], on=dup_cols], by=cols)
setorderv(ans, c(dup_cols, unique_cols))
return(ans)
}
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句