I have two data.tables df
(21 MIO rows) and tmp
(500k rows)
df
has three columns linking an original patent (origpat
) with a reference patent (refpat
), and tying an original classification (mainprim
) to the origpat
.
Below the 30 first lines are shown. Every origpat
, refpat
pair is unique but every origpat
appears between 1 and 300 times and every refpat
between 1 and 3,100 times
dput(df[1:30,-3])
structure(list(origpat = c(4247592, 4247592, 4247592, 4247592,
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592,
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592,
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592,
4247592, 4247592, 4247592, 4247592, 4247592), ref.pat = c(4318978,
4436368, 4358181, 4478622, 4312654, 4293439, 4286061, 4363648,
4406517, 4478623, 4277285, 4375743, 4470520, 4328022, 4248614,
4297139, 4296607, 4296608, 4395271, 4321141, 4294190, 4431420,
4322467, 4285730, 4393138, 4246034, 4251278, 4339174, 4277322,
4290586), mainprim = c("442", "442", "442", "442", "442", "442",
"442", "442", "442", "442", "442", "442", "442", "442", "442",
"442", "442", "442", "442", "442", "442", "442", "442", "442",
"442", "442", "442", "442", "442", "442")), .Names = c("origpat",
"ref.pat", "mainprim"), row.names = c(NA, 30L), class = c("data.table",
"data.frame"))
tmp
contains a list of patents pnum
with their respective primary classifications prim
. All the origpat
and refpat
in df
are pnum
(patent numbers) in tmp
. As example data I give the selection of tmp
data.table that contain all the info relating to the selected df
variables through dput
dput(tmp)
structure(list(pnum = c("4318978", "4318978", "4318978", "4318978",
"4318978", "4318978", "4318978", "4318978", "4436368", "4436368",
"4436368", "4436368", "4358181", "4358181", "4358181", "4358181",
"4478622", "4312654", "4312654", "4312654", "4312654", "4312654",
"4312654", "4293439", "4293439", "4293439", "4293439", "4293439",
"4293439", "4293439", "4293439", "4293439", "4293439", "4293439",
"4293439", "4293439", "4286061", "4286061", "4286061", "4286061",
"4286061", "4286061", "4286061", "4286061", "4363648", "4363648",
"4363648", "4406517", "4478623", "4478623", "4277285", "4375743",
"4375743", "4375743", "4375743", "4470520", "4470520", "4470520",
"4328022", "4328022", "4248614", "4248614", "4248614", "4248614",
"4248614", "4248614", "4297139", "4297139", "4297139", "4297139",
"4297139", "4296607", "4296607", "4296607", "4296607", "4296607",
"4296607", "4296608", "4296608", "4296608", "4296608", "4296608",
"4395271", "4395271", "4395271", "4321141", "4321141", "4321141",
"4321141", "4294190", "4294190", "4294190", "4294190", "4294190",
"4294190", "4431420", "4431420", "4431420", "4431420", "4431420",
"4431420", "4322467", "4322467", "4322467", "4322467", "4322467",
"4322467", "4322467", "4322467", "4322467", "4322467", "4285730",
"4285730", "4393138", "4393138", "4393138", "4393138", "4393138",
"4393138", "4393138", "4246034", "4246034", "4246034", "4246034",
"4251278", "4251278", "4251278", "4339174", "4339174", "4339174",
"4339174", "4277322", "4277322", "4290586", "4290586", "4290586",
"4290586", "4290586", "4247592", "4247592", "4247592", "4247592",
"4247592", "4247592", "4247592", "4247592", "4247592"), prim = c("430",
"430", "430", "430", "430", "430", "430", "430", "340", "385",
"385", "385", "385", "385", "65", "65", "65", "118", "427", "65",
"65", "65", "65", "106", "106", "106", "501", "501", "501", "501",
"501", "516", "516", "516", "516", "516", "435", "435", "435",
"435", "435", "435", "435", "435", "156", "428", "65", "385",
"65", "65", "501", "422", "53", "53", "53", "222", "422", "604",
"65", "65", "385", "385", "65", "65", "65", "65", "106", "106",
"501", "501", "501", "252", "423", "423", "501", "505", "62",
"423", "501", "501", "505", "62", "65", "65", "65", "210", "210",
"210", "435", "118", "118", "118", "118", "118", "118", "106",
"433", "433", "433", "433", "501", "156", "427", "427", "428",
"428", "428", "428", "428", "428", "428", "501", "501", "426",
"426", "426", "435", "435", "435", "435", "428", "501", "501",
"501", "501", "501", "65", "385", "385", "385", "65", "204",
"204", "204", "266", "266", "432", "73", "427", "427", "428",
"442", "442", "442", "442", "8", "8")), .Names = c("pnum", "prim"
), class = c("data.table", "data.frame"), row.names = c(NA, -147L
), .internal.selfref = <pointer: 0x0000000000100788>)
Now, I want to compare the mainprim
(which is linked to the origpat
) with the different prim
variables linked to refpat
.
Code below works but is prohibitively slow.
library(data.table)
df <- data.table(df) ; setkey(df, refpat, origpat)
refs <- unique(df$refpat) # Capture all unique refpat in df (71,000 in entire data.table)
startrow <- 0 # Set loop
overlap <- function(a,b) sum (a == b) / length(b)
df$compare <- NA # overlap values will be inserted here
for (h in 1:length(refs)) {
refclass <- tmp$prim[tmp$pnum == refs[h]] #subgroup of relevant 'prim'
x <- length(df$refpat[df$refpat == refs[h]])
prims <- df$mainprim[startrow:(startrow + x)] # isolate subset from large `df` data.table to reduce memory needed in second loop
for (i in 1:x) {
df$compare[startrow + i] <- overlap(prims[i], refclass)
}
startrow <- startrow + x
print(h)
}
The reason I use two for
loops is to save computer memory. I could use a single one and redetermine refclass
for every row but that made my computer crash within minutes. This loop works but at a speed of being done in about 250 hours. I'm sure there are ways to simply subset the needed rows from tmp
within df
and then repeat this for every origpat
but my data.table skills are not up to the task and I don't find an answer that explains how to make this work on SO or on the data.table pdf files.
Any suggestions are very welcome
EDIt @Frank The specific comparison I want to make is always changing. The main issue is the following. Consider a long df
with two columns of linked pnum
(patent numbers), one called origpat
and the second called ref.pat
. Every column contains multiple repeated pnum
but every combination (on a single row) is unique. It establishes a link between a firm patent and an older patent. This dataset is about 22 MIO rows. Then I have multiple other datatables, e.g. one linking the pnum
to the inventors
, one linking the pnum
to various technology classifications. What I am interested in is finding the fastest way to compare the linked data (e.g. inventors, technology classes) on a pairwise basis, with the pair defined in the rows of the df
(i.e. origpat
and ref.pat
). So far the data.table solution I have is the fastest but it still takes multiple days to complete a single new comparison. Hope this helps
Best idea I came with is:
df[,idx := .I] # Add an index to the data.table to group by row of df
df[,compare := sum(tmp[pnum == ref.pat, prim] == mainprim) /
length(tmp[pnum == ref.pat,prim]),by = idx]
Or reusing your overlap
function (still using the idx column):
df[,compare := overlap(
mainprim,
tmp[pnum == ref.pat, prim]),
by=idx]
What it does here is grouping by row and then use columns from Subset Data to get the mainprim
for this row and the subsets of tmp
needed.
If you want to avoid creating the idx
column you can use by=1:nrow(df)
instead but this could slow down the process (using an actual column is quicker in data.table
).
Great improvements by @Docendo:
You can further speed up the process by creating an intermediate variable to store the subset instead of doing the subset twice per row:
df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)}, by = idx]
And in case there are duplicated combinations of ref.pat
and mainprim
in df
you could further optimize the performance by using by = list(ref.pat, mainprim)
instead of by = idx
:
df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)},
by = list(ref.pat, mainprim)]
And another, probably just minimal, improvement could be done by using mean()
instead of sum()/length()
:
df[,compare := mean(tmp[pnum == ref.pat, prim] == mainprim), by = list(ref.pat, mainprim)]
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments