How to select rows from one data.table to apply in another data.table?

SJDS Published at Dev

SJDS

I have two data.tables df (21 MIO rows) and tmp (500k rows)

df has three columns linking an original patent (origpat) with a reference patent (refpat), and tying an original classification (mainprim) to the origpat.

Below the 30 first lines are shown. Every origpat, refpat pair is unique but every origpat appears between 1 and 300 times and every refpat between 1 and 3,100 times

dput(df[1:30,-3])
structure(list(origpat = c(4247592, 4247592, 4247592, 4247592, 
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 
4247592, 4247592, 4247592, 4247592, 4247592), ref.pat = c(4318978, 
4436368, 4358181, 4478622, 4312654, 4293439, 4286061, 4363648, 
4406517, 4478623, 4277285, 4375743, 4470520, 4328022, 4248614, 
4297139, 4296607, 4296608, 4395271, 4321141, 4294190, 4431420, 
4322467, 4285730, 4393138, 4246034, 4251278, 4339174, 4277322, 
4290586), mainprim = c("442", "442", "442", "442", "442", "442", 
"442", "442", "442", "442", "442", "442", "442", "442", "442", 
"442", "442", "442", "442", "442", "442", "442", "442", "442", 
"442", "442", "442", "442", "442", "442")), .Names = c("origpat", 
"ref.pat", "mainprim"), row.names = c(NA, 30L), class = c("data.table", 
"data.frame"))

tmp contains a list of patents pnum with their respective primary classifications prim. All the origpat and refpat in df are pnum (patent numbers) in tmp. As example data I give the selection of tmp data.table that contain all the info relating to the selected df variables through dput

dput(tmp)
structure(list(pnum = c("4318978", "4318978", "4318978", "4318978", 
"4318978", "4318978", "4318978", "4318978", "4436368", "4436368", 
"4436368", "4436368", "4358181", "4358181", "4358181", "4358181", 
"4478622", "4312654", "4312654", "4312654", "4312654", "4312654", 
"4312654", "4293439", "4293439", "4293439", "4293439", "4293439", 
"4293439", "4293439", "4293439", "4293439", "4293439", "4293439", 
"4293439", "4293439", "4286061", "4286061", "4286061", "4286061", 
"4286061", "4286061", "4286061", "4286061", "4363648", "4363648", 
"4363648", "4406517", "4478623", "4478623", "4277285", "4375743", 
"4375743", "4375743", "4375743", "4470520", "4470520", "4470520", 
"4328022", "4328022", "4248614", "4248614", "4248614", "4248614", 
"4248614", "4248614", "4297139", "4297139", "4297139", "4297139", 
"4297139", "4296607", "4296607", "4296607", "4296607", "4296607", 
"4296607", "4296608", "4296608", "4296608", "4296608", "4296608", 
"4395271", "4395271", "4395271", "4321141", "4321141", "4321141", 
"4321141", "4294190", "4294190", "4294190", "4294190", "4294190", 
"4294190", "4431420", "4431420", "4431420", "4431420", "4431420", 
"4431420", "4322467", "4322467", "4322467", "4322467", "4322467", 
"4322467", "4322467", "4322467", "4322467", "4322467", "4285730", 
"4285730", "4393138", "4393138", "4393138", "4393138", "4393138", 
"4393138", "4393138", "4246034", "4246034", "4246034", "4246034", 
"4251278", "4251278", "4251278", "4339174", "4339174", "4339174", 
"4339174", "4277322", "4277322", "4290586", "4290586", "4290586", 
"4290586", "4290586", "4247592", "4247592", "4247592", "4247592", 
"4247592", "4247592", "4247592", "4247592", "4247592"), prim = c("430", 
"430", "430", "430", "430", "430", "430", "430", "340", "385", 
"385", "385", "385", "385", "65", "65", "65", "118", "427", "65", 
"65", "65", "65", "106", "106", "106", "501", "501", "501", "501", 
"501", "516", "516", "516", "516", "516", "435", "435", "435", 
"435", "435", "435", "435", "435", "156", "428", "65", "385", 
"65", "65", "501", "422", "53", "53", "53", "222", "422", "604", 
"65", "65", "385", "385", "65", "65", "65", "65", "106", "106", 
"501", "501", "501", "252", "423", "423", "501", "505", "62", 
"423", "501", "501", "505", "62", "65", "65", "65", "210", "210", 
"210", "435", "118", "118", "118", "118", "118", "118", "106", 
"433", "433", "433", "433", "501", "156", "427", "427", "428", 
"428", "428", "428", "428", "428", "428", "501", "501", "426", 
"426", "426", "435", "435", "435", "435", "428", "501", "501", 
"501", "501", "501", "65", "385", "385", "385", "65", "204", 
"204", "204", "266", "266", "432", "73", "427", "427", "428", 
"442", "442", "442", "442", "8", "8")), .Names = c("pnum", "prim"
), class = c("data.table", "data.frame"), row.names = c(NA, -147L
), .internal.selfref = <pointer: 0x0000000000100788>)

Now, I want to compare the mainprim (which is linked to the origpat) with the different prim variables linked to refpat.

Code below works but is prohibitively slow.

library(data.table)
df <- data.table(df) ; setkey(df, refpat, origpat) 
refs <- unique(df$refpat) # Capture all unique refpat in df (71,000 in entire data.table)
startrow <- 0 # Set loop    
overlap <- function(a,b) sum (a == b) / length(b)
df$compare <- NA # overlap values will be inserted here

for (h in 1:length(refs)) {
  refclass <- tmp$prim[tmp$pnum == refs[h]] #subgroup of relevant 'prim'
    x <- length(df$refpat[df$refpat == refs[h]])
    prims <- df$mainprim[startrow:(startrow + x)] # isolate subset from large `df` data.table to reduce memory needed in second loop
      for (i in 1:x) {
      df$compare[startrow + i] <- overlap(prims[i], refclass) 
        }
    startrow <- startrow + x
    print(h)
  }

The reason I use two for loops is to save computer memory. I could use a single one and redetermine refclass for every row but that made my computer crash within minutes. This loop works but at a speed of being done in about 250 hours. I'm sure there are ways to simply subset the needed rows from tmp within df and then repeat this for every origpat but my data.table skills are not up to the task and I don't find an answer that explains how to make this work on SO or on the data.table pdf files.

Any suggestions are very welcome

EDIt @Frank The specific comparison I want to make is always changing. The main issue is the following. Consider a long df with two columns of linked pnum (patent numbers), one called origpat and the second called ref.pat. Every column contains multiple repeated pnum but every combination (on a single row) is unique. It establishes a link between a firm patent and an older patent. This dataset is about 22 MIO rows. Then I have multiple other datatables, e.g. one linking the pnum to the inventors, one linking the pnum to various technology classifications. What I am interested in is finding the fastest way to compare the linked data (e.g. inventors, technology classes) on a pairwise basis, with the pair defined in the rows of the df (i.e. origpat and ref.pat). So far the data.table solution I have is the fastest but it still takes multiple days to complete a single new comparison. Hope this helps

Tensibai

Best idea I came with is:

df[,idx := .I] # Add an index to the data.table to group by row of df
df[,compare := sum(tmp[pnum == ref.pat, prim] == mainprim) /
     length(tmp[pnum == ref.pat,prim]),by = idx]

Or reusing your overlap function (still using the idx column):

df[,compare := overlap(
                mainprim,
                tmp[pnum == ref.pat, prim]),
    by=idx]

What it does here is grouping by row and then use columns from Subset Data to get the mainprim for this row and the subsets of tmp needed.

If you want to avoid creating the idx column you can use by=1:nrow(df) instead but this could slow down the process (using an actual column is quicker in data.table).

Great improvements by @Docendo:

You can further speed up the process by creating an intermediate variable to store the subset instead of doing the subset twice per row:

df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)}, by = idx]

And in case there are duplicated combinations of ref.pat and mainprim in df you could further optimize the performance by using by = list(ref.pat, mainprim) instead of by = idx:

df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)},
   by = list(ref.pat, mainprim)]

And another, probably just minimal, improvement could be done by using mean() instead of sum()/length():

df[,compare := mean(tmp[pnum == ref.pat, prim] == mainprim), by = list(ref.pat, mainprim)]

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-04-6

Comments

0 comments

How to apply a combination function on data.table rows to extract and record the different possibilities in another data.table?

TOP Ranking

Article

How to select rows from one data.table to apply in another data.table?

How to select rows from one data.table to apply in another data.table?

pump.io port in URL

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Using Response.Redirect with Friendly URLS in ASP.NET

Can a 32-bit antivirus program protect you from 64-bit threats

Double spacing in rmarkdown pdf

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

3D Touch Peek Swipe Like Mail

Bootstrap 5 Static Modal Still Closes when I Click Outside

Assembly definition can't resolve namespaces from external packages

Vector input in shiny R and then use it

Emulator wrong screen resolution in Android Studio 1.3

Svchost high CPU from Microsoft.BingWeather app errors

Graphics Context misaligned on first paint

Python connect to firebird docker database

Is this docker-for-mac password dialog legit?

How to save models trained locally in Amazon SageMaker?