How to select rows from one data.table to apply in another data.table?

SJDS

I have two data.tables df (21 MIO rows) and tmp (500k rows)

df has three columns linking an original patent (origpat) with a reference patent (refpat), and tying an original classification (mainprim) to the origpat.

Below the 30 first lines are shown. Every origpat, refpat pair is unique but every origpat appears between 1 and 300 times and every refpat between 1 and 3,100 times

dput(df[1:30,-3])
structure(list(origpat = c(4247592, 4247592, 4247592, 4247592, 
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 
4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 4247592, 
4247592, 4247592, 4247592, 4247592, 4247592), ref.pat = c(4318978, 
4436368, 4358181, 4478622, 4312654, 4293439, 4286061, 4363648, 
4406517, 4478623, 4277285, 4375743, 4470520, 4328022, 4248614, 
4297139, 4296607, 4296608, 4395271, 4321141, 4294190, 4431420, 
4322467, 4285730, 4393138, 4246034, 4251278, 4339174, 4277322, 
4290586), mainprim = c("442", "442", "442", "442", "442", "442", 
"442", "442", "442", "442", "442", "442", "442", "442", "442", 
"442", "442", "442", "442", "442", "442", "442", "442", "442", 
"442", "442", "442", "442", "442", "442")), .Names = c("origpat", 
"ref.pat", "mainprim"), row.names = c(NA, 30L), class = c("data.table", 
"data.frame"))

tmp contains a list of patents pnum with their respective primary classifications prim. All the origpat and refpat in df are pnum (patent numbers) in tmp. As example data I give the selection of tmp data.table that contain all the info relating to the selected df variables through dput

dput(tmp)
structure(list(pnum = c("4318978", "4318978", "4318978", "4318978", 
"4318978", "4318978", "4318978", "4318978", "4436368", "4436368", 
"4436368", "4436368", "4358181", "4358181", "4358181", "4358181", 
"4478622", "4312654", "4312654", "4312654", "4312654", "4312654", 
"4312654", "4293439", "4293439", "4293439", "4293439", "4293439", 
"4293439", "4293439", "4293439", "4293439", "4293439", "4293439", 
"4293439", "4293439", "4286061", "4286061", "4286061", "4286061", 
"4286061", "4286061", "4286061", "4286061", "4363648", "4363648", 
"4363648", "4406517", "4478623", "4478623", "4277285", "4375743", 
"4375743", "4375743", "4375743", "4470520", "4470520", "4470520", 
"4328022", "4328022", "4248614", "4248614", "4248614", "4248614", 
"4248614", "4248614", "4297139", "4297139", "4297139", "4297139", 
"4297139", "4296607", "4296607", "4296607", "4296607", "4296607", 
"4296607", "4296608", "4296608", "4296608", "4296608", "4296608", 
"4395271", "4395271", "4395271", "4321141", "4321141", "4321141", 
"4321141", "4294190", "4294190", "4294190", "4294190", "4294190", 
"4294190", "4431420", "4431420", "4431420", "4431420", "4431420", 
"4431420", "4322467", "4322467", "4322467", "4322467", "4322467", 
"4322467", "4322467", "4322467", "4322467", "4322467", "4285730", 
"4285730", "4393138", "4393138", "4393138", "4393138", "4393138", 
"4393138", "4393138", "4246034", "4246034", "4246034", "4246034", 
"4251278", "4251278", "4251278", "4339174", "4339174", "4339174", 
"4339174", "4277322", "4277322", "4290586", "4290586", "4290586", 
"4290586", "4290586", "4247592", "4247592", "4247592", "4247592", 
"4247592", "4247592", "4247592", "4247592", "4247592"), prim = c("430", 
"430", "430", "430", "430", "430", "430", "430", "340", "385", 
"385", "385", "385", "385", "65", "65", "65", "118", "427", "65", 
"65", "65", "65", "106", "106", "106", "501", "501", "501", "501", 
"501", "516", "516", "516", "516", "516", "435", "435", "435", 
"435", "435", "435", "435", "435", "156", "428", "65", "385", 
"65", "65", "501", "422", "53", "53", "53", "222", "422", "604", 
"65", "65", "385", "385", "65", "65", "65", "65", "106", "106", 
"501", "501", "501", "252", "423", "423", "501", "505", "62", 
"423", "501", "501", "505", "62", "65", "65", "65", "210", "210", 
"210", "435", "118", "118", "118", "118", "118", "118", "106", 
"433", "433", "433", "433", "501", "156", "427", "427", "428", 
"428", "428", "428", "428", "428", "428", "501", "501", "426", 
"426", "426", "435", "435", "435", "435", "428", "501", "501", 
"501", "501", "501", "65", "385", "385", "385", "65", "204", 
"204", "204", "266", "266", "432", "73", "427", "427", "428", 
"442", "442", "442", "442", "8", "8")), .Names = c("pnum", "prim"
), class = c("data.table", "data.frame"), row.names = c(NA, -147L
), .internal.selfref = <pointer: 0x0000000000100788>)

Now, I want to compare the mainprim (which is linked to the origpat) with the different prim variables linked to refpat.

Code below works but is prohibitively slow.

library(data.table)
df <- data.table(df) ; setkey(df, refpat, origpat) 
refs <- unique(df$refpat) # Capture all unique refpat in df (71,000 in entire data.table)
startrow <- 0 # Set loop    
overlap <- function(a,b) sum (a == b) / length(b)
df$compare <- NA # overlap values will be inserted here

for (h in 1:length(refs)) {
  refclass <- tmp$prim[tmp$pnum == refs[h]] #subgroup of relevant 'prim'
    x <- length(df$refpat[df$refpat == refs[h]])
    prims <- df$mainprim[startrow:(startrow + x)] # isolate subset from large `df` data.table to reduce memory needed in second loop
      for (i in 1:x) {
      df$compare[startrow + i] <- overlap(prims[i], refclass) 
        }
    startrow <- startrow + x
    print(h)
  }

The reason I use two for loops is to save computer memory. I could use a single one and redetermine refclass for every row but that made my computer crash within minutes. This loop works but at a speed of being done in about 250 hours. I'm sure there are ways to simply subset the needed rows from tmp within df and then repeat this for every origpat but my data.table skills are not up to the task and I don't find an answer that explains how to make this work on SO or on the data.table pdf files.

Any suggestions are very welcome

EDIt @Frank The specific comparison I want to make is always changing. The main issue is the following. Consider a long df with two columns of linked pnum (patent numbers), one called origpat and the second called ref.pat. Every column contains multiple repeated pnum but every combination (on a single row) is unique. It establishes a link between a firm patent and an older patent. This dataset is about 22 MIO rows. Then I have multiple other datatables, e.g. one linking the pnum to the inventors, one linking the pnum to various technology classifications. What I am interested in is finding the fastest way to compare the linked data (e.g. inventors, technology classes) on a pairwise basis, with the pair defined in the rows of the df (i.e. origpat and ref.pat). So far the data.table solution I have is the fastest but it still takes multiple days to complete a single new comparison. Hope this helps

Tensibai

Best idea I came with is:

df[,idx := .I] # Add an index to the data.table to group by row of df
df[,compare := sum(tmp[pnum == ref.pat, prim] == mainprim) /
     length(tmp[pnum == ref.pat,prim]),by = idx]

Or reusing your overlap function (still using the idx column):

df[,compare := overlap(
                mainprim,
                tmp[pnum == ref.pat, prim]),
    by=idx]

What it does here is grouping by row and then use columns from Subset Data to get the mainprim for this row and the subsets of tmp needed.

If you want to avoid creating the idx column you can use by=1:nrow(df) instead but this could slow down the process (using an actual column is quicker in data.table).


Great improvements by @Docendo:

You can further speed up the process by creating an intermediate variable to store the subset instead of doing the subset twice per row:

df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)}, by = idx]

And in case there are duplicated combinations of ref.pat and mainprim in df you could further optimize the performance by using by = list(ref.pat, mainprim) instead of by = idx:

df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)},
   by = list(ref.pat, mainprim)]

And another, probably just minimal, improvement could be done by using mean() instead of sum()/length():

df[,compare := mean(tmp[pnum == ref.pat, prim] == mainprim), by = list(ref.pat, mainprim)]

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Select data from one table based on selected rows in another

select rows from table based on data from another table

How To Select All Rows In A Table By Querying Data In Another Table

SQLite: How to retrieve data from column in one table using SELECT to insert retrieved data in another table

How to select data from one table and select count to another table and combine it in MySQL?

How do I insert data from one table to another when there is unequal number of rows in one another?

How to insert specific columns and rows of data to an existing table from one table to another?

How to select all data from one table and records from another table matching data in first selection. All in one query

How to select data(with one decimal type) from a table and insert into another table (with different decimal type)

How to select data from one table and insert it into another table with a new column

Using data.table to select rows by distance from another row

Insert Data from one table to another leaving the already existing rows

How to SELECT none matching rows from one table to another?

Select data from one table with the conditions from another

How to apply a combination function on data.table rows to extract and record the different possibilities in another data.table?

In a SELECT command, how do I use data from one table to specify data in another?

How to select data from a table and insert into another table?

How to select data from a table by referring to another table in MySQL

How to select data from the table not exist in another table sql

How to copy data from one table to another table in Firebase?

How to copy data from one table to another new table in MySQL?

How do i copy data from one table to another table?

How to copy data from one table to another table based on criteria

How to get One Table data on the basis of ID from another table

How to move a field from one table to another table and save the data

How to Replace and Update Data From One Table to Another Table in MySQL

How to transfer data from one table of a database to a table in another database

Append data from one table to another table

Copy Data from one table to another table