R: Fast string split on first delimiter occurence

screechOwl

I have a file with ~ 40 million rows that I need to split based on the first comma delimiter.

The following using the stringr function str_split_fixed works well but is very slow.

library(data.table)
library(stringr)

df1 <- data.frame(id = 1:1000, letter1 = rep(letters[sample(1:25,1000, replace = T)], 40))
df1$combCol1 <- paste(df1$id, ',',df1$letter1, sep = '')
df1$combCol2 <- paste(df1$combCol1, ',', df1$combCol1, sep = '')

st1 <- str_split_fixed(df1$combCol2, ',', 2)

Any suggestions for a faster way to do this?

A5C1D2H2I1M1N2O1R2T1

Update

The stri_split_fixed function in more recent versions of "stringi" have a simplify argument that can be set to TRUE to return a matrix. Thus, the updated solution would be:

stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)

Original answer (with updated benchmarks)

If you are comfortable with the "stringr" syntax and don't want to veer too far from it, but you also want to benefit from a speed boost, try the "stringi" package instead:

library(stringr)
library(stringi)
system.time(temp1 <- str_split_fixed(df1$combCol2, ',', 2))
#    user  system elapsed 
#    3.25    0.00    3.25 
system.time(temp2a <- do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)))
#    user  system elapsed 
#    0.04    0.00    0.05 
system.time(temp2b <- stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE))
#    user  system elapsed 
#    0.01    0.00    0.01

Most of the "stringr" functions have "stringi" parallels, but as can be seen from this example, the "stringi" output required one extra step of binding the data to create the output as a matrix instead of as a list.


Here's how it compares with @RichardScriven's suggestion in the comments:

fun1a <- function() do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))
fun1b <- function() stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
fun2 <- function() {
  do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2), 
                            invert = TRUE))
} 

library(microbenchmark)
microbenchmark(fun1a(), fun1b(), fun2(), times = 10)
# Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#  fun1a()  42.72647  46.35848  59.56948  51.94796  69.29920  98.46330    10
#  fun1b()  17.55183  18.59337  20.09049  18.84907  22.09419  26.85343    10
#   fun2() 370.82055 404.23115 434.62582 439.54923 476.02889 480.97912    10

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Split string at last (or fourth) occurence of "." delimiter

split string only on first occurence of specified character

Split a string but retain first delimiter

Split string by first delimiter found

split string at first occurrence of a delimiter

How to split a string based on second occurence of delimiter in Go?

Split column on first occurence of '-'

How to split string with delimiter and get the first value

windows batch to split string on first occurrence of a delimiter

Excel :: split string only by the first occurrence of a delimiter

split string only by the first occurrence of a delimiter

How to split a string by the first occurrence of the delimiter?

Split string/variable on first occurrence delimiter (UNIX)

How to split a string column into two column by varying space delimiter on its last occurence

How to split a long string based on delimiter in R

String split and expand the (vector) at the delimiter: R

R: split string vector by delimiter and rearrange

String.split() - matching leading empty String prior to first delimiter?

How to count the number of occurence of First Charcter of each string of a column in R

R cSplit only using first delimiter in string

Haskell split string on last occurence

Twig split a string after first specific character as delimiter

Getting before first string occurence

Extract the first occurence from a string

Split string on delimiter, keeping delimiter before split

Split string with dot as delimiter

Split a string by a delimiter in python

How to split string with "-" as delimiter

Split string IF delimiter is found