How to reduce the size of a preprocessing recipe object in r?

martin_hulin

I am preprocessing a dataset with the R recipes packages, doing Yeo-Johnson transformation to make it more normally distributed and then scaling to standardize it. After that I want to reduce the size of the recipe object, I use the butcher package. But it does not help. I also try to manually clean the 'template' where data is stored, but again the size stays the same. Any idea how to reduce the size for storage and later use? Here is an example of a realistic problem I am facing:


suppressPackageStartupMessages({
library(dplyr)
library(purrr)
library(recipes)
})

#Lets generate skewed numeric data of size 20 000 x 3 000 (originally I am working with 10x more rows)
n <- 3000

example_list <- 
  1:n %>% 
  map(~abs(rnorm(n = 20000, mean = 0, sd = sample(seq(0.1, 10, length.out = n), size = n))))

names(example_list) <- paste0("col_", 1:n)

example_tibble <- as_tibble(example_list)

#Lets create preprocessing recipe
new_recipe <- 
  recipe( ~ ., data = example_tibble) %>%
  step_YeoJohnson(all_numeric()) %>%
  step_normalize(all_numeric()) %>%
  prep(strings_as_factors = FALSE, retain = FALSE)

#Lets check the structure and size of the recipe object
butcher::weigh(new_recipe)
#> # A tibble: 9,034 x 2
#>    object                     size
#>    <chr>                     <dbl>
#>  1 steps.terms             480.   
#>  2 steps.terms             480.   
#>  3 steps.lambdas             0.232
#>  4 steps.means               0.232
#>  5 steps.sds                 0.232
#>  6 var_info.variable         0.208
#>  7 term_info.variable        0.208
#>  8 last_term_info.variable   0.208
#>  9 template.col_1            0.160
#> 10 template.col_2            0.160
#> # … with 9,024 more rows

lobstr::obj_size(new_recipe)
#> 481,649,536 B

#Lets try to remove unnecessary parts of the object
new_recipe_butchered <- butcher::butcher(new_recipe, verbose = TRUE)
#> ✖ No memory released. Do not butcher.

#Lets check again the size
lobstr::obj_size(new_recipe_butchered)
#> 481,650,016 B

butcher::weigh(new_recipe_butchered)
#> # A tibble: 9,034 x 2
#>    object                     size
#>    <chr>                     <dbl>
#>  1 steps.terms             480.   
#>  2 steps.lambdas             0.232
#>  3 steps.means               0.232
#>  4 steps.sds                 0.232
#>  5 var_info.variable         0.208
#>  6 term_info.variable        0.208
#>  7 last_term_info.variable   0.208
#>  8 template.col_1            0.160
#>  9 template.col_2            0.160
#> 10 template.col_3            0.160
#> # … with 9,024 more rows

#Lets try to remove the template with data
new_recipe_butchered$template <- NULL

butcher::weigh(new_recipe_butchered)
#> # A tibble: 6,034 x 2
#>    object                      size
#>    <chr>                      <dbl>
#>  1 steps.terms             480.    
#>  2 steps.lambdas             0.232 
#>  3 steps.means               0.232 
#>  4 steps.sds                 0.232 
#>  5 var_info.variable         0.208 
#>  6 term_info.variable        0.208 
#>  7 last_term_info.variable   0.208 
#>  8 var_info.role             0.0241
#>  9 var_info.source           0.0241
#> 10 term_info.role            0.0241
#> # … with 6,024 more rows

#Lets check again the size - still the same
lobstr::obj_size(new_recipe_butchered)
#> 481,650,016 B

Created on 2021-06-17 by the reprex package (v0.3.0)

It seems I am not able to reduce the size, can someone help?

EmilHvitfeldt

This issue has been resolved in the developmental version of {butcher} which you can download with

# install.packages("devtools")
devtools::install_github("tidymodels/butcher")

{butcher} will now remove the terms environment from steps.

suppressPackageStartupMessages({
library(dplyr)
library(purrr)
library(recipes)
})

n <- 3000

example_list <- 
  1:n %>% 
  map(~abs(rnorm(n = 20000, mean = 0, sd = sample(seq(0.1, 10, length.out = n), size = n))))

names(example_list) <- paste0("col_", 1:n)

example_tibble <- as_tibble(example_list)

new_recipe <- 
  recipe( ~ ., data = example_tibble) %>%
  step_YeoJohnson(all_numeric()) %>%
  step_normalize(all_numeric()) %>%
  prep(strings_as_factors = FALSE, retain = FALSE)

butcher::weigh(new_recipe)
#> # A tibble: 12,033 x 2
#>    object                      size
#>    <chr>                      <dbl>
#>  1 steps.terms             480.    
#>  2 steps.terms             480.    
#>  3 steps.lambdas             0.232 
#>  4 steps.means               0.232 
#>  5 steps.sds                 0.232 
#>  6 var_info.variable         0.208 
#>  7 term_info.variable        0.208 
#>  8 last_term_info.variable   0.208 
#>  9 var_info.role             0.0241
#> 10 var_info.source           0.0241
#> # … with 12,023 more rows

lobstr::obj_size(new_recipe)
#> 481,985,880 B

new_recipe_butchered <- butcher::butcher(new_recipe, verbose = TRUE)
#> ✓ Memory released: '480,170,888 B'

lobstr::obj_size(new_recipe_butchered)
#> 1,814,992 B

butcher::weigh(new_recipe_butchered)
#> # A tibble: 12,033 x 2
#>    object                    size
#>    <chr>                    <dbl>
#>  1 steps.lambdas           0.232 
#>  2 steps.means             0.232 
#>  3 steps.sds               0.232 
#>  4 var_info.variable       0.208 
#>  5 term_info.variable      0.208 
#>  6 last_term_info.variable 0.208 
#>  7 var_info.role           0.0241
#>  8 var_info.source         0.0241
#>  9 term_info.role          0.0241
#> 10 term_info.source        0.0241
#> # … with 12,023 more rows

Created on 2021-06-17 by the reprex package (v2.0.0)

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related