Efficient random sampling in R

From a data frame, I am trying randomly sample 1:20 observations where for
each number of observation I would like to replicate the process 4 times. I
came up with this working solution, but it is very slow since it is
involving coping many times a large data frame because of the crossing()
function. Anyone can point me toward a more efficient solution?

library(tidyverse)



mtcars %>% 

  group_by(cyl) %>% 

  nest() %>% 

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>% 

  mutate(res = map2_dbl(data, n_random_sample, function(data, n) {



    data %>%

      sample_n(n, replace = TRUE) %>%

      summarise(mean_mpg = mean(mpg)) %>%

      pull(mean_mpg)



  }))

#> # A tibble: 240 x 5

#>      cyl data              n_random_sample n_replicate   res

#>    <dbl> <list>                      <int>       <int> <dbl>

#>  1     6 <tibble [7 × 10]>               1           1  17.8

#>  2     6 <tibble [7 × 10]>               1           2  21  

#>  3     6 <tibble [7 × 10]>               1           3  19.2

#>  4     6 <tibble [7 × 10]>               1           4  18.1

#>  5     6 <tibble [7 × 10]>               2           1  19.6

#>  6     6 <tibble [7 × 10]>               2           2  19.4

#>  7     6 <tibble [7 × 10]>               2           3  19.6

#>  8     6 <tibble [7 × 10]>               2           4  20.4

#>  9     6 <tibble [7 × 10]>               3           1  20.1

#> 10     6 <tibble [7 × 10]>               3           2  18.9

#> # ... with 230 more rows

^{Created on 2018-11-19 by the reprex package (v0.2.1)}

asked Nov 19 '18 at 14:18

Philippe Massicotte

1979

1

Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 '18 at 14:36

add a comment |

library(tidyverse)



mtcars %>% 

  group_by(cyl) %>% 

  nest() %>% 

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>% 

  mutate(res = map2_dbl(data, n_random_sample, function(data, n) {



    data %>%

      sample_n(n, replace = TRUE) %>%

      summarise(mean_mpg = mean(mpg)) %>%

      pull(mean_mpg)



  }))

#> # A tibble: 240 x 5

#>      cyl data              n_random_sample n_replicate   res

#>    <dbl> <list>                      <int>       <int> <dbl>

#>  1     6 <tibble [7 × 10]>               1           1  17.8

#>  2     6 <tibble [7 × 10]>               1           2  21  

#>  3     6 <tibble [7 × 10]>               1           3  19.2

#>  4     6 <tibble [7 × 10]>               1           4  18.1

#>  5     6 <tibble [7 × 10]>               2           1  19.6

#>  6     6 <tibble [7 × 10]>               2           2  19.4

#>  7     6 <tibble [7 × 10]>               2           3  19.6

#>  8     6 <tibble [7 × 10]>               2           4  20.4

#>  9     6 <tibble [7 × 10]>               3           1  20.1

#> 10     6 <tibble [7 × 10]>               3           2  18.9

#> # ... with 230 more rows

^{Created on 2018-11-19 by the reprex package (v0.2.1)}

asked Nov 19 '18 at 14:18

Philippe Massicotte

1979

1

Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 '18 at 14:36

add a comment |

library(tidyverse)



mtcars %>% 

  group_by(cyl) %>% 

  nest() %>% 

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>% 

  mutate(res = map2_dbl(data, n_random_sample, function(data, n) {



    data %>%

      sample_n(n, replace = TRUE) %>%

      summarise(mean_mpg = mean(mpg)) %>%

      pull(mean_mpg)



  }))

#> # A tibble: 240 x 5

#>      cyl data              n_random_sample n_replicate   res

#>    <dbl> <list>                      <int>       <int> <dbl>

#>  1     6 <tibble [7 × 10]>               1           1  17.8

#>  2     6 <tibble [7 × 10]>               1           2  21  

#>  3     6 <tibble [7 × 10]>               1           3  19.2

#>  4     6 <tibble [7 × 10]>               1           4  18.1

#>  5     6 <tibble [7 × 10]>               2           1  19.6

#>  6     6 <tibble [7 × 10]>               2           2  19.4

#>  7     6 <tibble [7 × 10]>               2           3  19.6

#>  8     6 <tibble [7 × 10]>               2           4  20.4

#>  9     6 <tibble [7 × 10]>               3           1  20.1

#> 10     6 <tibble [7 × 10]>               3           2  18.9

#> # ... with 230 more rows

^{Created on 2018-11-19 by the reprex package (v0.2.1)}

asked Nov 19 '18 at 14:18

Philippe Massicotte

1979

library(tidyverse)



mtcars %>% 

  group_by(cyl) %>% 

  nest() %>% 

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>% 

  mutate(res = map2_dbl(data, n_random_sample, function(data, n) {



    data %>%

      sample_n(n, replace = TRUE) %>%

      summarise(mean_mpg = mean(mpg)) %>%

      pull(mean_mpg)



  }))

#> # A tibble: 240 x 5

#>      cyl data              n_random_sample n_replicate   res

#>    <dbl> <list>                      <int>       <int> <dbl>

#>  1     6 <tibble [7 × 10]>               1           1  17.8

#>  2     6 <tibble [7 × 10]>               1           2  21  

#>  3     6 <tibble [7 × 10]>               1           3  19.2

#>  4     6 <tibble [7 × 10]>               1           4  18.1

#>  5     6 <tibble [7 × 10]>               2           1  19.6

#>  6     6 <tibble [7 × 10]>               2           2  19.4

#>  7     6 <tibble [7 × 10]>               2           3  19.6

#>  8     6 <tibble [7 × 10]>               2           4  20.4

#>  9     6 <tibble [7 × 10]>               3           1  20.1

#> 10     6 <tibble [7 × 10]>               3           2  18.9

#> # ... with 230 more rows

^{Created on 2018-11-19 by the reprex package (v0.2.1)}

r data.table tidyverse

asked Nov 19 '18 at 14:18

Philippe Massicotte

1979

asked Nov 19 '18 at 14:18

Philippe Massicotte

1979

asked Nov 19 '18 at 14:18

Philippe Massicotte

1979

asked Nov 19 '18 at 14:18

Philippe Massicotte

1979

asked Nov 19 '18 at 14:18

Philippe Massicotte

1979

1

Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 '18 at 14:36

add a comment |

1

Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 '18 at 14:36

Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 '18 at 14:36

add a comment |

2 Answers
2

active

oldest

votes

This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest to create the sub-datasets and store them as a list variable and then pick a sample using map:

library(tidyverse)



# create function to sample rows

f = function(c, n) {

  mtcars %>%

    filter(cyl == c) %>%

    sample_n(n, replace = TRUE) %>%

    summarise(mean_mpg = mean(mpg)) %>%

    pull(mean_mpg)

}



# vectorise function

f = Vectorize(f)



# set seed for reproducibility

set.seed(11)



tbl_df(mtcars) %>%

  distinct(cyl) %>%

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%

  mutate(res = f(cyl, n_random_sample))



# # A tibble: 240 x 4

#     cyl n_random_sample n_replicate   res

#   <dbl>           <int>       <int> <dbl>

# 1     6               1           1  21  

# 2     6               1           2  21  

# 3     6               1           3  18.1

# 4     6               1           4  21  

# 5     6               2           1  20.4

# 6     6               2           2  21.2

# 7     6               2           3  20.4

# 8     6               2           4  19.6

# 9     6               3           1  18.4

#10     6               3           2  19.6

# # ... with 230 more rows

answered Nov 19 '18 at 14:51

AntoniosK

12.5k1822

Works as intended, thank you.
– Philippe Massicotte
Nov 19 '18 at 15:47

add a comment |

mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)

This will give you a list of tables of nrows=1:20, each 4 times.

You can follow up with this to name the elements of the list:

names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))

Result:

head(mm,5)

$`sample.1-1`

              mpg cyl disp  hp drat    wt qsec vs am gear carb

Lotus Europa 30.4   4 95.1 113 3.77 1.513 16.9  1  1    5    2



$`sample.2-1`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6



$`sample.3-1`

             mpg cyl disp hp drat    wt  qsec vs am gear carb

Honda Civic 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2



$`sample.4-1`

               mpg cyl  disp hp drat    wt  qsec vs am gear carb

Toyota Corona 21.5   4 120.1 97  3.7 2.465 20.01  1  0    3    1



$`sample.1-2`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

Volvo 142E   21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

answered Nov 19 '18 at 14:39

iod

3,5592722

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376574%2fefficient-random-sampling-in-r%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

library(tidyverse)



# create function to sample rows

f = function(c, n) {

  mtcars %>%

    filter(cyl == c) %>%

    sample_n(n, replace = TRUE) %>%

    summarise(mean_mpg = mean(mpg)) %>%

    pull(mean_mpg)

}



# vectorise function

f = Vectorize(f)



# set seed for reproducibility

set.seed(11)



tbl_df(mtcars) %>%

  distinct(cyl) %>%

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%

  mutate(res = f(cyl, n_random_sample))



# # A tibble: 240 x 4

#     cyl n_random_sample n_replicate   res

#   <dbl>           <int>       <int> <dbl>

# 1     6               1           1  21  

# 2     6               1           2  21  

# 3     6               1           3  18.1

# 4     6               1           4  21  

# 5     6               2           1  20.4

# 6     6               2           2  21.2

# 7     6               2           3  20.4

# 8     6               2           4  19.6

# 9     6               3           1  18.4

#10     6               3           2  19.6

# # ... with 230 more rows

answered Nov 19 '18 at 14:51

AntoniosK

12.5k1822

Works as intended, thank you.
– Philippe Massicotte
Nov 19 '18 at 15:47

add a comment |

library(tidyverse)



# create function to sample rows

f = function(c, n) {

  mtcars %>%

    filter(cyl == c) %>%

    sample_n(n, replace = TRUE) %>%

    summarise(mean_mpg = mean(mpg)) %>%

    pull(mean_mpg)

}



# vectorise function

f = Vectorize(f)



# set seed for reproducibility

set.seed(11)



tbl_df(mtcars) %>%

  distinct(cyl) %>%

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%

  mutate(res = f(cyl, n_random_sample))



# # A tibble: 240 x 4

#     cyl n_random_sample n_replicate   res

#   <dbl>           <int>       <int> <dbl>

# 1     6               1           1  21  

# 2     6               1           2  21  

# 3     6               1           3  18.1

# 4     6               1           4  21  

# 5     6               2           1  20.4

# 6     6               2           2  21.2

# 7     6               2           3  20.4

# 8     6               2           4  19.6

# 9     6               3           1  18.4

#10     6               3           2  19.6

# # ... with 230 more rows

answered Nov 19 '18 at 14:51

AntoniosK

12.5k1822

Works as intended, thank you.
– Philippe Massicotte
Nov 19 '18 at 15:47

add a comment |

library(tidyverse)



# create function to sample rows

f = function(c, n) {

  mtcars %>%

    filter(cyl == c) %>%

    sample_n(n, replace = TRUE) %>%

    summarise(mean_mpg = mean(mpg)) %>%

    pull(mean_mpg)

}



# vectorise function

f = Vectorize(f)



# set seed for reproducibility

set.seed(11)



tbl_df(mtcars) %>%

  distinct(cyl) %>%

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%

  mutate(res = f(cyl, n_random_sample))



# # A tibble: 240 x 4

#     cyl n_random_sample n_replicate   res

#   <dbl>           <int>       <int> <dbl>

# 1     6               1           1  21  

# 2     6               1           2  21  

# 3     6               1           3  18.1

# 4     6               1           4  21  

# 5     6               2           1  20.4

# 6     6               2           2  21.2

# 7     6               2           3  20.4

# 8     6               2           4  19.6

# 9     6               3           1  18.4

#10     6               3           2  19.6

# # ... with 230 more rows

answered Nov 19 '18 at 14:51

AntoniosK

12.5k1822

library(tidyverse)



# create function to sample rows

f = function(c, n) {

  mtcars %>%

    filter(cyl == c) %>%

    sample_n(n, replace = TRUE) %>%

    summarise(mean_mpg = mean(mpg)) %>%

    pull(mean_mpg)

}



# vectorise function

f = Vectorize(f)



# set seed for reproducibility

set.seed(11)



tbl_df(mtcars) %>%

  distinct(cyl) %>%

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%

  mutate(res = f(cyl, n_random_sample))



# # A tibble: 240 x 4

#     cyl n_random_sample n_replicate   res

#   <dbl>           <int>       <int> <dbl>

# 1     6               1           1  21  

# 2     6               1           2  21  

# 3     6               1           3  18.1

# 4     6               1           4  21  

# 5     6               2           1  20.4

# 6     6               2           2  21.2

# 7     6               2           3  20.4

# 8     6               2           4  19.6

# 9     6               3           1  18.4

#10     6               3           2  19.6

# # ... with 230 more rows

answered Nov 19 '18 at 14:51

AntoniosK

12.5k1822

answered Nov 19 '18 at 14:51

AntoniosK

12.5k1822

answered Nov 19 '18 at 14:51

AntoniosK

12.5k1822

answered Nov 19 '18 at 14:51

AntoniosK

12.5k1822

Works as intended, thank you.
– Philippe Massicotte
Nov 19 '18 at 15:47

add a comment |

Works as intended, thank you.
– Philippe Massicotte
Nov 19 '18 at 15:47

Works as intended, thank you.
– Philippe Massicotte
Nov 19 '18 at 15:47

add a comment |

mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)

This will give you a list of tables of nrows=1:20, each 4 times.

You can follow up with this to name the elements of the list:

names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))

Result:

head(mm,5)

$`sample.1-1`

              mpg cyl disp  hp drat    wt qsec vs am gear carb

Lotus Europa 30.4   4 95.1 113 3.77 1.513 16.9  1  1    5    2



$`sample.2-1`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6



$`sample.3-1`

             mpg cyl disp hp drat    wt  qsec vs am gear carb

Honda Civic 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2



$`sample.4-1`

               mpg cyl  disp hp drat    wt  qsec vs am gear carb

Toyota Corona 21.5   4 120.1 97  3.7 2.465 20.01  1  0    3    1



$`sample.1-2`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

Volvo 142E   21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

answered Nov 19 '18 at 14:39

iod

3,5592722

add a comment |

mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)

This will give you a list of tables of nrows=1:20, each 4 times.

You can follow up with this to name the elements of the list:

names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))

Result:

head(mm,5)

$`sample.1-1`

              mpg cyl disp  hp drat    wt qsec vs am gear carb

Lotus Europa 30.4   4 95.1 113 3.77 1.513 16.9  1  1    5    2



$`sample.2-1`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6



$`sample.3-1`

             mpg cyl disp hp drat    wt  qsec vs am gear carb

Honda Civic 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2



$`sample.4-1`

               mpg cyl  disp hp drat    wt  qsec vs am gear carb

Toyota Corona 21.5   4 120.1 97  3.7 2.465 20.01  1  0    3    1



$`sample.1-2`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

Volvo 142E   21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

answered Nov 19 '18 at 14:39

iod

3,5592722

add a comment |

mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)

This will give you a list of tables of nrows=1:20, each 4 times.

You can follow up with this to name the elements of the list:

names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))

Result:

head(mm,5)

$`sample.1-1`

              mpg cyl disp  hp drat    wt qsec vs am gear carb

Lotus Europa 30.4   4 95.1 113 3.77 1.513 16.9  1  1    5    2



$`sample.2-1`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6



$`sample.3-1`

             mpg cyl disp hp drat    wt  qsec vs am gear carb

Honda Civic 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2



$`sample.4-1`

               mpg cyl  disp hp drat    wt  qsec vs am gear carb

Toyota Corona 21.5   4 120.1 97  3.7 2.465 20.01  1  0    3    1



$`sample.1-2`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

Volvo 142E   21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

answered Nov 19 '18 at 14:39

iod

3,5592722

mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)

This will give you a list of tables of nrows=1:20, each 4 times.

You can follow up with this to name the elements of the list:

names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))

Result:

head(mm,5)

$`sample.1-1`

              mpg cyl disp  hp drat    wt qsec vs am gear carb

Lotus Europa 30.4   4 95.1 113 3.77 1.513 16.9  1  1    5    2



$`sample.2-1`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6



$`sample.3-1`

             mpg cyl disp hp drat    wt  qsec vs am gear carb

Honda Civic 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2



$`sample.4-1`

               mpg cyl  disp hp drat    wt  qsec vs am gear carb

Toyota Corona 21.5   4 120.1 97  3.7 2.465 20.01  1  0    3    1



$`sample.1-2`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

Volvo 142E   21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

answered Nov 19 '18 at 14:39

iod

3,5592722

answered Nov 19 '18 at 14:39

iod

3,5592722

answered Nov 19 '18 at 14:39

iod

3,5592722

answered Nov 19 '18 at 14:39

iod

3,5592722

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu