Taking subset of data to do calculation in data.table












0















Let say I have this data.table:



df = data.table(date = c(20180101, 20180102, 20180103, 20180104, 20180105, 20180106, 20180107, 20180108, 20180109, 20180110, 20180111, 20180112, 20180113, 20180114, 20180115, 20180116, 20180117, 20180118), value = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18))


I want to do some calculations (e.g. mean) that is using subset of data. For example: In 20180103, the average will be the sum of (yesterday) 20180102 and (today) 20180103 value ((2+3)/2 = 2.5). This is then rolling until the end of the period.



Result is like this:



    date    mean
20180102 1.5
20180103 2.5
20180104 3.5
20180105 4.5
....


Obviously I can write a for loop, subset the data for each iteration then calculate the mean, store the data and output the result. It is deemed too slow using for loop, and using foreach I don't know how to save the result...





The for loop is like:



datelist = df[, .(date)] 

# initialize the object
data = NA
temp = 0
for (i in 2:nrow(datelist)) {
today = as.numeric(datelist[i])
yesterday = as.numeric(datelist[i-1])

temp = df[date >= yesterday & date <= today]

temp = temp[, .(mean(value))]

temp = cbind(datelist[i], mean = temp$V1)


if (is.na(data)[1]){
data=temp

} else {
data=rbind(data,temp)

}


}


You can see I first subset the data and call it temp then do the calculation (average, use it to do lm, whatever function then stack it into data object)



This is slow and inefficient as I have millions of data point





Is there anyway I can do this in data.table syntax:



result = df[, { data = .SD[date >= yesterday & date <= today]
mean = mean(data$value)
list(mean = mean)}, by=.(date)]


I don't know how to express yesterday and today?? so that yesterday will be, in the for loop case, i-1 and today is i?



What i understand when doing by=.(date) is that data.table will look at each date and calculate whatever function you give in. If I can get the value (i.e. i) of which date the data.table is looking at now, then the value (i-1) will be yesterday...



Thanks










share|improve this question

























  • Do you want the current mean at any given time or for any given time the mean of "today" and "yesterday"?

    – nate
    Jan 1 at 16:39











  • Hi Nate, let me write the for loop out and you will understand what I mean. I don't know how to explain it..

    – Gabriel
    Jan 1 at 16:40






  • 2





    Is this a rolling mean? If so, see e.g. Improve rolling mean usage in data.table or Adaptive moving average - top performance in R

    – Henrik
    Jan 1 at 16:44








  • 1





    "I need to run a rolling regression" ~~> google "rolling regression data.table R site:stackoverflow.com". Among the first few hits, an answer by the data.table author: Is there a fast way to run a rolling regression inside data.table?

    – Henrik
    Jan 1 at 19:19








  • 1





    See also RcppRoll as described e.g. here Rolling regressions in R

    – Henrik
    Jan 1 at 19:27
















0















Let say I have this data.table:



df = data.table(date = c(20180101, 20180102, 20180103, 20180104, 20180105, 20180106, 20180107, 20180108, 20180109, 20180110, 20180111, 20180112, 20180113, 20180114, 20180115, 20180116, 20180117, 20180118), value = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18))


I want to do some calculations (e.g. mean) that is using subset of data. For example: In 20180103, the average will be the sum of (yesterday) 20180102 and (today) 20180103 value ((2+3)/2 = 2.5). This is then rolling until the end of the period.



Result is like this:



    date    mean
20180102 1.5
20180103 2.5
20180104 3.5
20180105 4.5
....


Obviously I can write a for loop, subset the data for each iteration then calculate the mean, store the data and output the result. It is deemed too slow using for loop, and using foreach I don't know how to save the result...





The for loop is like:



datelist = df[, .(date)] 

# initialize the object
data = NA
temp = 0
for (i in 2:nrow(datelist)) {
today = as.numeric(datelist[i])
yesterday = as.numeric(datelist[i-1])

temp = df[date >= yesterday & date <= today]

temp = temp[, .(mean(value))]

temp = cbind(datelist[i], mean = temp$V1)


if (is.na(data)[1]){
data=temp

} else {
data=rbind(data,temp)

}


}


You can see I first subset the data and call it temp then do the calculation (average, use it to do lm, whatever function then stack it into data object)



This is slow and inefficient as I have millions of data point





Is there anyway I can do this in data.table syntax:



result = df[, { data = .SD[date >= yesterday & date <= today]
mean = mean(data$value)
list(mean = mean)}, by=.(date)]


I don't know how to express yesterday and today?? so that yesterday will be, in the for loop case, i-1 and today is i?



What i understand when doing by=.(date) is that data.table will look at each date and calculate whatever function you give in. If I can get the value (i.e. i) of which date the data.table is looking at now, then the value (i-1) will be yesterday...



Thanks










share|improve this question

























  • Do you want the current mean at any given time or for any given time the mean of "today" and "yesterday"?

    – nate
    Jan 1 at 16:39











  • Hi Nate, let me write the for loop out and you will understand what I mean. I don't know how to explain it..

    – Gabriel
    Jan 1 at 16:40






  • 2





    Is this a rolling mean? If so, see e.g. Improve rolling mean usage in data.table or Adaptive moving average - top performance in R

    – Henrik
    Jan 1 at 16:44








  • 1





    "I need to run a rolling regression" ~~> google "rolling regression data.table R site:stackoverflow.com". Among the first few hits, an answer by the data.table author: Is there a fast way to run a rolling regression inside data.table?

    – Henrik
    Jan 1 at 19:19








  • 1





    See also RcppRoll as described e.g. here Rolling regressions in R

    – Henrik
    Jan 1 at 19:27














0












0








0








Let say I have this data.table:



df = data.table(date = c(20180101, 20180102, 20180103, 20180104, 20180105, 20180106, 20180107, 20180108, 20180109, 20180110, 20180111, 20180112, 20180113, 20180114, 20180115, 20180116, 20180117, 20180118), value = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18))


I want to do some calculations (e.g. mean) that is using subset of data. For example: In 20180103, the average will be the sum of (yesterday) 20180102 and (today) 20180103 value ((2+3)/2 = 2.5). This is then rolling until the end of the period.



Result is like this:



    date    mean
20180102 1.5
20180103 2.5
20180104 3.5
20180105 4.5
....


Obviously I can write a for loop, subset the data for each iteration then calculate the mean, store the data and output the result. It is deemed too slow using for loop, and using foreach I don't know how to save the result...





The for loop is like:



datelist = df[, .(date)] 

# initialize the object
data = NA
temp = 0
for (i in 2:nrow(datelist)) {
today = as.numeric(datelist[i])
yesterday = as.numeric(datelist[i-1])

temp = df[date >= yesterday & date <= today]

temp = temp[, .(mean(value))]

temp = cbind(datelist[i], mean = temp$V1)


if (is.na(data)[1]){
data=temp

} else {
data=rbind(data,temp)

}


}


You can see I first subset the data and call it temp then do the calculation (average, use it to do lm, whatever function then stack it into data object)



This is slow and inefficient as I have millions of data point





Is there anyway I can do this in data.table syntax:



result = df[, { data = .SD[date >= yesterday & date <= today]
mean = mean(data$value)
list(mean = mean)}, by=.(date)]


I don't know how to express yesterday and today?? so that yesterday will be, in the for loop case, i-1 and today is i?



What i understand when doing by=.(date) is that data.table will look at each date and calculate whatever function you give in. If I can get the value (i.e. i) of which date the data.table is looking at now, then the value (i-1) will be yesterday...



Thanks










share|improve this question
















Let say I have this data.table:



df = data.table(date = c(20180101, 20180102, 20180103, 20180104, 20180105, 20180106, 20180107, 20180108, 20180109, 20180110, 20180111, 20180112, 20180113, 20180114, 20180115, 20180116, 20180117, 20180118), value = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18))


I want to do some calculations (e.g. mean) that is using subset of data. For example: In 20180103, the average will be the sum of (yesterday) 20180102 and (today) 20180103 value ((2+3)/2 = 2.5). This is then rolling until the end of the period.



Result is like this:



    date    mean
20180102 1.5
20180103 2.5
20180104 3.5
20180105 4.5
....


Obviously I can write a for loop, subset the data for each iteration then calculate the mean, store the data and output the result. It is deemed too slow using for loop, and using foreach I don't know how to save the result...





The for loop is like:



datelist = df[, .(date)] 

# initialize the object
data = NA
temp = 0
for (i in 2:nrow(datelist)) {
today = as.numeric(datelist[i])
yesterday = as.numeric(datelist[i-1])

temp = df[date >= yesterday & date <= today]

temp = temp[, .(mean(value))]

temp = cbind(datelist[i], mean = temp$V1)


if (is.na(data)[1]){
data=temp

} else {
data=rbind(data,temp)

}


}


You can see I first subset the data and call it temp then do the calculation (average, use it to do lm, whatever function then stack it into data object)



This is slow and inefficient as I have millions of data point





Is there anyway I can do this in data.table syntax:



result = df[, { data = .SD[date >= yesterday & date <= today]
mean = mean(data$value)
list(mean = mean)}, by=.(date)]


I don't know how to express yesterday and today?? so that yesterday will be, in the for loop case, i-1 and today is i?



What i understand when doing by=.(date) is that data.table will look at each date and calculate whatever function you give in. If I can get the value (i.e. i) of which date the data.table is looking at now, then the value (i-1) will be yesterday...



Thanks







r data.table






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 1 at 17:02







Gabriel

















asked Jan 1 at 16:24









GabrielGabriel

868




868













  • Do you want the current mean at any given time or for any given time the mean of "today" and "yesterday"?

    – nate
    Jan 1 at 16:39











  • Hi Nate, let me write the for loop out and you will understand what I mean. I don't know how to explain it..

    – Gabriel
    Jan 1 at 16:40






  • 2





    Is this a rolling mean? If so, see e.g. Improve rolling mean usage in data.table or Adaptive moving average - top performance in R

    – Henrik
    Jan 1 at 16:44








  • 1





    "I need to run a rolling regression" ~~> google "rolling regression data.table R site:stackoverflow.com". Among the first few hits, an answer by the data.table author: Is there a fast way to run a rolling regression inside data.table?

    – Henrik
    Jan 1 at 19:19








  • 1





    See also RcppRoll as described e.g. here Rolling regressions in R

    – Henrik
    Jan 1 at 19:27



















  • Do you want the current mean at any given time or for any given time the mean of "today" and "yesterday"?

    – nate
    Jan 1 at 16:39











  • Hi Nate, let me write the for loop out and you will understand what I mean. I don't know how to explain it..

    – Gabriel
    Jan 1 at 16:40






  • 2





    Is this a rolling mean? If so, see e.g. Improve rolling mean usage in data.table or Adaptive moving average - top performance in R

    – Henrik
    Jan 1 at 16:44








  • 1





    "I need to run a rolling regression" ~~> google "rolling regression data.table R site:stackoverflow.com". Among the first few hits, an answer by the data.table author: Is there a fast way to run a rolling regression inside data.table?

    – Henrik
    Jan 1 at 19:19








  • 1





    See also RcppRoll as described e.g. here Rolling regressions in R

    – Henrik
    Jan 1 at 19:27

















Do you want the current mean at any given time or for any given time the mean of "today" and "yesterday"?

– nate
Jan 1 at 16:39





Do you want the current mean at any given time or for any given time the mean of "today" and "yesterday"?

– nate
Jan 1 at 16:39













Hi Nate, let me write the for loop out and you will understand what I mean. I don't know how to explain it..

– Gabriel
Jan 1 at 16:40





Hi Nate, let me write the for loop out and you will understand what I mean. I don't know how to explain it..

– Gabriel
Jan 1 at 16:40




2




2





Is this a rolling mean? If so, see e.g. Improve rolling mean usage in data.table or Adaptive moving average - top performance in R

– Henrik
Jan 1 at 16:44







Is this a rolling mean? If so, see e.g. Improve rolling mean usage in data.table or Adaptive moving average - top performance in R

– Henrik
Jan 1 at 16:44






1




1





"I need to run a rolling regression" ~~> google "rolling regression data.table R site:stackoverflow.com". Among the first few hits, an answer by the data.table author: Is there a fast way to run a rolling regression inside data.table?

– Henrik
Jan 1 at 19:19







"I need to run a rolling regression" ~~> google "rolling regression data.table R site:stackoverflow.com". Among the first few hits, an answer by the data.table author: Is there a fast way to run a rolling regression inside data.table?

– Henrik
Jan 1 at 19:19






1




1





See also RcppRoll as described e.g. here Rolling regressions in R

– Henrik
Jan 1 at 19:27





See also RcppRoll as described e.g. here Rolling regressions in R

– Henrik
Jan 1 at 19:27












3 Answers
3






active

oldest

votes


















2














You can use the shift operator in the data.table j clause:



df[order(date),
rollmean := (value + shift(value, n = 1, type = "lag"))/2]

date value rollmean
1: 20180101 1 NA
2: 20180102 2 1.5
3: 20180103 3 2.5
4: 20180104 4 3.5
5: 20180105 5 4.5
6: 20180106 6 5.5
7: 20180107 7 6.5
8: 20180108 8 7.5
...





share|improve this answer
























  • I have think of this answer but if I need to lag 3 years of data then i need to shift 360*3 times...

    – Gabriel
    Jan 1 at 17:03



















1














Solution



What about something like this



(df$value[-nrow(df)]+df$value[-1] ) / 2
# yields
# [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5


And here creating the data frame



data.table::data.table(date = .subset2(df,1)[-1], 
mean = (df$value[-nrow(df)]+df$value[-1] ) / 2)
# date mean
# 1 20180102 1.5
# 2 20180103 2.5
# 3 20180104 3.5
# 4 20180105 4.5
# 5 20180106 5.5
# ...


with the data you provided.





Benchmarks



Here are some benchmarking figures:



# create a bigger data frame
dfLarge <- data.table::data.table(
date = seq(as.Date('1989-01-01'),as.Date('2019-01-01'),1),
value = 1:10958
)
microbenchmark::microbenchmark(sol = {
data.table::data.table(date = .subset2(dfLarge,1)[-1],
mean = (dfLarge$value[-nrow(dfLarge)]+dfLarge$value[-1] ) / 2)
})
# Unit: microseconds
# expr min lq mean median uq max neval
# sol 367.955 423.203 921.4908 530.781 788.969 22095.85 100




Addendum



If the main topic here isn't the task per se but subsetting efficiently, then specify that to begin with what your aim exactly is (subsetting itself is rather broad a topic, so add detail about the task(s) that need to be done). That way you are more likely to find what you are seeking and other users do not waste any effort.



That being said, here is a link providing some great information about subsetting in R.






share|improve this answer


























  • Thanks Nate let me try this out! I think subset2 is the function I am after.

    – Gabriel
    Jan 1 at 18:06





















0














Staying away from for loops you could use a purrr map function like this:



nvals <- nrow(df) # get the number of rows
vals <- df$value # get the value vector
output <- map(1:nvals, function(x) mean(vals[c(x-1, x)])
output <- unlist(output)
df <- cbind(df, output)


The output vector is:



 1.0  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5


Which is I think what you want.






share|improve this answer
























  • You should probably add the package name (purrr?) as map is not a base function

    – nate
    Jan 1 at 18:08











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53997042%2ftaking-subset-of-data-to-do-calculation-in-data-table%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














You can use the shift operator in the data.table j clause:



df[order(date),
rollmean := (value + shift(value, n = 1, type = "lag"))/2]

date value rollmean
1: 20180101 1 NA
2: 20180102 2 1.5
3: 20180103 3 2.5
4: 20180104 4 3.5
5: 20180105 5 4.5
6: 20180106 6 5.5
7: 20180107 7 6.5
8: 20180108 8 7.5
...





share|improve this answer
























  • I have think of this answer but if I need to lag 3 years of data then i need to shift 360*3 times...

    – Gabriel
    Jan 1 at 17:03
















2














You can use the shift operator in the data.table j clause:



df[order(date),
rollmean := (value + shift(value, n = 1, type = "lag"))/2]

date value rollmean
1: 20180101 1 NA
2: 20180102 2 1.5
3: 20180103 3 2.5
4: 20180104 4 3.5
5: 20180105 5 4.5
6: 20180106 6 5.5
7: 20180107 7 6.5
8: 20180108 8 7.5
...





share|improve this answer
























  • I have think of this answer but if I need to lag 3 years of data then i need to shift 360*3 times...

    – Gabriel
    Jan 1 at 17:03














2












2








2







You can use the shift operator in the data.table j clause:



df[order(date),
rollmean := (value + shift(value, n = 1, type = "lag"))/2]

date value rollmean
1: 20180101 1 NA
2: 20180102 2 1.5
3: 20180103 3 2.5
4: 20180104 4 3.5
5: 20180105 5 4.5
6: 20180106 6 5.5
7: 20180107 7 6.5
8: 20180108 8 7.5
...





share|improve this answer













You can use the shift operator in the data.table j clause:



df[order(date),
rollmean := (value + shift(value, n = 1, type = "lag"))/2]

date value rollmean
1: 20180101 1 NA
2: 20180102 2 1.5
3: 20180103 3 2.5
4: 20180104 4 3.5
5: 20180105 5 4.5
6: 20180106 6 5.5
7: 20180107 7 6.5
8: 20180108 8 7.5
...






share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 1 at 16:47









zackzack

3,3641322




3,3641322













  • I have think of this answer but if I need to lag 3 years of data then i need to shift 360*3 times...

    – Gabriel
    Jan 1 at 17:03



















  • I have think of this answer but if I need to lag 3 years of data then i need to shift 360*3 times...

    – Gabriel
    Jan 1 at 17:03

















I have think of this answer but if I need to lag 3 years of data then i need to shift 360*3 times...

– Gabriel
Jan 1 at 17:03





I have think of this answer but if I need to lag 3 years of data then i need to shift 360*3 times...

– Gabriel
Jan 1 at 17:03













1














Solution



What about something like this



(df$value[-nrow(df)]+df$value[-1] ) / 2
# yields
# [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5


And here creating the data frame



data.table::data.table(date = .subset2(df,1)[-1], 
mean = (df$value[-nrow(df)]+df$value[-1] ) / 2)
# date mean
# 1 20180102 1.5
# 2 20180103 2.5
# 3 20180104 3.5
# 4 20180105 4.5
# 5 20180106 5.5
# ...


with the data you provided.





Benchmarks



Here are some benchmarking figures:



# create a bigger data frame
dfLarge <- data.table::data.table(
date = seq(as.Date('1989-01-01'),as.Date('2019-01-01'),1),
value = 1:10958
)
microbenchmark::microbenchmark(sol = {
data.table::data.table(date = .subset2(dfLarge,1)[-1],
mean = (dfLarge$value[-nrow(dfLarge)]+dfLarge$value[-1] ) / 2)
})
# Unit: microseconds
# expr min lq mean median uq max neval
# sol 367.955 423.203 921.4908 530.781 788.969 22095.85 100




Addendum



If the main topic here isn't the task per se but subsetting efficiently, then specify that to begin with what your aim exactly is (subsetting itself is rather broad a topic, so add detail about the task(s) that need to be done). That way you are more likely to find what you are seeking and other users do not waste any effort.



That being said, here is a link providing some great information about subsetting in R.






share|improve this answer


























  • Thanks Nate let me try this out! I think subset2 is the function I am after.

    – Gabriel
    Jan 1 at 18:06


















1














Solution



What about something like this



(df$value[-nrow(df)]+df$value[-1] ) / 2
# yields
# [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5


And here creating the data frame



data.table::data.table(date = .subset2(df,1)[-1], 
mean = (df$value[-nrow(df)]+df$value[-1] ) / 2)
# date mean
# 1 20180102 1.5
# 2 20180103 2.5
# 3 20180104 3.5
# 4 20180105 4.5
# 5 20180106 5.5
# ...


with the data you provided.





Benchmarks



Here are some benchmarking figures:



# create a bigger data frame
dfLarge <- data.table::data.table(
date = seq(as.Date('1989-01-01'),as.Date('2019-01-01'),1),
value = 1:10958
)
microbenchmark::microbenchmark(sol = {
data.table::data.table(date = .subset2(dfLarge,1)[-1],
mean = (dfLarge$value[-nrow(dfLarge)]+dfLarge$value[-1] ) / 2)
})
# Unit: microseconds
# expr min lq mean median uq max neval
# sol 367.955 423.203 921.4908 530.781 788.969 22095.85 100




Addendum



If the main topic here isn't the task per se but subsetting efficiently, then specify that to begin with what your aim exactly is (subsetting itself is rather broad a topic, so add detail about the task(s) that need to be done). That way you are more likely to find what you are seeking and other users do not waste any effort.



That being said, here is a link providing some great information about subsetting in R.






share|improve this answer


























  • Thanks Nate let me try this out! I think subset2 is the function I am after.

    – Gabriel
    Jan 1 at 18:06
















1












1








1







Solution



What about something like this



(df$value[-nrow(df)]+df$value[-1] ) / 2
# yields
# [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5


And here creating the data frame



data.table::data.table(date = .subset2(df,1)[-1], 
mean = (df$value[-nrow(df)]+df$value[-1] ) / 2)
# date mean
# 1 20180102 1.5
# 2 20180103 2.5
# 3 20180104 3.5
# 4 20180105 4.5
# 5 20180106 5.5
# ...


with the data you provided.





Benchmarks



Here are some benchmarking figures:



# create a bigger data frame
dfLarge <- data.table::data.table(
date = seq(as.Date('1989-01-01'),as.Date('2019-01-01'),1),
value = 1:10958
)
microbenchmark::microbenchmark(sol = {
data.table::data.table(date = .subset2(dfLarge,1)[-1],
mean = (dfLarge$value[-nrow(dfLarge)]+dfLarge$value[-1] ) / 2)
})
# Unit: microseconds
# expr min lq mean median uq max neval
# sol 367.955 423.203 921.4908 530.781 788.969 22095.85 100




Addendum



If the main topic here isn't the task per se but subsetting efficiently, then specify that to begin with what your aim exactly is (subsetting itself is rather broad a topic, so add detail about the task(s) that need to be done). That way you are more likely to find what you are seeking and other users do not waste any effort.



That being said, here is a link providing some great information about subsetting in R.






share|improve this answer















Solution



What about something like this



(df$value[-nrow(df)]+df$value[-1] ) / 2
# yields
# [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5


And here creating the data frame



data.table::data.table(date = .subset2(df,1)[-1], 
mean = (df$value[-nrow(df)]+df$value[-1] ) / 2)
# date mean
# 1 20180102 1.5
# 2 20180103 2.5
# 3 20180104 3.5
# 4 20180105 4.5
# 5 20180106 5.5
# ...


with the data you provided.





Benchmarks



Here are some benchmarking figures:



# create a bigger data frame
dfLarge <- data.table::data.table(
date = seq(as.Date('1989-01-01'),as.Date('2019-01-01'),1),
value = 1:10958
)
microbenchmark::microbenchmark(sol = {
data.table::data.table(date = .subset2(dfLarge,1)[-1],
mean = (dfLarge$value[-nrow(dfLarge)]+dfLarge$value[-1] ) / 2)
})
# Unit: microseconds
# expr min lq mean median uq max neval
# sol 367.955 423.203 921.4908 530.781 788.969 22095.85 100




Addendum



If the main topic here isn't the task per se but subsetting efficiently, then specify that to begin with what your aim exactly is (subsetting itself is rather broad a topic, so add detail about the task(s) that need to be done). That way you are more likely to find what you are seeking and other users do not waste any effort.



That being said, here is a link providing some great information about subsetting in R.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 1 at 18:29

























answered Jan 1 at 16:35









natenate

3,1411321




3,1411321













  • Thanks Nate let me try this out! I think subset2 is the function I am after.

    – Gabriel
    Jan 1 at 18:06





















  • Thanks Nate let me try this out! I think subset2 is the function I am after.

    – Gabriel
    Jan 1 at 18:06



















Thanks Nate let me try this out! I think subset2 is the function I am after.

– Gabriel
Jan 1 at 18:06







Thanks Nate let me try this out! I think subset2 is the function I am after.

– Gabriel
Jan 1 at 18:06













0














Staying away from for loops you could use a purrr map function like this:



nvals <- nrow(df) # get the number of rows
vals <- df$value # get the value vector
output <- map(1:nvals, function(x) mean(vals[c(x-1, x)])
output <- unlist(output)
df <- cbind(df, output)


The output vector is:



 1.0  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5


Which is I think what you want.






share|improve this answer
























  • You should probably add the package name (purrr?) as map is not a base function

    – nate
    Jan 1 at 18:08
















0














Staying away from for loops you could use a purrr map function like this:



nvals <- nrow(df) # get the number of rows
vals <- df$value # get the value vector
output <- map(1:nvals, function(x) mean(vals[c(x-1, x)])
output <- unlist(output)
df <- cbind(df, output)


The output vector is:



 1.0  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5


Which is I think what you want.






share|improve this answer
























  • You should probably add the package name (purrr?) as map is not a base function

    – nate
    Jan 1 at 18:08














0












0








0







Staying away from for loops you could use a purrr map function like this:



nvals <- nrow(df) # get the number of rows
vals <- df$value # get the value vector
output <- map(1:nvals, function(x) mean(vals[c(x-1, x)])
output <- unlist(output)
df <- cbind(df, output)


The output vector is:



 1.0  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5


Which is I think what you want.






share|improve this answer













Staying away from for loops you could use a purrr map function like this:



nvals <- nrow(df) # get the number of rows
vals <- df$value # get the value vector
output <- map(1:nvals, function(x) mean(vals[c(x-1, x)])
output <- unlist(output)
df <- cbind(df, output)


The output vector is:



 1.0  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5


Which is I think what you want.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 1 at 17:27









SteveMSteveM

190211




190211













  • You should probably add the package name (purrr?) as map is not a base function

    – nate
    Jan 1 at 18:08



















  • You should probably add the package name (purrr?) as map is not a base function

    – nate
    Jan 1 at 18:08

















You should probably add the package name (purrr?) as map is not a base function

– nate
Jan 1 at 18:08





You should probably add the package name (purrr?) as map is not a base function

– nate
Jan 1 at 18:08


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53997042%2ftaking-subset-of-data-to-do-calculation-in-data-table%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

How to fix TextFormField cause rebuild widget in Flutter

Npm cannot find a required file even through it is in the searched directory