Simplify code for getting multiple disease proportions in the population
I have data which looks like this
df <- data.frame (
cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),
CVD = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
diab = c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),
stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),
asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),
LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))
My data is much larger, approx. 1 million rows. Each row is a person, and the variables correspond to the diseases that person has (1 = yes)
What I want is a dataframe with the proportion of people from the population who have and do not have each condition.
This is what I have done to generate the output I want:
1) Construct the proportion of the population which have each condition individually
Prop_cancer <- df %>%
group_by(cancer) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "cancer") %>%
rename(Y_N = cancer)
Prop_CVD <- df %>%
group_by(CVD) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "CVD") %>%
rename(Y_N = CVD)
Prop_diab <- df %>%
group_by(diab) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "diab") %>%
rename(Y_N = diab)
Prop_stroke <- df %>%
group_by(stroke) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "stroke") %>%
rename(Y_N = stroke)
Prop_asthma <- df %>%
group_by(asthma) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "asthma") %>%
rename(Y_N = asthma)
Add these all together
Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)
I've got a large number of conditions, and a lot of data. Is there an easier / faster way to do this?
I considered creating a new variable 'condition' in the original dataframe, through an ifelse
statement, but this does not allow a person to have more than one condition, and the conditions take precedence in the order I specify them.
Would be grateful for advice on how to simplify this code so that it's not so long.
r group-by dplyr bind
add a comment |
I have data which looks like this
df <- data.frame (
cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),
CVD = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
diab = c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),
stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),
asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),
LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))
My data is much larger, approx. 1 million rows. Each row is a person, and the variables correspond to the diseases that person has (1 = yes)
What I want is a dataframe with the proportion of people from the population who have and do not have each condition.
This is what I have done to generate the output I want:
1) Construct the proportion of the population which have each condition individually
Prop_cancer <- df %>%
group_by(cancer) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "cancer") %>%
rename(Y_N = cancer)
Prop_CVD <- df %>%
group_by(CVD) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "CVD") %>%
rename(Y_N = CVD)
Prop_diab <- df %>%
group_by(diab) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "diab") %>%
rename(Y_N = diab)
Prop_stroke <- df %>%
group_by(stroke) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "stroke") %>%
rename(Y_N = stroke)
Prop_asthma <- df %>%
group_by(asthma) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "asthma") %>%
rename(Y_N = asthma)
Add these all together
Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)
I've got a large number of conditions, and a lot of data. Is there an easier / faster way to do this?
I considered creating a new variable 'condition' in the original dataframe, through an ifelse
statement, but this does not allow a person to have more than one condition, and the conditions take precedence in the order I specify them.
Would be grateful for advice on how to simplify this code so that it's not so long.
r group-by dplyr bind
add a comment |
I have data which looks like this
df <- data.frame (
cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),
CVD = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
diab = c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),
stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),
asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),
LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))
My data is much larger, approx. 1 million rows. Each row is a person, and the variables correspond to the diseases that person has (1 = yes)
What I want is a dataframe with the proportion of people from the population who have and do not have each condition.
This is what I have done to generate the output I want:
1) Construct the proportion of the population which have each condition individually
Prop_cancer <- df %>%
group_by(cancer) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "cancer") %>%
rename(Y_N = cancer)
Prop_CVD <- df %>%
group_by(CVD) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "CVD") %>%
rename(Y_N = CVD)
Prop_diab <- df %>%
group_by(diab) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "diab") %>%
rename(Y_N = diab)
Prop_stroke <- df %>%
group_by(stroke) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "stroke") %>%
rename(Y_N = stroke)
Prop_asthma <- df %>%
group_by(asthma) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "asthma") %>%
rename(Y_N = asthma)
Add these all together
Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)
I've got a large number of conditions, and a lot of data. Is there an easier / faster way to do this?
I considered creating a new variable 'condition' in the original dataframe, through an ifelse
statement, but this does not allow a person to have more than one condition, and the conditions take precedence in the order I specify them.
Would be grateful for advice on how to simplify this code so that it's not so long.
r group-by dplyr bind
I have data which looks like this
df <- data.frame (
cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),
CVD = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
diab = c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),
stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),
asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),
LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))
My data is much larger, approx. 1 million rows. Each row is a person, and the variables correspond to the diseases that person has (1 = yes)
What I want is a dataframe with the proportion of people from the population who have and do not have each condition.
This is what I have done to generate the output I want:
1) Construct the proportion of the population which have each condition individually
Prop_cancer <- df %>%
group_by(cancer) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "cancer") %>%
rename(Y_N = cancer)
Prop_CVD <- df %>%
group_by(CVD) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "CVD") %>%
rename(Y_N = CVD)
Prop_diab <- df %>%
group_by(diab) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "diab") %>%
rename(Y_N = diab)
Prop_stroke <- df %>%
group_by(stroke) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "stroke") %>%
rename(Y_N = stroke)
Prop_asthma <- df %>%
group_by(asthma) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "asthma") %>%
rename(Y_N = asthma)
Add these all together
Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)
I've got a large number of conditions, and a lot of data. Is there an easier / faster way to do this?
I considered creating a new variable 'condition' in the original dataframe, through an ifelse
statement, but this does not allow a person to have more than one condition, and the conditions take precedence in the order I specify them.
Would be grateful for advice on how to simplify this code so that it's not so long.
r group-by dplyr bind
r group-by dplyr bind
edited Nov 21 '18 at 1:13


Ronak Shah
37k104161
37k104161
asked Nov 21 '18 at 0:56
Laura Laura
1517
1517
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
With tidyverse
we can use gather
to collapse the dataframe into long format in key
, value
pairs then group_by
them and count the ratio in each group.
library(tidyverse)
df %>%
gather() %>%
group_by(key, value) %>%
summarise(freq = n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(freq = freq/sum(freq) * 100)
# key value freq
# <chr> <dbl> <dbl>
# 1 CVD 0 70
# 2 CVD 1 30
# 3 asthma 0 80
# 4 asthma 1 20
# 5 cancer 0 80
# 6 cancer 1 20
# 7 diab 0 70
# 8 diab 1 30
# 9 stroke 0 60
#10 stroke 1 40
Note - I have ignored LTC_count
column since that doesn't seem to be involved in the calculation.
Or we can reduce some steps by using count
as suggested by @Jake Kaupp
df %>%
gather() %>%
count(key, value) %>%
group_by(key) %>%
mutate(n = n/sum(n) * 100)
add a comment |
Percent population with a given disease:
colSums(df) / nrow(df) * 100
#cancer CVD diab stroke asthma LTC_count
#20 30 30 40 20 150
add a comment |
Using dplyr
this can be done in a single line, without gathering and whatnot:
df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))
cancer CVD diab stroke asthma
1 0.2 0.3 0.3 0.4 0.2
If we want both the yes and the no frequencies:
bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)),
"N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")
id cancer CVD diab stroke asthma
1 Y 20 30 30 40 20
2 N 80 70 70 60 80
In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:
df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)),
"N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")
df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),
select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]
df1$condition<-gsub("_count","",df1$condition)
condition count freq Y_N
1 cancer 2 20 Y
2 cancer 8 80 N
3 CVD 3 30 Y
4 CVD 7 70 N
5 diab 3 30 Y
6 diab 7 70 N
7 stroke 4 40 Y
8 stroke 6 60 N
9 asthma 2 20 Y
10 asthma 8 80 N
This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?
– Laura
Nov 21 '18 at 3:32
I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.
– iod
Nov 21 '18 at 4:13
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403837%2fsimplify-code-for-getting-multiple-disease-proportions-in-the-population%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
With tidyverse
we can use gather
to collapse the dataframe into long format in key
, value
pairs then group_by
them and count the ratio in each group.
library(tidyverse)
df %>%
gather() %>%
group_by(key, value) %>%
summarise(freq = n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(freq = freq/sum(freq) * 100)
# key value freq
# <chr> <dbl> <dbl>
# 1 CVD 0 70
# 2 CVD 1 30
# 3 asthma 0 80
# 4 asthma 1 20
# 5 cancer 0 80
# 6 cancer 1 20
# 7 diab 0 70
# 8 diab 1 30
# 9 stroke 0 60
#10 stroke 1 40
Note - I have ignored LTC_count
column since that doesn't seem to be involved in the calculation.
Or we can reduce some steps by using count
as suggested by @Jake Kaupp
df %>%
gather() %>%
count(key, value) %>%
group_by(key) %>%
mutate(n = n/sum(n) * 100)
add a comment |
With tidyverse
we can use gather
to collapse the dataframe into long format in key
, value
pairs then group_by
them and count the ratio in each group.
library(tidyverse)
df %>%
gather() %>%
group_by(key, value) %>%
summarise(freq = n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(freq = freq/sum(freq) * 100)
# key value freq
# <chr> <dbl> <dbl>
# 1 CVD 0 70
# 2 CVD 1 30
# 3 asthma 0 80
# 4 asthma 1 20
# 5 cancer 0 80
# 6 cancer 1 20
# 7 diab 0 70
# 8 diab 1 30
# 9 stroke 0 60
#10 stroke 1 40
Note - I have ignored LTC_count
column since that doesn't seem to be involved in the calculation.
Or we can reduce some steps by using count
as suggested by @Jake Kaupp
df %>%
gather() %>%
count(key, value) %>%
group_by(key) %>%
mutate(n = n/sum(n) * 100)
add a comment |
With tidyverse
we can use gather
to collapse the dataframe into long format in key
, value
pairs then group_by
them and count the ratio in each group.
library(tidyverse)
df %>%
gather() %>%
group_by(key, value) %>%
summarise(freq = n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(freq = freq/sum(freq) * 100)
# key value freq
# <chr> <dbl> <dbl>
# 1 CVD 0 70
# 2 CVD 1 30
# 3 asthma 0 80
# 4 asthma 1 20
# 5 cancer 0 80
# 6 cancer 1 20
# 7 diab 0 70
# 8 diab 1 30
# 9 stroke 0 60
#10 stroke 1 40
Note - I have ignored LTC_count
column since that doesn't seem to be involved in the calculation.
Or we can reduce some steps by using count
as suggested by @Jake Kaupp
df %>%
gather() %>%
count(key, value) %>%
group_by(key) %>%
mutate(n = n/sum(n) * 100)
With tidyverse
we can use gather
to collapse the dataframe into long format in key
, value
pairs then group_by
them and count the ratio in each group.
library(tidyverse)
df %>%
gather() %>%
group_by(key, value) %>%
summarise(freq = n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(freq = freq/sum(freq) * 100)
# key value freq
# <chr> <dbl> <dbl>
# 1 CVD 0 70
# 2 CVD 1 30
# 3 asthma 0 80
# 4 asthma 1 20
# 5 cancer 0 80
# 6 cancer 1 20
# 7 diab 0 70
# 8 diab 1 30
# 9 stroke 0 60
#10 stroke 1 40
Note - I have ignored LTC_count
column since that doesn't seem to be involved in the calculation.
Or we can reduce some steps by using count
as suggested by @Jake Kaupp
df %>%
gather() %>%
count(key, value) %>%
group_by(key) %>%
mutate(n = n/sum(n) * 100)
edited Nov 21 '18 at 1:29
answered Nov 21 '18 at 1:08


Ronak ShahRonak Shah
37k104161
37k104161
add a comment |
add a comment |
Percent population with a given disease:
colSums(df) / nrow(df) * 100
#cancer CVD diab stroke asthma LTC_count
#20 30 30 40 20 150
add a comment |
Percent population with a given disease:
colSums(df) / nrow(df) * 100
#cancer CVD diab stroke asthma LTC_count
#20 30 30 40 20 150
add a comment |
Percent population with a given disease:
colSums(df) / nrow(df) * 100
#cancer CVD diab stroke asthma LTC_count
#20 30 30 40 20 150
Percent population with a given disease:
colSums(df) / nrow(df) * 100
#cancer CVD diab stroke asthma LTC_count
#20 30 30 40 20 150
answered Nov 21 '18 at 0:58
12b345b6b7812b345b6b78
782115
782115
add a comment |
add a comment |
Using dplyr
this can be done in a single line, without gathering and whatnot:
df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))
cancer CVD diab stroke asthma
1 0.2 0.3 0.3 0.4 0.2
If we want both the yes and the no frequencies:
bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)),
"N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")
id cancer CVD diab stroke asthma
1 Y 20 30 30 40 20
2 N 80 70 70 60 80
In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:
df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)),
"N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")
df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),
select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]
df1$condition<-gsub("_count","",df1$condition)
condition count freq Y_N
1 cancer 2 20 Y
2 cancer 8 80 N
3 CVD 3 30 Y
4 CVD 7 70 N
5 diab 3 30 Y
6 diab 7 70 N
7 stroke 4 40 Y
8 stroke 6 60 N
9 asthma 2 20 Y
10 asthma 8 80 N
This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?
– Laura
Nov 21 '18 at 3:32
I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.
– iod
Nov 21 '18 at 4:13
add a comment |
Using dplyr
this can be done in a single line, without gathering and whatnot:
df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))
cancer CVD diab stroke asthma
1 0.2 0.3 0.3 0.4 0.2
If we want both the yes and the no frequencies:
bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)),
"N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")
id cancer CVD diab stroke asthma
1 Y 20 30 30 40 20
2 N 80 70 70 60 80
In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:
df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)),
"N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")
df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),
select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]
df1$condition<-gsub("_count","",df1$condition)
condition count freq Y_N
1 cancer 2 20 Y
2 cancer 8 80 N
3 CVD 3 30 Y
4 CVD 7 70 N
5 diab 3 30 Y
6 diab 7 70 N
7 stroke 4 40 Y
8 stroke 6 60 N
9 asthma 2 20 Y
10 asthma 8 80 N
This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?
– Laura
Nov 21 '18 at 3:32
I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.
– iod
Nov 21 '18 at 4:13
add a comment |
Using dplyr
this can be done in a single line, without gathering and whatnot:
df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))
cancer CVD diab stroke asthma
1 0.2 0.3 0.3 0.4 0.2
If we want both the yes and the no frequencies:
bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)),
"N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")
id cancer CVD diab stroke asthma
1 Y 20 30 30 40 20
2 N 80 70 70 60 80
In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:
df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)),
"N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")
df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),
select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]
df1$condition<-gsub("_count","",df1$condition)
condition count freq Y_N
1 cancer 2 20 Y
2 cancer 8 80 N
3 CVD 3 30 Y
4 CVD 7 70 N
5 diab 3 30 Y
6 diab 7 70 N
7 stroke 4 40 Y
8 stroke 6 60 N
9 asthma 2 20 Y
10 asthma 8 80 N
Using dplyr
this can be done in a single line, without gathering and whatnot:
df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))
cancer CVD diab stroke asthma
1 0.2 0.3 0.3 0.4 0.2
If we want both the yes and the no frequencies:
bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)),
"N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")
id cancer CVD diab stroke asthma
1 Y 20 30 30 40 20
2 N 80 70 70 60 80
In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:
df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)),
"N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")
df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),
select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]
df1$condition<-gsub("_count","",df1$condition)
condition count freq Y_N
1 cancer 2 20 Y
2 cancer 8 80 N
3 CVD 3 30 Y
4 CVD 7 70 N
5 diab 3 30 Y
6 diab 7 70 N
7 stroke 4 40 Y
8 stroke 6 60 N
9 asthma 2 20 Y
10 asthma 8 80 N
edited Nov 21 '18 at 4:12
answered Nov 21 '18 at 2:53
iodiod
3,8232722
3,8232722
This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?
– Laura
Nov 21 '18 at 3:32
I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.
– iod
Nov 21 '18 at 4:13
add a comment |
This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?
– Laura
Nov 21 '18 at 3:32
I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.
– iod
Nov 21 '18 at 4:13
This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?
– Laura
Nov 21 '18 at 3:32
This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?
– Laura
Nov 21 '18 at 3:32
I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.
– iod
Nov 21 '18 at 4:13
I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.
– iod
Nov 21 '18 at 4:13
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403837%2fsimplify-code-for-getting-multiple-disease-proportions-in-the-population%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown