Simplify code for getting multiple disease proportions in the population

I have data which looks like this

df <- data.frame (

cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),

CVD =    c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),

diab =   c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),

stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),

asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),

LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))

My data is much larger, approx. 1 million rows. Each row is a person, and the variables correspond to the diseases that person has (1 = yes)

What I want is a dataframe with the proportion of people from the population who have and do not have each condition.

This is what I have done to generate the output I want:

1) Construct the proportion of the population which have each condition individually

Prop_cancer <- df %>%

group_by(cancer) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "cancer") %>%

rename(Y_N = cancer) 



Prop_CVD <- df %>%

group_by(CVD) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "CVD") %>%

rename(Y_N = CVD)



Prop_diab <- df %>%

group_by(diab) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "diab") %>%

rename(Y_N = diab)



Prop_stroke <- df %>%

group_by(stroke) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "stroke") %>%

rename(Y_N = stroke)



Prop_asthma <- df %>%

group_by(asthma) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "asthma") %>%

rename(Y_N = asthma)

Add these all together

Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)

I've got a large number of conditions, and a lot of data. Is there an easier / faster way to do this?

I considered creating a new variable 'condition' in the original dataframe, through an ifelse statement, but this does not allow a person to have more than one condition, and the conditions take precedence in the order I specify them.

Would be grateful for advice on how to simplify this code so that it's not so long.

edited Nov 21 '18 at 1:13

Ronak Shah

37k104161

asked Nov 21 '18 at 0:56

Laura

1517

add a comment |

I have data which looks like this

df <- data.frame (

cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),

CVD =    c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),

diab =   c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),

stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),

asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),

LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))

My data is much larger, approx. 1 million rows. Each row is a person, and the variables correspond to the diseases that person has (1 = yes)

What I want is a dataframe with the proportion of people from the population who have and do not have each condition.

This is what I have done to generate the output I want:

1) Construct the proportion of the population which have each condition individually

Prop_cancer <- df %>%

group_by(cancer) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "cancer") %>%

rename(Y_N = cancer) 



Prop_CVD <- df %>%

group_by(CVD) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "CVD") %>%

rename(Y_N = CVD)



Prop_diab <- df %>%

group_by(diab) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "diab") %>%

rename(Y_N = diab)



Prop_stroke <- df %>%

group_by(stroke) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "stroke") %>%

rename(Y_N = stroke)



Prop_asthma <- df %>%

group_by(asthma) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "asthma") %>%

rename(Y_N = asthma)

Add these all together

Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)

I've got a large number of conditions, and a lot of data. Is there an easier / faster way to do this?

Would be grateful for advice on how to simplify this code so that it's not so long.

edited Nov 21 '18 at 1:13

Ronak Shah

37k104161

asked Nov 21 '18 at 0:56

Laura

1517

add a comment |

I have data which looks like this

df <- data.frame (

cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),

CVD =    c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),

diab =   c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),

stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),

asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),

LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))

My data is much larger, approx. 1 million rows. Each row is a person, and the variables correspond to the diseases that person has (1 = yes)

What I want is a dataframe with the proportion of people from the population who have and do not have each condition.

This is what I have done to generate the output I want:

1) Construct the proportion of the population which have each condition individually

Prop_cancer <- df %>%

group_by(cancer) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "cancer") %>%

rename(Y_N = cancer) 



Prop_CVD <- df %>%

group_by(CVD) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "CVD") %>%

rename(Y_N = CVD)



Prop_diab <- df %>%

group_by(diab) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "diab") %>%

rename(Y_N = diab)



Prop_stroke <- df %>%

group_by(stroke) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "stroke") %>%

rename(Y_N = stroke)



Prop_asthma <- df %>%

group_by(asthma) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "asthma") %>%

rename(Y_N = asthma)

Add these all together

Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)

I've got a large number of conditions, and a lot of data. Is there an easier / faster way to do this?

Would be grateful for advice on how to simplify this code so that it's not so long.

edited Nov 21 '18 at 1:13

Ronak Shah

37k104161

asked Nov 21 '18 at 0:56

Laura

1517

I have data which looks like this

df <- data.frame (

cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),

CVD =    c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),

diab =   c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),

stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),

asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),

LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))

My data is much larger, approx. 1 million rows. Each row is a person, and the variables correspond to the diseases that person has (1 = yes)

What I want is a dataframe with the proportion of people from the population who have and do not have each condition.

This is what I have done to generate the output I want:

1) Construct the proportion of the population which have each condition individually

Prop_cancer <- df %>%

group_by(cancer) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "cancer") %>%

rename(Y_N = cancer) 



Prop_CVD <- df %>%

group_by(CVD) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "CVD") %>%

rename(Y_N = CVD)



Prop_diab <- df %>%

group_by(diab) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "diab") %>%

rename(Y_N = diab)



Prop_stroke <- df %>%

group_by(stroke) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "stroke") %>%

rename(Y_N = stroke)



Prop_asthma <- df %>%

group_by(asthma) %>%

summarise(count = n()) %>%

mutate(freq = round((count / sum(count))*100, digits = 1)) %>%

mutate(condition = "asthma") %>%

rename(Y_N = asthma)

Add these all together

Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)

I've got a large number of conditions, and a lot of data. Is there an easier / faster way to do this?

Would be grateful for advice on how to simplify this code so that it's not so long.

r group-by dplyr bind

edited Nov 21 '18 at 1:13

Ronak Shah

37k104161

asked Nov 21 '18 at 0:56

Laura

1517

edited Nov 21 '18 at 1:13

Ronak Shah

37k104161

asked Nov 21 '18 at 0:56

Laura

1517

edited Nov 21 '18 at 1:13

Ronak Shah

37k104161

edited Nov 21 '18 at 1:13

Ronak Shah

37k104161

edited Nov 21 '18 at 1:13

Ronak Shah

37k104161

asked Nov 21 '18 at 0:56

Laura

1517

asked Nov 21 '18 at 0:56

Laura

1517

asked Nov 21 '18 at 0:56

Laura

1517

add a comment |

3 Answers
3

active

oldest

votes

With tidyverse we can use gather to collapse the dataframe into long format in key, value pairs then group_by them and count the ratio in each group.

library(tidyverse)



df %>%

  gather() %>%

  group_by(key, value) %>%

  summarise(freq = n()) %>%

  ungroup() %>%

  group_by(key) %>%

  mutate(freq = freq/sum(freq) * 100)





#   key    value  freq

#   <chr>  <dbl> <dbl>

# 1 CVD        0    70

# 2 CVD        1    30

# 3 asthma     0    80

# 4 asthma     1    20

# 5 cancer     0    80

# 6 cancer     1    20

# 7 diab       0    70

# 8 diab       1    30

# 9 stroke     0    60

#10 stroke     1    40

Note - I have ignored LTC_count column since that doesn't seem to be involved in the calculation.

Or we can reduce some steps by using count as suggested by @Jake Kaupp

df %>%

  gather() %>%

  count(key, value) %>%

  group_by(key) %>%

  mutate(n = n/sum(n) * 100)

edited Nov 21 '18 at 1:29

answered Nov 21 '18 at 1:08

Ronak Shah

37k104161

add a comment |

Percent population with a given disease:

colSums(df) / nrow(df) * 100

#cancer       CVD      diab    stroke    asthma LTC_count 

#20        30        30        40        20       150

answered Nov 21 '18 at 0:58

12b345b6b78

782115

add a comment |

Using dplyr this can be done in a single line, without gathering and whatnot:

df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))

  cancer CVD diab stroke asthma

1    0.2 0.3  0.3    0.4    0.2

If we want both the yes and the no frequencies:

bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)), 

  "N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")



  id cancer CVD diab stroke asthma

1  Y     20  30   30     40     20

2  N     80  70   70     60     80

In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:

df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)), 

                 "N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")



df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),

          select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]



df1$condition<-gsub("_count","",df1$condition)



   condition count freq Y_N

1     cancer     2   20   Y

2     cancer     8   80   N

3        CVD     3   30   Y

4        CVD     7   70   N

5       diab     3   30   Y

6       diab     7   70   N

7     stroke     4   40   Y

8     stroke     6   60   N

9     asthma     2   20   Y

10    asthma     8   80   N

edited Nov 21 '18 at 4:12

answered Nov 21 '18 at 2:53

iod

3,8232722

This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

– Laura
Nov 21 '18 at 3:32

I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

– iod
Nov 21 '18 at 4:13

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403837%2fsimplify-code-for-getting-multiple-disease-proportions-in-the-population%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

With tidyverse we can use gather to collapse the dataframe into long format in key, value pairs then group_by them and count the ratio in each group.

library(tidyverse)



df %>%

  gather() %>%

  group_by(key, value) %>%

  summarise(freq = n()) %>%

  ungroup() %>%

  group_by(key) %>%

  mutate(freq = freq/sum(freq) * 100)





#   key    value  freq

#   <chr>  <dbl> <dbl>

# 1 CVD        0    70

# 2 CVD        1    30

# 3 asthma     0    80

# 4 asthma     1    20

# 5 cancer     0    80

# 6 cancer     1    20

# 7 diab       0    70

# 8 diab       1    30

# 9 stroke     0    60

#10 stroke     1    40

Note - I have ignored LTC_count column since that doesn't seem to be involved in the calculation.

Or we can reduce some steps by using count as suggested by @Jake Kaupp

df %>%

  gather() %>%

  count(key, value) %>%

  group_by(key) %>%

  mutate(n = n/sum(n) * 100)

edited Nov 21 '18 at 1:29

answered Nov 21 '18 at 1:08

Ronak Shah

37k104161

add a comment |

With tidyverse we can use gather to collapse the dataframe into long format in key, value pairs then group_by them and count the ratio in each group.

library(tidyverse)



df %>%

  gather() %>%

  group_by(key, value) %>%

  summarise(freq = n()) %>%

  ungroup() %>%

  group_by(key) %>%

  mutate(freq = freq/sum(freq) * 100)





#   key    value  freq

#   <chr>  <dbl> <dbl>

# 1 CVD        0    70

# 2 CVD        1    30

# 3 asthma     0    80

# 4 asthma     1    20

# 5 cancer     0    80

# 6 cancer     1    20

# 7 diab       0    70

# 8 diab       1    30

# 9 stroke     0    60

#10 stroke     1    40

Note - I have ignored LTC_count column since that doesn't seem to be involved in the calculation.

Or we can reduce some steps by using count as suggested by @Jake Kaupp

df %>%

  gather() %>%

  count(key, value) %>%

  group_by(key) %>%

  mutate(n = n/sum(n) * 100)

edited Nov 21 '18 at 1:29

answered Nov 21 '18 at 1:08

Ronak Shah

37k104161

add a comment |

With tidyverse we can use gather to collapse the dataframe into long format in key, value pairs then group_by them and count the ratio in each group.

library(tidyverse)



df %>%

  gather() %>%

  group_by(key, value) %>%

  summarise(freq = n()) %>%

  ungroup() %>%

  group_by(key) %>%

  mutate(freq = freq/sum(freq) * 100)





#   key    value  freq

#   <chr>  <dbl> <dbl>

# 1 CVD        0    70

# 2 CVD        1    30

# 3 asthma     0    80

# 4 asthma     1    20

# 5 cancer     0    80

# 6 cancer     1    20

# 7 diab       0    70

# 8 diab       1    30

# 9 stroke     0    60

#10 stroke     1    40

Note - I have ignored LTC_count column since that doesn't seem to be involved in the calculation.

Or we can reduce some steps by using count as suggested by @Jake Kaupp

df %>%

  gather() %>%

  count(key, value) %>%

  group_by(key) %>%

  mutate(n = n/sum(n) * 100)

edited Nov 21 '18 at 1:29

answered Nov 21 '18 at 1:08

Ronak Shah

37k104161

With tidyverse we can use gather to collapse the dataframe into long format in key, value pairs then group_by them and count the ratio in each group.

library(tidyverse)



df %>%

  gather() %>%

  group_by(key, value) %>%

  summarise(freq = n()) %>%

  ungroup() %>%

  group_by(key) %>%

  mutate(freq = freq/sum(freq) * 100)





#   key    value  freq

#   <chr>  <dbl> <dbl>

# 1 CVD        0    70

# 2 CVD        1    30

# 3 asthma     0    80

# 4 asthma     1    20

# 5 cancer     0    80

# 6 cancer     1    20

# 7 diab       0    70

# 8 diab       1    30

# 9 stroke     0    60

#10 stroke     1    40

Note - I have ignored LTC_count column since that doesn't seem to be involved in the calculation.

Or we can reduce some steps by using count as suggested by @Jake Kaupp

df %>%

  gather() %>%

  count(key, value) %>%

  group_by(key) %>%

  mutate(n = n/sum(n) * 100)

edited Nov 21 '18 at 1:29

answered Nov 21 '18 at 1:08

Ronak Shah

37k104161

edited Nov 21 '18 at 1:29

answered Nov 21 '18 at 1:08

Ronak Shah

37k104161

answered Nov 21 '18 at 1:08

Ronak Shah

37k104161

answered Nov 21 '18 at 1:08

Ronak Shah

37k104161

add a comment |

Percent population with a given disease:

colSums(df) / nrow(df) * 100

#cancer       CVD      diab    stroke    asthma LTC_count 

#20        30        30        40        20       150

answered Nov 21 '18 at 0:58

12b345b6b78

782115

add a comment |

Percent population with a given disease:

colSums(df) / nrow(df) * 100

#cancer       CVD      diab    stroke    asthma LTC_count 

#20        30        30        40        20       150

answered Nov 21 '18 at 0:58

12b345b6b78

782115

add a comment |

Percent population with a given disease:

colSums(df) / nrow(df) * 100

#cancer       CVD      diab    stroke    asthma LTC_count 

#20        30        30        40        20       150

answered Nov 21 '18 at 0:58

12b345b6b78

782115

Percent population with a given disease:

colSums(df) / nrow(df) * 100

#cancer       CVD      diab    stroke    asthma LTC_count 

#20        30        30        40        20       150

answered Nov 21 '18 at 0:58

12b345b6b78

782115

answered Nov 21 '18 at 0:58

12b345b6b78

782115

answered Nov 21 '18 at 0:58

12b345b6b78

782115

answered Nov 21 '18 at 0:58

12b345b6b78

782115

add a comment |

Using dplyr this can be done in a single line, without gathering and whatnot:

df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))

  cancer CVD diab stroke asthma

1    0.2 0.3  0.3    0.4    0.2

If we want both the yes and the no frequencies:

bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)), 

  "N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")



  id cancer CVD diab stroke asthma

1  Y     20  30   30     40     20

2  N     80  70   70     60     80

In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:

df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)), 

                 "N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")



df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),

          select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]



df1$condition<-gsub("_count","",df1$condition)



   condition count freq Y_N

1     cancer     2   20   Y

2     cancer     8   80   N

3        CVD     3   30   Y

4        CVD     7   70   N

5       diab     3   30   Y

6       diab     7   70   N

7     stroke     4   40   Y

8     stroke     6   60   N

9     asthma     2   20   Y

10    asthma     8   80   N

edited Nov 21 '18 at 4:12

answered Nov 21 '18 at 2:53

iod

3,8232722

This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

– Laura
Nov 21 '18 at 3:32

I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

– iod
Nov 21 '18 at 4:13

add a comment |

Using dplyr this can be done in a single line, without gathering and whatnot:

df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))

  cancer CVD diab stroke asthma

1    0.2 0.3  0.3    0.4    0.2

If we want both the yes and the no frequencies:

bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)), 

  "N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")



  id cancer CVD diab stroke asthma

1  Y     20  30   30     40     20

2  N     80  70   70     60     80

In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:

df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)), 

                 "N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")



df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),

          select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]



df1$condition<-gsub("_count","",df1$condition)



   condition count freq Y_N

1     cancer     2   20   Y

2     cancer     8   80   N

3        CVD     3   30   Y

4        CVD     7   70   N

5       diab     3   30   Y

6       diab     7   70   N

7     stroke     4   40   Y

8     stroke     6   60   N

9     asthma     2   20   Y

10    asthma     8   80   N

edited Nov 21 '18 at 4:12

answered Nov 21 '18 at 2:53

iod

3,8232722

This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

– Laura
Nov 21 '18 at 3:32

I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

– iod
Nov 21 '18 at 4:13

add a comment |

Using dplyr this can be done in a single line, without gathering and whatnot:

df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))

  cancer CVD diab stroke asthma

1    0.2 0.3  0.3    0.4    0.2

If we want both the yes and the no frequencies:

bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)), 

  "N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")



  id cancer CVD diab stroke asthma

1  Y     20  30   30     40     20

2  N     80  70   70     60     80

In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:

df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)), 

                 "N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")



df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),

          select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]



df1$condition<-gsub("_count","",df1$condition)



   condition count freq Y_N

1     cancer     2   20   Y

2     cancer     8   80   N

3        CVD     3   30   Y

4        CVD     7   70   N

5       diab     3   30   Y

6       diab     7   70   N

7     stroke     4   40   Y

8     stroke     6   60   N

9     asthma     2   20   Y

10    asthma     8   80   N

edited Nov 21 '18 at 4:12

answered Nov 21 '18 at 2:53

iod

3,8232722

Using dplyr this can be done in a single line, without gathering and whatnot:

df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))

  cancer CVD diab stroke asthma

1    0.2 0.3  0.3    0.4    0.2

If we want both the yes and the no frequencies:

bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)), 

  "N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")



  id cancer CVD diab stroke asthma

1  Y     20  30   30     40     20

2  N     80  70   70     60     80

In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:

df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)), 

                 "N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")



df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),

          select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]



df1$condition<-gsub("_count","",df1$condition)



   condition count freq Y_N

1     cancer     2   20   Y

2     cancer     8   80   N

3        CVD     3   30   Y

4        CVD     7   70   N

5       diab     3   30   Y

6       diab     7   70   N

7     stroke     4   40   Y

8     stroke     6   60   N

9     asthma     2   20   Y

10    asthma     8   80   N

edited Nov 21 '18 at 4:12

answered Nov 21 '18 at 2:53

iod

3,8232722

edited Nov 21 '18 at 4:12

answered Nov 21 '18 at 2:53

iod

3,8232722

answered Nov 21 '18 at 2:53

iod

3,8232722

answered Nov 21 '18 at 2:53

iod

3,8232722

This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

– Laura
Nov 21 '18 at 3:32

I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

– iod
Nov 21 '18 at 4:13

add a comment |

This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

– Laura
Nov 21 '18 at 3:32

I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

– iod
Nov 21 '18 at 4:13

This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

– Laura
Nov 21 '18 at 3:32

I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

– iod
Nov 21 '18 at 4:13

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu