Simplify code for getting multiple disease proportions in the population












0















I have data which looks like this



df <- data.frame (
cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),
CVD = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
diab = c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),
stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),
asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),
LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))


My data is much larger, approx. 1 million rows. Each row is a person, and the variables correspond to the diseases that person has (1 = yes)



What I want is a dataframe with the proportion of people from the population who have and do not have each condition.



This is what I have done to generate the output I want:



1) Construct the proportion of the population which have each condition individually



Prop_cancer <- df %>%
group_by(cancer) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "cancer") %>%
rename(Y_N = cancer)

Prop_CVD <- df %>%
group_by(CVD) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "CVD") %>%
rename(Y_N = CVD)

Prop_diab <- df %>%
group_by(diab) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "diab") %>%
rename(Y_N = diab)

Prop_stroke <- df %>%
group_by(stroke) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "stroke") %>%
rename(Y_N = stroke)

Prop_asthma <- df %>%
group_by(asthma) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "asthma") %>%
rename(Y_N = asthma)


Add these all together



Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)


I've got a large number of conditions, and a lot of data. Is there an easier / faster way to do this?



I considered creating a new variable 'condition' in the original dataframe, through an ifelse statement, but this does not allow a person to have more than one condition, and the conditions take precedence in the order I specify them.



Would be grateful for advice on how to simplify this code so that it's not so long.










share|improve this question





























    0















    I have data which looks like this



    df <- data.frame (
    cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),
    CVD = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
    diab = c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),
    stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),
    asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),
    LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))


    My data is much larger, approx. 1 million rows. Each row is a person, and the variables correspond to the diseases that person has (1 = yes)



    What I want is a dataframe with the proportion of people from the population who have and do not have each condition.



    This is what I have done to generate the output I want:



    1) Construct the proportion of the population which have each condition individually



    Prop_cancer <- df %>%
    group_by(cancer) %>%
    summarise(count = n()) %>%
    mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
    mutate(condition = "cancer") %>%
    rename(Y_N = cancer)

    Prop_CVD <- df %>%
    group_by(CVD) %>%
    summarise(count = n()) %>%
    mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
    mutate(condition = "CVD") %>%
    rename(Y_N = CVD)

    Prop_diab <- df %>%
    group_by(diab) %>%
    summarise(count = n()) %>%
    mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
    mutate(condition = "diab") %>%
    rename(Y_N = diab)

    Prop_stroke <- df %>%
    group_by(stroke) %>%
    summarise(count = n()) %>%
    mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
    mutate(condition = "stroke") %>%
    rename(Y_N = stroke)

    Prop_asthma <- df %>%
    group_by(asthma) %>%
    summarise(count = n()) %>%
    mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
    mutate(condition = "asthma") %>%
    rename(Y_N = asthma)


    Add these all together



    Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)


    I've got a large number of conditions, and a lot of data. Is there an easier / faster way to do this?



    I considered creating a new variable 'condition' in the original dataframe, through an ifelse statement, but this does not allow a person to have more than one condition, and the conditions take precedence in the order I specify them.



    Would be grateful for advice on how to simplify this code so that it's not so long.










    share|improve this question



























      0












      0








      0








      I have data which looks like this



      df <- data.frame (
      cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),
      CVD = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
      diab = c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),
      stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),
      asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),
      LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))


      My data is much larger, approx. 1 million rows. Each row is a person, and the variables correspond to the diseases that person has (1 = yes)



      What I want is a dataframe with the proportion of people from the population who have and do not have each condition.



      This is what I have done to generate the output I want:



      1) Construct the proportion of the population which have each condition individually



      Prop_cancer <- df %>%
      group_by(cancer) %>%
      summarise(count = n()) %>%
      mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
      mutate(condition = "cancer") %>%
      rename(Y_N = cancer)

      Prop_CVD <- df %>%
      group_by(CVD) %>%
      summarise(count = n()) %>%
      mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
      mutate(condition = "CVD") %>%
      rename(Y_N = CVD)

      Prop_diab <- df %>%
      group_by(diab) %>%
      summarise(count = n()) %>%
      mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
      mutate(condition = "diab") %>%
      rename(Y_N = diab)

      Prop_stroke <- df %>%
      group_by(stroke) %>%
      summarise(count = n()) %>%
      mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
      mutate(condition = "stroke") %>%
      rename(Y_N = stroke)

      Prop_asthma <- df %>%
      group_by(asthma) %>%
      summarise(count = n()) %>%
      mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
      mutate(condition = "asthma") %>%
      rename(Y_N = asthma)


      Add these all together



      Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)


      I've got a large number of conditions, and a lot of data. Is there an easier / faster way to do this?



      I considered creating a new variable 'condition' in the original dataframe, through an ifelse statement, but this does not allow a person to have more than one condition, and the conditions take precedence in the order I specify them.



      Would be grateful for advice on how to simplify this code so that it's not so long.










      share|improve this question
















      I have data which looks like this



      df <- data.frame (
      cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),
      CVD = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
      diab = c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),
      stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),
      asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),
      LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))


      My data is much larger, approx. 1 million rows. Each row is a person, and the variables correspond to the diseases that person has (1 = yes)



      What I want is a dataframe with the proportion of people from the population who have and do not have each condition.



      This is what I have done to generate the output I want:



      1) Construct the proportion of the population which have each condition individually



      Prop_cancer <- df %>%
      group_by(cancer) %>%
      summarise(count = n()) %>%
      mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
      mutate(condition = "cancer") %>%
      rename(Y_N = cancer)

      Prop_CVD <- df %>%
      group_by(CVD) %>%
      summarise(count = n()) %>%
      mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
      mutate(condition = "CVD") %>%
      rename(Y_N = CVD)

      Prop_diab <- df %>%
      group_by(diab) %>%
      summarise(count = n()) %>%
      mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
      mutate(condition = "diab") %>%
      rename(Y_N = diab)

      Prop_stroke <- df %>%
      group_by(stroke) %>%
      summarise(count = n()) %>%
      mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
      mutate(condition = "stroke") %>%
      rename(Y_N = stroke)

      Prop_asthma <- df %>%
      group_by(asthma) %>%
      summarise(count = n()) %>%
      mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
      mutate(condition = "asthma") %>%
      rename(Y_N = asthma)


      Add these all together



      Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)


      I've got a large number of conditions, and a lot of data. Is there an easier / faster way to do this?



      I considered creating a new variable 'condition' in the original dataframe, through an ifelse statement, but this does not allow a person to have more than one condition, and the conditions take precedence in the order I specify them.



      Would be grateful for advice on how to simplify this code so that it's not so long.







      r group-by dplyr bind






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 21 '18 at 1:13









      Ronak Shah

      37k104161




      37k104161










      asked Nov 21 '18 at 0:56









      Laura Laura

      1517




      1517
























          3 Answers
          3






          active

          oldest

          votes


















          1














          With tidyverse we can use gather to collapse the dataframe into long format in key, value pairs then group_by them and count the ratio in each group.



          library(tidyverse)

          df %>%
          gather() %>%
          group_by(key, value) %>%
          summarise(freq = n()) %>%
          ungroup() %>%
          group_by(key) %>%
          mutate(freq = freq/sum(freq) * 100)


          # key value freq
          # <chr> <dbl> <dbl>
          # 1 CVD 0 70
          # 2 CVD 1 30
          # 3 asthma 0 80
          # 4 asthma 1 20
          # 5 cancer 0 80
          # 6 cancer 1 20
          # 7 diab 0 70
          # 8 diab 1 30
          # 9 stroke 0 60
          #10 stroke 1 40


          Note - I have ignored LTC_count column since that doesn't seem to be involved in the calculation.





          Or we can reduce some steps by using count as suggested by @Jake Kaupp



          df %>%
          gather() %>%
          count(key, value) %>%
          group_by(key) %>%
          mutate(n = n/sum(n) * 100)





          share|improve this answer

































            1














            Percent population with a given disease:



            colSums(df) / nrow(df) * 100
            #cancer CVD diab stroke asthma LTC_count
            #20 30 30 40 20 150





            share|improve this answer































              1














              Using dplyr this can be done in a single line, without gathering and whatnot:



              df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))
              cancer CVD diab stroke asthma
              1 0.2 0.3 0.3 0.4 0.2


              If we want both the yes and the no frequencies:



              bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)), 
              "N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")

              id cancer CVD diab stroke asthma
              1 Y 20 30 30 40 20
              2 N 80 70 70 60 80


              In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:



              df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)), 
              "N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")

              df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),
              select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]

              df1$condition<-gsub("_count","",df1$condition)

              condition count freq Y_N
              1 cancer 2 20 Y
              2 cancer 8 80 N
              3 CVD 3 30 Y
              4 CVD 7 70 N
              5 diab 3 30 Y
              6 diab 7 70 N
              7 stroke 4 40 Y
              8 stroke 6 60 N
              9 asthma 2 20 Y
              10 asthma 8 80 N





              share|improve this answer


























              • This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

                – Laura
                Nov 21 '18 at 3:32













              • I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

                – iod
                Nov 21 '18 at 4:13













              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403837%2fsimplify-code-for-getting-multiple-disease-proportions-in-the-population%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1














              With tidyverse we can use gather to collapse the dataframe into long format in key, value pairs then group_by them and count the ratio in each group.



              library(tidyverse)

              df %>%
              gather() %>%
              group_by(key, value) %>%
              summarise(freq = n()) %>%
              ungroup() %>%
              group_by(key) %>%
              mutate(freq = freq/sum(freq) * 100)


              # key value freq
              # <chr> <dbl> <dbl>
              # 1 CVD 0 70
              # 2 CVD 1 30
              # 3 asthma 0 80
              # 4 asthma 1 20
              # 5 cancer 0 80
              # 6 cancer 1 20
              # 7 diab 0 70
              # 8 diab 1 30
              # 9 stroke 0 60
              #10 stroke 1 40


              Note - I have ignored LTC_count column since that doesn't seem to be involved in the calculation.





              Or we can reduce some steps by using count as suggested by @Jake Kaupp



              df %>%
              gather() %>%
              count(key, value) %>%
              group_by(key) %>%
              mutate(n = n/sum(n) * 100)





              share|improve this answer






























                1














                With tidyverse we can use gather to collapse the dataframe into long format in key, value pairs then group_by them and count the ratio in each group.



                library(tidyverse)

                df %>%
                gather() %>%
                group_by(key, value) %>%
                summarise(freq = n()) %>%
                ungroup() %>%
                group_by(key) %>%
                mutate(freq = freq/sum(freq) * 100)


                # key value freq
                # <chr> <dbl> <dbl>
                # 1 CVD 0 70
                # 2 CVD 1 30
                # 3 asthma 0 80
                # 4 asthma 1 20
                # 5 cancer 0 80
                # 6 cancer 1 20
                # 7 diab 0 70
                # 8 diab 1 30
                # 9 stroke 0 60
                #10 stroke 1 40


                Note - I have ignored LTC_count column since that doesn't seem to be involved in the calculation.





                Or we can reduce some steps by using count as suggested by @Jake Kaupp



                df %>%
                gather() %>%
                count(key, value) %>%
                group_by(key) %>%
                mutate(n = n/sum(n) * 100)





                share|improve this answer




























                  1












                  1








                  1







                  With tidyverse we can use gather to collapse the dataframe into long format in key, value pairs then group_by them and count the ratio in each group.



                  library(tidyverse)

                  df %>%
                  gather() %>%
                  group_by(key, value) %>%
                  summarise(freq = n()) %>%
                  ungroup() %>%
                  group_by(key) %>%
                  mutate(freq = freq/sum(freq) * 100)


                  # key value freq
                  # <chr> <dbl> <dbl>
                  # 1 CVD 0 70
                  # 2 CVD 1 30
                  # 3 asthma 0 80
                  # 4 asthma 1 20
                  # 5 cancer 0 80
                  # 6 cancer 1 20
                  # 7 diab 0 70
                  # 8 diab 1 30
                  # 9 stroke 0 60
                  #10 stroke 1 40


                  Note - I have ignored LTC_count column since that doesn't seem to be involved in the calculation.





                  Or we can reduce some steps by using count as suggested by @Jake Kaupp



                  df %>%
                  gather() %>%
                  count(key, value) %>%
                  group_by(key) %>%
                  mutate(n = n/sum(n) * 100)





                  share|improve this answer















                  With tidyverse we can use gather to collapse the dataframe into long format in key, value pairs then group_by them and count the ratio in each group.



                  library(tidyverse)

                  df %>%
                  gather() %>%
                  group_by(key, value) %>%
                  summarise(freq = n()) %>%
                  ungroup() %>%
                  group_by(key) %>%
                  mutate(freq = freq/sum(freq) * 100)


                  # key value freq
                  # <chr> <dbl> <dbl>
                  # 1 CVD 0 70
                  # 2 CVD 1 30
                  # 3 asthma 0 80
                  # 4 asthma 1 20
                  # 5 cancer 0 80
                  # 6 cancer 1 20
                  # 7 diab 0 70
                  # 8 diab 1 30
                  # 9 stroke 0 60
                  #10 stroke 1 40


                  Note - I have ignored LTC_count column since that doesn't seem to be involved in the calculation.





                  Or we can reduce some steps by using count as suggested by @Jake Kaupp



                  df %>%
                  gather() %>%
                  count(key, value) %>%
                  group_by(key) %>%
                  mutate(n = n/sum(n) * 100)






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 21 '18 at 1:29

























                  answered Nov 21 '18 at 1:08









                  Ronak ShahRonak Shah

                  37k104161




                  37k104161

























                      1














                      Percent population with a given disease:



                      colSums(df) / nrow(df) * 100
                      #cancer CVD diab stroke asthma LTC_count
                      #20 30 30 40 20 150





                      share|improve this answer




























                        1














                        Percent population with a given disease:



                        colSums(df) / nrow(df) * 100
                        #cancer CVD diab stroke asthma LTC_count
                        #20 30 30 40 20 150





                        share|improve this answer


























                          1












                          1








                          1







                          Percent population with a given disease:



                          colSums(df) / nrow(df) * 100
                          #cancer CVD diab stroke asthma LTC_count
                          #20 30 30 40 20 150





                          share|improve this answer













                          Percent population with a given disease:



                          colSums(df) / nrow(df) * 100
                          #cancer CVD diab stroke asthma LTC_count
                          #20 30 30 40 20 150






                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 21 '18 at 0:58









                          12b345b6b7812b345b6b78

                          782115




                          782115























                              1














                              Using dplyr this can be done in a single line, without gathering and whatnot:



                              df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))
                              cancer CVD diab stroke asthma
                              1 0.2 0.3 0.3 0.4 0.2


                              If we want both the yes and the no frequencies:



                              bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)), 
                              "N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")

                              id cancer CVD diab stroke asthma
                              1 Y 20 30 30 40 20
                              2 N 80 70 70 60 80


                              In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:



                              df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)), 
                              "N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")

                              df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),
                              select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]

                              df1$condition<-gsub("_count","",df1$condition)

                              condition count freq Y_N
                              1 cancer 2 20 Y
                              2 cancer 8 80 N
                              3 CVD 3 30 Y
                              4 CVD 7 70 N
                              5 diab 3 30 Y
                              6 diab 7 70 N
                              7 stroke 4 40 Y
                              8 stroke 6 60 N
                              9 asthma 2 20 Y
                              10 asthma 8 80 N





                              share|improve this answer


























                              • This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

                                – Laura
                                Nov 21 '18 at 3:32













                              • I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

                                – iod
                                Nov 21 '18 at 4:13


















                              1














                              Using dplyr this can be done in a single line, without gathering and whatnot:



                              df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))
                              cancer CVD diab stroke asthma
                              1 0.2 0.3 0.3 0.4 0.2


                              If we want both the yes and the no frequencies:



                              bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)), 
                              "N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")

                              id cancer CVD diab stroke asthma
                              1 Y 20 30 30 40 20
                              2 N 80 70 70 60 80


                              In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:



                              df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)), 
                              "N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")

                              df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),
                              select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]

                              df1$condition<-gsub("_count","",df1$condition)

                              condition count freq Y_N
                              1 cancer 2 20 Y
                              2 cancer 8 80 N
                              3 CVD 3 30 Y
                              4 CVD 7 70 N
                              5 diab 3 30 Y
                              6 diab 7 70 N
                              7 stroke 4 40 Y
                              8 stroke 6 60 N
                              9 asthma 2 20 Y
                              10 asthma 8 80 N





                              share|improve this answer


























                              • This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

                                – Laura
                                Nov 21 '18 at 3:32













                              • I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

                                – iod
                                Nov 21 '18 at 4:13
















                              1












                              1








                              1







                              Using dplyr this can be done in a single line, without gathering and whatnot:



                              df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))
                              cancer CVD diab stroke asthma
                              1 0.2 0.3 0.3 0.4 0.2


                              If we want both the yes and the no frequencies:



                              bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)), 
                              "N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")

                              id cancer CVD diab stroke asthma
                              1 Y 20 30 30 40 20
                              2 N 80 70 70 60 80


                              In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:



                              df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)), 
                              "N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")

                              df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),
                              select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]

                              df1$condition<-gsub("_count","",df1$condition)

                              condition count freq Y_N
                              1 cancer 2 20 Y
                              2 cancer 8 80 N
                              3 CVD 3 30 Y
                              4 CVD 7 70 N
                              5 diab 3 30 Y
                              6 diab 7 70 N
                              7 stroke 4 40 Y
                              8 stroke 6 60 N
                              9 asthma 2 20 Y
                              10 asthma 8 80 N





                              share|improve this answer















                              Using dplyr this can be done in a single line, without gathering and whatnot:



                              df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))
                              cancer CVD diab stroke asthma
                              1 0.2 0.3 0.3 0.4 0.2


                              If we want both the yes and the no frequencies:



                              bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)), 
                              "N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")

                              id cancer CVD diab stroke asthma
                              1 Y 20 30 30 40 20
                              2 N 80 70 70 60 80


                              In response to your request for a long dataset, I could do the below, but frankly if you want this, you're better off with @Ronak's solultion:



                              df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)), 
                              "N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")

                              df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),
                              select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]

                              df1$condition<-gsub("_count","",df1$condition)

                              condition count freq Y_N
                              1 cancer 2 20 Y
                              2 cancer 8 80 N
                              3 CVD 3 30 Y
                              4 CVD 7 70 N
                              5 diab 3 30 Y
                              6 diab 7 70 N
                              7 stroke 4 40 Y
                              8 stroke 6 60 N
                              9 asthma 2 20 Y
                              10 asthma 8 80 N






                              share|improve this answer














                              share|improve this answer



                              share|improve this answer








                              edited Nov 21 '18 at 4:12

























                              answered Nov 21 '18 at 2:53









                              iodiod

                              3,8232722




                              3,8232722













                              • This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

                                – Laura
                                Nov 21 '18 at 3:32













                              • I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

                                – iod
                                Nov 21 '18 at 4:13





















                              • This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

                                – Laura
                                Nov 21 '18 at 3:32













                              • I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

                                – iod
                                Nov 21 '18 at 4:13



















                              This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

                              – Laura
                              Nov 21 '18 at 3:32







                              This is amazing, thank you. This works great. But is it possible to have a long dataset whereby there is one column called "condition", another called "count", another called "freq", and the final one "Y_N"?

                              – Laura
                              Nov 21 '18 at 3:32















                              I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

                              – iod
                              Nov 21 '18 at 4:13







                              I created a solution based on this fulfilling your request, but frankly at this point @Ronak's solution makes more sense.

                              – iod
                              Nov 21 '18 at 4:13




















                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403837%2fsimplify-code-for-getting-multiple-disease-proportions-in-the-population%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              MongoDB - Not Authorized To Execute Command

                              in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

                              Npm cannot find a required file even through it is in the searched directory