Object size increases hugely when transposing a data frame












3















I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.



I then have to transpose the data in order to subset it properly later:



df <- data.frame(t(df))



After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?



str() of the first 20 columns:



str(df[1:20])
Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
$ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
$ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
$ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
$ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
$ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
$ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
$ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
$ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
$ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
$ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
$ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
$ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
$ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
$ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
$ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
$ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
$ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
$ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
$ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...









share|improve this question





























    3















    I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.



    I then have to transpose the data in order to subset it properly later:



    df <- data.frame(t(df))



    After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?



    str() of the first 20 columns:



    str(df[1:20])
    Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
    $ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
    $ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
    $ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
    $ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
    $ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
    $ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
    $ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
    $ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
    $ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
    $ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
    $ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
    $ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
    $ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
    $ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
    $ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
    $ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
    $ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
    $ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
    $ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
    $ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...









    share|improve this question



























      3












      3








      3








      I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.



      I then have to transpose the data in order to subset it properly later:



      df <- data.frame(t(df))



      After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?



      str() of the first 20 columns:



      str(df[1:20])
      Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
      $ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
      $ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
      $ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
      $ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
      $ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
      $ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
      $ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
      $ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
      $ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
      $ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
      $ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
      $ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
      $ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
      $ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
      $ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
      $ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
      $ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
      $ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
      $ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
      $ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...









      share|improve this question
















      I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.



      I then have to transpose the data in order to subset it properly later:



      df <- data.frame(t(df))



      After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?



      str() of the first 20 columns:



      str(df[1:20])
      Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
      $ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
      $ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
      $ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
      $ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
      $ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
      $ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
      $ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
      $ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
      $ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
      $ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
      $ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
      $ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
      $ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
      $ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
      $ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
      $ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
      $ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
      $ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
      $ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
      $ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...






      r dataframe memory transpose






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 21 '18 at 18:11









      Henrik

      41.5k994109




      41.5k994109










      asked Nov 21 '18 at 9:47









      Phil DPhil D

      327




      327
























          1 Answer
          1






          active

          oldest

          votes


















          5














          First, you write that:




          I then have to transpose this dataset in order to subset it properly later,




          To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.





          The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.



          I will try to illustrate this with some examples. We begin with the change of class.



          Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:



          # set number of rows and columns
          nr <- 5
          nc <- 5

          set.seed(1)
          d <- data.frame(x = sample(letters, nr, replace = TRUE),
          y = sample(letters, nr, replace = TRUE),
          matrix(runif(nr * nc), nrow = nr),
          stringsAsFactors = FALSE)


          Transpose it:



          d_t <- t(d)


          Check the structure of the original data and its transposed sibling:



          str(d)
          # 'data.frame': 5 obs. of 7 variables:
          # $ x : chr "g" "j" "o" "x" ...
          # $ y : chr "x" "y" "r" "q" ...
          # $ X1: num 0.206 0.177 0.687 0.384 0.77
          # $ X2: num 0.498 0.718 0.992 0.38 0.777
          # $ X3: num 0.935 0.212 0.652 0.126 0.267
          # $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
          # $ X5: num 0.482 0.6 0.494 0.186 0.827

          str(d_t)
          # chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
          # - attr(*, "dimnames")=List of 2
          # ..$ : chr [1:7] "x" "y" "X1" "X2" ...
          # ..$ : NULL


          The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:




          A data frame is first coerced to a matrix: see as.matrix.




          OK, see ?as.matrix:




          The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]




          Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).





          In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:



          # original data frame
          object.size(d)
          # 2360 bytes

          # transposed df - a character matrix
          object.size(d_t)
          # 3280 bytes


          The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:



          nr <- 56202
          nc <- 20

          object.size(d)
          # 9897712 bytes
          object.size(d_t)
          # 78299656 bytes




          Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:



          onedigit_int <- sample(1:9, 1e4, replace = TRUE)
          onedigit_num <- as.numeric(onedigit_int)
          onedigit_char <- as.character(onedigit_int)

          object.size(onedigit_int)
          # 40048 bytes

          object.size(onedigit_num)
          # 80048 bytes

          object.size(onedigit_char)
          # 80552 bytes


          For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:



          multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
          multidigit_num <- as.numeric(multidigit_int)
          multidigit_char <- as.character(multidigit_int)

          object.size(multidigit_int)
          # 40048 bytes

          object.size(multidigit_num)
          # 80048 bytes

          object.size(multidigit_char)
          # 637360 bytes


          The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.



          Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.



          Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.





          Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham






          share|improve this answer


























          • Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.

            – Phil D
            Nov 21 '18 at 17:08








          • 1





            Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers

            – Henrik
            Nov 21 '18 at 17:43











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53409246%2fobject-size-increases-hugely-when-transposing-a-data-frame%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          5














          First, you write that:




          I then have to transpose this dataset in order to subset it properly later,




          To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.





          The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.



          I will try to illustrate this with some examples. We begin with the change of class.



          Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:



          # set number of rows and columns
          nr <- 5
          nc <- 5

          set.seed(1)
          d <- data.frame(x = sample(letters, nr, replace = TRUE),
          y = sample(letters, nr, replace = TRUE),
          matrix(runif(nr * nc), nrow = nr),
          stringsAsFactors = FALSE)


          Transpose it:



          d_t <- t(d)


          Check the structure of the original data and its transposed sibling:



          str(d)
          # 'data.frame': 5 obs. of 7 variables:
          # $ x : chr "g" "j" "o" "x" ...
          # $ y : chr "x" "y" "r" "q" ...
          # $ X1: num 0.206 0.177 0.687 0.384 0.77
          # $ X2: num 0.498 0.718 0.992 0.38 0.777
          # $ X3: num 0.935 0.212 0.652 0.126 0.267
          # $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
          # $ X5: num 0.482 0.6 0.494 0.186 0.827

          str(d_t)
          # chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
          # - attr(*, "dimnames")=List of 2
          # ..$ : chr [1:7] "x" "y" "X1" "X2" ...
          # ..$ : NULL


          The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:




          A data frame is first coerced to a matrix: see as.matrix.




          OK, see ?as.matrix:




          The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]




          Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).





          In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:



          # original data frame
          object.size(d)
          # 2360 bytes

          # transposed df - a character matrix
          object.size(d_t)
          # 3280 bytes


          The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:



          nr <- 56202
          nc <- 20

          object.size(d)
          # 9897712 bytes
          object.size(d_t)
          # 78299656 bytes




          Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:



          onedigit_int <- sample(1:9, 1e4, replace = TRUE)
          onedigit_num <- as.numeric(onedigit_int)
          onedigit_char <- as.character(onedigit_int)

          object.size(onedigit_int)
          # 40048 bytes

          object.size(onedigit_num)
          # 80048 bytes

          object.size(onedigit_char)
          # 80552 bytes


          For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:



          multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
          multidigit_num <- as.numeric(multidigit_int)
          multidigit_char <- as.character(multidigit_int)

          object.size(multidigit_int)
          # 40048 bytes

          object.size(multidigit_num)
          # 80048 bytes

          object.size(multidigit_char)
          # 637360 bytes


          The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.



          Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.



          Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.





          Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham






          share|improve this answer


























          • Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.

            – Phil D
            Nov 21 '18 at 17:08








          • 1





            Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers

            – Henrik
            Nov 21 '18 at 17:43
















          5














          First, you write that:




          I then have to transpose this dataset in order to subset it properly later,




          To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.





          The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.



          I will try to illustrate this with some examples. We begin with the change of class.



          Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:



          # set number of rows and columns
          nr <- 5
          nc <- 5

          set.seed(1)
          d <- data.frame(x = sample(letters, nr, replace = TRUE),
          y = sample(letters, nr, replace = TRUE),
          matrix(runif(nr * nc), nrow = nr),
          stringsAsFactors = FALSE)


          Transpose it:



          d_t <- t(d)


          Check the structure of the original data and its transposed sibling:



          str(d)
          # 'data.frame': 5 obs. of 7 variables:
          # $ x : chr "g" "j" "o" "x" ...
          # $ y : chr "x" "y" "r" "q" ...
          # $ X1: num 0.206 0.177 0.687 0.384 0.77
          # $ X2: num 0.498 0.718 0.992 0.38 0.777
          # $ X3: num 0.935 0.212 0.652 0.126 0.267
          # $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
          # $ X5: num 0.482 0.6 0.494 0.186 0.827

          str(d_t)
          # chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
          # - attr(*, "dimnames")=List of 2
          # ..$ : chr [1:7] "x" "y" "X1" "X2" ...
          # ..$ : NULL


          The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:




          A data frame is first coerced to a matrix: see as.matrix.




          OK, see ?as.matrix:




          The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]




          Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).





          In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:



          # original data frame
          object.size(d)
          # 2360 bytes

          # transposed df - a character matrix
          object.size(d_t)
          # 3280 bytes


          The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:



          nr <- 56202
          nc <- 20

          object.size(d)
          # 9897712 bytes
          object.size(d_t)
          # 78299656 bytes




          Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:



          onedigit_int <- sample(1:9, 1e4, replace = TRUE)
          onedigit_num <- as.numeric(onedigit_int)
          onedigit_char <- as.character(onedigit_int)

          object.size(onedigit_int)
          # 40048 bytes

          object.size(onedigit_num)
          # 80048 bytes

          object.size(onedigit_char)
          # 80552 bytes


          For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:



          multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
          multidigit_num <- as.numeric(multidigit_int)
          multidigit_char <- as.character(multidigit_int)

          object.size(multidigit_int)
          # 40048 bytes

          object.size(multidigit_num)
          # 80048 bytes

          object.size(multidigit_char)
          # 637360 bytes


          The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.



          Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.



          Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.





          Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham






          share|improve this answer


























          • Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.

            – Phil D
            Nov 21 '18 at 17:08








          • 1





            Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers

            – Henrik
            Nov 21 '18 at 17:43














          5












          5








          5







          First, you write that:




          I then have to transpose this dataset in order to subset it properly later,




          To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.





          The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.



          I will try to illustrate this with some examples. We begin with the change of class.



          Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:



          # set number of rows and columns
          nr <- 5
          nc <- 5

          set.seed(1)
          d <- data.frame(x = sample(letters, nr, replace = TRUE),
          y = sample(letters, nr, replace = TRUE),
          matrix(runif(nr * nc), nrow = nr),
          stringsAsFactors = FALSE)


          Transpose it:



          d_t <- t(d)


          Check the structure of the original data and its transposed sibling:



          str(d)
          # 'data.frame': 5 obs. of 7 variables:
          # $ x : chr "g" "j" "o" "x" ...
          # $ y : chr "x" "y" "r" "q" ...
          # $ X1: num 0.206 0.177 0.687 0.384 0.77
          # $ X2: num 0.498 0.718 0.992 0.38 0.777
          # $ X3: num 0.935 0.212 0.652 0.126 0.267
          # $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
          # $ X5: num 0.482 0.6 0.494 0.186 0.827

          str(d_t)
          # chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
          # - attr(*, "dimnames")=List of 2
          # ..$ : chr [1:7] "x" "y" "X1" "X2" ...
          # ..$ : NULL


          The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:




          A data frame is first coerced to a matrix: see as.matrix.




          OK, see ?as.matrix:




          The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]




          Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).





          In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:



          # original data frame
          object.size(d)
          # 2360 bytes

          # transposed df - a character matrix
          object.size(d_t)
          # 3280 bytes


          The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:



          nr <- 56202
          nc <- 20

          object.size(d)
          # 9897712 bytes
          object.size(d_t)
          # 78299656 bytes




          Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:



          onedigit_int <- sample(1:9, 1e4, replace = TRUE)
          onedigit_num <- as.numeric(onedigit_int)
          onedigit_char <- as.character(onedigit_int)

          object.size(onedigit_int)
          # 40048 bytes

          object.size(onedigit_num)
          # 80048 bytes

          object.size(onedigit_char)
          # 80552 bytes


          For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:



          multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
          multidigit_num <- as.numeric(multidigit_int)
          multidigit_char <- as.character(multidigit_int)

          object.size(multidigit_int)
          # 40048 bytes

          object.size(multidigit_num)
          # 80048 bytes

          object.size(multidigit_char)
          # 637360 bytes


          The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.



          Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.



          Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.





          Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham






          share|improve this answer















          First, you write that:




          I then have to transpose this dataset in order to subset it properly later,




          To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.





          The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.



          I will try to illustrate this with some examples. We begin with the change of class.



          Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:



          # set number of rows and columns
          nr <- 5
          nc <- 5

          set.seed(1)
          d <- data.frame(x = sample(letters, nr, replace = TRUE),
          y = sample(letters, nr, replace = TRUE),
          matrix(runif(nr * nc), nrow = nr),
          stringsAsFactors = FALSE)


          Transpose it:



          d_t <- t(d)


          Check the structure of the original data and its transposed sibling:



          str(d)
          # 'data.frame': 5 obs. of 7 variables:
          # $ x : chr "g" "j" "o" "x" ...
          # $ y : chr "x" "y" "r" "q" ...
          # $ X1: num 0.206 0.177 0.687 0.384 0.77
          # $ X2: num 0.498 0.718 0.992 0.38 0.777
          # $ X3: num 0.935 0.212 0.652 0.126 0.267
          # $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
          # $ X5: num 0.482 0.6 0.494 0.186 0.827

          str(d_t)
          # chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
          # - attr(*, "dimnames")=List of 2
          # ..$ : chr [1:7] "x" "y" "X1" "X2" ...
          # ..$ : NULL


          The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:




          A data frame is first coerced to a matrix: see as.matrix.




          OK, see ?as.matrix:




          The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]




          Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).





          In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:



          # original data frame
          object.size(d)
          # 2360 bytes

          # transposed df - a character matrix
          object.size(d_t)
          # 3280 bytes


          The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:



          nr <- 56202
          nc <- 20

          object.size(d)
          # 9897712 bytes
          object.size(d_t)
          # 78299656 bytes




          Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:



          onedigit_int <- sample(1:9, 1e4, replace = TRUE)
          onedigit_num <- as.numeric(onedigit_int)
          onedigit_char <- as.character(onedigit_int)

          object.size(onedigit_int)
          # 40048 bytes

          object.size(onedigit_num)
          # 80048 bytes

          object.size(onedigit_char)
          # 80552 bytes


          For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:



          multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
          multidigit_num <- as.numeric(multidigit_int)
          multidigit_char <- as.character(multidigit_int)

          object.size(multidigit_int)
          # 40048 bytes

          object.size(multidigit_num)
          # 80048 bytes

          object.size(multidigit_char)
          # 637360 bytes


          The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.



          Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.



          Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.





          Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 22 '18 at 10:22

























          answered Nov 21 '18 at 15:13









          HenrikHenrik

          41.5k994109




          41.5k994109













          • Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.

            – Phil D
            Nov 21 '18 at 17:08








          • 1





            Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers

            – Henrik
            Nov 21 '18 at 17:43



















          • Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.

            – Phil D
            Nov 21 '18 at 17:08








          • 1





            Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers

            – Henrik
            Nov 21 '18 at 17:43

















          Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.

          – Phil D
          Nov 21 '18 at 17:08







          Thanks for the reply! And you were right in that it was the object type that was the problem. I did actually manage to get the size down myself as I was wrangling the data all afternoon and converting it back to numbers and factors, but your explanation has helped me understand WHY this makes such a difference. You said that you didn't think I would need to transpose it in the first place, but I was doing this as I wanted to merge in with another file in order to subset on tissue type. The only way I know how to subset needs the samples in rows and genes in columns, hence the transpose.

          – Phil D
          Nov 21 '18 at 17:08






          1




          1





          Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers

          – Henrik
          Nov 21 '18 at 17:43





          Hi @PhilD Thanks for your feedback. Regarding your need (or not ;) ) for transpose, without seeing your data and a more thorough description on what you try to aceive, I can only guess (hence my "may"). Still, if you allow me to guess, it sounds that an alternative you may consider is to melt your data from wide to long. Then merge/subset. Cheers

          – Henrik
          Nov 21 '18 at 17:43




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53409246%2fobject-size-increases-hugely-when-transposing-a-data-frame%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

          Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

          A Topological Invariant for $pi_3(U(n))$