How does numpy's memmap copy-on-write mode work?












4















I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.



I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).



However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.



Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?



Here's an example of a test which I expected to fail:



On a large memory system, create the array:



import numpy as np
GB = 1000**3
GiB = 1024**3
a = np.zeros((50000, 20000), dtype='float32')
bytes = a.size * a.itemsize
print('{} GB'.format(bytes / GB))
print('{} GiB'.format(bytes / GiB))
np.save('a.npy', a)
# Output:
# 4.0 GB
# 3.725290298461914 GiB


Now, on a machine with just 2 Gb of memory, this fails as expected:



a = np.load('a.npy')


But these two will succeed, as expected:



a = np.load('a.npy', mmap_mode='r+')
a = np.load('a.npy', mmap_mode='c')


Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):



for i in range(a.shape[0]):
print('row {}'.format(i))
a[i,:] = i*np.arange(a.shape[1])


Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?



Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?



for i in range(a.shape[0]):
if i % 100 == 0:
print('row {}'.format(i))
a.flush()
a[i,:] = i*np.arange(a.shape[1])









share|improve this question





























    4















    I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.



    I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).



    However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.



    Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?



    Here's an example of a test which I expected to fail:



    On a large memory system, create the array:



    import numpy as np
    GB = 1000**3
    GiB = 1024**3
    a = np.zeros((50000, 20000), dtype='float32')
    bytes = a.size * a.itemsize
    print('{} GB'.format(bytes / GB))
    print('{} GiB'.format(bytes / GiB))
    np.save('a.npy', a)
    # Output:
    # 4.0 GB
    # 3.725290298461914 GiB


    Now, on a machine with just 2 Gb of memory, this fails as expected:



    a = np.load('a.npy')


    But these two will succeed, as expected:



    a = np.load('a.npy', mmap_mode='r+')
    a = np.load('a.npy', mmap_mode='c')


    Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):



    for i in range(a.shape[0]):
    print('row {}'.format(i))
    a[i,:] = i*np.arange(a.shape[1])


    Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?



    Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?



    for i in range(a.shape[0]):
    if i % 100 == 0:
    print('row {}'.format(i))
    a.flush()
    a[i,:] = i*np.arange(a.shape[1])









    share|improve this question



























      4












      4








      4


      3






      I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.



      I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).



      However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.



      Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?



      Here's an example of a test which I expected to fail:



      On a large memory system, create the array:



      import numpy as np
      GB = 1000**3
      GiB = 1024**3
      a = np.zeros((50000, 20000), dtype='float32')
      bytes = a.size * a.itemsize
      print('{} GB'.format(bytes / GB))
      print('{} GiB'.format(bytes / GiB))
      np.save('a.npy', a)
      # Output:
      # 4.0 GB
      # 3.725290298461914 GiB


      Now, on a machine with just 2 Gb of memory, this fails as expected:



      a = np.load('a.npy')


      But these two will succeed, as expected:



      a = np.load('a.npy', mmap_mode='r+')
      a = np.load('a.npy', mmap_mode='c')


      Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):



      for i in range(a.shape[0]):
      print('row {}'.format(i))
      a[i,:] = i*np.arange(a.shape[1])


      Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?



      Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?



      for i in range(a.shape[0]):
      if i % 100 == 0:
      print('row {}'.format(i))
      a.flush()
      a[i,:] = i*np.arange(a.shape[1])









      share|improve this question
















      I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.



      I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).



      However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.



      Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?



      Here's an example of a test which I expected to fail:



      On a large memory system, create the array:



      import numpy as np
      GB = 1000**3
      GiB = 1024**3
      a = np.zeros((50000, 20000), dtype='float32')
      bytes = a.size * a.itemsize
      print('{} GB'.format(bytes / GB))
      print('{} GiB'.format(bytes / GiB))
      np.save('a.npy', a)
      # Output:
      # 4.0 GB
      # 3.725290298461914 GiB


      Now, on a machine with just 2 Gb of memory, this fails as expected:



      a = np.load('a.npy')


      But these two will succeed, as expected:



      a = np.load('a.npy', mmap_mode='r+')
      a = np.load('a.npy', mmap_mode='c')


      Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):



      for i in range(a.shape[0]):
      print('row {}'.format(i))
      a[i,:] = i*np.arange(a.shape[1])


      Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?



      Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?



      for i in range(a.shape[0]):
      if i % 100 == 0:
      print('row {}'.format(i))
      a.flush()
      a[i,:] = i*np.arange(a.shape[1])






      python numpy






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 3 at 3:00







      Amir

















      asked Jan 2 at 21:53









      AmirAmir

      212




      212
























          0






          active

          oldest

          votes












          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54013658%2fhow-does-numpys-memmap-copy-on-write-mode-work%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54013658%2fhow-does-numpys-memmap-copy-on-write-mode-work%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

          Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

          A Topological Invariant for $pi_3(U(n))$