How does numpy's memmap copy-on-write mode work?

I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.

I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).

However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.

Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?

Here's an example of a test which I expected to fail:

On a large memory system, create the array:

import numpy as np

GB = 1000**3

GiB = 1024**3

a = np.zeros((50000, 20000), dtype='float32')

bytes = a.size * a.itemsize

print('{} GB'.format(bytes / GB))

print('{} GiB'.format(bytes / GiB))

np.save('a.npy', a)

# Output:

# 4.0 GB

# 3.725290298461914 GiB

Now, on a machine with just 2 Gb of memory, this fails as expected:

a = np.load('a.npy')

But these two will succeed, as expected:

a = np.load('a.npy', mmap_mode='r+')

a = np.load('a.npy', mmap_mode='c')

Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):

for i in range(a.shape[0]):

    print('row {}'.format(i))

    a[i,:] = i*np.arange(a.shape[1])

Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?

Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?

for i in range(a.shape[0]):

    if i % 100 == 0:

        print('row {}'.format(i))

        a.flush()

    a[i,:] = i*np.arange(a.shape[1])

edited Jan 3 at 3:00

asked Jan 2 at 21:53

Amir

212

add a comment |

Here's an example of a test which I expected to fail:

On a large memory system, create the array:

import numpy as np

GB = 1000**3

GiB = 1024**3

a = np.zeros((50000, 20000), dtype='float32')

bytes = a.size * a.itemsize

print('{} GB'.format(bytes / GB))

print('{} GiB'.format(bytes / GiB))

np.save('a.npy', a)

# Output:

# 4.0 GB

# 3.725290298461914 GiB

Now, on a machine with just 2 Gb of memory, this fails as expected:

a = np.load('a.npy')

But these two will succeed, as expected:

a = np.load('a.npy', mmap_mode='r+')

a = np.load('a.npy', mmap_mode='c')

Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):

for i in range(a.shape[0]):

    print('row {}'.format(i))

    a[i,:] = i*np.arange(a.shape[1])

Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?

for i in range(a.shape[0]):

    if i % 100 == 0:

        print('row {}'.format(i))

        a.flush()

    a[i,:] = i*np.arange(a.shape[1])

edited Jan 3 at 3:00

asked Jan 2 at 21:53

Amir

212

add a comment |

Here's an example of a test which I expected to fail:

On a large memory system, create the array:

import numpy as np

GB = 1000**3

GiB = 1024**3

a = np.zeros((50000, 20000), dtype='float32')

bytes = a.size * a.itemsize

print('{} GB'.format(bytes / GB))

print('{} GiB'.format(bytes / GiB))

np.save('a.npy', a)

# Output:

# 4.0 GB

# 3.725290298461914 GiB

Now, on a machine with just 2 Gb of memory, this fails as expected:

a = np.load('a.npy')

But these two will succeed, as expected:

a = np.load('a.npy', mmap_mode='r+')

a = np.load('a.npy', mmap_mode='c')

Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):

for i in range(a.shape[0]):

    print('row {}'.format(i))

    a[i,:] = i*np.arange(a.shape[1])

Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?

for i in range(a.shape[0]):

    if i % 100 == 0:

        print('row {}'.format(i))

        a.flush()

    a[i,:] = i*np.arange(a.shape[1])

edited Jan 3 at 3:00

asked Jan 2 at 21:53

Amir

212

Here's an example of a test which I expected to fail:

On a large memory system, create the array:

import numpy as np

GB = 1000**3

GiB = 1024**3

a = np.zeros((50000, 20000), dtype='float32')

bytes = a.size * a.itemsize

print('{} GB'.format(bytes / GB))

print('{} GiB'.format(bytes / GiB))

np.save('a.npy', a)

# Output:

# 4.0 GB

# 3.725290298461914 GiB

Now, on a machine with just 2 Gb of memory, this fails as expected:

a = np.load('a.npy')

But these two will succeed, as expected:

a = np.load('a.npy', mmap_mode='r+')

a = np.load('a.npy', mmap_mode='c')

Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):

for i in range(a.shape[0]):

    print('row {}'.format(i))

    a[i,:] = i*np.arange(a.shape[1])

Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?

for i in range(a.shape[0]):

    if i % 100 == 0:

        print('row {}'.format(i))

        a.flush()

    a[i,:] = i*np.arange(a.shape[1])

python numpy

edited Jan 3 at 3:00

asked Jan 2 at 21:53

Amir

212

edited Jan 3 at 3:00

asked Jan 2 at 21:53

Amir

212

edited Jan 3 at 3:00

asked Jan 2 at 21:53

Amir

212

asked Jan 2 at 21:53

Amir

212

asked Jan 2 at 21:53

Amir

212

add a comment |

0

active

oldest

votes

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54013658%2fhow-does-numpys-memmap-copy-on-write-mode-work%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu