Memory Leak Merging 2 Pandas Dataframe

I am not the first one asking about this issue however my problem seems a bit different from what I saw. Basically I was trying to merge a dataframe A.shape = [900k, 6] and B.shape = [600k, 6] and I get a memory error with a pd.merge(A, B, how='left', left_on=[col_1, col_2], right_on=[col_1, col_2]). (I have 5GB of RAM available for the computation.)

To try to circunvent it, I tried to do it iteratively like below but after the 6 first iterations where the memory computation time increases slightly from one to the other, it explodes at the 7th one:

def partial_merge(A, partial_B):

    return pd.merge(A, partial_B, 

                 how='left', 

                 left_on=['col_1','col_2'], 

                 right_on = ['col_1','col_2'])



low_bound = 0

for high_bound in tqdm(range(0, merged_stats.shape[0], 10000)):

    A = partial_merge(A, B.iloc[low_bound:high_bound, :])

    gc.collect()

    low_bound = high_bound

Any idea ?
I am reaching the point where I am going to write those 2 dataframes to a database to do the join there.

asked Jan 2 at 17:53

Yohan Obadia

8281019

What type of merge do you have? If this is a many:many merge then 5 GB of RAM might be enough to hold everything in memory depending upon the duplication. If it's a 1:1 then perhaps just concat?

– ALollz
Jan 2 at 17:57

On the A dataframe, I can have multiple occurences with the same ['col_1', 'col_2'] while one the B dataframe, I have only one of each. The concat would have otherwise been a good idea that I did not even think of.

– Yohan Obadia
Jan 2 at 18:42

I can addthat I tried to run the join inside an SQLite database, outside of python and got the result in 30sec without any memory explosion. However when I wrote a pure sqlite3 solution doing the same query, I got again the memory error.

– Yohan Obadia
Jan 2 at 18:44

add a comment |

def partial_merge(A, partial_B):

    return pd.merge(A, partial_B, 

                 how='left', 

                 left_on=['col_1','col_2'], 

                 right_on = ['col_1','col_2'])



low_bound = 0

for high_bound in tqdm(range(0, merged_stats.shape[0], 10000)):

    A = partial_merge(A, B.iloc[low_bound:high_bound, :])

    gc.collect()

    low_bound = high_bound

Any idea ?
I am reaching the point where I am going to write those 2 dataframes to a database to do the join there.

asked Jan 2 at 17:53

Yohan Obadia

8281019

What type of merge do you have? If this is a many:many merge then 5 GB of RAM might be enough to hold everything in memory depending upon the duplication. If it's a 1:1 then perhaps just concat?

– ALollz
Jan 2 at 17:57

On the A dataframe, I can have multiple occurences with the same ['col_1', 'col_2'] while one the B dataframe, I have only one of each. The concat would have otherwise been a good idea that I did not even think of.

– Yohan Obadia
Jan 2 at 18:42

I can addthat I tried to run the join inside an SQLite database, outside of python and got the result in 30sec without any memory explosion. However when I wrote a pure sqlite3 solution doing the same query, I got again the memory error.

– Yohan Obadia
Jan 2 at 18:44

add a comment |

def partial_merge(A, partial_B):

    return pd.merge(A, partial_B, 

                 how='left', 

                 left_on=['col_1','col_2'], 

                 right_on = ['col_1','col_2'])



low_bound = 0

for high_bound in tqdm(range(0, merged_stats.shape[0], 10000)):

    A = partial_merge(A, B.iloc[low_bound:high_bound, :])

    gc.collect()

    low_bound = high_bound

Any idea ?
I am reaching the point where I am going to write those 2 dataframes to a database to do the join there.

asked Jan 2 at 17:53

Yohan Obadia

8281019

def partial_merge(A, partial_B):

    return pd.merge(A, partial_B, 

                 how='left', 

                 left_on=['col_1','col_2'], 

                 right_on = ['col_1','col_2'])



low_bound = 0

for high_bound in tqdm(range(0, merged_stats.shape[0], 10000)):

    A = partial_merge(A, B.iloc[low_bound:high_bound, :])

    gc.collect()

    low_bound = high_bound

Any idea ?
I am reaching the point where I am going to write those 2 dataframes to a database to do the join there.

pandas merge memory-leaks

asked Jan 2 at 17:53

Yohan Obadia

8281019

asked Jan 2 at 17:53

Yohan Obadia

8281019

asked Jan 2 at 17:53

Yohan Obadia

8281019

asked Jan 2 at 17:53

Yohan Obadia

8281019

asked Jan 2 at 17:53

Yohan Obadia

8281019

What type of merge do you have? If this is a many:many merge then 5 GB of RAM might be enough to hold everything in memory depending upon the duplication. If it's a 1:1 then perhaps just concat?

– ALollz
Jan 2 at 17:57

On the A dataframe, I can have multiple occurences with the same ['col_1', 'col_2'] while one the B dataframe, I have only one of each. The concat would have otherwise been a good idea that I did not even think of.

– Yohan Obadia
Jan 2 at 18:42

I can addthat I tried to run the join inside an SQLite database, outside of python and got the result in 30sec without any memory explosion. However when I wrote a pure sqlite3 solution doing the same query, I got again the memory error.

– Yohan Obadia
Jan 2 at 18:44

add a comment |

What type of merge do you have? If this is a many:many merge then 5 GB of RAM might be enough to hold everything in memory depending upon the duplication. If it's a 1:1 then perhaps just concat?

– ALollz
Jan 2 at 17:57

On the A dataframe, I can have multiple occurences with the same ['col_1', 'col_2'] while one the B dataframe, I have only one of each. The concat would have otherwise been a good idea that I did not even think of.

– Yohan Obadia
Jan 2 at 18:42

I can addthat I tried to run the join inside an SQLite database, outside of python and got the result in 30sec without any memory explosion. However when I wrote a pure sqlite3 solution doing the same query, I got again the memory error.

– Yohan Obadia
Jan 2 at 18:44

What type of merge do you have? If this is a many:many merge then 5 GB of RAM might be enough to hold everything in memory depending upon the duplication. If it's a 1:1 then perhaps just concat?

– ALollz
Jan 2 at 17:57

On the A dataframe, I can have multiple occurences with the same ['col_1', 'col_2'] while one the B dataframe, I have only one of each. The concat would have otherwise been a good idea that I did not even think of.

– Yohan Obadia
Jan 2 at 18:42

I can addthat I tried to run the join inside an SQLite database, outside of python and got the result in 30sec without any memory explosion. However when I wrote a pure sqlite3 solution doing the same query, I got again the memory error.

– Yohan Obadia
Jan 2 at 18:44

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54010946%2fmemory-leak-merging-2-pandas-dataframe%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu