Memory Leak Merging 2 Pandas Dataframe












0















I am not the first one asking about this issue however my problem seems a bit different from what I saw. Basically I was trying to merge a dataframe A.shape = [900k, 6] and B.shape = [600k, 6] and I get a memory error with a pd.merge(A, B, how='left', left_on=[col_1, col_2], right_on=[col_1, col_2]). (I have 5GB of RAM available for the computation.)



To try to circunvent it, I tried to do it iteratively like below but after the 6 first iterations where the memory computation time increases slightly from one to the other, it explodes at the 7th one:



def partial_merge(A, partial_B):
return pd.merge(A, partial_B,
how='left',
left_on=['col_1','col_2'],
right_on = ['col_1','col_2'])

low_bound = 0
for high_bound in tqdm(range(0, merged_stats.shape[0], 10000)):
A = partial_merge(A, B.iloc[low_bound:high_bound, :])
gc.collect()
low_bound = high_bound


Any idea ?
I am reaching the point where I am going to write those 2 dataframes to a database to do the join there.










share|improve this question























  • What type of merge do you have? If this is a many:many merge then 5 GB of RAM might be enough to hold everything in memory depending upon the duplication. If it's a 1:1 then perhaps just concat?

    – ALollz
    Jan 2 at 17:57











  • On the A dataframe, I can have multiple occurences with the same ['col_1', 'col_2'] while one the B dataframe, I have only one of each. The concat would have otherwise been a good idea that I did not even think of.

    – Yohan Obadia
    Jan 2 at 18:42











  • I can addthat I tried to run the join inside an SQLite database, outside of python and got the result in 30sec without any memory explosion. However when I wrote a pure sqlite3 solution doing the same query, I got again the memory error.

    – Yohan Obadia
    Jan 2 at 18:44


















0















I am not the first one asking about this issue however my problem seems a bit different from what I saw. Basically I was trying to merge a dataframe A.shape = [900k, 6] and B.shape = [600k, 6] and I get a memory error with a pd.merge(A, B, how='left', left_on=[col_1, col_2], right_on=[col_1, col_2]). (I have 5GB of RAM available for the computation.)



To try to circunvent it, I tried to do it iteratively like below but after the 6 first iterations where the memory computation time increases slightly from one to the other, it explodes at the 7th one:



def partial_merge(A, partial_B):
return pd.merge(A, partial_B,
how='left',
left_on=['col_1','col_2'],
right_on = ['col_1','col_2'])

low_bound = 0
for high_bound in tqdm(range(0, merged_stats.shape[0], 10000)):
A = partial_merge(A, B.iloc[low_bound:high_bound, :])
gc.collect()
low_bound = high_bound


Any idea ?
I am reaching the point where I am going to write those 2 dataframes to a database to do the join there.










share|improve this question























  • What type of merge do you have? If this is a many:many merge then 5 GB of RAM might be enough to hold everything in memory depending upon the duplication. If it's a 1:1 then perhaps just concat?

    – ALollz
    Jan 2 at 17:57











  • On the A dataframe, I can have multiple occurences with the same ['col_1', 'col_2'] while one the B dataframe, I have only one of each. The concat would have otherwise been a good idea that I did not even think of.

    – Yohan Obadia
    Jan 2 at 18:42











  • I can addthat I tried to run the join inside an SQLite database, outside of python and got the result in 30sec without any memory explosion. However when I wrote a pure sqlite3 solution doing the same query, I got again the memory error.

    – Yohan Obadia
    Jan 2 at 18:44
















0












0








0








I am not the first one asking about this issue however my problem seems a bit different from what I saw. Basically I was trying to merge a dataframe A.shape = [900k, 6] and B.shape = [600k, 6] and I get a memory error with a pd.merge(A, B, how='left', left_on=[col_1, col_2], right_on=[col_1, col_2]). (I have 5GB of RAM available for the computation.)



To try to circunvent it, I tried to do it iteratively like below but after the 6 first iterations where the memory computation time increases slightly from one to the other, it explodes at the 7th one:



def partial_merge(A, partial_B):
return pd.merge(A, partial_B,
how='left',
left_on=['col_1','col_2'],
right_on = ['col_1','col_2'])

low_bound = 0
for high_bound in tqdm(range(0, merged_stats.shape[0], 10000)):
A = partial_merge(A, B.iloc[low_bound:high_bound, :])
gc.collect()
low_bound = high_bound


Any idea ?
I am reaching the point where I am going to write those 2 dataframes to a database to do the join there.










share|improve this question














I am not the first one asking about this issue however my problem seems a bit different from what I saw. Basically I was trying to merge a dataframe A.shape = [900k, 6] and B.shape = [600k, 6] and I get a memory error with a pd.merge(A, B, how='left', left_on=[col_1, col_2], right_on=[col_1, col_2]). (I have 5GB of RAM available for the computation.)



To try to circunvent it, I tried to do it iteratively like below but after the 6 first iterations where the memory computation time increases slightly from one to the other, it explodes at the 7th one:



def partial_merge(A, partial_B):
return pd.merge(A, partial_B,
how='left',
left_on=['col_1','col_2'],
right_on = ['col_1','col_2'])

low_bound = 0
for high_bound in tqdm(range(0, merged_stats.shape[0], 10000)):
A = partial_merge(A, B.iloc[low_bound:high_bound, :])
gc.collect()
low_bound = high_bound


Any idea ?
I am reaching the point where I am going to write those 2 dataframes to a database to do the join there.







pandas merge memory-leaks






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 2 at 17:53









Yohan ObadiaYohan Obadia

8281019




8281019













  • What type of merge do you have? If this is a many:many merge then 5 GB of RAM might be enough to hold everything in memory depending upon the duplication. If it's a 1:1 then perhaps just concat?

    – ALollz
    Jan 2 at 17:57











  • On the A dataframe, I can have multiple occurences with the same ['col_1', 'col_2'] while one the B dataframe, I have only one of each. The concat would have otherwise been a good idea that I did not even think of.

    – Yohan Obadia
    Jan 2 at 18:42











  • I can addthat I tried to run the join inside an SQLite database, outside of python and got the result in 30sec without any memory explosion. However when I wrote a pure sqlite3 solution doing the same query, I got again the memory error.

    – Yohan Obadia
    Jan 2 at 18:44





















  • What type of merge do you have? If this is a many:many merge then 5 GB of RAM might be enough to hold everything in memory depending upon the duplication. If it's a 1:1 then perhaps just concat?

    – ALollz
    Jan 2 at 17:57











  • On the A dataframe, I can have multiple occurences with the same ['col_1', 'col_2'] while one the B dataframe, I have only one of each. The concat would have otherwise been a good idea that I did not even think of.

    – Yohan Obadia
    Jan 2 at 18:42











  • I can addthat I tried to run the join inside an SQLite database, outside of python and got the result in 30sec without any memory explosion. However when I wrote a pure sqlite3 solution doing the same query, I got again the memory error.

    – Yohan Obadia
    Jan 2 at 18:44



















What type of merge do you have? If this is a many:many merge then 5 GB of RAM might be enough to hold everything in memory depending upon the duplication. If it's a 1:1 then perhaps just concat?

– ALollz
Jan 2 at 17:57





What type of merge do you have? If this is a many:many merge then 5 GB of RAM might be enough to hold everything in memory depending upon the duplication. If it's a 1:1 then perhaps just concat?

– ALollz
Jan 2 at 17:57













On the A dataframe, I can have multiple occurences with the same ['col_1', 'col_2'] while one the B dataframe, I have only one of each. The concat would have otherwise been a good idea that I did not even think of.

– Yohan Obadia
Jan 2 at 18:42





On the A dataframe, I can have multiple occurences with the same ['col_1', 'col_2'] while one the B dataframe, I have only one of each. The concat would have otherwise been a good idea that I did not even think of.

– Yohan Obadia
Jan 2 at 18:42













I can addthat I tried to run the join inside an SQLite database, outside of python and got the result in 30sec without any memory explosion. However when I wrote a pure sqlite3 solution doing the same query, I got again the memory error.

– Yohan Obadia
Jan 2 at 18:44







I can addthat I tried to run the join inside an SQLite database, outside of python and got the result in 30sec without any memory explosion. However when I wrote a pure sqlite3 solution doing the same query, I got again the memory error.

– Yohan Obadia
Jan 2 at 18:44














0






active

oldest

votes












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54010946%2fmemory-leak-merging-2-pandas-dataframe%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54010946%2fmemory-leak-merging-2-pandas-dataframe%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

Npm cannot find a required file even through it is in the searched directory