Python pandas remove duplicate rows that have a column value “NaN”












0















The need to rows that have NaN values in them but are also duplicates. For example this table:



    A   B   C
0 foo 2 3
1 foo nan nan
2 foo 1 4
3 bar nan nan
4 foo nan nan


Should become this:



    A   B   C
0 foo 2 3
2 foo 1 4
3 bar nan nan


How can i do that?










share|improve this question























  • drop_duplicates doesnt drop different column values and drop_na drops every column with nan values. I need to drop every duplicate of a specific column where the row has a NaN value.

    – Lame Fanello
    Nov 21 '18 at 13:31
















0















The need to rows that have NaN values in them but are also duplicates. For example this table:



    A   B   C
0 foo 2 3
1 foo nan nan
2 foo 1 4
3 bar nan nan
4 foo nan nan


Should become this:



    A   B   C
0 foo 2 3
2 foo 1 4
3 bar nan nan


How can i do that?










share|improve this question























  • drop_duplicates doesnt drop different column values and drop_na drops every column with nan values. I need to drop every duplicate of a specific column where the row has a NaN value.

    – Lame Fanello
    Nov 21 '18 at 13:31














0












0








0








The need to rows that have NaN values in them but are also duplicates. For example this table:



    A   B   C
0 foo 2 3
1 foo nan nan
2 foo 1 4
3 bar nan nan
4 foo nan nan


Should become this:



    A   B   C
0 foo 2 3
2 foo 1 4
3 bar nan nan


How can i do that?










share|improve this question














The need to rows that have NaN values in them but are also duplicates. For example this table:



    A   B   C
0 foo 2 3
1 foo nan nan
2 foo 1 4
3 bar nan nan
4 foo nan nan


Should become this:



    A   B   C
0 foo 2 3
2 foo 1 4
3 bar nan nan


How can i do that?







python pandas duplicates






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 21 '18 at 13:26









Lame FanelloLame Fanello

7310




7310













  • drop_duplicates doesnt drop different column values and drop_na drops every column with nan values. I need to drop every duplicate of a specific column where the row has a NaN value.

    – Lame Fanello
    Nov 21 '18 at 13:31



















  • drop_duplicates doesnt drop different column values and drop_na drops every column with nan values. I need to drop every duplicate of a specific column where the row has a NaN value.

    – Lame Fanello
    Nov 21 '18 at 13:31

















drop_duplicates doesnt drop different column values and drop_na drops every column with nan values. I need to drop every duplicate of a specific column where the row has a NaN value.

– Lame Fanello
Nov 21 '18 at 13:31





drop_duplicates doesnt drop different column values and drop_na drops every column with nan values. I need to drop every duplicate of a specific column where the row has a NaN value.

– Lame Fanello
Nov 21 '18 at 13:31












2 Answers
2






active

oldest

votes


















2














Use boolean indexing:



df = df[~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1)]
print (df)
A B C
0 foo 2.0 3.0
2 foo 1.0 4.0
3 bar NaN NaN


Explanation:



Test column A for not duplicates - duplicated with ~ for invert boolean mask:



print (~df['A'].duplicated(keep=False))
0 False
1 False
2 False
3 True
4 False
Name: A, dtype: bool


Check non missing values in B,C columns:



print (df[['B','C']].notnull())
B C
0 True True
1 False False
2 True True
3 False False
4 False False


And then at least one True per row with DataFrame.any:



print (df[['B','C']].notnull().any(axis=1))
0 True
1 False
2 True
3 False
4 False
dtype: bool


Chain together by | for bitwise OR:



print (~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1))
0 True
1 False
2 True
3 True
4 False
dtype: bool





share|improve this answer





















  • 1





    To the downvoter: what's wrong with this answer?

    – timgeb
    Nov 21 '18 at 14:02



















1














Slightly different to jezrael's solution:



>>> df                                                                                                     
A B C
0 foo 2.0 3.0
1 foo NaN NaN
2 foo 1.0 4.0
3 bar NaN NaN
4 foo NaN NaN
>>>
>>> df.drop(index=df[df.duplicated(keep=False)].isnull().any(1).index)
A B C
0 foo 2.0 3.0
2 foo 1.0 4.0
3 bar NaN NaN


Steps:



>>> df.duplicated(keep=False)                                                                            
0 False
1 True
2 False
3 False
4 True
dtype: bool
>>>
>>> df[df.duplicated(keep=False)]
A B C
1 foo NaN NaN
4 foo NaN NaN
>>>
>>> df[df.duplicated(keep=False)].isnull()
A B C
1 False True True
4 False True True
>>>
>>> df[df.duplicated(keep=False)].isnull().any(1).index
Int64Index([1, 4], dtype='int64')





share|improve this answer
























  • If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

    – timgeb
    Nov 21 '18 at 14:01











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53413077%2fpython-pandas-remove-duplicate-rows-that-have-a-column-value-nan%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














Use boolean indexing:



df = df[~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1)]
print (df)
A B C
0 foo 2.0 3.0
2 foo 1.0 4.0
3 bar NaN NaN


Explanation:



Test column A for not duplicates - duplicated with ~ for invert boolean mask:



print (~df['A'].duplicated(keep=False))
0 False
1 False
2 False
3 True
4 False
Name: A, dtype: bool


Check non missing values in B,C columns:



print (df[['B','C']].notnull())
B C
0 True True
1 False False
2 True True
3 False False
4 False False


And then at least one True per row with DataFrame.any:



print (df[['B','C']].notnull().any(axis=1))
0 True
1 False
2 True
3 False
4 False
dtype: bool


Chain together by | for bitwise OR:



print (~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1))
0 True
1 False
2 True
3 True
4 False
dtype: bool





share|improve this answer





















  • 1





    To the downvoter: what's wrong with this answer?

    – timgeb
    Nov 21 '18 at 14:02
















2














Use boolean indexing:



df = df[~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1)]
print (df)
A B C
0 foo 2.0 3.0
2 foo 1.0 4.0
3 bar NaN NaN


Explanation:



Test column A for not duplicates - duplicated with ~ for invert boolean mask:



print (~df['A'].duplicated(keep=False))
0 False
1 False
2 False
3 True
4 False
Name: A, dtype: bool


Check non missing values in B,C columns:



print (df[['B','C']].notnull())
B C
0 True True
1 False False
2 True True
3 False False
4 False False


And then at least one True per row with DataFrame.any:



print (df[['B','C']].notnull().any(axis=1))
0 True
1 False
2 True
3 False
4 False
dtype: bool


Chain together by | for bitwise OR:



print (~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1))
0 True
1 False
2 True
3 True
4 False
dtype: bool





share|improve this answer





















  • 1





    To the downvoter: what's wrong with this answer?

    – timgeb
    Nov 21 '18 at 14:02














2












2








2







Use boolean indexing:



df = df[~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1)]
print (df)
A B C
0 foo 2.0 3.0
2 foo 1.0 4.0
3 bar NaN NaN


Explanation:



Test column A for not duplicates - duplicated with ~ for invert boolean mask:



print (~df['A'].duplicated(keep=False))
0 False
1 False
2 False
3 True
4 False
Name: A, dtype: bool


Check non missing values in B,C columns:



print (df[['B','C']].notnull())
B C
0 True True
1 False False
2 True True
3 False False
4 False False


And then at least one True per row with DataFrame.any:



print (df[['B','C']].notnull().any(axis=1))
0 True
1 False
2 True
3 False
4 False
dtype: bool


Chain together by | for bitwise OR:



print (~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1))
0 True
1 False
2 True
3 True
4 False
dtype: bool





share|improve this answer















Use boolean indexing:



df = df[~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1)]
print (df)
A B C
0 foo 2.0 3.0
2 foo 1.0 4.0
3 bar NaN NaN


Explanation:



Test column A for not duplicates - duplicated with ~ for invert boolean mask:



print (~df['A'].duplicated(keep=False))
0 False
1 False
2 False
3 True
4 False
Name: A, dtype: bool


Check non missing values in B,C columns:



print (df[['B','C']].notnull())
B C
0 True True
1 False False
2 True True
3 False False
4 False False


And then at least one True per row with DataFrame.any:



print (df[['B','C']].notnull().any(axis=1))
0 True
1 False
2 True
3 False
4 False
dtype: bool


Chain together by | for bitwise OR:



print (~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1))
0 True
1 False
2 True
3 True
4 False
dtype: bool






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 21 '18 at 13:37

























answered Nov 21 '18 at 13:31









jezraeljezrael

335k25281357




335k25281357








  • 1





    To the downvoter: what's wrong with this answer?

    – timgeb
    Nov 21 '18 at 14:02














  • 1





    To the downvoter: what's wrong with this answer?

    – timgeb
    Nov 21 '18 at 14:02








1




1





To the downvoter: what's wrong with this answer?

– timgeb
Nov 21 '18 at 14:02





To the downvoter: what's wrong with this answer?

– timgeb
Nov 21 '18 at 14:02













1














Slightly different to jezrael's solution:



>>> df                                                                                                     
A B C
0 foo 2.0 3.0
1 foo NaN NaN
2 foo 1.0 4.0
3 bar NaN NaN
4 foo NaN NaN
>>>
>>> df.drop(index=df[df.duplicated(keep=False)].isnull().any(1).index)
A B C
0 foo 2.0 3.0
2 foo 1.0 4.0
3 bar NaN NaN


Steps:



>>> df.duplicated(keep=False)                                                                            
0 False
1 True
2 False
3 False
4 True
dtype: bool
>>>
>>> df[df.duplicated(keep=False)]
A B C
1 foo NaN NaN
4 foo NaN NaN
>>>
>>> df[df.duplicated(keep=False)].isnull()
A B C
1 False True True
4 False True True
>>>
>>> df[df.duplicated(keep=False)].isnull().any(1).index
Int64Index([1, 4], dtype='int64')





share|improve this answer
























  • If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

    – timgeb
    Nov 21 '18 at 14:01
















1














Slightly different to jezrael's solution:



>>> df                                                                                                     
A B C
0 foo 2.0 3.0
1 foo NaN NaN
2 foo 1.0 4.0
3 bar NaN NaN
4 foo NaN NaN
>>>
>>> df.drop(index=df[df.duplicated(keep=False)].isnull().any(1).index)
A B C
0 foo 2.0 3.0
2 foo 1.0 4.0
3 bar NaN NaN


Steps:



>>> df.duplicated(keep=False)                                                                            
0 False
1 True
2 False
3 False
4 True
dtype: bool
>>>
>>> df[df.duplicated(keep=False)]
A B C
1 foo NaN NaN
4 foo NaN NaN
>>>
>>> df[df.duplicated(keep=False)].isnull()
A B C
1 False True True
4 False True True
>>>
>>> df[df.duplicated(keep=False)].isnull().any(1).index
Int64Index([1, 4], dtype='int64')





share|improve this answer
























  • If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

    – timgeb
    Nov 21 '18 at 14:01














1












1








1







Slightly different to jezrael's solution:



>>> df                                                                                                     
A B C
0 foo 2.0 3.0
1 foo NaN NaN
2 foo 1.0 4.0
3 bar NaN NaN
4 foo NaN NaN
>>>
>>> df.drop(index=df[df.duplicated(keep=False)].isnull().any(1).index)
A B C
0 foo 2.0 3.0
2 foo 1.0 4.0
3 bar NaN NaN


Steps:



>>> df.duplicated(keep=False)                                                                            
0 False
1 True
2 False
3 False
4 True
dtype: bool
>>>
>>> df[df.duplicated(keep=False)]
A B C
1 foo NaN NaN
4 foo NaN NaN
>>>
>>> df[df.duplicated(keep=False)].isnull()
A B C
1 False True True
4 False True True
>>>
>>> df[df.duplicated(keep=False)].isnull().any(1).index
Int64Index([1, 4], dtype='int64')





share|improve this answer













Slightly different to jezrael's solution:



>>> df                                                                                                     
A B C
0 foo 2.0 3.0
1 foo NaN NaN
2 foo 1.0 4.0
3 bar NaN NaN
4 foo NaN NaN
>>>
>>> df.drop(index=df[df.duplicated(keep=False)].isnull().any(1).index)
A B C
0 foo 2.0 3.0
2 foo 1.0 4.0
3 bar NaN NaN


Steps:



>>> df.duplicated(keep=False)                                                                            
0 False
1 True
2 False
3 False
4 True
dtype: bool
>>>
>>> df[df.duplicated(keep=False)]
A B C
1 foo NaN NaN
4 foo NaN NaN
>>>
>>> df[df.duplicated(keep=False)].isnull()
A B C
1 False True True
4 False True True
>>>
>>> df[df.duplicated(keep=False)].isnull().any(1).index
Int64Index([1, 4], dtype='int64')






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 21 '18 at 13:40









timgebtimgeb

50.8k116493




50.8k116493













  • If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

    – timgeb
    Nov 21 '18 at 14:01



















  • If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

    – timgeb
    Nov 21 '18 at 14:01

















If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

– timgeb
Nov 21 '18 at 14:01





If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

– timgeb
Nov 21 '18 at 14:01


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53413077%2fpython-pandas-remove-duplicate-rows-that-have-a-column-value-nan%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

How to fix TextFormField cause rebuild widget in Flutter

Npm cannot find a required file even through it is in the searched directory