Python pandas remove duplicate rows that have a column value “NaN”

The need to rows that have NaN values in them but are also duplicates. For example this table:

    A   B   C

0   foo 2   3

1   foo nan nan

2   foo 1   4

3   bar nan nan

4   foo nan nan

Should become this:

    A   B   C

0   foo 2   3

2   foo 1   4

3   bar nan nan

How can i do that?

asked Nov 21 '18 at 13:26

Lame Fanello

7310

drop_duplicates doesnt drop different column values and drop_na drops every column with nan values. I need to drop every duplicate of a specific column where the row has a NaN value.

– Lame Fanello
Nov 21 '18 at 13:31

add a comment |

The need to rows that have NaN values in them but are also duplicates. For example this table:

    A   B   C

0   foo 2   3

1   foo nan nan

2   foo 1   4

3   bar nan nan

4   foo nan nan

Should become this:

    A   B   C

0   foo 2   3

2   foo 1   4

3   bar nan nan

How can i do that?

asked Nov 21 '18 at 13:26

Lame Fanello

7310

drop_duplicates doesnt drop different column values and drop_na drops every column with nan values. I need to drop every duplicate of a specific column where the row has a NaN value.

– Lame Fanello
Nov 21 '18 at 13:31

add a comment |

The need to rows that have NaN values in them but are also duplicates. For example this table:

    A   B   C

0   foo 2   3

1   foo nan nan

2   foo 1   4

3   bar nan nan

4   foo nan nan

Should become this:

    A   B   C

0   foo 2   3

2   foo 1   4

3   bar nan nan

How can i do that?

asked Nov 21 '18 at 13:26

Lame Fanello

7310

The need to rows that have NaN values in them but are also duplicates. For example this table:

    A   B   C

0   foo 2   3

1   foo nan nan

2   foo 1   4

3   bar nan nan

4   foo nan nan

Should become this:

    A   B   C

0   foo 2   3

2   foo 1   4

3   bar nan nan

How can i do that?

python pandas duplicates

asked Nov 21 '18 at 13:26

Lame Fanello

7310

asked Nov 21 '18 at 13:26

Lame Fanello

7310

asked Nov 21 '18 at 13:26

Lame Fanello

7310

asked Nov 21 '18 at 13:26

Lame Fanello

7310

asked Nov 21 '18 at 13:26

Lame Fanello

7310

drop_duplicates doesnt drop different column values and drop_na drops every column with nan values. I need to drop every duplicate of a specific column where the row has a NaN value.

– Lame Fanello
Nov 21 '18 at 13:31

add a comment |

drop_duplicates doesnt drop different column values and drop_na drops every column with nan values. I need to drop every duplicate of a specific column where the row has a NaN value.

– Lame Fanello
Nov 21 '18 at 13:31

drop_duplicates doesnt drop different column values and drop_na drops every column with nan values. I need to drop every duplicate of a specific column where the row has a NaN value.

– Lame Fanello
Nov 21 '18 at 13:31

add a comment |

2 Answers
2

active

oldest

votes

Use boolean indexing:

df = df[~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1)]

print (df)

     A    B    C

0  foo  2.0  3.0

2  foo  1.0  4.0

3  bar  NaN  NaN

Explanation:

Test column A for not duplicates - duplicated with ~ for invert boolean mask:

print (~df['A'].duplicated(keep=False))

0    False

1    False

2    False

3     True

4    False

Name: A, dtype: bool

Check non missing values in B,C columns:

print (df[['B','C']].notnull())

       B      C

0   True   True

1  False  False

2   True   True

3  False  False

4  False  False

And then at least one True per row with DataFrame.any:

print (df[['B','C']].notnull().any(axis=1))

0     True

1    False

2     True

3    False

4    False

dtype: bool

Chain together by | for bitwise OR:

print (~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1))

0     True

1    False

2     True

3     True

4    False

dtype: bool

edited Nov 21 '18 at 13:37

answered Nov 21 '18 at 13:31

jezrael

335k25281357

1

To the downvoter: what's wrong with this answer?

– timgeb
Nov 21 '18 at 14:02

add a comment |

Slightly different to jezrael's solution:

>>> df                                                                                                     

     A    B    C

0  foo  2.0  3.0

1  foo  NaN  NaN

2  foo  1.0  4.0

3  bar  NaN  NaN

4  foo  NaN  NaN

>>>

>>> df.drop(index=df[df.duplicated(keep=False)].isnull().any(1).index)                                  

     A    B    C

0  foo  2.0  3.0

2  foo  1.0  4.0

3  bar  NaN  NaN

Steps:

>>> df.duplicated(keep=False)                                                                            

0    False

1     True

2    False

3    False

4     True

dtype: bool

>>>

>>> df[df.duplicated(keep=False)]                                                                       

      A   B   C

1  foo NaN NaN

4  foo NaN NaN

>>>

>>> df[df.duplicated(keep=False)].isnull()                                                                 

       A     B     C

1  False  True  True

4  False  True  True

>>>

>>> df[df.duplicated(keep=False)].isnull().any(1).index                                                          

Int64Index([1, 4], dtype='int64')

answered Nov 21 '18 at 13:40

timgeb

50.8k116493

If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

– timgeb
Nov 21 '18 at 14:01

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53413077%2fpython-pandas-remove-duplicate-rows-that-have-a-column-value-nan%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Use boolean indexing:

df = df[~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1)]

print (df)

     A    B    C

0  foo  2.0  3.0

2  foo  1.0  4.0

3  bar  NaN  NaN

Explanation:

Test column A for not duplicates - duplicated with ~ for invert boolean mask:

print (~df['A'].duplicated(keep=False))

0    False

1    False

2    False

3     True

4    False

Name: A, dtype: bool

Check non missing values in B,C columns:

print (df[['B','C']].notnull())

       B      C

0   True   True

1  False  False

2   True   True

3  False  False

4  False  False

And then at least one True per row with DataFrame.any:

print (df[['B','C']].notnull().any(axis=1))

0     True

1    False

2     True

3    False

4    False

dtype: bool

Chain together by | for bitwise OR:

print (~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1))

0     True

1    False

2     True

3     True

4    False

dtype: bool

edited Nov 21 '18 at 13:37

answered Nov 21 '18 at 13:31

jezrael

335k25281357

1

To the downvoter: what's wrong with this answer?

– timgeb
Nov 21 '18 at 14:02

add a comment |

Use boolean indexing:

df = df[~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1)]

print (df)

     A    B    C

0  foo  2.0  3.0

2  foo  1.0  4.0

3  bar  NaN  NaN

Explanation:

Test column A for not duplicates - duplicated with ~ for invert boolean mask:

print (~df['A'].duplicated(keep=False))

0    False

1    False

2    False

3     True

4    False

Name: A, dtype: bool

Check non missing values in B,C columns:

print (df[['B','C']].notnull())

       B      C

0   True   True

1  False  False

2   True   True

3  False  False

4  False  False

And then at least one True per row with DataFrame.any:

print (df[['B','C']].notnull().any(axis=1))

0     True

1    False

2     True

3    False

4    False

dtype: bool

Chain together by | for bitwise OR:

print (~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1))

0     True

1    False

2     True

3     True

4    False

dtype: bool

edited Nov 21 '18 at 13:37

answered Nov 21 '18 at 13:31

jezrael

335k25281357

1

To the downvoter: what's wrong with this answer?

– timgeb
Nov 21 '18 at 14:02

add a comment |

Use boolean indexing:

df = df[~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1)]

print (df)

     A    B    C

0  foo  2.0  3.0

2  foo  1.0  4.0

3  bar  NaN  NaN

Explanation:

Test column A for not duplicates - duplicated with ~ for invert boolean mask:

print (~df['A'].duplicated(keep=False))

0    False

1    False

2    False

3     True

4    False

Name: A, dtype: bool

Check non missing values in B,C columns:

print (df[['B','C']].notnull())

       B      C

0   True   True

1  False  False

2   True   True

3  False  False

4  False  False

And then at least one True per row with DataFrame.any:

print (df[['B','C']].notnull().any(axis=1))

0     True

1    False

2     True

3    False

4    False

dtype: bool

Chain together by | for bitwise OR:

print (~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1))

0     True

1    False

2     True

3     True

4    False

dtype: bool

edited Nov 21 '18 at 13:37

answered Nov 21 '18 at 13:31

jezrael

335k25281357

Use boolean indexing:

df = df[~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1)]

print (df)

     A    B    C

0  foo  2.0  3.0

2  foo  1.0  4.0

3  bar  NaN  NaN

Explanation:

Test column A for not duplicates - duplicated with ~ for invert boolean mask:

print (~df['A'].duplicated(keep=False))

0    False

1    False

2    False

3     True

4    False

Name: A, dtype: bool

Check non missing values in B,C columns:

print (df[['B','C']].notnull())

       B      C

0   True   True

1  False  False

2   True   True

3  False  False

4  False  False

And then at least one True per row with DataFrame.any:

print (df[['B','C']].notnull().any(axis=1))

0     True

1    False

2     True

3    False

4    False

dtype: bool

Chain together by | for bitwise OR:

print (~df['A'].duplicated(keep=False) | df[['B','C']].notnull().any(axis=1))

0     True

1    False

2     True

3     True

4    False

dtype: bool

edited Nov 21 '18 at 13:37

answered Nov 21 '18 at 13:31

jezrael

335k25281357

edited Nov 21 '18 at 13:37

answered Nov 21 '18 at 13:31

jezrael

335k25281357

answered Nov 21 '18 at 13:31

jezrael

335k25281357

answered Nov 21 '18 at 13:31

jezrael

335k25281357

1

To the downvoter: what's wrong with this answer?

– timgeb
Nov 21 '18 at 14:02

add a comment |

1

To the downvoter: what's wrong with this answer?

– timgeb
Nov 21 '18 at 14:02

To the downvoter: what's wrong with this answer?

– timgeb
Nov 21 '18 at 14:02

add a comment |

Slightly different to jezrael's solution:

>>> df                                                                                                     

     A    B    C

0  foo  2.0  3.0

1  foo  NaN  NaN

2  foo  1.0  4.0

3  bar  NaN  NaN

4  foo  NaN  NaN

>>>

>>> df.drop(index=df[df.duplicated(keep=False)].isnull().any(1).index)                                  

     A    B    C

0  foo  2.0  3.0

2  foo  1.0  4.0

3  bar  NaN  NaN

Steps:

>>> df.duplicated(keep=False)                                                                            

0    False

1     True

2    False

3    False

4     True

dtype: bool

>>>

>>> df[df.duplicated(keep=False)]                                                                       

      A   B   C

1  foo NaN NaN

4  foo NaN NaN

>>>

>>> df[df.duplicated(keep=False)].isnull()                                                                 

       A     B     C

1  False  True  True

4  False  True  True

>>>

>>> df[df.duplicated(keep=False)].isnull().any(1).index                                                          

Int64Index([1, 4], dtype='int64')

answered Nov 21 '18 at 13:40

timgeb

50.8k116493

If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

– timgeb
Nov 21 '18 at 14:01

add a comment |

Slightly different to jezrael's solution:

>>> df                                                                                                     

     A    B    C

0  foo  2.0  3.0

1  foo  NaN  NaN

2  foo  1.0  4.0

3  bar  NaN  NaN

4  foo  NaN  NaN

>>>

>>> df.drop(index=df[df.duplicated(keep=False)].isnull().any(1).index)                                  

     A    B    C

0  foo  2.0  3.0

2  foo  1.0  4.0

3  bar  NaN  NaN

Steps:

>>> df.duplicated(keep=False)                                                                            

0    False

1     True

2    False

3    False

4     True

dtype: bool

>>>

>>> df[df.duplicated(keep=False)]                                                                       

      A   B   C

1  foo NaN NaN

4  foo NaN NaN

>>>

>>> df[df.duplicated(keep=False)].isnull()                                                                 

       A     B     C

1  False  True  True

4  False  True  True

>>>

>>> df[df.duplicated(keep=False)].isnull().any(1).index                                                          

Int64Index([1, 4], dtype='int64')

answered Nov 21 '18 at 13:40

timgeb

50.8k116493

If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

– timgeb
Nov 21 '18 at 14:01

add a comment |

Slightly different to jezrael's solution:

>>> df                                                                                                     

     A    B    C

0  foo  2.0  3.0

1  foo  NaN  NaN

2  foo  1.0  4.0

3  bar  NaN  NaN

4  foo  NaN  NaN

>>>

>>> df.drop(index=df[df.duplicated(keep=False)].isnull().any(1).index)                                  

     A    B    C

0  foo  2.0  3.0

2  foo  1.0  4.0

3  bar  NaN  NaN

Steps:

>>> df.duplicated(keep=False)                                                                            

0    False

1     True

2    False

3    False

4     True

dtype: bool

>>>

>>> df[df.duplicated(keep=False)]                                                                       

      A   B   C

1  foo NaN NaN

4  foo NaN NaN

>>>

>>> df[df.duplicated(keep=False)].isnull()                                                                 

       A     B     C

1  False  True  True

4  False  True  True

>>>

>>> df[df.duplicated(keep=False)].isnull().any(1).index                                                          

Int64Index([1, 4], dtype='int64')

answered Nov 21 '18 at 13:40

timgeb

50.8k116493

Slightly different to jezrael's solution:

>>> df                                                                                                     

     A    B    C

0  foo  2.0  3.0

1  foo  NaN  NaN

2  foo  1.0  4.0

3  bar  NaN  NaN

4  foo  NaN  NaN

>>>

>>> df.drop(index=df[df.duplicated(keep=False)].isnull().any(1).index)                                  

     A    B    C

0  foo  2.0  3.0

2  foo  1.0  4.0

3  bar  NaN  NaN

Steps:

>>> df.duplicated(keep=False)                                                                            

0    False

1     True

2    False

3    False

4     True

dtype: bool

>>>

>>> df[df.duplicated(keep=False)]                                                                       

      A   B   C

1  foo NaN NaN

4  foo NaN NaN

>>>

>>> df[df.duplicated(keep=False)].isnull()                                                                 

       A     B     C

1  False  True  True

4  False  True  True

>>>

>>> df[df.duplicated(keep=False)].isnull().any(1).index                                                          

Int64Index([1, 4], dtype='int64')

answered Nov 21 '18 at 13:40

timgeb

50.8k116493

answered Nov 21 '18 at 13:40

timgeb

50.8k116493

answered Nov 21 '18 at 13:40

timgeb

50.8k116493

answered Nov 21 '18 at 13:40

timgeb

50.8k116493

If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

– timgeb
Nov 21 '18 at 14:01

add a comment |

If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

– timgeb
Nov 21 '18 at 14:01

If you downvoted this answer, I would appreciate if you left a comment such that I can improve the answer.

– timgeb
Nov 21 '18 at 14:01

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu