Finding out and logging the failed validation condition in pandas
I have a dataframe df,
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
I have put a data validation on plan_year
and name
columns like below,
m4 = ((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4))
m1 = (df1[['name']].notnull().all(axis=1))
I am getting the valid dataframe with below ,
df1 = df[m1 & m4]
I can get the rows which are not present in df1(the rows which are invalid)
merged = df.merge(df1.drop_duplicates(), how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']
I want to keep track as to which row failed due to which validation.
I want to get a dataframe with all the invalid data dataframe to look something like below-
plan_year name metal_level_name Failed message
0 20118 Gold Heritage Plus 1500 - 02 Gold Failed due to wrong plan_year
1 2018 NaN Platinum name column cannot be null
Can someone help me with this please.
python pandas dataframe
add a comment |
I have a dataframe df,
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
I have put a data validation on plan_year
and name
columns like below,
m4 = ((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4))
m1 = (df1[['name']].notnull().all(axis=1))
I am getting the valid dataframe with below ,
df1 = df[m1 & m4]
I can get the rows which are not present in df1(the rows which are invalid)
merged = df.merge(df1.drop_duplicates(), how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']
I want to keep track as to which row failed due to which validation.
I want to get a dataframe with all the invalid data dataframe to look something like below-
plan_year name metal_level_name Failed message
0 20118 Gold Heritage Plus 1500 - 02 Gold Failed due to wrong plan_year
1 2018 NaN Platinum name column cannot be null
Can someone help me with this please.
python pandas dataframe
add a comment |
I have a dataframe df,
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
I have put a data validation on plan_year
and name
columns like below,
m4 = ((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4))
m1 = (df1[['name']].notnull().all(axis=1))
I am getting the valid dataframe with below ,
df1 = df[m1 & m4]
I can get the rows which are not present in df1(the rows which are invalid)
merged = df.merge(df1.drop_duplicates(), how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']
I want to keep track as to which row failed due to which validation.
I want to get a dataframe with all the invalid data dataframe to look something like below-
plan_year name metal_level_name Failed message
0 20118 Gold Heritage Plus 1500 - 02 Gold Failed due to wrong plan_year
1 2018 NaN Platinum name column cannot be null
Can someone help me with this please.
python pandas dataframe
I have a dataframe df,
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
I have put a data validation on plan_year
and name
columns like below,
m4 = ((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4))
m1 = (df1[['name']].notnull().all(axis=1))
I am getting the valid dataframe with below ,
df1 = df[m1 & m4]
I can get the rows which are not present in df1(the rows which are invalid)
merged = df.merge(df1.drop_duplicates(), how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']
I want to keep track as to which row failed due to which validation.
I want to get a dataframe with all the invalid data dataframe to look something like below-
plan_year name metal_level_name Failed message
0 20118 Gold Heritage Plus 1500 - 02 Gold Failed due to wrong plan_year
1 2018 NaN Platinum name column cannot be null
Can someone help me with this please.
python pandas dataframe
python pandas dataframe
asked Nov 21 '18 at 7:20
user1896796user1896796
129217
129217
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
You can use numpy.select
with inverting boolena masks by ~
:
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2 OK
df1 = df[df['Failed message'] != 'OK']
print (df1)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
EDIT: For multiple error messages create new DataFrame
by concat
and then matrix multiple it by columns names with separator by dot
and last remove separator from rigth side by rstrip
:
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
print (df1)
name column cannot be null Failed due to wrong plan_year
0 False True
1 True False
2 False False
1 True True
df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2
1 name column cannot be null, Failed due to wron...
df1 = df[df['Failed message'] != '']
print (df1)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
1 name column cannot be null, Failed due to wron...
Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?
– user1896796
Nov 21 '18 at 7:37
@user1896796 - so need join error messages, right?
– jezrael
Nov 21 '18 at 7:38
yes .. I guess so .I need both the error messages.
– user1896796
Nov 21 '18 at 7:40
@user1896796 - Check edited answer.
– jezrael
Nov 21 '18 at 7:47
One thing . like previously we were segregating withdf1 = df[df['Failed message'] != 'OK']
. How do we segregate now?
– user1896796
Nov 21 '18 at 7:55
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53407035%2ffinding-out-and-logging-the-failed-validation-condition-in-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can use numpy.select
with inverting boolena masks by ~
:
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2 OK
df1 = df[df['Failed message'] != 'OK']
print (df1)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
EDIT: For multiple error messages create new DataFrame
by concat
and then matrix multiple it by columns names with separator by dot
and last remove separator from rigth side by rstrip
:
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
print (df1)
name column cannot be null Failed due to wrong plan_year
0 False True
1 True False
2 False False
1 True True
df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2
1 name column cannot be null, Failed due to wron...
df1 = df[df['Failed message'] != '']
print (df1)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
1 name column cannot be null, Failed due to wron...
Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?
– user1896796
Nov 21 '18 at 7:37
@user1896796 - so need join error messages, right?
– jezrael
Nov 21 '18 at 7:38
yes .. I guess so .I need both the error messages.
– user1896796
Nov 21 '18 at 7:40
@user1896796 - Check edited answer.
– jezrael
Nov 21 '18 at 7:47
One thing . like previously we were segregating withdf1 = df[df['Failed message'] != 'OK']
. How do we segregate now?
– user1896796
Nov 21 '18 at 7:55
|
show 1 more comment
You can use numpy.select
with inverting boolena masks by ~
:
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2 OK
df1 = df[df['Failed message'] != 'OK']
print (df1)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
EDIT: For multiple error messages create new DataFrame
by concat
and then matrix multiple it by columns names with separator by dot
and last remove separator from rigth side by rstrip
:
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
print (df1)
name column cannot be null Failed due to wrong plan_year
0 False True
1 True False
2 False False
1 True True
df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2
1 name column cannot be null, Failed due to wron...
df1 = df[df['Failed message'] != '']
print (df1)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
1 name column cannot be null, Failed due to wron...
Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?
– user1896796
Nov 21 '18 at 7:37
@user1896796 - so need join error messages, right?
– jezrael
Nov 21 '18 at 7:38
yes .. I guess so .I need both the error messages.
– user1896796
Nov 21 '18 at 7:40
@user1896796 - Check edited answer.
– jezrael
Nov 21 '18 at 7:47
One thing . like previously we were segregating withdf1 = df[df['Failed message'] != 'OK']
. How do we segregate now?
– user1896796
Nov 21 '18 at 7:55
|
show 1 more comment
You can use numpy.select
with inverting boolena masks by ~
:
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2 OK
df1 = df[df['Failed message'] != 'OK']
print (df1)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
EDIT: For multiple error messages create new DataFrame
by concat
and then matrix multiple it by columns names with separator by dot
and last remove separator from rigth side by rstrip
:
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
print (df1)
name column cannot be null Failed due to wrong plan_year
0 False True
1 True False
2 False False
1 True True
df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2
1 name column cannot be null, Failed due to wron...
df1 = df[df['Failed message'] != '']
print (df1)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
1 name column cannot be null, Failed due to wron...
You can use numpy.select
with inverting boolena masks by ~
:
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2 OK
df1 = df[df['Failed message'] != 'OK']
print (df1)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
EDIT: For multiple error messages create new DataFrame
by concat
and then matrix multiple it by columns names with separator by dot
and last remove separator from rigth side by rstrip
:
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
print (df1)
name column cannot be null Failed due to wrong plan_year
0 False True
1 True False
2 False False
1 True True
df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2
1 name column cannot be null, Failed due to wron...
df1 = df[df['Failed message'] != '']
print (df1)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
1 name column cannot be null, Failed due to wron...
edited Nov 21 '18 at 7:52
answered Nov 21 '18 at 7:27
jezraeljezrael
333k24274351
333k24274351
Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?
– user1896796
Nov 21 '18 at 7:37
@user1896796 - so need join error messages, right?
– jezrael
Nov 21 '18 at 7:38
yes .. I guess so .I need both the error messages.
– user1896796
Nov 21 '18 at 7:40
@user1896796 - Check edited answer.
– jezrael
Nov 21 '18 at 7:47
One thing . like previously we were segregating withdf1 = df[df['Failed message'] != 'OK']
. How do we segregate now?
– user1896796
Nov 21 '18 at 7:55
|
show 1 more comment
Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?
– user1896796
Nov 21 '18 at 7:37
@user1896796 - so need join error messages, right?
– jezrael
Nov 21 '18 at 7:38
yes .. I guess so .I need both the error messages.
– user1896796
Nov 21 '18 at 7:40
@user1896796 - Check edited answer.
– jezrael
Nov 21 '18 at 7:47
One thing . like previously we were segregating withdf1 = df[df['Failed message'] != 'OK']
. How do we segregate now?
– user1896796
Nov 21 '18 at 7:55
Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?
– user1896796
Nov 21 '18 at 7:37
Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?
– user1896796
Nov 21 '18 at 7:37
@user1896796 - so need join error messages, right?
– jezrael
Nov 21 '18 at 7:38
@user1896796 - so need join error messages, right?
– jezrael
Nov 21 '18 at 7:38
yes .. I guess so .I need both the error messages.
– user1896796
Nov 21 '18 at 7:40
yes .. I guess so .I need both the error messages.
– user1896796
Nov 21 '18 at 7:40
@user1896796 - Check edited answer.
– jezrael
Nov 21 '18 at 7:47
@user1896796 - Check edited answer.
– jezrael
Nov 21 '18 at 7:47
One thing . like previously we were segregating with
df1 = df[df['Failed message'] != 'OK']
. How do we segregate now?– user1896796
Nov 21 '18 at 7:55
One thing . like previously we were segregating with
df1 = df[df['Failed message'] != 'OK']
. How do we segregate now?– user1896796
Nov 21 '18 at 7:55
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53407035%2ffinding-out-and-logging-the-failed-validation-condition-in-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown