Finding out and logging the failed validation condition in pandas












1















I have a dataframe df,



      plan_year                                    name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold


I have put a data validation on plan_year and name columns like below,



m4 = ((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4))

m1 = (df1[['name']].notnull().all(axis=1))


I am getting the valid dataframe with below ,



df1 = df[m1 & m4]


I can get the rows which are not present in df1(the rows which are invalid)



merged = df.merge(df1.drop_duplicates(), how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']


I want to keep track as to which row failed due to which validation.



I want to get a dataframe with all the invalid data dataframe to look something like below-



 plan_year                                    name metal_level_name    Failed message
0 20118 Gold Heritage Plus 1500 - 02 Gold Failed due to wrong plan_year
1 2018 NaN Platinum name column cannot be null


Can someone help me with this please.










share|improve this question



























    1















    I have a dataframe df,



          plan_year                                    name metal_level_name
    0 20118 Gold Heritage Plus 1500 - 02 Gold
    1 2018 NaN Platinum
    2 2018 Gold Heritage Plus 2000 - 01 Gold


    I have put a data validation on plan_year and name columns like below,



    m4 = ((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4))

    m1 = (df1[['name']].notnull().all(axis=1))


    I am getting the valid dataframe with below ,



    df1 = df[m1 & m4]


    I can get the rows which are not present in df1(the rows which are invalid)



    merged = df.merge(df1.drop_duplicates(), how='outer', indicator=True)
    merged[merged['_merge'] == 'left_only']


    I want to keep track as to which row failed due to which validation.



    I want to get a dataframe with all the invalid data dataframe to look something like below-



     plan_year                                    name metal_level_name    Failed message
    0 20118 Gold Heritage Plus 1500 - 02 Gold Failed due to wrong plan_year
    1 2018 NaN Platinum name column cannot be null


    Can someone help me with this please.










    share|improve this question

























      1












      1








      1








      I have a dataframe df,



            plan_year                                    name metal_level_name
      0 20118 Gold Heritage Plus 1500 - 02 Gold
      1 2018 NaN Platinum
      2 2018 Gold Heritage Plus 2000 - 01 Gold


      I have put a data validation on plan_year and name columns like below,



      m4 = ((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4))

      m1 = (df1[['name']].notnull().all(axis=1))


      I am getting the valid dataframe with below ,



      df1 = df[m1 & m4]


      I can get the rows which are not present in df1(the rows which are invalid)



      merged = df.merge(df1.drop_duplicates(), how='outer', indicator=True)
      merged[merged['_merge'] == 'left_only']


      I want to keep track as to which row failed due to which validation.



      I want to get a dataframe with all the invalid data dataframe to look something like below-



       plan_year                                    name metal_level_name    Failed message
      0 20118 Gold Heritage Plus 1500 - 02 Gold Failed due to wrong plan_year
      1 2018 NaN Platinum name column cannot be null


      Can someone help me with this please.










      share|improve this question














      I have a dataframe df,



            plan_year                                    name metal_level_name
      0 20118 Gold Heritage Plus 1500 - 02 Gold
      1 2018 NaN Platinum
      2 2018 Gold Heritage Plus 2000 - 01 Gold


      I have put a data validation on plan_year and name columns like below,



      m4 = ((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4))

      m1 = (df1[['name']].notnull().all(axis=1))


      I am getting the valid dataframe with below ,



      df1 = df[m1 & m4]


      I can get the rows which are not present in df1(the rows which are invalid)



      merged = df.merge(df1.drop_duplicates(), how='outer', indicator=True)
      merged[merged['_merge'] == 'left_only']


      I want to keep track as to which row failed due to which validation.



      I want to get a dataframe with all the invalid data dataframe to look something like below-



       plan_year                                    name metal_level_name    Failed message
      0 20118 Gold Heritage Plus 1500 - 02 Gold Failed due to wrong plan_year
      1 2018 NaN Platinum name column cannot be null


      Can someone help me with this please.







      python pandas dataframe






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 21 '18 at 7:20









      user1896796user1896796

      129217




      129217
























          1 Answer
          1






          active

          oldest

          votes


















          2














          You can use numpy.select with inverting boolena masks by ~:



          message1 = 'name column cannot be null'
          message4 = 'Failed due to wrong plan_year'


          df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
          print (df)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          2 OK




          df1 = df[df['Failed message'] != 'OK']
          print (df1)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null


          EDIT: For multiple error messages create new DataFrame by concat and then matrix multiple it by columns names with separator by dot and last remove separator from rigth side by rstrip:



          print (df)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold
          1 20148 NaN Platinum

          message1 = 'name column cannot be null'
          message4 = 'Failed due to wrong plan_year'

          df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
          print (df1)
          name column cannot be null Failed due to wrong plan_year
          0 False True
          1 True False
          2 False False
          1 True True


          df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
          print (df)

          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold
          1 20148 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          2
          1 name column cannot be null, Failed due to wron...




          df1 = df[df['Failed message'] != '']
          print (df1)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          1 20148 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          1 name column cannot be null, Failed due to wron...





          share|improve this answer


























          • Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?

            – user1896796
            Nov 21 '18 at 7:37













          • @user1896796 - so need join error messages, right?

            – jezrael
            Nov 21 '18 at 7:38











          • yes .. I guess so .I need both the error messages.

            – user1896796
            Nov 21 '18 at 7:40











          • @user1896796 - Check edited answer.

            – jezrael
            Nov 21 '18 at 7:47











          • One thing . like previously we were segregating with df1 = df[df['Failed message'] != 'OK']. How do we segregate now?

            – user1896796
            Nov 21 '18 at 7:55











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53407035%2ffinding-out-and-logging-the-failed-validation-condition-in-pandas%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          You can use numpy.select with inverting boolena masks by ~:



          message1 = 'name column cannot be null'
          message4 = 'Failed due to wrong plan_year'


          df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
          print (df)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          2 OK




          df1 = df[df['Failed message'] != 'OK']
          print (df1)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null


          EDIT: For multiple error messages create new DataFrame by concat and then matrix multiple it by columns names with separator by dot and last remove separator from rigth side by rstrip:



          print (df)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold
          1 20148 NaN Platinum

          message1 = 'name column cannot be null'
          message4 = 'Failed due to wrong plan_year'

          df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
          print (df1)
          name column cannot be null Failed due to wrong plan_year
          0 False True
          1 True False
          2 False False
          1 True True


          df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
          print (df)

          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold
          1 20148 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          2
          1 name column cannot be null, Failed due to wron...




          df1 = df[df['Failed message'] != '']
          print (df1)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          1 20148 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          1 name column cannot be null, Failed due to wron...





          share|improve this answer


























          • Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?

            – user1896796
            Nov 21 '18 at 7:37













          • @user1896796 - so need join error messages, right?

            – jezrael
            Nov 21 '18 at 7:38











          • yes .. I guess so .I need both the error messages.

            – user1896796
            Nov 21 '18 at 7:40











          • @user1896796 - Check edited answer.

            – jezrael
            Nov 21 '18 at 7:47











          • One thing . like previously we were segregating with df1 = df[df['Failed message'] != 'OK']. How do we segregate now?

            – user1896796
            Nov 21 '18 at 7:55
















          2














          You can use numpy.select with inverting boolena masks by ~:



          message1 = 'name column cannot be null'
          message4 = 'Failed due to wrong plan_year'


          df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
          print (df)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          2 OK




          df1 = df[df['Failed message'] != 'OK']
          print (df1)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null


          EDIT: For multiple error messages create new DataFrame by concat and then matrix multiple it by columns names with separator by dot and last remove separator from rigth side by rstrip:



          print (df)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold
          1 20148 NaN Platinum

          message1 = 'name column cannot be null'
          message4 = 'Failed due to wrong plan_year'

          df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
          print (df1)
          name column cannot be null Failed due to wrong plan_year
          0 False True
          1 True False
          2 False False
          1 True True


          df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
          print (df)

          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold
          1 20148 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          2
          1 name column cannot be null, Failed due to wron...




          df1 = df[df['Failed message'] != '']
          print (df1)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          1 20148 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          1 name column cannot be null, Failed due to wron...





          share|improve this answer


























          • Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?

            – user1896796
            Nov 21 '18 at 7:37













          • @user1896796 - so need join error messages, right?

            – jezrael
            Nov 21 '18 at 7:38











          • yes .. I guess so .I need both the error messages.

            – user1896796
            Nov 21 '18 at 7:40











          • @user1896796 - Check edited answer.

            – jezrael
            Nov 21 '18 at 7:47











          • One thing . like previously we were segregating with df1 = df[df['Failed message'] != 'OK']. How do we segregate now?

            – user1896796
            Nov 21 '18 at 7:55














          2












          2








          2







          You can use numpy.select with inverting boolena masks by ~:



          message1 = 'name column cannot be null'
          message4 = 'Failed due to wrong plan_year'


          df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
          print (df)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          2 OK




          df1 = df[df['Failed message'] != 'OK']
          print (df1)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null


          EDIT: For multiple error messages create new DataFrame by concat and then matrix multiple it by columns names with separator by dot and last remove separator from rigth side by rstrip:



          print (df)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold
          1 20148 NaN Platinum

          message1 = 'name column cannot be null'
          message4 = 'Failed due to wrong plan_year'

          df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
          print (df1)
          name column cannot be null Failed due to wrong plan_year
          0 False True
          1 True False
          2 False False
          1 True True


          df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
          print (df)

          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold
          1 20148 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          2
          1 name column cannot be null, Failed due to wron...




          df1 = df[df['Failed message'] != '']
          print (df1)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          1 20148 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          1 name column cannot be null, Failed due to wron...





          share|improve this answer















          You can use numpy.select with inverting boolena masks by ~:



          message1 = 'name column cannot be null'
          message4 = 'Failed due to wrong plan_year'


          df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
          print (df)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          2 OK




          df1 = df[df['Failed message'] != 'OK']
          print (df1)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null


          EDIT: For multiple error messages create new DataFrame by concat and then matrix multiple it by columns names with separator by dot and last remove separator from rigth side by rstrip:



          print (df)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold
          1 20148 NaN Platinum

          message1 = 'name column cannot be null'
          message4 = 'Failed due to wrong plan_year'

          df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
          print (df1)
          name column cannot be null Failed due to wrong plan_year
          0 False True
          1 True False
          2 False False
          1 True True


          df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
          print (df)

          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          2 2018 Gold Heritage Plus 2000 - 01 Gold
          1 20148 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          2
          1 name column cannot be null, Failed due to wron...




          df1 = df[df['Failed message'] != '']
          print (df1)
          plan_year name metal_level_name
          0 20118 Gold Heritage Plus 1500 - 02 Gold
          1 2018 NaN Platinum
          1 20148 NaN Platinum

          Failed message
          0 Failed due to wrong plan_year
          1 name column cannot be null
          1 name column cannot be null, Failed due to wron...






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 21 '18 at 7:52

























          answered Nov 21 '18 at 7:27









          jezraeljezrael

          333k24274351




          333k24274351













          • Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?

            – user1896796
            Nov 21 '18 at 7:37













          • @user1896796 - so need join error messages, right?

            – jezrael
            Nov 21 '18 at 7:38











          • yes .. I guess so .I need both the error messages.

            – user1896796
            Nov 21 '18 at 7:40











          • @user1896796 - Check edited answer.

            – jezrael
            Nov 21 '18 at 7:47











          • One thing . like previously we were segregating with df1 = df[df['Failed message'] != 'OK']. How do we segregate now?

            – user1896796
            Nov 21 '18 at 7:55



















          • Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?

            – user1896796
            Nov 21 '18 at 7:37













          • @user1896796 - so need join error messages, right?

            – jezrael
            Nov 21 '18 at 7:38











          • yes .. I guess so .I need both the error messages.

            – user1896796
            Nov 21 '18 at 7:40











          • @user1896796 - Check edited answer.

            – jezrael
            Nov 21 '18 at 7:47











          • One thing . like previously we were segregating with df1 = df[df['Failed message'] != 'OK']. How do we segregate now?

            – user1896796
            Nov 21 '18 at 7:55

















          Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?

          – user1896796
          Nov 21 '18 at 7:37







          Thanks Jezrael. In case, plan_year and name columns have invalid data , it is taking name message. Though I can work this. can we handle this situation?

          – user1896796
          Nov 21 '18 at 7:37















          @user1896796 - so need join error messages, right?

          – jezrael
          Nov 21 '18 at 7:38





          @user1896796 - so need join error messages, right?

          – jezrael
          Nov 21 '18 at 7:38













          yes .. I guess so .I need both the error messages.

          – user1896796
          Nov 21 '18 at 7:40





          yes .. I guess so .I need both the error messages.

          – user1896796
          Nov 21 '18 at 7:40













          @user1896796 - Check edited answer.

          – jezrael
          Nov 21 '18 at 7:47





          @user1896796 - Check edited answer.

          – jezrael
          Nov 21 '18 at 7:47













          One thing . like previously we were segregating with df1 = df[df['Failed message'] != 'OK']. How do we segregate now?

          – user1896796
          Nov 21 '18 at 7:55





          One thing . like previously we were segregating with df1 = df[df['Failed message'] != 'OK']. How do we segregate now?

          – user1896796
          Nov 21 '18 at 7:55


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53407035%2ffinding-out-and-logging-the-failed-validation-condition-in-pandas%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

          ts Property 'filter' does not exist on type '{}'

          mat-slide-toggle shouldn't change it's state when I click cancel in confirmation window