How can I find specified string matching filter patterns with Pandas












6















I hava a pandas dataset called tf which has a column containing blank space seperated keywords titled "Keywords":



Name         ...                    Keywords
0 Jonas 0 ... Archie Betty
1 Jonas 1 ... Archie
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
4 Jonas 4 ... Daisy
5 Jonas 5 ... NaN
6 Jonas 5 ... Chris Archie


As an input I want to provide a set of strings to filter the rows by these keywords. I thought about using a list:



list = ["Chris", "Betty"]


I found out that I can filter rows if I make the list a string with the entries seperated by "|":



t="|".join(list)



and look for matches in that column with:



tf[tf["Keywords"].str.contains(t, na=False)]



This filters by finding ANY matching content, so the output is:



Name         ...                    Keywords
0 Jonas 0 ... Archie Betty
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
6 Jonas 5 ... Chris Archie


What I want instead is:




  1. filtering by containing ONLY the list entries and


  2. filtering by containing AT LEAST the list entries



For 1. the result should be



3 Jonas 3 ... Betty Chris



For 2. the result should be:



2  Jonas 2         ...          Chris Betty Archie
3 Jonas 3 ... Betty Chris


I found out that the following basically did the trick for 2.



a = tf["Keywords"].str.contains("Chris")
b = tf["Keywords"].str.contains("Betty")
tf[a&b]


However, I need to get this done generic as the list length and its entries may vary. I had a clumsy idea with a loop to intersect each two consecutive list entries but that didn't work:



i = 0
while i < len(list)-1:
a = tf["Keywords"].str.contains(list[i])
b = tf["Keywords"].str.contains(list[i+1])
tf = a & b
i += 1


I appreciate your help.










share|improve this question



























    6















    I hava a pandas dataset called tf which has a column containing blank space seperated keywords titled "Keywords":



    Name         ...                    Keywords
    0 Jonas 0 ... Archie Betty
    1 Jonas 1 ... Archie
    2 Jonas 2 ... Chris Betty Archie
    3 Jonas 3 ... Betty Chris
    4 Jonas 4 ... Daisy
    5 Jonas 5 ... NaN
    6 Jonas 5 ... Chris Archie


    As an input I want to provide a set of strings to filter the rows by these keywords. I thought about using a list:



    list = ["Chris", "Betty"]


    I found out that I can filter rows if I make the list a string with the entries seperated by "|":



    t="|".join(list)



    and look for matches in that column with:



    tf[tf["Keywords"].str.contains(t, na=False)]



    This filters by finding ANY matching content, so the output is:



    Name         ...                    Keywords
    0 Jonas 0 ... Archie Betty
    2 Jonas 2 ... Chris Betty Archie
    3 Jonas 3 ... Betty Chris
    6 Jonas 5 ... Chris Archie


    What I want instead is:




    1. filtering by containing ONLY the list entries and


    2. filtering by containing AT LEAST the list entries



    For 1. the result should be



    3 Jonas 3 ... Betty Chris



    For 2. the result should be:



    2  Jonas 2         ...          Chris Betty Archie
    3 Jonas 3 ... Betty Chris


    I found out that the following basically did the trick for 2.



    a = tf["Keywords"].str.contains("Chris")
    b = tf["Keywords"].str.contains("Betty")
    tf[a&b]


    However, I need to get this done generic as the list length and its entries may vary. I had a clumsy idea with a loop to intersect each two consecutive list entries but that didn't work:



    i = 0
    while i < len(list)-1:
    a = tf["Keywords"].str.contains(list[i])
    b = tf["Keywords"].str.contains(list[i+1])
    tf = a & b
    i += 1


    I appreciate your help.










    share|improve this question

























      6












      6








      6








      I hava a pandas dataset called tf which has a column containing blank space seperated keywords titled "Keywords":



      Name         ...                    Keywords
      0 Jonas 0 ... Archie Betty
      1 Jonas 1 ... Archie
      2 Jonas 2 ... Chris Betty Archie
      3 Jonas 3 ... Betty Chris
      4 Jonas 4 ... Daisy
      5 Jonas 5 ... NaN
      6 Jonas 5 ... Chris Archie


      As an input I want to provide a set of strings to filter the rows by these keywords. I thought about using a list:



      list = ["Chris", "Betty"]


      I found out that I can filter rows if I make the list a string with the entries seperated by "|":



      t="|".join(list)



      and look for matches in that column with:



      tf[tf["Keywords"].str.contains(t, na=False)]



      This filters by finding ANY matching content, so the output is:



      Name         ...                    Keywords
      0 Jonas 0 ... Archie Betty
      2 Jonas 2 ... Chris Betty Archie
      3 Jonas 3 ... Betty Chris
      6 Jonas 5 ... Chris Archie


      What I want instead is:




      1. filtering by containing ONLY the list entries and


      2. filtering by containing AT LEAST the list entries



      For 1. the result should be



      3 Jonas 3 ... Betty Chris



      For 2. the result should be:



      2  Jonas 2         ...          Chris Betty Archie
      3 Jonas 3 ... Betty Chris


      I found out that the following basically did the trick for 2.



      a = tf["Keywords"].str.contains("Chris")
      b = tf["Keywords"].str.contains("Betty")
      tf[a&b]


      However, I need to get this done generic as the list length and its entries may vary. I had a clumsy idea with a loop to intersect each two consecutive list entries but that didn't work:



      i = 0
      while i < len(list)-1:
      a = tf["Keywords"].str.contains(list[i])
      b = tf["Keywords"].str.contains(list[i+1])
      tf = a & b
      i += 1


      I appreciate your help.










      share|improve this question














      I hava a pandas dataset called tf which has a column containing blank space seperated keywords titled "Keywords":



      Name         ...                    Keywords
      0 Jonas 0 ... Archie Betty
      1 Jonas 1 ... Archie
      2 Jonas 2 ... Chris Betty Archie
      3 Jonas 3 ... Betty Chris
      4 Jonas 4 ... Daisy
      5 Jonas 5 ... NaN
      6 Jonas 5 ... Chris Archie


      As an input I want to provide a set of strings to filter the rows by these keywords. I thought about using a list:



      list = ["Chris", "Betty"]


      I found out that I can filter rows if I make the list a string with the entries seperated by "|":



      t="|".join(list)



      and look for matches in that column with:



      tf[tf["Keywords"].str.contains(t, na=False)]



      This filters by finding ANY matching content, so the output is:



      Name         ...                    Keywords
      0 Jonas 0 ... Archie Betty
      2 Jonas 2 ... Chris Betty Archie
      3 Jonas 3 ... Betty Chris
      6 Jonas 5 ... Chris Archie


      What I want instead is:




      1. filtering by containing ONLY the list entries and


      2. filtering by containing AT LEAST the list entries



      For 1. the result should be



      3 Jonas 3 ... Betty Chris



      For 2. the result should be:



      2  Jonas 2         ...          Chris Betty Archie
      3 Jonas 3 ... Betty Chris


      I found out that the following basically did the trick for 2.



      a = tf["Keywords"].str.contains("Chris")
      b = tf["Keywords"].str.contains("Betty")
      tf[a&b]


      However, I need to get this done generic as the list length and its entries may vary. I had a clumsy idea with a loop to intersect each two consecutive list entries but that didn't work:



      i = 0
      while i < len(list)-1:
      a = tf["Keywords"].str.contains(list[i])
      b = tf["Keywords"].str.contains(list[i+1])
      tf = a & b
      i += 1


      I appreciate your help.







      python pandas






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 20 '18 at 11:57









      JonasJonas

      403




      403
























          4 Answers
          4






          active

          oldest

          votes


















          0














          Notice:



          Dont use variable name list, because python code word.





          Solution if all keywords have only one word, no space between:



          You can split all words by space and convert them to sets, so possible comparing by set converted from list L:



          L = ["Chris", "Betty"]
          s = set(L)

          arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])
          print (arr)
          [{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}
          {'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]

          df1 = tf[arr == s]
          print (df1)
          Name Keywords
          3 Jonas 3 Betty Chris

          df2 = tf[arr >= s]
          print (df2)
          Name Keywords
          2 Jonas 2 Chris Betty Archie
          3 Jonas 3 Betty Chris




          More general solution working with multiple words in keywords:



          print (tf)
          Name Keywords
          0 Jonas 0 Archie Betty
          1 Jonas 1 Archie
          2 Jonas 2 Chris Betty Archie
          3 Jonas 3 Betty Chris
          4 Jonas 4 Daisy Chris Archie Betty
          5 Jonas 5 NaN
          6 Jonas 5 Chris Archie Betty

          L = ["Chris Archie", "Betty"]
          s = set(L)

          #create pattern with word boundaries
          pat = '|'.join(r"b{}b".format(x) for x in L)

          #extract all keywords and convert to sets
          a = tf['Keywords'].str.findall('('+ pat + ')')
          a = np.array([set(x) if isinstance(x, list) else set() for x in a])
          #remove all matched keywords and remove possible traling whitespaces
          b = tf['Keywords'].str.replace(pat, '').str.strip()

          #compare only matched values and also empty value after replace
          df1 = tf[(b == '') & (a == s)]
          print (df1)
          Name Keywords
          6 Jonas 5 Chris Archie Betty

          #same like one keyword solution
          df2 = tf[a >= s]
          print (df2)
          Name Keywords
          4 Jonas 4 Daisy Chris Archie Betty
          6 Jonas 5 Chris Archie Betty





          share|improve this answer


























          • Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

            – Jonas
            Nov 21 '18 at 13:36













          • @Jonas - yes, use df1 = tf[a.astype(bool)]

            – jezrael
            Nov 21 '18 at 13:42



















          0














          I think this is more what you're looking for, pandas dataframe cells can actually contain lists:



          import pandas

          # Create a test dataframe
          df = pandas.DataFrame(
          [
          {"name": "A", "keywords": "Something SomethingElse"},
          {"name": "B", "keywords": "SomethingElse Tada"},
          {"name": "C", "keywords": "Something SomethingElse AndAnother"},
          ]
          )

          # Split the keywords INSIDE the cell
          df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))

          # Filter for a specific keyword
          filter_terms = ["Something"]
          filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]

          # Show the filtered results
          print(filtered)





          share|improve this answer
























          • Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

            – Jonas
            Nov 21 '18 at 12:24





















          0














          Just add on the approach you implied to your Post with



          Just Simulated DataFrame:



          >>> df
          Name Keywords
          0 Jonas 0 Archie Betty
          1 Jonas 1 Archie
          2 Jonas 2 Chris Betty Archie
          3 Jonas 3 Betty Chris
          4 Jonas 4 Daisy
          5 Jonas 5 NaN


          Using str.contains while using the names with | separated..



          >>> df[df.Keywords.str.contains("Chris|Betty", na=False)]
          Name Keywords
          0 Jonas 0 Archie Betty
          2 Jonas 2 Chris Betty Archie
          3 Jonas 3 Betty Chris


          Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern with |:



          >>> pattern
          ['Chris', 'Betty']

          >>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]
          Name Keywords
          0 Jonas 0 Archie Betty
          2 Jonas 2 Chris Betty Archie
          3 Jonas 3 Betty Chris





          share|improve this answer































            0














            def compset(x, mylist):
            y = set(x.lower().split())
            if len(y.intersection(mylist)) > 1: # == 2 for exact match
            return True
            else:
            return False

            mylist=set('chris betty'.lower().split())

            df['Keywords'].apply(compset, args=(mylist,))





            share|improve this answer























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53392511%2fhow-can-i-find-specified-string-matching-filter-patterns-with-pandas%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              4 Answers
              4






              active

              oldest

              votes








              4 Answers
              4






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              0














              Notice:



              Dont use variable name list, because python code word.





              Solution if all keywords have only one word, no space between:



              You can split all words by space and convert them to sets, so possible comparing by set converted from list L:



              L = ["Chris", "Betty"]
              s = set(L)

              arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])
              print (arr)
              [{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}
              {'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]

              df1 = tf[arr == s]
              print (df1)
              Name Keywords
              3 Jonas 3 Betty Chris

              df2 = tf[arr >= s]
              print (df2)
              Name Keywords
              2 Jonas 2 Chris Betty Archie
              3 Jonas 3 Betty Chris




              More general solution working with multiple words in keywords:



              print (tf)
              Name Keywords
              0 Jonas 0 Archie Betty
              1 Jonas 1 Archie
              2 Jonas 2 Chris Betty Archie
              3 Jonas 3 Betty Chris
              4 Jonas 4 Daisy Chris Archie Betty
              5 Jonas 5 NaN
              6 Jonas 5 Chris Archie Betty

              L = ["Chris Archie", "Betty"]
              s = set(L)

              #create pattern with word boundaries
              pat = '|'.join(r"b{}b".format(x) for x in L)

              #extract all keywords and convert to sets
              a = tf['Keywords'].str.findall('('+ pat + ')')
              a = np.array([set(x) if isinstance(x, list) else set() for x in a])
              #remove all matched keywords and remove possible traling whitespaces
              b = tf['Keywords'].str.replace(pat, '').str.strip()

              #compare only matched values and also empty value after replace
              df1 = tf[(b == '') & (a == s)]
              print (df1)
              Name Keywords
              6 Jonas 5 Chris Archie Betty

              #same like one keyword solution
              df2 = tf[a >= s]
              print (df2)
              Name Keywords
              4 Jonas 4 Daisy Chris Archie Betty
              6 Jonas 5 Chris Archie Betty





              share|improve this answer


























              • Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

                – Jonas
                Nov 21 '18 at 13:36













              • @Jonas - yes, use df1 = tf[a.astype(bool)]

                – jezrael
                Nov 21 '18 at 13:42
















              0














              Notice:



              Dont use variable name list, because python code word.





              Solution if all keywords have only one word, no space between:



              You can split all words by space and convert them to sets, so possible comparing by set converted from list L:



              L = ["Chris", "Betty"]
              s = set(L)

              arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])
              print (arr)
              [{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}
              {'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]

              df1 = tf[arr == s]
              print (df1)
              Name Keywords
              3 Jonas 3 Betty Chris

              df2 = tf[arr >= s]
              print (df2)
              Name Keywords
              2 Jonas 2 Chris Betty Archie
              3 Jonas 3 Betty Chris




              More general solution working with multiple words in keywords:



              print (tf)
              Name Keywords
              0 Jonas 0 Archie Betty
              1 Jonas 1 Archie
              2 Jonas 2 Chris Betty Archie
              3 Jonas 3 Betty Chris
              4 Jonas 4 Daisy Chris Archie Betty
              5 Jonas 5 NaN
              6 Jonas 5 Chris Archie Betty

              L = ["Chris Archie", "Betty"]
              s = set(L)

              #create pattern with word boundaries
              pat = '|'.join(r"b{}b".format(x) for x in L)

              #extract all keywords and convert to sets
              a = tf['Keywords'].str.findall('('+ pat + ')')
              a = np.array([set(x) if isinstance(x, list) else set() for x in a])
              #remove all matched keywords and remove possible traling whitespaces
              b = tf['Keywords'].str.replace(pat, '').str.strip()

              #compare only matched values and also empty value after replace
              df1 = tf[(b == '') & (a == s)]
              print (df1)
              Name Keywords
              6 Jonas 5 Chris Archie Betty

              #same like one keyword solution
              df2 = tf[a >= s]
              print (df2)
              Name Keywords
              4 Jonas 4 Daisy Chris Archie Betty
              6 Jonas 5 Chris Archie Betty





              share|improve this answer


























              • Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

                – Jonas
                Nov 21 '18 at 13:36













              • @Jonas - yes, use df1 = tf[a.astype(bool)]

                – jezrael
                Nov 21 '18 at 13:42














              0












              0








              0







              Notice:



              Dont use variable name list, because python code word.





              Solution if all keywords have only one word, no space between:



              You can split all words by space and convert them to sets, so possible comparing by set converted from list L:



              L = ["Chris", "Betty"]
              s = set(L)

              arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])
              print (arr)
              [{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}
              {'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]

              df1 = tf[arr == s]
              print (df1)
              Name Keywords
              3 Jonas 3 Betty Chris

              df2 = tf[arr >= s]
              print (df2)
              Name Keywords
              2 Jonas 2 Chris Betty Archie
              3 Jonas 3 Betty Chris




              More general solution working with multiple words in keywords:



              print (tf)
              Name Keywords
              0 Jonas 0 Archie Betty
              1 Jonas 1 Archie
              2 Jonas 2 Chris Betty Archie
              3 Jonas 3 Betty Chris
              4 Jonas 4 Daisy Chris Archie Betty
              5 Jonas 5 NaN
              6 Jonas 5 Chris Archie Betty

              L = ["Chris Archie", "Betty"]
              s = set(L)

              #create pattern with word boundaries
              pat = '|'.join(r"b{}b".format(x) for x in L)

              #extract all keywords and convert to sets
              a = tf['Keywords'].str.findall('('+ pat + ')')
              a = np.array([set(x) if isinstance(x, list) else set() for x in a])
              #remove all matched keywords and remove possible traling whitespaces
              b = tf['Keywords'].str.replace(pat, '').str.strip()

              #compare only matched values and also empty value after replace
              df1 = tf[(b == '') & (a == s)]
              print (df1)
              Name Keywords
              6 Jonas 5 Chris Archie Betty

              #same like one keyword solution
              df2 = tf[a >= s]
              print (df2)
              Name Keywords
              4 Jonas 4 Daisy Chris Archie Betty
              6 Jonas 5 Chris Archie Betty





              share|improve this answer















              Notice:



              Dont use variable name list, because python code word.





              Solution if all keywords have only one word, no space between:



              You can split all words by space and convert them to sets, so possible comparing by set converted from list L:



              L = ["Chris", "Betty"]
              s = set(L)

              arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])
              print (arr)
              [{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}
              {'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]

              df1 = tf[arr == s]
              print (df1)
              Name Keywords
              3 Jonas 3 Betty Chris

              df2 = tf[arr >= s]
              print (df2)
              Name Keywords
              2 Jonas 2 Chris Betty Archie
              3 Jonas 3 Betty Chris




              More general solution working with multiple words in keywords:



              print (tf)
              Name Keywords
              0 Jonas 0 Archie Betty
              1 Jonas 1 Archie
              2 Jonas 2 Chris Betty Archie
              3 Jonas 3 Betty Chris
              4 Jonas 4 Daisy Chris Archie Betty
              5 Jonas 5 NaN
              6 Jonas 5 Chris Archie Betty

              L = ["Chris Archie", "Betty"]
              s = set(L)

              #create pattern with word boundaries
              pat = '|'.join(r"b{}b".format(x) for x in L)

              #extract all keywords and convert to sets
              a = tf['Keywords'].str.findall('('+ pat + ')')
              a = np.array([set(x) if isinstance(x, list) else set() for x in a])
              #remove all matched keywords and remove possible traling whitespaces
              b = tf['Keywords'].str.replace(pat, '').str.strip()

              #compare only matched values and also empty value after replace
              df1 = tf[(b == '') & (a == s)]
              print (df1)
              Name Keywords
              6 Jonas 5 Chris Archie Betty

              #same like one keyword solution
              df2 = tf[a >= s]
              print (df2)
              Name Keywords
              4 Jonas 4 Daisy Chris Archie Betty
              6 Jonas 5 Chris Archie Betty






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Nov 20 '18 at 14:36

























              answered Nov 20 '18 at 12:06









              jezraeljezrael

              328k23270348




              328k23270348













              • Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

                – Jonas
                Nov 21 '18 at 13:36













              • @Jonas - yes, use df1 = tf[a.astype(bool)]

                – jezrael
                Nov 21 '18 at 13:42



















              • Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

                – Jonas
                Nov 21 '18 at 13:36













              • @Jonas - yes, use df1 = tf[a.astype(bool)]

                – jezrael
                Nov 21 '18 at 13:42

















              Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

              – Jonas
              Nov 21 '18 at 13:36







              Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

              – Jonas
              Nov 21 '18 at 13:36















              @Jonas - yes, use df1 = tf[a.astype(bool)]

              – jezrael
              Nov 21 '18 at 13:42





              @Jonas - yes, use df1 = tf[a.astype(bool)]

              – jezrael
              Nov 21 '18 at 13:42













              0














              I think this is more what you're looking for, pandas dataframe cells can actually contain lists:



              import pandas

              # Create a test dataframe
              df = pandas.DataFrame(
              [
              {"name": "A", "keywords": "Something SomethingElse"},
              {"name": "B", "keywords": "SomethingElse Tada"},
              {"name": "C", "keywords": "Something SomethingElse AndAnother"},
              ]
              )

              # Split the keywords INSIDE the cell
              df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))

              # Filter for a specific keyword
              filter_terms = ["Something"]
              filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]

              # Show the filtered results
              print(filtered)





              share|improve this answer
























              • Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

                – Jonas
                Nov 21 '18 at 12:24


















              0














              I think this is more what you're looking for, pandas dataframe cells can actually contain lists:



              import pandas

              # Create a test dataframe
              df = pandas.DataFrame(
              [
              {"name": "A", "keywords": "Something SomethingElse"},
              {"name": "B", "keywords": "SomethingElse Tada"},
              {"name": "C", "keywords": "Something SomethingElse AndAnother"},
              ]
              )

              # Split the keywords INSIDE the cell
              df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))

              # Filter for a specific keyword
              filter_terms = ["Something"]
              filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]

              # Show the filtered results
              print(filtered)





              share|improve this answer
























              • Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

                – Jonas
                Nov 21 '18 at 12:24
















              0












              0








              0







              I think this is more what you're looking for, pandas dataframe cells can actually contain lists:



              import pandas

              # Create a test dataframe
              df = pandas.DataFrame(
              [
              {"name": "A", "keywords": "Something SomethingElse"},
              {"name": "B", "keywords": "SomethingElse Tada"},
              {"name": "C", "keywords": "Something SomethingElse AndAnother"},
              ]
              )

              # Split the keywords INSIDE the cell
              df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))

              # Filter for a specific keyword
              filter_terms = ["Something"]
              filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]

              # Show the filtered results
              print(filtered)





              share|improve this answer













              I think this is more what you're looking for, pandas dataframe cells can actually contain lists:



              import pandas

              # Create a test dataframe
              df = pandas.DataFrame(
              [
              {"name": "A", "keywords": "Something SomethingElse"},
              {"name": "B", "keywords": "SomethingElse Tada"},
              {"name": "C", "keywords": "Something SomethingElse AndAnother"},
              ]
              )

              # Split the keywords INSIDE the cell
              df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))

              # Filter for a specific keyword
              filter_terms = ["Something"]
              filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]

              # Show the filtered results
              print(filtered)






              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Nov 20 '18 at 12:10









              Gijs WobbenGijs Wobben

              515




              515













              • Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

                – Jonas
                Nov 21 '18 at 12:24





















              • Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

                – Jonas
                Nov 21 '18 at 12:24



















              Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

              – Jonas
              Nov 21 '18 at 12:24







              Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

              – Jonas
              Nov 21 '18 at 12:24













              0














              Just add on the approach you implied to your Post with



              Just Simulated DataFrame:



              >>> df
              Name Keywords
              0 Jonas 0 Archie Betty
              1 Jonas 1 Archie
              2 Jonas 2 Chris Betty Archie
              3 Jonas 3 Betty Chris
              4 Jonas 4 Daisy
              5 Jonas 5 NaN


              Using str.contains while using the names with | separated..



              >>> df[df.Keywords.str.contains("Chris|Betty", na=False)]
              Name Keywords
              0 Jonas 0 Archie Betty
              2 Jonas 2 Chris Betty Archie
              3 Jonas 3 Betty Chris


              Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern with |:



              >>> pattern
              ['Chris', 'Betty']

              >>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]
              Name Keywords
              0 Jonas 0 Archie Betty
              2 Jonas 2 Chris Betty Archie
              3 Jonas 3 Betty Chris





              share|improve this answer




























                0














                Just add on the approach you implied to your Post with



                Just Simulated DataFrame:



                >>> df
                Name Keywords
                0 Jonas 0 Archie Betty
                1 Jonas 1 Archie
                2 Jonas 2 Chris Betty Archie
                3 Jonas 3 Betty Chris
                4 Jonas 4 Daisy
                5 Jonas 5 NaN


                Using str.contains while using the names with | separated..



                >>> df[df.Keywords.str.contains("Chris|Betty", na=False)]
                Name Keywords
                0 Jonas 0 Archie Betty
                2 Jonas 2 Chris Betty Archie
                3 Jonas 3 Betty Chris


                Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern with |:



                >>> pattern
                ['Chris', 'Betty']

                >>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]
                Name Keywords
                0 Jonas 0 Archie Betty
                2 Jonas 2 Chris Betty Archie
                3 Jonas 3 Betty Chris





                share|improve this answer


























                  0












                  0








                  0







                  Just add on the approach you implied to your Post with



                  Just Simulated DataFrame:



                  >>> df
                  Name Keywords
                  0 Jonas 0 Archie Betty
                  1 Jonas 1 Archie
                  2 Jonas 2 Chris Betty Archie
                  3 Jonas 3 Betty Chris
                  4 Jonas 4 Daisy
                  5 Jonas 5 NaN


                  Using str.contains while using the names with | separated..



                  >>> df[df.Keywords.str.contains("Chris|Betty", na=False)]
                  Name Keywords
                  0 Jonas 0 Archie Betty
                  2 Jonas 2 Chris Betty Archie
                  3 Jonas 3 Betty Chris


                  Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern with |:



                  >>> pattern
                  ['Chris', 'Betty']

                  >>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]
                  Name Keywords
                  0 Jonas 0 Archie Betty
                  2 Jonas 2 Chris Betty Archie
                  3 Jonas 3 Betty Chris





                  share|improve this answer













                  Just add on the approach you implied to your Post with



                  Just Simulated DataFrame:



                  >>> df
                  Name Keywords
                  0 Jonas 0 Archie Betty
                  1 Jonas 1 Archie
                  2 Jonas 2 Chris Betty Archie
                  3 Jonas 3 Betty Chris
                  4 Jonas 4 Daisy
                  5 Jonas 5 NaN


                  Using str.contains while using the names with | separated..



                  >>> df[df.Keywords.str.contains("Chris|Betty", na=False)]
                  Name Keywords
                  0 Jonas 0 Archie Betty
                  2 Jonas 2 Chris Betty Archie
                  3 Jonas 3 Betty Chris


                  Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern with |:



                  >>> pattern
                  ['Chris', 'Betty']

                  >>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]
                  Name Keywords
                  0 Jonas 0 Archie Betty
                  2 Jonas 2 Chris Betty Archie
                  3 Jonas 3 Betty Chris






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 20 '18 at 16:29









                  pygopygo

                  2,8081619




                  2,8081619























                      0














                      def compset(x, mylist):
                      y = set(x.lower().split())
                      if len(y.intersection(mylist)) > 1: # == 2 for exact match
                      return True
                      else:
                      return False

                      mylist=set('chris betty'.lower().split())

                      df['Keywords'].apply(compset, args=(mylist,))





                      share|improve this answer




























                        0














                        def compset(x, mylist):
                        y = set(x.lower().split())
                        if len(y.intersection(mylist)) > 1: # == 2 for exact match
                        return True
                        else:
                        return False

                        mylist=set('chris betty'.lower().split())

                        df['Keywords'].apply(compset, args=(mylist,))





                        share|improve this answer


























                          0












                          0








                          0







                          def compset(x, mylist):
                          y = set(x.lower().split())
                          if len(y.intersection(mylist)) > 1: # == 2 for exact match
                          return True
                          else:
                          return False

                          mylist=set('chris betty'.lower().split())

                          df['Keywords'].apply(compset, args=(mylist,))





                          share|improve this answer













                          def compset(x, mylist):
                          y = set(x.lower().split())
                          if len(y.intersection(mylist)) > 1: # == 2 for exact match
                          return True
                          else:
                          return False

                          mylist=set('chris betty'.lower().split())

                          df['Keywords'].apply(compset, args=(mylist,))






                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 29 '18 at 13:25









                          shantanuoshantanuo

                          11.7k56153256




                          11.7k56153256






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53392511%2fhow-can-i-find-specified-string-matching-filter-patterns-with-pandas%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              MongoDB - Not Authorized To Execute Command

                              How to fix TextFormField cause rebuild widget in Flutter

                              Npm cannot find a required file even through it is in the searched directory