How can I find specified string matching filter patterns with Pandas

I hava a pandas dataset called tf which has a column containing blank space seperated keywords titled "Keywords":

Name         ...                    Keywords

0  Jonas 0         ...                Archie Betty

1  Jonas 1         ...                      Archie

2  Jonas 2         ...          Chris Betty Archie

3  Jonas 3         ...                 Betty Chris

4  Jonas 4         ...                       Daisy

5  Jonas 5         ...                         NaN

6  Jonas 5         ...                Chris Archie

As an input I want to provide a set of strings to filter the rows by these keywords. I thought about using a list:

list = ["Chris", "Betty"]

I found out that I can filter rows if I make the list a string with the entries seperated by "|":

t="|".join(list)

and look for matches in that column with:

tf[tf["Keywords"].str.contains(t, na=False)]

This filters by finding ANY matching content, so the output is:

Name         ...                    Keywords

0  Jonas 0         ...                Archie Betty

2  Jonas 2         ...          Chris Betty Archie

3  Jonas 3         ...                 Betty Chris

6  Jonas 5         ...                Chris Archie

What I want instead is:

filtering by containing ONLY the list entries and

filtering by containing AT LEAST the list entries

For 1. the result should be

3 Jonas 3 ... Betty Chris

For 2. the result should be:

2  Jonas 2         ...          Chris Betty Archie

3  Jonas 3         ...                 Betty Chris

I found out that the following basically did the trick for 2.

a = tf["Keywords"].str.contains("Chris")

b = tf["Keywords"].str.contains("Betty")

tf[a&b]

However, I need to get this done generic as the list length and its entries may vary. I had a clumsy idea with a loop to intersect each two consecutive list entries but that didn't work:

i = 0

while i < len(list)-1:

    a = tf["Keywords"].str.contains(list[i])

    b = tf["Keywords"].str.contains(list[i+1])

    tf = a & b

    i += 1

I appreciate your help.

asked Nov 20 '18 at 11:57

Jonas

403

add a comment |

I hava a pandas dataset called tf which has a column containing blank space seperated keywords titled "Keywords":

Name         ...                    Keywords

0  Jonas 0         ...                Archie Betty

1  Jonas 1         ...                      Archie

2  Jonas 2         ...          Chris Betty Archie

3  Jonas 3         ...                 Betty Chris

4  Jonas 4         ...                       Daisy

5  Jonas 5         ...                         NaN

6  Jonas 5         ...                Chris Archie

As an input I want to provide a set of strings to filter the rows by these keywords. I thought about using a list:

list = ["Chris", "Betty"]

I found out that I can filter rows if I make the list a string with the entries seperated by "|":

t="|".join(list)

and look for matches in that column with:

tf[tf["Keywords"].str.contains(t, na=False)]

This filters by finding ANY matching content, so the output is:

Name         ...                    Keywords

0  Jonas 0         ...                Archie Betty

2  Jonas 2         ...          Chris Betty Archie

3  Jonas 3         ...                 Betty Chris

6  Jonas 5         ...                Chris Archie

What I want instead is:

filtering by containing ONLY the list entries and

filtering by containing AT LEAST the list entries

For 1. the result should be

3 Jonas 3 ... Betty Chris

For 2. the result should be:

2  Jonas 2         ...          Chris Betty Archie

3  Jonas 3         ...                 Betty Chris

I found out that the following basically did the trick for 2.

a = tf["Keywords"].str.contains("Chris")

b = tf["Keywords"].str.contains("Betty")

tf[a&b]

However, I need to get this done generic as the list length and its entries may vary. I had a clumsy idea with a loop to intersect each two consecutive list entries but that didn't work:

i = 0

while i < len(list)-1:

    a = tf["Keywords"].str.contains(list[i])

    b = tf["Keywords"].str.contains(list[i+1])

    tf = a & b

    i += 1

I appreciate your help.

asked Nov 20 '18 at 11:57

Jonas

403

add a comment |

I hava a pandas dataset called tf which has a column containing blank space seperated keywords titled "Keywords":

Name         ...                    Keywords

0  Jonas 0         ...                Archie Betty

1  Jonas 1         ...                      Archie

2  Jonas 2         ...          Chris Betty Archie

3  Jonas 3         ...                 Betty Chris

4  Jonas 4         ...                       Daisy

5  Jonas 5         ...                         NaN

6  Jonas 5         ...                Chris Archie

As an input I want to provide a set of strings to filter the rows by these keywords. I thought about using a list:

list = ["Chris", "Betty"]

I found out that I can filter rows if I make the list a string with the entries seperated by "|":

t="|".join(list)

and look for matches in that column with:

tf[tf["Keywords"].str.contains(t, na=False)]

This filters by finding ANY matching content, so the output is:

Name         ...                    Keywords

0  Jonas 0         ...                Archie Betty

2  Jonas 2         ...          Chris Betty Archie

3  Jonas 3         ...                 Betty Chris

6  Jonas 5         ...                Chris Archie

What I want instead is:

filtering by containing ONLY the list entries and

filtering by containing AT LEAST the list entries

For 1. the result should be

3 Jonas 3 ... Betty Chris

For 2. the result should be:

2  Jonas 2         ...          Chris Betty Archie

3  Jonas 3         ...                 Betty Chris

I found out that the following basically did the trick for 2.

a = tf["Keywords"].str.contains("Chris")

b = tf["Keywords"].str.contains("Betty")

tf[a&b]

However, I need to get this done generic as the list length and its entries may vary. I had a clumsy idea with a loop to intersect each two consecutive list entries but that didn't work:

i = 0

while i < len(list)-1:

    a = tf["Keywords"].str.contains(list[i])

    b = tf["Keywords"].str.contains(list[i+1])

    tf = a & b

    i += 1

I appreciate your help.

asked Nov 20 '18 at 11:57

Jonas

403

I hava a pandas dataset called tf which has a column containing blank space seperated keywords titled "Keywords":

Name         ...                    Keywords

0  Jonas 0         ...                Archie Betty

1  Jonas 1         ...                      Archie

2  Jonas 2         ...          Chris Betty Archie

3  Jonas 3         ...                 Betty Chris

4  Jonas 4         ...                       Daisy

5  Jonas 5         ...                         NaN

6  Jonas 5         ...                Chris Archie

As an input I want to provide a set of strings to filter the rows by these keywords. I thought about using a list:

list = ["Chris", "Betty"]

I found out that I can filter rows if I make the list a string with the entries seperated by "|":

t="|".join(list)

and look for matches in that column with:

tf[tf["Keywords"].str.contains(t, na=False)]

This filters by finding ANY matching content, so the output is:

Name         ...                    Keywords

0  Jonas 0         ...                Archie Betty

2  Jonas 2         ...          Chris Betty Archie

3  Jonas 3         ...                 Betty Chris

6  Jonas 5         ...                Chris Archie

What I want instead is:

filtering by containing ONLY the list entries and

filtering by containing AT LEAST the list entries

For 1. the result should be

3 Jonas 3 ... Betty Chris

For 2. the result should be:

2  Jonas 2         ...          Chris Betty Archie

3  Jonas 3         ...                 Betty Chris

I found out that the following basically did the trick for 2.

a = tf["Keywords"].str.contains("Chris")

b = tf["Keywords"].str.contains("Betty")

tf[a&b]

However, I need to get this done generic as the list length and its entries may vary. I had a clumsy idea with a loop to intersect each two consecutive list entries but that didn't work:

i = 0

while i < len(list)-1:

    a = tf["Keywords"].str.contains(list[i])

    b = tf["Keywords"].str.contains(list[i+1])

    tf = a & b

    i += 1

I appreciate your help.

python pandas

asked Nov 20 '18 at 11:57

Jonas

403

asked Nov 20 '18 at 11:57

Jonas

403

asked Nov 20 '18 at 11:57

Jonas

403

asked Nov 20 '18 at 11:57

Jonas

403

asked Nov 20 '18 at 11:57

Jonas

403

add a comment |

4 Answers
4

active

oldest

votes

Notice:

Dont use variable name list, because python code word.

Solution if all keywords have only one word, no space between:

You can split all words by space and convert them to sets, so possible comparing by set converted from list L:

L = ["Chris", "Betty"]

s = set(L)



arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])

print (arr)

[{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}

 {'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]



df1 = tf[arr == s]

print (df1)

      Name     Keywords

3  Jonas 3  Betty Chris



df2 = tf[arr >= s]

print (df2)

      Name            Keywords

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

More general solution working with multiple words in keywords:

print (tf)

      Name                  Keywords

0  Jonas 0              Archie Betty

1  Jonas 1                    Archie

2  Jonas 2        Chris Betty Archie

3  Jonas 3               Betty Chris

4  Jonas 4  Daisy Chris Archie Betty

5  Jonas 5                       NaN

6  Jonas 5        Chris Archie Betty



L = ["Chris Archie", "Betty"]

s = set(L)



#create pattern with word boundaries

pat = '|'.join(r"b{}b".format(x) for x in L)



#extract all keywords and convert to sets

a = tf['Keywords'].str.findall('('+ pat + ')')

a = np.array([set(x) if isinstance(x, list) else set() for x in a])

#remove all matched keywords and remove possible traling whitespaces

b = tf['Keywords'].str.replace(pat, '').str.strip()



#compare only matched values and also empty value after replace

df1 = tf[(b == '') & (a == s)]

print (df1)

      Name            Keywords

6  Jonas 5  Chris Archie Betty



#same like one keyword solution

df2 = tf[a >= s]

print (df2)

      Name                  Keywords

4  Jonas 4  Daisy Chris Archie Betty

6  Jonas 5        Chris Archie Betty

edited Nov 20 '18 at 14:36

answered Nov 20 '18 at 12:06

jezrael

328k23270348

Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

– Jonas
Nov 21 '18 at 13:36

@Jonas - yes, use df1 = tf[a.astype(bool)]

– jezrael
Nov 21 '18 at 13:42

add a comment |

I think this is more what you're looking for, pandas dataframe cells can actually contain lists:

import pandas



# Create a test dataframe

df = pandas.DataFrame(

    [

        {"name": "A", "keywords": "Something SomethingElse"},

        {"name": "B", "keywords": "SomethingElse Tada"},

        {"name": "C", "keywords": "Something SomethingElse AndAnother"},

    ]

)



# Split the keywords INSIDE the cell

df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))



# Filter for a specific keyword

filter_terms = ["Something"]

filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]



# Show the filtered results

print(filtered)

answered Nov 20 '18 at 12:10

Gijs Wobben

515

Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

– Jonas
Nov 21 '18 at 12:24

add a comment |

Just add on the approach you implied to your Post with

Just Simulated DataFrame:

>>> df

      Name            Keywords

0  Jonas 0        Archie Betty

1  Jonas 1              Archie

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

4  Jonas 4               Daisy

5  Jonas 5                 NaN

Using str.contains while using the names with | separated..

>>> df[df.Keywords.str.contains("Chris|Betty", na=False)]

      Name            Keywords

0  Jonas 0        Archie Betty

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern with |:

>>> pattern

['Chris', 'Betty']



>>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]

      Name            Keywords

0  Jonas 0        Archie Betty

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

answered Nov 20 '18 at 16:29

pygo

2,8081619

add a comment |

def compset(x, mylist):

    y = set(x.lower().split())

    if len(y.intersection(mylist)) > 1:  # == 2 for exact match

        return True

    else:

        return False



mylist=set('chris betty'.lower().split())



df['Keywords'].apply(compset, args=(mylist,))

answered Nov 29 '18 at 13:25

shantanuo

11.7k56153256

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53392511%2fhow-can-i-find-specified-string-matching-filter-patterns-with-pandas%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

Notice:

Dont use variable name list, because python code word.

Solution if all keywords have only one word, no space between:

You can split all words by space and convert them to sets, so possible comparing by set converted from list L:

L = ["Chris", "Betty"]

s = set(L)



arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])

print (arr)

[{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}

 {'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]



df1 = tf[arr == s]

print (df1)

      Name     Keywords

3  Jonas 3  Betty Chris



df2 = tf[arr >= s]

print (df2)

      Name            Keywords

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

More general solution working with multiple words in keywords:

print (tf)

      Name                  Keywords

0  Jonas 0              Archie Betty

1  Jonas 1                    Archie

2  Jonas 2        Chris Betty Archie

3  Jonas 3               Betty Chris

4  Jonas 4  Daisy Chris Archie Betty

5  Jonas 5                       NaN

6  Jonas 5        Chris Archie Betty



L = ["Chris Archie", "Betty"]

s = set(L)



#create pattern with word boundaries

pat = '|'.join(r"b{}b".format(x) for x in L)



#extract all keywords and convert to sets

a = tf['Keywords'].str.findall('('+ pat + ')')

a = np.array([set(x) if isinstance(x, list) else set() for x in a])

#remove all matched keywords and remove possible traling whitespaces

b = tf['Keywords'].str.replace(pat, '').str.strip()



#compare only matched values and also empty value after replace

df1 = tf[(b == '') & (a == s)]

print (df1)

      Name            Keywords

6  Jonas 5  Chris Archie Betty



#same like one keyword solution

df2 = tf[a >= s]

print (df2)

      Name                  Keywords

4  Jonas 4  Daisy Chris Archie Betty

6  Jonas 5        Chris Archie Betty

edited Nov 20 '18 at 14:36

answered Nov 20 '18 at 12:06

jezrael

328k23270348

Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

– Jonas
Nov 21 '18 at 13:36

@Jonas - yes, use df1 = tf[a.astype(bool)]

– jezrael
Nov 21 '18 at 13:42

add a comment |

Notice:

Dont use variable name list, because python code word.

Solution if all keywords have only one word, no space between:

You can split all words by space and convert them to sets, so possible comparing by set converted from list L:

L = ["Chris", "Betty"]

s = set(L)



arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])

print (arr)

[{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}

 {'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]



df1 = tf[arr == s]

print (df1)

      Name     Keywords

3  Jonas 3  Betty Chris



df2 = tf[arr >= s]

print (df2)

      Name            Keywords

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

More general solution working with multiple words in keywords:

print (tf)

      Name                  Keywords

0  Jonas 0              Archie Betty

1  Jonas 1                    Archie

2  Jonas 2        Chris Betty Archie

3  Jonas 3               Betty Chris

4  Jonas 4  Daisy Chris Archie Betty

5  Jonas 5                       NaN

6  Jonas 5        Chris Archie Betty



L = ["Chris Archie", "Betty"]

s = set(L)



#create pattern with word boundaries

pat = '|'.join(r"b{}b".format(x) for x in L)



#extract all keywords and convert to sets

a = tf['Keywords'].str.findall('('+ pat + ')')

a = np.array([set(x) if isinstance(x, list) else set() for x in a])

#remove all matched keywords and remove possible traling whitespaces

b = tf['Keywords'].str.replace(pat, '').str.strip()



#compare only matched values and also empty value after replace

df1 = tf[(b == '') & (a == s)]

print (df1)

      Name            Keywords

6  Jonas 5  Chris Archie Betty



#same like one keyword solution

df2 = tf[a >= s]

print (df2)

      Name                  Keywords

4  Jonas 4  Daisy Chris Archie Betty

6  Jonas 5        Chris Archie Betty

edited Nov 20 '18 at 14:36

answered Nov 20 '18 at 12:06

jezrael

328k23270348

Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

– Jonas
Nov 21 '18 at 13:36

@Jonas - yes, use df1 = tf[a.astype(bool)]

– jezrael
Nov 21 '18 at 13:42

add a comment |

Notice:

Dont use variable name list, because python code word.

Solution if all keywords have only one word, no space between:

You can split all words by space and convert them to sets, so possible comparing by set converted from list L:

L = ["Chris", "Betty"]

s = set(L)



arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])

print (arr)

[{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}

 {'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]



df1 = tf[arr == s]

print (df1)

      Name     Keywords

3  Jonas 3  Betty Chris



df2 = tf[arr >= s]

print (df2)

      Name            Keywords

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

More general solution working with multiple words in keywords:

print (tf)

      Name                  Keywords

0  Jonas 0              Archie Betty

1  Jonas 1                    Archie

2  Jonas 2        Chris Betty Archie

3  Jonas 3               Betty Chris

4  Jonas 4  Daisy Chris Archie Betty

5  Jonas 5                       NaN

6  Jonas 5        Chris Archie Betty



L = ["Chris Archie", "Betty"]

s = set(L)



#create pattern with word boundaries

pat = '|'.join(r"b{}b".format(x) for x in L)



#extract all keywords and convert to sets

a = tf['Keywords'].str.findall('('+ pat + ')')

a = np.array([set(x) if isinstance(x, list) else set() for x in a])

#remove all matched keywords and remove possible traling whitespaces

b = tf['Keywords'].str.replace(pat, '').str.strip()



#compare only matched values and also empty value after replace

df1 = tf[(b == '') & (a == s)]

print (df1)

      Name            Keywords

6  Jonas 5  Chris Archie Betty



#same like one keyword solution

df2 = tf[a >= s]

print (df2)

      Name                  Keywords

4  Jonas 4  Daisy Chris Archie Betty

6  Jonas 5        Chris Archie Betty

edited Nov 20 '18 at 14:36

answered Nov 20 '18 at 12:06

jezrael

328k23270348

Notice:

Dont use variable name list, because python code word.

Solution if all keywords have only one word, no space between:

You can split all words by space and convert them to sets, so possible comparing by set converted from list L:

L = ["Chris", "Betty"]

s = set(L)



arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])

print (arr)

[{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}

 {'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]



df1 = tf[arr == s]

print (df1)

      Name     Keywords

3  Jonas 3  Betty Chris



df2 = tf[arr >= s]

print (df2)

      Name            Keywords

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

More general solution working with multiple words in keywords:

print (tf)

      Name                  Keywords

0  Jonas 0              Archie Betty

1  Jonas 1                    Archie

2  Jonas 2        Chris Betty Archie

3  Jonas 3               Betty Chris

4  Jonas 4  Daisy Chris Archie Betty

5  Jonas 5                       NaN

6  Jonas 5        Chris Archie Betty



L = ["Chris Archie", "Betty"]

s = set(L)



#create pattern with word boundaries

pat = '|'.join(r"b{}b".format(x) for x in L)



#extract all keywords and convert to sets

a = tf['Keywords'].str.findall('('+ pat + ')')

a = np.array([set(x) if isinstance(x, list) else set() for x in a])

#remove all matched keywords and remove possible traling whitespaces

b = tf['Keywords'].str.replace(pat, '').str.strip()



#compare only matched values and also empty value after replace

df1 = tf[(b == '') & (a == s)]

print (df1)

      Name            Keywords

6  Jonas 5  Chris Archie Betty



#same like one keyword solution

df2 = tf[a >= s]

print (df2)

      Name                  Keywords

4  Jonas 4  Daisy Chris Archie Betty

6  Jonas 5        Chris Archie Betty

edited Nov 20 '18 at 14:36

answered Nov 20 '18 at 12:06

jezrael

328k23270348

edited Nov 20 '18 at 14:36

answered Nov 20 '18 at 12:06

jezrael

328k23270348

answered Nov 20 '18 at 12:06

jezrael

328k23270348

answered Nov 20 '18 at 12:06

jezrael

328k23270348

Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

– Jonas
Nov 21 '18 at 13:36

@Jonas - yes, use df1 = tf[a.astype(bool)]

– jezrael
Nov 21 '18 at 13:42

add a comment |

Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

– Jonas
Nov 21 '18 at 13:36

@Jonas - yes, use df1 = tf[a.astype(bool)]

– jezrael
Nov 21 '18 at 13:42

Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?

– Jonas
Nov 21 '18 at 13:36

@Jonas - yes, use df1 = tf[a.astype(bool)]

– jezrael
Nov 21 '18 at 13:42

add a comment |

I think this is more what you're looking for, pandas dataframe cells can actually contain lists:

import pandas



# Create a test dataframe

df = pandas.DataFrame(

    [

        {"name": "A", "keywords": "Something SomethingElse"},

        {"name": "B", "keywords": "SomethingElse Tada"},

        {"name": "C", "keywords": "Something SomethingElse AndAnother"},

    ]

)



# Split the keywords INSIDE the cell

df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))



# Filter for a specific keyword

filter_terms = ["Something"]

filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]



# Show the filtered results

print(filtered)

answered Nov 20 '18 at 12:10

Gijs Wobben

515

Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

– Jonas
Nov 21 '18 at 12:24

add a comment |

I think this is more what you're looking for, pandas dataframe cells can actually contain lists:

import pandas



# Create a test dataframe

df = pandas.DataFrame(

    [

        {"name": "A", "keywords": "Something SomethingElse"},

        {"name": "B", "keywords": "SomethingElse Tada"},

        {"name": "C", "keywords": "Something SomethingElse AndAnother"},

    ]

)



# Split the keywords INSIDE the cell

df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))



# Filter for a specific keyword

filter_terms = ["Something"]

filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]



# Show the filtered results

print(filtered)

answered Nov 20 '18 at 12:10

Gijs Wobben

515

Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

– Jonas
Nov 21 '18 at 12:24

add a comment |

I think this is more what you're looking for, pandas dataframe cells can actually contain lists:

import pandas



# Create a test dataframe

df = pandas.DataFrame(

    [

        {"name": "A", "keywords": "Something SomethingElse"},

        {"name": "B", "keywords": "SomethingElse Tada"},

        {"name": "C", "keywords": "Something SomethingElse AndAnother"},

    ]

)



# Split the keywords INSIDE the cell

df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))



# Filter for a specific keyword

filter_terms = ["Something"]

filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]



# Show the filtered results

print(filtered)

answered Nov 20 '18 at 12:10

Gijs Wobben

515

I think this is more what you're looking for, pandas dataframe cells can actually contain lists:

import pandas



# Create a test dataframe

df = pandas.DataFrame(

    [

        {"name": "A", "keywords": "Something SomethingElse"},

        {"name": "B", "keywords": "SomethingElse Tada"},

        {"name": "C", "keywords": "Something SomethingElse AndAnother"},

    ]

)



# Split the keywords INSIDE the cell

df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))



# Filter for a specific keyword

filter_terms = ["Something"]

filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]



# Show the filtered results

print(filtered)

answered Nov 20 '18 at 12:10

Gijs Wobben

515

answered Nov 20 '18 at 12:10

Gijs Wobben

515

answered Nov 20 '18 at 12:10

Gijs Wobben

515

answered Nov 20 '18 at 12:10

Gijs Wobben

515

Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

– Jonas
Nov 21 '18 at 12:24

add a comment |

Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

– Jonas
Nov 21 '18 at 12:24

Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!

– Jonas
Nov 21 '18 at 12:24

add a comment |

Just add on the approach you implied to your Post with

Just Simulated DataFrame:

>>> df

      Name            Keywords

0  Jonas 0        Archie Betty

1  Jonas 1              Archie

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

4  Jonas 4               Daisy

5  Jonas 5                 NaN

Using str.contains while using the names with | separated..

>>> df[df.Keywords.str.contains("Chris|Betty", na=False)]

      Name            Keywords

0  Jonas 0        Archie Betty

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern with |:

>>> pattern

['Chris', 'Betty']



>>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]

      Name            Keywords

0  Jonas 0        Archie Betty

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

answered Nov 20 '18 at 16:29

pygo

2,8081619

add a comment |

Just add on the approach you implied to your Post with

Just Simulated DataFrame:

>>> df

      Name            Keywords

0  Jonas 0        Archie Betty

1  Jonas 1              Archie

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

4  Jonas 4               Daisy

5  Jonas 5                 NaN

Using str.contains while using the names with | separated..

>>> df[df.Keywords.str.contains("Chris|Betty", na=False)]

      Name            Keywords

0  Jonas 0        Archie Betty

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern with |:

>>> pattern

['Chris', 'Betty']



>>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]

      Name            Keywords

0  Jonas 0        Archie Betty

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

answered Nov 20 '18 at 16:29

pygo

2,8081619

add a comment |

Just add on the approach you implied to your Post with

Just Simulated DataFrame:

>>> df

      Name            Keywords

0  Jonas 0        Archie Betty

1  Jonas 1              Archie

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

4  Jonas 4               Daisy

5  Jonas 5                 NaN

Using str.contains while using the names with | separated..

>>> df[df.Keywords.str.contains("Chris|Betty", na=False)]

      Name            Keywords

0  Jonas 0        Archie Betty

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern with |:

>>> pattern

['Chris', 'Betty']



>>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]

      Name            Keywords

0  Jonas 0        Archie Betty

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

answered Nov 20 '18 at 16:29

pygo

2,8081619

Just add on the approach you implied to your Post with

Just Simulated DataFrame:

>>> df

      Name            Keywords

0  Jonas 0        Archie Betty

1  Jonas 1              Archie

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

4  Jonas 4               Daisy

5  Jonas 5                 NaN

Using str.contains while using the names with | separated..

>>> df[df.Keywords.str.contains("Chris|Betty", na=False)]

      Name            Keywords

0  Jonas 0        Archie Betty

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern with |:

>>> pattern

['Chris', 'Betty']



>>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]

      Name            Keywords

0  Jonas 0        Archie Betty

2  Jonas 2  Chris Betty Archie

3  Jonas 3         Betty Chris

answered Nov 20 '18 at 16:29

pygo

2,8081619

answered Nov 20 '18 at 16:29

pygo

2,8081619

answered Nov 20 '18 at 16:29

pygo

2,8081619

answered Nov 20 '18 at 16:29

pygo

2,8081619

add a comment |

def compset(x, mylist):

    y = set(x.lower().split())

    if len(y.intersection(mylist)) > 1:  # == 2 for exact match

        return True

    else:

        return False



mylist=set('chris betty'.lower().split())



df['Keywords'].apply(compset, args=(mylist,))

answered Nov 29 '18 at 13:25

shantanuo

11.7k56153256

add a comment |

def compset(x, mylist):

    y = set(x.lower().split())

    if len(y.intersection(mylist)) > 1:  # == 2 for exact match

        return True

    else:

        return False



mylist=set('chris betty'.lower().split())



df['Keywords'].apply(compset, args=(mylist,))

answered Nov 29 '18 at 13:25

shantanuo

11.7k56153256

add a comment |

def compset(x, mylist):

    y = set(x.lower().split())

    if len(y.intersection(mylist)) > 1:  # == 2 for exact match

        return True

    else:

        return False



mylist=set('chris betty'.lower().split())



df['Keywords'].apply(compset, args=(mylist,))

answered Nov 29 '18 at 13:25

shantanuo

11.7k56153256

def compset(x, mylist):

    y = set(x.lower().split())

    if len(y.intersection(mylist)) > 1:  # == 2 for exact match

        return True

    else:

        return False



mylist=set('chris betty'.lower().split())



df['Keywords'].apply(compset, args=(mylist,))

answered Nov 29 '18 at 13:25

shantanuo

11.7k56153256

answered Nov 29 '18 at 13:25

shantanuo

11.7k56153256

answered Nov 29 '18 at 13:25

shantanuo

11.7k56153256

answered Nov 29 '18 at 13:25

shantanuo

11.7k56153256

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu