How can I find specified string matching filter patterns with Pandas
I hava a pandas dataset called tf
which has a column containing blank space seperated keywords titled "Keywords":
Name ... Keywords
0 Jonas 0 ... Archie Betty
1 Jonas 1 ... Archie
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
4 Jonas 4 ... Daisy
5 Jonas 5 ... NaN
6 Jonas 5 ... Chris Archie
As an input I want to provide a set of strings to filter the rows by these keywords. I thought about using a list:
list = ["Chris", "Betty"]
I found out that I can filter rows if I make the list a string with the entries seperated by "|":
t="|".join(list)
and look for matches in that column with:
tf[tf["Keywords"].str.contains(t, na=False)]
This filters by finding ANY matching content, so the output is:
Name ... Keywords
0 Jonas 0 ... Archie Betty
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
6 Jonas 5 ... Chris Archie
What I want instead is:
filtering by containing ONLY the list entries and
filtering by containing AT LEAST the list entries
For 1. the result should be
3 Jonas 3 ... Betty Chris
For 2. the result should be:
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
I found out that the following basically did the trick for 2.
a = tf["Keywords"].str.contains("Chris")
b = tf["Keywords"].str.contains("Betty")
tf[a&b]
However, I need to get this done generic as the list length and its entries may vary. I had a clumsy idea with a loop to intersect each two consecutive list entries but that didn't work:
i = 0
while i < len(list)-1:
a = tf["Keywords"].str.contains(list[i])
b = tf["Keywords"].str.contains(list[i+1])
tf = a & b
i += 1
I appreciate your help.
python pandas
add a comment |
I hava a pandas dataset called tf
which has a column containing blank space seperated keywords titled "Keywords":
Name ... Keywords
0 Jonas 0 ... Archie Betty
1 Jonas 1 ... Archie
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
4 Jonas 4 ... Daisy
5 Jonas 5 ... NaN
6 Jonas 5 ... Chris Archie
As an input I want to provide a set of strings to filter the rows by these keywords. I thought about using a list:
list = ["Chris", "Betty"]
I found out that I can filter rows if I make the list a string with the entries seperated by "|":
t="|".join(list)
and look for matches in that column with:
tf[tf["Keywords"].str.contains(t, na=False)]
This filters by finding ANY matching content, so the output is:
Name ... Keywords
0 Jonas 0 ... Archie Betty
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
6 Jonas 5 ... Chris Archie
What I want instead is:
filtering by containing ONLY the list entries and
filtering by containing AT LEAST the list entries
For 1. the result should be
3 Jonas 3 ... Betty Chris
For 2. the result should be:
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
I found out that the following basically did the trick for 2.
a = tf["Keywords"].str.contains("Chris")
b = tf["Keywords"].str.contains("Betty")
tf[a&b]
However, I need to get this done generic as the list length and its entries may vary. I had a clumsy idea with a loop to intersect each two consecutive list entries but that didn't work:
i = 0
while i < len(list)-1:
a = tf["Keywords"].str.contains(list[i])
b = tf["Keywords"].str.contains(list[i+1])
tf = a & b
i += 1
I appreciate your help.
python pandas
add a comment |
I hava a pandas dataset called tf
which has a column containing blank space seperated keywords titled "Keywords":
Name ... Keywords
0 Jonas 0 ... Archie Betty
1 Jonas 1 ... Archie
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
4 Jonas 4 ... Daisy
5 Jonas 5 ... NaN
6 Jonas 5 ... Chris Archie
As an input I want to provide a set of strings to filter the rows by these keywords. I thought about using a list:
list = ["Chris", "Betty"]
I found out that I can filter rows if I make the list a string with the entries seperated by "|":
t="|".join(list)
and look for matches in that column with:
tf[tf["Keywords"].str.contains(t, na=False)]
This filters by finding ANY matching content, so the output is:
Name ... Keywords
0 Jonas 0 ... Archie Betty
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
6 Jonas 5 ... Chris Archie
What I want instead is:
filtering by containing ONLY the list entries and
filtering by containing AT LEAST the list entries
For 1. the result should be
3 Jonas 3 ... Betty Chris
For 2. the result should be:
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
I found out that the following basically did the trick for 2.
a = tf["Keywords"].str.contains("Chris")
b = tf["Keywords"].str.contains("Betty")
tf[a&b]
However, I need to get this done generic as the list length and its entries may vary. I had a clumsy idea with a loop to intersect each two consecutive list entries but that didn't work:
i = 0
while i < len(list)-1:
a = tf["Keywords"].str.contains(list[i])
b = tf["Keywords"].str.contains(list[i+1])
tf = a & b
i += 1
I appreciate your help.
python pandas
I hava a pandas dataset called tf
which has a column containing blank space seperated keywords titled "Keywords":
Name ... Keywords
0 Jonas 0 ... Archie Betty
1 Jonas 1 ... Archie
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
4 Jonas 4 ... Daisy
5 Jonas 5 ... NaN
6 Jonas 5 ... Chris Archie
As an input I want to provide a set of strings to filter the rows by these keywords. I thought about using a list:
list = ["Chris", "Betty"]
I found out that I can filter rows if I make the list a string with the entries seperated by "|":
t="|".join(list)
and look for matches in that column with:
tf[tf["Keywords"].str.contains(t, na=False)]
This filters by finding ANY matching content, so the output is:
Name ... Keywords
0 Jonas 0 ... Archie Betty
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
6 Jonas 5 ... Chris Archie
What I want instead is:
filtering by containing ONLY the list entries and
filtering by containing AT LEAST the list entries
For 1. the result should be
3 Jonas 3 ... Betty Chris
For 2. the result should be:
2 Jonas 2 ... Chris Betty Archie
3 Jonas 3 ... Betty Chris
I found out that the following basically did the trick for 2.
a = tf["Keywords"].str.contains("Chris")
b = tf["Keywords"].str.contains("Betty")
tf[a&b]
However, I need to get this done generic as the list length and its entries may vary. I had a clumsy idea with a loop to intersect each two consecutive list entries but that didn't work:
i = 0
while i < len(list)-1:
a = tf["Keywords"].str.contains(list[i])
b = tf["Keywords"].str.contains(list[i+1])
tf = a & b
i += 1
I appreciate your help.
python pandas
python pandas
asked Nov 20 '18 at 11:57
JonasJonas
403
403
add a comment |
add a comment |
4 Answers
4
active
oldest
votes
Notice:
Dont use variable name list
, because python code word.
Solution if all keywords have only one word, no space between:
You can split all words by space and convert them to set
s, so possible comparing by set converted from list L
:
L = ["Chris", "Betty"]
s = set(L)
arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])
print (arr)
[{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}
{'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]
df1 = tf[arr == s]
print (df1)
Name Keywords
3 Jonas 3 Betty Chris
df2 = tf[arr >= s]
print (df2)
Name Keywords
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
More general solution working with multiple words in keywords:
print (tf)
Name Keywords
0 Jonas 0 Archie Betty
1 Jonas 1 Archie
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
4 Jonas 4 Daisy Chris Archie Betty
5 Jonas 5 NaN
6 Jonas 5 Chris Archie Betty
L = ["Chris Archie", "Betty"]
s = set(L)
#create pattern with word boundaries
pat = '|'.join(r"b{}b".format(x) for x in L)
#extract all keywords and convert to sets
a = tf['Keywords'].str.findall('('+ pat + ')')
a = np.array([set(x) if isinstance(x, list) else set() for x in a])
#remove all matched keywords and remove possible traling whitespaces
b = tf['Keywords'].str.replace(pat, '').str.strip()
#compare only matched values and also empty value after replace
df1 = tf[(b == '') & (a == s)]
print (df1)
Name Keywords
6 Jonas 5 Chris Archie Betty
#same like one keyword solution
df2 = tf[a >= s]
print (df2)
Name Keywords
4 Jonas 4 Daisy Chris Archie Betty
6 Jonas 5 Chris Archie Betty
Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?
– Jonas
Nov 21 '18 at 13:36
@Jonas - yes, usedf1 = tf[a.astype(bool)]
– jezrael
Nov 21 '18 at 13:42
add a comment |
I think this is more what you're looking for, pandas dataframe cells can actually contain lists:
import pandas
# Create a test dataframe
df = pandas.DataFrame(
[
{"name": "A", "keywords": "Something SomethingElse"},
{"name": "B", "keywords": "SomethingElse Tada"},
{"name": "C", "keywords": "Something SomethingElse AndAnother"},
]
)
# Split the keywords INSIDE the cell
df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))
# Filter for a specific keyword
filter_terms = ["Something"]
filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]
# Show the filtered results
print(filtered)
Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!
– Jonas
Nov 21 '18 at 12:24
add a comment |
Just add on the approach you implied to your Post with
Just Simulated DataFrame:
>>> df
Name Keywords
0 Jonas 0 Archie Betty
1 Jonas 1 Archie
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
4 Jonas 4 Daisy
5 Jonas 5 NaN
Using str.contains
while using the names with |
separated..
>>> df[df.Keywords.str.contains("Chris|Betty", na=False)]
Name Keywords
0 Jonas 0 Archie Betty
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern
with |
:
>>> pattern
['Chris', 'Betty']
>>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]
Name Keywords
0 Jonas 0 Archie Betty
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
add a comment |
def compset(x, mylist):
y = set(x.lower().split())
if len(y.intersection(mylist)) > 1: # == 2 for exact match
return True
else:
return False
mylist=set('chris betty'.lower().split())
df['Keywords'].apply(compset, args=(mylist,))
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53392511%2fhow-can-i-find-specified-string-matching-filter-patterns-with-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Notice:
Dont use variable name list
, because python code word.
Solution if all keywords have only one word, no space between:
You can split all words by space and convert them to set
s, so possible comparing by set converted from list L
:
L = ["Chris", "Betty"]
s = set(L)
arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])
print (arr)
[{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}
{'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]
df1 = tf[arr == s]
print (df1)
Name Keywords
3 Jonas 3 Betty Chris
df2 = tf[arr >= s]
print (df2)
Name Keywords
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
More general solution working with multiple words in keywords:
print (tf)
Name Keywords
0 Jonas 0 Archie Betty
1 Jonas 1 Archie
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
4 Jonas 4 Daisy Chris Archie Betty
5 Jonas 5 NaN
6 Jonas 5 Chris Archie Betty
L = ["Chris Archie", "Betty"]
s = set(L)
#create pattern with word boundaries
pat = '|'.join(r"b{}b".format(x) for x in L)
#extract all keywords and convert to sets
a = tf['Keywords'].str.findall('('+ pat + ')')
a = np.array([set(x) if isinstance(x, list) else set() for x in a])
#remove all matched keywords and remove possible traling whitespaces
b = tf['Keywords'].str.replace(pat, '').str.strip()
#compare only matched values and also empty value after replace
df1 = tf[(b == '') & (a == s)]
print (df1)
Name Keywords
6 Jonas 5 Chris Archie Betty
#same like one keyword solution
df2 = tf[a >= s]
print (df2)
Name Keywords
4 Jonas 4 Daisy Chris Archie Betty
6 Jonas 5 Chris Archie Betty
Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?
– Jonas
Nov 21 '18 at 13:36
@Jonas - yes, usedf1 = tf[a.astype(bool)]
– jezrael
Nov 21 '18 at 13:42
add a comment |
Notice:
Dont use variable name list
, because python code word.
Solution if all keywords have only one word, no space between:
You can split all words by space and convert them to set
s, so possible comparing by set converted from list L
:
L = ["Chris", "Betty"]
s = set(L)
arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])
print (arr)
[{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}
{'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]
df1 = tf[arr == s]
print (df1)
Name Keywords
3 Jonas 3 Betty Chris
df2 = tf[arr >= s]
print (df2)
Name Keywords
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
More general solution working with multiple words in keywords:
print (tf)
Name Keywords
0 Jonas 0 Archie Betty
1 Jonas 1 Archie
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
4 Jonas 4 Daisy Chris Archie Betty
5 Jonas 5 NaN
6 Jonas 5 Chris Archie Betty
L = ["Chris Archie", "Betty"]
s = set(L)
#create pattern with word boundaries
pat = '|'.join(r"b{}b".format(x) for x in L)
#extract all keywords and convert to sets
a = tf['Keywords'].str.findall('('+ pat + ')')
a = np.array([set(x) if isinstance(x, list) else set() for x in a])
#remove all matched keywords and remove possible traling whitespaces
b = tf['Keywords'].str.replace(pat, '').str.strip()
#compare only matched values and also empty value after replace
df1 = tf[(b == '') & (a == s)]
print (df1)
Name Keywords
6 Jonas 5 Chris Archie Betty
#same like one keyword solution
df2 = tf[a >= s]
print (df2)
Name Keywords
4 Jonas 4 Daisy Chris Archie Betty
6 Jonas 5 Chris Archie Betty
Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?
– Jonas
Nov 21 '18 at 13:36
@Jonas - yes, usedf1 = tf[a.astype(bool)]
– jezrael
Nov 21 '18 at 13:42
add a comment |
Notice:
Dont use variable name list
, because python code word.
Solution if all keywords have only one word, no space between:
You can split all words by space and convert them to set
s, so possible comparing by set converted from list L
:
L = ["Chris", "Betty"]
s = set(L)
arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])
print (arr)
[{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}
{'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]
df1 = tf[arr == s]
print (df1)
Name Keywords
3 Jonas 3 Betty Chris
df2 = tf[arr >= s]
print (df2)
Name Keywords
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
More general solution working with multiple words in keywords:
print (tf)
Name Keywords
0 Jonas 0 Archie Betty
1 Jonas 1 Archie
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
4 Jonas 4 Daisy Chris Archie Betty
5 Jonas 5 NaN
6 Jonas 5 Chris Archie Betty
L = ["Chris Archie", "Betty"]
s = set(L)
#create pattern with word boundaries
pat = '|'.join(r"b{}b".format(x) for x in L)
#extract all keywords and convert to sets
a = tf['Keywords'].str.findall('('+ pat + ')')
a = np.array([set(x) if isinstance(x, list) else set() for x in a])
#remove all matched keywords and remove possible traling whitespaces
b = tf['Keywords'].str.replace(pat, '').str.strip()
#compare only matched values and also empty value after replace
df1 = tf[(b == '') & (a == s)]
print (df1)
Name Keywords
6 Jonas 5 Chris Archie Betty
#same like one keyword solution
df2 = tf[a >= s]
print (df2)
Name Keywords
4 Jonas 4 Daisy Chris Archie Betty
6 Jonas 5 Chris Archie Betty
Notice:
Dont use variable name list
, because python code word.
Solution if all keywords have only one word, no space between:
You can split all words by space and convert them to set
s, so possible comparing by set converted from list L
:
L = ["Chris", "Betty"]
s = set(L)
arr = np.array([set(x.split()) if isinstance(x, str) else set() for x in tf["Keywords"]])
print (arr)
[{'Archie', 'Betty'} {'Archie'} {'Chris', 'Archie', 'Betty'}
{'Chris', 'Betty'} {'Daisy'} set() {'Chris', 'Archie'}]
df1 = tf[arr == s]
print (df1)
Name Keywords
3 Jonas 3 Betty Chris
df2 = tf[arr >= s]
print (df2)
Name Keywords
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
More general solution working with multiple words in keywords:
print (tf)
Name Keywords
0 Jonas 0 Archie Betty
1 Jonas 1 Archie
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
4 Jonas 4 Daisy Chris Archie Betty
5 Jonas 5 NaN
6 Jonas 5 Chris Archie Betty
L = ["Chris Archie", "Betty"]
s = set(L)
#create pattern with word boundaries
pat = '|'.join(r"b{}b".format(x) for x in L)
#extract all keywords and convert to sets
a = tf['Keywords'].str.findall('('+ pat + ')')
a = np.array([set(x) if isinstance(x, list) else set() for x in a])
#remove all matched keywords and remove possible traling whitespaces
b = tf['Keywords'].str.replace(pat, '').str.strip()
#compare only matched values and also empty value after replace
df1 = tf[(b == '') & (a == s)]
print (df1)
Name Keywords
6 Jonas 5 Chris Archie Betty
#same like one keyword solution
df2 = tf[a >= s]
print (df2)
Name Keywords
4 Jonas 4 Daisy Chris Archie Betty
6 Jonas 5 Chris Archie Betty
edited Nov 20 '18 at 14:36
answered Nov 20 '18 at 12:06


jezraeljezrael
328k23270348
328k23270348
Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?
– Jonas
Nov 21 '18 at 13:36
@Jonas - yes, usedf1 = tf[a.astype(bool)]
– jezrael
Nov 21 '18 at 13:42
add a comment |
Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?
– Jonas
Nov 21 '18 at 13:36
@Jonas - yes, usedf1 = tf[a.astype(bool)]
– jezrael
Nov 21 '18 at 13:42
Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?
– Jonas
Nov 21 '18 at 13:36
Nice! Is there also a "matching any list entry" solution as mentioned in my post with your approach?
– Jonas
Nov 21 '18 at 13:36
@Jonas - yes, use
df1 = tf[a.astype(bool)]
– jezrael
Nov 21 '18 at 13:42
@Jonas - yes, use
df1 = tf[a.astype(bool)]
– jezrael
Nov 21 '18 at 13:42
add a comment |
I think this is more what you're looking for, pandas dataframe cells can actually contain lists:
import pandas
# Create a test dataframe
df = pandas.DataFrame(
[
{"name": "A", "keywords": "Something SomethingElse"},
{"name": "B", "keywords": "SomethingElse Tada"},
{"name": "C", "keywords": "Something SomethingElse AndAnother"},
]
)
# Split the keywords INSIDE the cell
df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))
# Filter for a specific keyword
filter_terms = ["Something"]
filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]
# Show the filtered results
print(filtered)
Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!
– Jonas
Nov 21 '18 at 12:24
add a comment |
I think this is more what you're looking for, pandas dataframe cells can actually contain lists:
import pandas
# Create a test dataframe
df = pandas.DataFrame(
[
{"name": "A", "keywords": "Something SomethingElse"},
{"name": "B", "keywords": "SomethingElse Tada"},
{"name": "C", "keywords": "Something SomethingElse AndAnother"},
]
)
# Split the keywords INSIDE the cell
df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))
# Filter for a specific keyword
filter_terms = ["Something"]
filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]
# Show the filtered results
print(filtered)
Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!
– Jonas
Nov 21 '18 at 12:24
add a comment |
I think this is more what you're looking for, pandas dataframe cells can actually contain lists:
import pandas
# Create a test dataframe
df = pandas.DataFrame(
[
{"name": "A", "keywords": "Something SomethingElse"},
{"name": "B", "keywords": "SomethingElse Tada"},
{"name": "C", "keywords": "Something SomethingElse AndAnother"},
]
)
# Split the keywords INSIDE the cell
df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))
# Filter for a specific keyword
filter_terms = ["Something"]
filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]
# Show the filtered results
print(filtered)
I think this is more what you're looking for, pandas dataframe cells can actually contain lists:
import pandas
# Create a test dataframe
df = pandas.DataFrame(
[
{"name": "A", "keywords": "Something SomethingElse"},
{"name": "B", "keywords": "SomethingElse Tada"},
{"name": "C", "keywords": "Something SomethingElse AndAnother"},
]
)
# Split the keywords INSIDE the cell
df["keywords"] = df["keywords"].apply(lambda row: row.split(" "))
# Filter for a specific keyword
filter_terms = ["Something"]
filtered = df.loc[df["keywords"].apply(lambda row: any([term in filter_terms for term in row]))]
# Show the filtered results
print(filtered)
answered Nov 20 '18 at 12:10
Gijs WobbenGijs Wobben
515
515
Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!
– Jonas
Nov 21 '18 at 12:24
add a comment |
Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!
– Jonas
Nov 21 '18 at 12:24
Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!
– Jonas
Nov 21 '18 at 12:24
Thank you, interesting! Not initially a solution to my problem, but if I replace the "any" inside "filtered" with "all", then it looks like this is a solution to my problem nr. 1!
– Jonas
Nov 21 '18 at 12:24
add a comment |
Just add on the approach you implied to your Post with
Just Simulated DataFrame:
>>> df
Name Keywords
0 Jonas 0 Archie Betty
1 Jonas 1 Archie
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
4 Jonas 4 Daisy
5 Jonas 5 NaN
Using str.contains
while using the names with |
separated..
>>> df[df.Keywords.str.contains("Chris|Betty", na=False)]
Name Keywords
0 Jonas 0 Archie Betty
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern
with |
:
>>> pattern
['Chris', 'Betty']
>>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]
Name Keywords
0 Jonas 0 Archie Betty
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
add a comment |
Just add on the approach you implied to your Post with
Just Simulated DataFrame:
>>> df
Name Keywords
0 Jonas 0 Archie Betty
1 Jonas 1 Archie
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
4 Jonas 4 Daisy
5 Jonas 5 NaN
Using str.contains
while using the names with |
separated..
>>> df[df.Keywords.str.contains("Chris|Betty", na=False)]
Name Keywords
0 Jonas 0 Archie Betty
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern
with |
:
>>> pattern
['Chris', 'Betty']
>>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]
Name Keywords
0 Jonas 0 Archie Betty
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
add a comment |
Just add on the approach you implied to your Post with
Just Simulated DataFrame:
>>> df
Name Keywords
0 Jonas 0 Archie Betty
1 Jonas 1 Archie
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
4 Jonas 4 Daisy
5 Jonas 5 NaN
Using str.contains
while using the names with |
separated..
>>> df[df.Keywords.str.contains("Chris|Betty", na=False)]
Name Keywords
0 Jonas 0 Archie Betty
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern
with |
:
>>> pattern
['Chris', 'Betty']
>>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]
Name Keywords
0 Jonas 0 Archie Betty
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
Just add on the approach you implied to your Post with
Just Simulated DataFrame:
>>> df
Name Keywords
0 Jonas 0 Archie Betty
1 Jonas 1 Archie
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
4 Jonas 4 Daisy
5 Jonas 5 NaN
Using str.contains
while using the names with |
separated..
>>> df[df.Keywords.str.contains("Chris|Betty", na=False)]
Name Keywords
0 Jonas 0 Archie Betty
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
Now, if we have multiple searches for names then applying pattern base search by constructing the regex by joining the words in pattern
with |
:
>>> pattern
['Chris', 'Betty']
>>> df[df.Keywords.str.contains('|'.join(pattern), na=False)]
Name Keywords
0 Jonas 0 Archie Betty
2 Jonas 2 Chris Betty Archie
3 Jonas 3 Betty Chris
answered Nov 20 '18 at 16:29


pygopygo
2,8081619
2,8081619
add a comment |
add a comment |
def compset(x, mylist):
y = set(x.lower().split())
if len(y.intersection(mylist)) > 1: # == 2 for exact match
return True
else:
return False
mylist=set('chris betty'.lower().split())
df['Keywords'].apply(compset, args=(mylist,))
add a comment |
def compset(x, mylist):
y = set(x.lower().split())
if len(y.intersection(mylist)) > 1: # == 2 for exact match
return True
else:
return False
mylist=set('chris betty'.lower().split())
df['Keywords'].apply(compset, args=(mylist,))
add a comment |
def compset(x, mylist):
y = set(x.lower().split())
if len(y.intersection(mylist)) > 1: # == 2 for exact match
return True
else:
return False
mylist=set('chris betty'.lower().split())
df['Keywords'].apply(compset, args=(mylist,))
def compset(x, mylist):
y = set(x.lower().split())
if len(y.intersection(mylist)) > 1: # == 2 for exact match
return True
else:
return False
mylist=set('chris betty'.lower().split())
df['Keywords'].apply(compset, args=(mylist,))
answered Nov 29 '18 at 13:25
shantanuoshantanuo
11.7k56153256
11.7k56153256
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53392511%2fhow-can-i-find-specified-string-matching-filter-patterns-with-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown