How to count 500 most common words in pandas dataframe
I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.
I tried so far (both methods from stackoverflow):
pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]
and
Counter(" ".join(df["Text"]).split()).most_common(100)
both gave me the following error:
TypeError: sequence item 0: expected str instance, list found
And i have tried the counter method simply with
df.Text.apply(Counter())
which gave me the word count in each text
and i also altered the counter method so it returned the most common words in each text
But i want the overall most common words
Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)
Datum File File_type Text length len_cleaned_text
Datum
2000-01-27 2000-01-27 _04.txt _04 [business, date, jan, heineken, starts, integr... 396 220
Edit: Code to 'reporduce' it
for file in file_list:
name = file[len(input_path):]
date = name[11:17]
type_1 = name[17:20]
with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
format
text = rfile.read()
text = text.encode('utf-8', 'ignore')
text = text.decode('utf-8', 'ignore')
a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
result_list.append(a)
new cell
df['Text']= df['Text'].str.lower()
p = re.compile(r'[^ws]+')
d = re.compile(r'd+')
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('n',' ')
df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
df['Text']=df['Text'].apply(word_tokenize)
Datum File File_type Text length the
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr... 396 0
2000-02-01 2000-02-01 0910068040_000201_04.txt _04 [group, english, cns, date, feb, bat, acquisit... 305 0
2000-05-03 2000-05-03 1070448040_000503_04.txt _04 [date, may, cobham, plc, cob, acquisitionsdisp... 701 0
2000-05-11 2000-05-11 0865985020_000511_04.txt _04 [business, date, may, swedish, match, complete... 439 0
2000-11-28 2000-11-28 1067252020_001128_04.txt _04 [date, nov, intec, telecom, sys, itl, doc, pla... 158 0
2000-12-18 2000-12-18 1963867040_001218_04.txt _04 [associated, press, apw, date, dec, volvo, div... 367 0
2000-12-19 2000-12-19 1065767020_001219_04.txt _04 [date, dec, spirent, plc, spt, acquisition, co... 414 0
2000-12-21 2000-12-21 1076829040_001221_04.txt _04 [bloomberg, news, bn, date, dec, eni, ceo, cfo... 271 0
2001-02-06 2001-02-06 1084749020_010206_04.txt _04 [date, feb, chemring, group, plc, chg, acquisi... 130 0
2001-02-15 2001-02-15 1063497040_010215_04.txt _04 [date, feb, electrolux, ab, elxb, acquisition,... 420 0
And a description of the dataframe:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum 557 non-null datetime64[ns]
File 557 non-null object
File_type 557 non-null object
Text 557 non-null object
customers 557 non-null int64
grwoth 557 non-null int64
human 557 non-null int64
intagibles 557 non-null int64
length 557 non-null int64
synergies 557 non-null int64
technology 557 non-null int64
the 557 non-null int64
len_cleaned_text 557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB
Thanks in advance
python python-3.x pandas
add a comment |
I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.
I tried so far (both methods from stackoverflow):
pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]
and
Counter(" ".join(df["Text"]).split()).most_common(100)
both gave me the following error:
TypeError: sequence item 0: expected str instance, list found
And i have tried the counter method simply with
df.Text.apply(Counter())
which gave me the word count in each text
and i also altered the counter method so it returned the most common words in each text
But i want the overall most common words
Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)
Datum File File_type Text length len_cleaned_text
Datum
2000-01-27 2000-01-27 _04.txt _04 [business, date, jan, heineken, starts, integr... 396 220
Edit: Code to 'reporduce' it
for file in file_list:
name = file[len(input_path):]
date = name[11:17]
type_1 = name[17:20]
with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
format
text = rfile.read()
text = text.encode('utf-8', 'ignore')
text = text.decode('utf-8', 'ignore')
a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
result_list.append(a)
new cell
df['Text']= df['Text'].str.lower()
p = re.compile(r'[^ws]+')
d = re.compile(r'd+')
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('n',' ')
df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
df['Text']=df['Text'].apply(word_tokenize)
Datum File File_type Text length the
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr... 396 0
2000-02-01 2000-02-01 0910068040_000201_04.txt _04 [group, english, cns, date, feb, bat, acquisit... 305 0
2000-05-03 2000-05-03 1070448040_000503_04.txt _04 [date, may, cobham, plc, cob, acquisitionsdisp... 701 0
2000-05-11 2000-05-11 0865985020_000511_04.txt _04 [business, date, may, swedish, match, complete... 439 0
2000-11-28 2000-11-28 1067252020_001128_04.txt _04 [date, nov, intec, telecom, sys, itl, doc, pla... 158 0
2000-12-18 2000-12-18 1963867040_001218_04.txt _04 [associated, press, apw, date, dec, volvo, div... 367 0
2000-12-19 2000-12-19 1065767020_001219_04.txt _04 [date, dec, spirent, plc, spt, acquisition, co... 414 0
2000-12-21 2000-12-21 1076829040_001221_04.txt _04 [bloomberg, news, bn, date, dec, eni, ceo, cfo... 271 0
2001-02-06 2001-02-06 1084749020_010206_04.txt _04 [date, feb, chemring, group, plc, chg, acquisi... 130 0
2001-02-15 2001-02-15 1063497040_010215_04.txt _04 [date, feb, electrolux, ab, elxb, acquisition,... 420 0
And a description of the dataframe:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum 557 non-null datetime64[ns]
File 557 non-null object
File_type 557 non-null object
Text 557 non-null object
customers 557 non-null int64
grwoth 557 non-null int64
human 557 non-null int64
intagibles 557 non-null int64
length 557 non-null int64
synergies 557 non-null int64
technology 557 non-null int64
the 557 non-null int64
len_cleaned_text 557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB
Thanks in advance
python python-3.x pandas
Could you provide a sample Dataframe and a mcve?
– Scotty1-
Nov 21 '18 at 9:52
medium.com/@cristhianboujon/… if you want to try
– iamklaus
Nov 21 '18 at 9:59
add a comment |
I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.
I tried so far (both methods from stackoverflow):
pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]
and
Counter(" ".join(df["Text"]).split()).most_common(100)
both gave me the following error:
TypeError: sequence item 0: expected str instance, list found
And i have tried the counter method simply with
df.Text.apply(Counter())
which gave me the word count in each text
and i also altered the counter method so it returned the most common words in each text
But i want the overall most common words
Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)
Datum File File_type Text length len_cleaned_text
Datum
2000-01-27 2000-01-27 _04.txt _04 [business, date, jan, heineken, starts, integr... 396 220
Edit: Code to 'reporduce' it
for file in file_list:
name = file[len(input_path):]
date = name[11:17]
type_1 = name[17:20]
with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
format
text = rfile.read()
text = text.encode('utf-8', 'ignore')
text = text.decode('utf-8', 'ignore')
a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
result_list.append(a)
new cell
df['Text']= df['Text'].str.lower()
p = re.compile(r'[^ws]+')
d = re.compile(r'd+')
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('n',' ')
df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
df['Text']=df['Text'].apply(word_tokenize)
Datum File File_type Text length the
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr... 396 0
2000-02-01 2000-02-01 0910068040_000201_04.txt _04 [group, english, cns, date, feb, bat, acquisit... 305 0
2000-05-03 2000-05-03 1070448040_000503_04.txt _04 [date, may, cobham, plc, cob, acquisitionsdisp... 701 0
2000-05-11 2000-05-11 0865985020_000511_04.txt _04 [business, date, may, swedish, match, complete... 439 0
2000-11-28 2000-11-28 1067252020_001128_04.txt _04 [date, nov, intec, telecom, sys, itl, doc, pla... 158 0
2000-12-18 2000-12-18 1963867040_001218_04.txt _04 [associated, press, apw, date, dec, volvo, div... 367 0
2000-12-19 2000-12-19 1065767020_001219_04.txt _04 [date, dec, spirent, plc, spt, acquisition, co... 414 0
2000-12-21 2000-12-21 1076829040_001221_04.txt _04 [bloomberg, news, bn, date, dec, eni, ceo, cfo... 271 0
2001-02-06 2001-02-06 1084749020_010206_04.txt _04 [date, feb, chemring, group, plc, chg, acquisi... 130 0
2001-02-15 2001-02-15 1063497040_010215_04.txt _04 [date, feb, electrolux, ab, elxb, acquisition,... 420 0
And a description of the dataframe:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum 557 non-null datetime64[ns]
File 557 non-null object
File_type 557 non-null object
Text 557 non-null object
customers 557 non-null int64
grwoth 557 non-null int64
human 557 non-null int64
intagibles 557 non-null int64
length 557 non-null int64
synergies 557 non-null int64
technology 557 non-null int64
the 557 non-null int64
len_cleaned_text 557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB
Thanks in advance
python python-3.x pandas
I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.
I tried so far (both methods from stackoverflow):
pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]
and
Counter(" ".join(df["Text"]).split()).most_common(100)
both gave me the following error:
TypeError: sequence item 0: expected str instance, list found
And i have tried the counter method simply with
df.Text.apply(Counter())
which gave me the word count in each text
and i also altered the counter method so it returned the most common words in each text
But i want the overall most common words
Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)
Datum File File_type Text length len_cleaned_text
Datum
2000-01-27 2000-01-27 _04.txt _04 [business, date, jan, heineken, starts, integr... 396 220
Edit: Code to 'reporduce' it
for file in file_list:
name = file[len(input_path):]
date = name[11:17]
type_1 = name[17:20]
with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
format
text = rfile.read()
text = text.encode('utf-8', 'ignore')
text = text.decode('utf-8', 'ignore')
a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
result_list.append(a)
new cell
df['Text']= df['Text'].str.lower()
p = re.compile(r'[^ws]+')
d = re.compile(r'd+')
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('n',' ')
df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
df['Text']=df['Text'].apply(word_tokenize)
Datum File File_type Text length the
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr... 396 0
2000-02-01 2000-02-01 0910068040_000201_04.txt _04 [group, english, cns, date, feb, bat, acquisit... 305 0
2000-05-03 2000-05-03 1070448040_000503_04.txt _04 [date, may, cobham, plc, cob, acquisitionsdisp... 701 0
2000-05-11 2000-05-11 0865985020_000511_04.txt _04 [business, date, may, swedish, match, complete... 439 0
2000-11-28 2000-11-28 1067252020_001128_04.txt _04 [date, nov, intec, telecom, sys, itl, doc, pla... 158 0
2000-12-18 2000-12-18 1963867040_001218_04.txt _04 [associated, press, apw, date, dec, volvo, div... 367 0
2000-12-19 2000-12-19 1065767020_001219_04.txt _04 [date, dec, spirent, plc, spt, acquisition, co... 414 0
2000-12-21 2000-12-21 1076829040_001221_04.txt _04 [bloomberg, news, bn, date, dec, eni, ceo, cfo... 271 0
2001-02-06 2001-02-06 1084749020_010206_04.txt _04 [date, feb, chemring, group, plc, chg, acquisi... 130 0
2001-02-15 2001-02-15 1063497040_010215_04.txt _04 [date, feb, electrolux, ab, elxb, acquisition,... 420 0
And a description of the dataframe:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum 557 non-null datetime64[ns]
File 557 non-null object
File_type 557 non-null object
Text 557 non-null object
customers 557 non-null int64
grwoth 557 non-null int64
human 557 non-null int64
intagibles 557 non-null int64
length 557 non-null int64
synergies 557 non-null int64
technology 557 non-null int64
the 557 non-null int64
len_cleaned_text 557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB
Thanks in advance
python python-3.x pandas
python python-3.x pandas
edited Nov 21 '18 at 10:23
user10395806
asked Nov 21 '18 at 9:51
user10395806user10395806
356
356
Could you provide a sample Dataframe and a mcve?
– Scotty1-
Nov 21 '18 at 9:52
medium.com/@cristhianboujon/… if you want to try
– iamklaus
Nov 21 '18 at 9:59
add a comment |
Could you provide a sample Dataframe and a mcve?
– Scotty1-
Nov 21 '18 at 9:52
medium.com/@cristhianboujon/… if you want to try
– iamklaus
Nov 21 '18 at 9:59
Could you provide a sample Dataframe and a mcve?
– Scotty1-
Nov 21 '18 at 9:52
Could you provide a sample Dataframe and a mcve?
– Scotty1-
Nov 21 '18 at 9:52
medium.com/@cristhianboujon/… if you want to try
– iamklaus
Nov 21 '18 at 9:59
medium.com/@cristhianboujon/… if you want to try
– iamklaus
Nov 21 '18 at 9:59
add a comment |
3 Answers
3
active
oldest
votes
Ok, I got it. Your df['Text']
consists of lists of texts. So you can do this:
full_list = # list containing all words of all texts
for elmnt in df['Text']: # loop over lists in df
full_list += elmnt # append elements of lists to full list
val_counts = pd.Series(full_list).value_counts() # make temporary Series to count
This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re
or collections
are needed.
tried value counts as well... it counted each text as one
– user10395806
Nov 21 '18 at 10:03
It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.
– Scotty1-
Nov 21 '18 at 10:08
add a comment |
Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:
your_text_list = df['Text'].tolist()
your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]
counter = collections.Counter(flat_list)
top_words = counter.most_common(100)
error: 'list' object has no attribute 'split'
– user10395806
Nov 21 '18 at 10:08
add a comment |
You can do it via apply
and Counter.update
methods:
from collections import Counter
counter = Counter()
df = pd.DataFrame({'Text': values})
_ = df['Text'].apply(lambda x: counter.update(x))
counter.most_common(10)
Out:
[('Amy', 3), ('was', 3), ('hated', 2),
('Kamal', 2), ('her', 2), ('and', 2),
('she', 2), ('She', 2), ('sent', 2), ('text', 2)]
Where df['Text']
is:
0 [Amy, normally, hated, Monday, mornings, but, ...
1 [Kamal, was, in, her, art, class, and, she, li...
2 [She, was, waiting, outside, the, classroom, w...
3 [Hi, Amy, Your, mum, sent, me, a, text]
4 [You, forgot, your, inhaler]
5 [Why, don’t, you, turn, your, phone, on, Amy, ...
6 [She, never, sent, text, messages, and, she, h...
Name: Text, dtype: object
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53409318%2fhow-to-count-500-most-common-words-in-pandas-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
Ok, I got it. Your df['Text']
consists of lists of texts. So you can do this:
full_list = # list containing all words of all texts
for elmnt in df['Text']: # loop over lists in df
full_list += elmnt # append elements of lists to full list
val_counts = pd.Series(full_list).value_counts() # make temporary Series to count
This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re
or collections
are needed.
tried value counts as well... it counted each text as one
– user10395806
Nov 21 '18 at 10:03
It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.
– Scotty1-
Nov 21 '18 at 10:08
add a comment |
Ok, I got it. Your df['Text']
consists of lists of texts. So you can do this:
full_list = # list containing all words of all texts
for elmnt in df['Text']: # loop over lists in df
full_list += elmnt # append elements of lists to full list
val_counts = pd.Series(full_list).value_counts() # make temporary Series to count
This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re
or collections
are needed.
tried value counts as well... it counted each text as one
– user10395806
Nov 21 '18 at 10:03
It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.
– Scotty1-
Nov 21 '18 at 10:08
add a comment |
Ok, I got it. Your df['Text']
consists of lists of texts. So you can do this:
full_list = # list containing all words of all texts
for elmnt in df['Text']: # loop over lists in df
full_list += elmnt # append elements of lists to full list
val_counts = pd.Series(full_list).value_counts() # make temporary Series to count
This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re
or collections
are needed.
Ok, I got it. Your df['Text']
consists of lists of texts. So you can do this:
full_list = # list containing all words of all texts
for elmnt in df['Text']: # loop over lists in df
full_list += elmnt # append elements of lists to full list
val_counts = pd.Series(full_list).value_counts() # make temporary Series to count
This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re
or collections
are needed.
edited Nov 21 '18 at 10:26
answered Nov 21 '18 at 9:58
Scotty1-Scotty1-
2,0551323
2,0551323
tried value counts as well... it counted each text as one
– user10395806
Nov 21 '18 at 10:03
It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.
– Scotty1-
Nov 21 '18 at 10:08
add a comment |
tried value counts as well... it counted each text as one
– user10395806
Nov 21 '18 at 10:03
It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.
– Scotty1-
Nov 21 '18 at 10:08
tried value counts as well... it counted each text as one
– user10395806
Nov 21 '18 at 10:03
tried value counts as well... it counted each text as one
– user10395806
Nov 21 '18 at 10:03
It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.
– Scotty1-
Nov 21 '18 at 10:08
It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.
– Scotty1-
Nov 21 '18 at 10:08
add a comment |
Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:
your_text_list = df['Text'].tolist()
your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]
counter = collections.Counter(flat_list)
top_words = counter.most_common(100)
error: 'list' object has no attribute 'split'
– user10395806
Nov 21 '18 at 10:08
add a comment |
Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:
your_text_list = df['Text'].tolist()
your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]
counter = collections.Counter(flat_list)
top_words = counter.most_common(100)
error: 'list' object has no attribute 'split'
– user10395806
Nov 21 '18 at 10:08
add a comment |
Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:
your_text_list = df['Text'].tolist()
your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]
counter = collections.Counter(flat_list)
top_words = counter.most_common(100)
Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:
your_text_list = df['Text'].tolist()
your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]
counter = collections.Counter(flat_list)
top_words = counter.most_common(100)
edited Nov 21 '18 at 10:16
answered Nov 21 '18 at 10:05


RoyaumeIXRoyaumeIX
1,2591725
1,2591725
error: 'list' object has no attribute 'split'
– user10395806
Nov 21 '18 at 10:08
add a comment |
error: 'list' object has no attribute 'split'
– user10395806
Nov 21 '18 at 10:08
error: 'list' object has no attribute 'split'
– user10395806
Nov 21 '18 at 10:08
error: 'list' object has no attribute 'split'
– user10395806
Nov 21 '18 at 10:08
add a comment |
You can do it via apply
and Counter.update
methods:
from collections import Counter
counter = Counter()
df = pd.DataFrame({'Text': values})
_ = df['Text'].apply(lambda x: counter.update(x))
counter.most_common(10)
Out:
[('Amy', 3), ('was', 3), ('hated', 2),
('Kamal', 2), ('her', 2), ('and', 2),
('she', 2), ('She', 2), ('sent', 2), ('text', 2)]
Where df['Text']
is:
0 [Amy, normally, hated, Monday, mornings, but, ...
1 [Kamal, was, in, her, art, class, and, she, li...
2 [She, was, waiting, outside, the, classroom, w...
3 [Hi, Amy, Your, mum, sent, me, a, text]
4 [You, forgot, your, inhaler]
5 [Why, don’t, you, turn, your, phone, on, Amy, ...
6 [She, never, sent, text, messages, and, she, h...
Name: Text, dtype: object
add a comment |
You can do it via apply
and Counter.update
methods:
from collections import Counter
counter = Counter()
df = pd.DataFrame({'Text': values})
_ = df['Text'].apply(lambda x: counter.update(x))
counter.most_common(10)
Out:
[('Amy', 3), ('was', 3), ('hated', 2),
('Kamal', 2), ('her', 2), ('and', 2),
('she', 2), ('She', 2), ('sent', 2), ('text', 2)]
Where df['Text']
is:
0 [Amy, normally, hated, Monday, mornings, but, ...
1 [Kamal, was, in, her, art, class, and, she, li...
2 [She, was, waiting, outside, the, classroom, w...
3 [Hi, Amy, Your, mum, sent, me, a, text]
4 [You, forgot, your, inhaler]
5 [Why, don’t, you, turn, your, phone, on, Amy, ...
6 [She, never, sent, text, messages, and, she, h...
Name: Text, dtype: object
add a comment |
You can do it via apply
and Counter.update
methods:
from collections import Counter
counter = Counter()
df = pd.DataFrame({'Text': values})
_ = df['Text'].apply(lambda x: counter.update(x))
counter.most_common(10)
Out:
[('Amy', 3), ('was', 3), ('hated', 2),
('Kamal', 2), ('her', 2), ('and', 2),
('she', 2), ('She', 2), ('sent', 2), ('text', 2)]
Where df['Text']
is:
0 [Amy, normally, hated, Monday, mornings, but, ...
1 [Kamal, was, in, her, art, class, and, she, li...
2 [She, was, waiting, outside, the, classroom, w...
3 [Hi, Amy, Your, mum, sent, me, a, text]
4 [You, forgot, your, inhaler]
5 [Why, don’t, you, turn, your, phone, on, Amy, ...
6 [She, never, sent, text, messages, and, she, h...
Name: Text, dtype: object
You can do it via apply
and Counter.update
methods:
from collections import Counter
counter = Counter()
df = pd.DataFrame({'Text': values})
_ = df['Text'].apply(lambda x: counter.update(x))
counter.most_common(10)
Out:
[('Amy', 3), ('was', 3), ('hated', 2),
('Kamal', 2), ('her', 2), ('and', 2),
('she', 2), ('She', 2), ('sent', 2), ('text', 2)]
Where df['Text']
is:
0 [Amy, normally, hated, Monday, mornings, but, ...
1 [Kamal, was, in, her, art, class, and, she, li...
2 [She, was, waiting, outside, the, classroom, w...
3 [Hi, Amy, Your, mum, sent, me, a, text]
4 [You, forgot, your, inhaler]
5 [Why, don’t, you, turn, your, phone, on, Amy, ...
6 [She, never, sent, text, messages, and, she, h...
Name: Text, dtype: object
answered Nov 21 '18 at 11:51


Mikhail StepanovMikhail Stepanov
1,3993812
1,3993812
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53409318%2fhow-to-count-500-most-common-words-in-pandas-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Could you provide a sample Dataframe and a mcve?
– Scotty1-
Nov 21 '18 at 9:52
medium.com/@cristhianboujon/… if you want to try
– iamklaus
Nov 21 '18 at 9:59