How to count 500 most common words in pandas dataframe












3















I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.



I tried so far (both methods from stackoverflow):



pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]


and



Counter(" ".join(df["Text"]).split()).most_common(100)


both gave me the following error:




TypeError: sequence item 0: expected str instance, list found




And i have tried the counter method simply with



df.Text.apply(Counter()) 


which gave me the word count in each text
and i also altered the counter method so it returned the most common words in each text



But i want the overall most common words



Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)



    Datum   File    File_type                                         Text                         length    len_cleaned_text
Datum
2000-01-27 2000-01-27 _04.txt _04 [business, date, jan, heineken, starts, integr... 396 220


Edit: Code to 'reporduce' it



  for file in file_list:
name = file[len(input_path):]
date = name[11:17]
type_1 = name[17:20]


with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
format
text = rfile.read()
text = text.encode('utf-8', 'ignore')
text = text.decode('utf-8', 'ignore')
a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
result_list.append(a)


new cell



  df['Text']= df['Text'].str.lower()
p = re.compile(r'[^ws]+')
d = re.compile(r'd+')
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('n',' ')
df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
df['Text']=df['Text'].apply(word_tokenize)


Datum File File_type Text length the
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr... 396 0
2000-02-01 2000-02-01 0910068040_000201_04.txt _04 [group, english, cns, date, feb, bat, acquisit... 305 0
2000-05-03 2000-05-03 1070448040_000503_04.txt _04 [date, may, cobham, plc, cob, acquisitionsdisp... 701 0
2000-05-11 2000-05-11 0865985020_000511_04.txt _04 [business, date, may, swedish, match, complete... 439 0
2000-11-28 2000-11-28 1067252020_001128_04.txt _04 [date, nov, intec, telecom, sys, itl, doc, pla... 158 0
2000-12-18 2000-12-18 1963867040_001218_04.txt _04 [associated, press, apw, date, dec, volvo, div... 367 0
2000-12-19 2000-12-19 1065767020_001219_04.txt _04 [date, dec, spirent, plc, spt, acquisition, co... 414 0
2000-12-21 2000-12-21 1076829040_001221_04.txt _04 [bloomberg, news, bn, date, dec, eni, ceo, cfo... 271 0
2001-02-06 2001-02-06 1084749020_010206_04.txt _04 [date, feb, chemring, group, plc, chg, acquisi... 130 0
2001-02-15 2001-02-15 1063497040_010215_04.txt _04 [date, feb, electrolux, ab, elxb, acquisition,... 420 0


And a description of the dataframe:



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum 557 non-null datetime64[ns]
File 557 non-null object
File_type 557 non-null object
Text 557 non-null object
customers 557 non-null int64
grwoth 557 non-null int64
human 557 non-null int64
intagibles 557 non-null int64
length 557 non-null int64
synergies 557 non-null int64
technology 557 non-null int64
the 557 non-null int64
len_cleaned_text 557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB


Thanks in advance










share|improve this question

























  • Could you provide a sample Dataframe and a mcve?

    – Scotty1-
    Nov 21 '18 at 9:52











  • medium.com/@cristhianboujon/… if you want to try

    – iamklaus
    Nov 21 '18 at 9:59
















3















I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.



I tried so far (both methods from stackoverflow):



pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]


and



Counter(" ".join(df["Text"]).split()).most_common(100)


both gave me the following error:




TypeError: sequence item 0: expected str instance, list found




And i have tried the counter method simply with



df.Text.apply(Counter()) 


which gave me the word count in each text
and i also altered the counter method so it returned the most common words in each text



But i want the overall most common words



Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)



    Datum   File    File_type                                         Text                         length    len_cleaned_text
Datum
2000-01-27 2000-01-27 _04.txt _04 [business, date, jan, heineken, starts, integr... 396 220


Edit: Code to 'reporduce' it



  for file in file_list:
name = file[len(input_path):]
date = name[11:17]
type_1 = name[17:20]


with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
format
text = rfile.read()
text = text.encode('utf-8', 'ignore')
text = text.decode('utf-8', 'ignore')
a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
result_list.append(a)


new cell



  df['Text']= df['Text'].str.lower()
p = re.compile(r'[^ws]+')
d = re.compile(r'd+')
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('n',' ')
df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
df['Text']=df['Text'].apply(word_tokenize)


Datum File File_type Text length the
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr... 396 0
2000-02-01 2000-02-01 0910068040_000201_04.txt _04 [group, english, cns, date, feb, bat, acquisit... 305 0
2000-05-03 2000-05-03 1070448040_000503_04.txt _04 [date, may, cobham, plc, cob, acquisitionsdisp... 701 0
2000-05-11 2000-05-11 0865985020_000511_04.txt _04 [business, date, may, swedish, match, complete... 439 0
2000-11-28 2000-11-28 1067252020_001128_04.txt _04 [date, nov, intec, telecom, sys, itl, doc, pla... 158 0
2000-12-18 2000-12-18 1963867040_001218_04.txt _04 [associated, press, apw, date, dec, volvo, div... 367 0
2000-12-19 2000-12-19 1065767020_001219_04.txt _04 [date, dec, spirent, plc, spt, acquisition, co... 414 0
2000-12-21 2000-12-21 1076829040_001221_04.txt _04 [bloomberg, news, bn, date, dec, eni, ceo, cfo... 271 0
2001-02-06 2001-02-06 1084749020_010206_04.txt _04 [date, feb, chemring, group, plc, chg, acquisi... 130 0
2001-02-15 2001-02-15 1063497040_010215_04.txt _04 [date, feb, electrolux, ab, elxb, acquisition,... 420 0


And a description of the dataframe:



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum 557 non-null datetime64[ns]
File 557 non-null object
File_type 557 non-null object
Text 557 non-null object
customers 557 non-null int64
grwoth 557 non-null int64
human 557 non-null int64
intagibles 557 non-null int64
length 557 non-null int64
synergies 557 non-null int64
technology 557 non-null int64
the 557 non-null int64
len_cleaned_text 557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB


Thanks in advance










share|improve this question

























  • Could you provide a sample Dataframe and a mcve?

    – Scotty1-
    Nov 21 '18 at 9:52











  • medium.com/@cristhianboujon/… if you want to try

    – iamklaus
    Nov 21 '18 at 9:59














3












3








3


1






I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.



I tried so far (both methods from stackoverflow):



pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]


and



Counter(" ".join(df["Text"]).split()).most_common(100)


both gave me the following error:




TypeError: sequence item 0: expected str instance, list found




And i have tried the counter method simply with



df.Text.apply(Counter()) 


which gave me the word count in each text
and i also altered the counter method so it returned the most common words in each text



But i want the overall most common words



Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)



    Datum   File    File_type                                         Text                         length    len_cleaned_text
Datum
2000-01-27 2000-01-27 _04.txt _04 [business, date, jan, heineken, starts, integr... 396 220


Edit: Code to 'reporduce' it



  for file in file_list:
name = file[len(input_path):]
date = name[11:17]
type_1 = name[17:20]


with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
format
text = rfile.read()
text = text.encode('utf-8', 'ignore')
text = text.decode('utf-8', 'ignore')
a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
result_list.append(a)


new cell



  df['Text']= df['Text'].str.lower()
p = re.compile(r'[^ws]+')
d = re.compile(r'd+')
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('n',' ')
df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
df['Text']=df['Text'].apply(word_tokenize)


Datum File File_type Text length the
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr... 396 0
2000-02-01 2000-02-01 0910068040_000201_04.txt _04 [group, english, cns, date, feb, bat, acquisit... 305 0
2000-05-03 2000-05-03 1070448040_000503_04.txt _04 [date, may, cobham, plc, cob, acquisitionsdisp... 701 0
2000-05-11 2000-05-11 0865985020_000511_04.txt _04 [business, date, may, swedish, match, complete... 439 0
2000-11-28 2000-11-28 1067252020_001128_04.txt _04 [date, nov, intec, telecom, sys, itl, doc, pla... 158 0
2000-12-18 2000-12-18 1963867040_001218_04.txt _04 [associated, press, apw, date, dec, volvo, div... 367 0
2000-12-19 2000-12-19 1065767020_001219_04.txt _04 [date, dec, spirent, plc, spt, acquisition, co... 414 0
2000-12-21 2000-12-21 1076829040_001221_04.txt _04 [bloomberg, news, bn, date, dec, eni, ceo, cfo... 271 0
2001-02-06 2001-02-06 1084749020_010206_04.txt _04 [date, feb, chemring, group, plc, chg, acquisi... 130 0
2001-02-15 2001-02-15 1063497040_010215_04.txt _04 [date, feb, electrolux, ab, elxb, acquisition,... 420 0


And a description of the dataframe:



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum 557 non-null datetime64[ns]
File 557 non-null object
File_type 557 non-null object
Text 557 non-null object
customers 557 non-null int64
grwoth 557 non-null int64
human 557 non-null int64
intagibles 557 non-null int64
length 557 non-null int64
synergies 557 non-null int64
technology 557 non-null int64
the 557 non-null int64
len_cleaned_text 557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB


Thanks in advance










share|improve this question
















I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.



I tried so far (both methods from stackoverflow):



pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]


and



Counter(" ".join(df["Text"]).split()).most_common(100)


both gave me the following error:




TypeError: sequence item 0: expected str instance, list found




And i have tried the counter method simply with



df.Text.apply(Counter()) 


which gave me the word count in each text
and i also altered the counter method so it returned the most common words in each text



But i want the overall most common words



Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)



    Datum   File    File_type                                         Text                         length    len_cleaned_text
Datum
2000-01-27 2000-01-27 _04.txt _04 [business, date, jan, heineken, starts, integr... 396 220


Edit: Code to 'reporduce' it



  for file in file_list:
name = file[len(input_path):]
date = name[11:17]
type_1 = name[17:20]


with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
format
text = rfile.read()
text = text.encode('utf-8', 'ignore')
text = text.decode('utf-8', 'ignore')
a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
result_list.append(a)


new cell



  df['Text']= df['Text'].str.lower()
p = re.compile(r'[^ws]+')
d = re.compile(r'd+')
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('n',' ')
df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
df['Text']=df['Text'].apply(word_tokenize)


Datum File File_type Text length the
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr... 396 0
2000-02-01 2000-02-01 0910068040_000201_04.txt _04 [group, english, cns, date, feb, bat, acquisit... 305 0
2000-05-03 2000-05-03 1070448040_000503_04.txt _04 [date, may, cobham, plc, cob, acquisitionsdisp... 701 0
2000-05-11 2000-05-11 0865985020_000511_04.txt _04 [business, date, may, swedish, match, complete... 439 0
2000-11-28 2000-11-28 1067252020_001128_04.txt _04 [date, nov, intec, telecom, sys, itl, doc, pla... 158 0
2000-12-18 2000-12-18 1963867040_001218_04.txt _04 [associated, press, apw, date, dec, volvo, div... 367 0
2000-12-19 2000-12-19 1065767020_001219_04.txt _04 [date, dec, spirent, plc, spt, acquisition, co... 414 0
2000-12-21 2000-12-21 1076829040_001221_04.txt _04 [bloomberg, news, bn, date, dec, eni, ceo, cfo... 271 0
2001-02-06 2001-02-06 1084749020_010206_04.txt _04 [date, feb, chemring, group, plc, chg, acquisi... 130 0
2001-02-15 2001-02-15 1063497040_010215_04.txt _04 [date, feb, electrolux, ab, elxb, acquisition,... 420 0


And a description of the dataframe:



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum 557 non-null datetime64[ns]
File 557 non-null object
File_type 557 non-null object
Text 557 non-null object
customers 557 non-null int64
grwoth 557 non-null int64
human 557 non-null int64
intagibles 557 non-null int64
length 557 non-null int64
synergies 557 non-null int64
technology 557 non-null int64
the 557 non-null int64
len_cleaned_text 557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB


Thanks in advance







python python-3.x pandas






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 '18 at 10:23







user10395806

















asked Nov 21 '18 at 9:51









user10395806user10395806

356




356













  • Could you provide a sample Dataframe and a mcve?

    – Scotty1-
    Nov 21 '18 at 9:52











  • medium.com/@cristhianboujon/… if you want to try

    – iamklaus
    Nov 21 '18 at 9:59



















  • Could you provide a sample Dataframe and a mcve?

    – Scotty1-
    Nov 21 '18 at 9:52











  • medium.com/@cristhianboujon/… if you want to try

    – iamklaus
    Nov 21 '18 at 9:59

















Could you provide a sample Dataframe and a mcve?

– Scotty1-
Nov 21 '18 at 9:52





Could you provide a sample Dataframe and a mcve?

– Scotty1-
Nov 21 '18 at 9:52













medium.com/@cristhianboujon/… if you want to try

– iamklaus
Nov 21 '18 at 9:59





medium.com/@cristhianboujon/… if you want to try

– iamklaus
Nov 21 '18 at 9:59












3 Answers
3






active

oldest

votes


















1














Ok, I got it. Your df['Text'] consists of lists of texts. So you can do this:



full_list =   # list containing all words of all texts
for elmnt in df['Text']: # loop over lists in df
full_list += elmnt # append elements of lists to full list

val_counts = pd.Series(full_list).value_counts() # make temporary Series to count


This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re or collections are needed.






share|improve this answer


























  • tried value counts as well... it counted each text as one

    – user10395806
    Nov 21 '18 at 10:03











  • It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

    – Scotty1-
    Nov 21 '18 at 10:08



















1














Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:



your_text_list = df['Text'].tolist()
your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]

counter = collections.Counter(flat_list)
top_words = counter.most_common(100)





share|improve this answer


























  • error: 'list' object has no attribute 'split'

    – user10395806
    Nov 21 '18 at 10:08



















0














You can do it via apply and Counter.update methods:



from collections import Counter

counter = Counter()
df = pd.DataFrame({'Text': values})
_ = df['Text'].apply(lambda x: counter.update(x))

counter.most_common(10)
Out:

[('Amy', 3), ('was', 3), ('hated', 2),
('Kamal', 2), ('her', 2), ('and', 2),
('she', 2), ('She', 2), ('sent', 2), ('text', 2)]


Where df['Text'] is:



0    [Amy, normally, hated, Monday, mornings, but, ...
1 [Kamal, was, in, her, art, class, and, she, li...
2 [She, was, waiting, outside, the, classroom, w...
3 [Hi, Amy, Your, mum, sent, me, a, text]
4 [You, forgot, your, inhaler]
5 [Why, don’t, you, turn, your, phone, on, Amy, ...
6 [She, never, sent, text, messages, and, she, h...
Name: Text, dtype: object





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53409318%2fhow-to-count-500-most-common-words-in-pandas-dataframe%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Ok, I got it. Your df['Text'] consists of lists of texts. So you can do this:



    full_list =   # list containing all words of all texts
    for elmnt in df['Text']: # loop over lists in df
    full_list += elmnt # append elements of lists to full list

    val_counts = pd.Series(full_list).value_counts() # make temporary Series to count


    This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re or collections are needed.






    share|improve this answer


























    • tried value counts as well... it counted each text as one

      – user10395806
      Nov 21 '18 at 10:03











    • It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

      – Scotty1-
      Nov 21 '18 at 10:08
















    1














    Ok, I got it. Your df['Text'] consists of lists of texts. So you can do this:



    full_list =   # list containing all words of all texts
    for elmnt in df['Text']: # loop over lists in df
    full_list += elmnt # append elements of lists to full list

    val_counts = pd.Series(full_list).value_counts() # make temporary Series to count


    This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re or collections are needed.






    share|improve this answer


























    • tried value counts as well... it counted each text as one

      – user10395806
      Nov 21 '18 at 10:03











    • It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

      – Scotty1-
      Nov 21 '18 at 10:08














    1












    1








    1







    Ok, I got it. Your df['Text'] consists of lists of texts. So you can do this:



    full_list =   # list containing all words of all texts
    for elmnt in df['Text']: # loop over lists in df
    full_list += elmnt # append elements of lists to full list

    val_counts = pd.Series(full_list).value_counts() # make temporary Series to count


    This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re or collections are needed.






    share|improve this answer















    Ok, I got it. Your df['Text'] consists of lists of texts. So you can do this:



    full_list =   # list containing all words of all texts
    for elmnt in df['Text']: # loop over lists in df
    full_list += elmnt # append elements of lists to full list

    val_counts = pd.Series(full_list).value_counts() # make temporary Series to count


    This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re or collections are needed.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 21 '18 at 10:26

























    answered Nov 21 '18 at 9:58









    Scotty1-Scotty1-

    2,0551323




    2,0551323













    • tried value counts as well... it counted each text as one

      – user10395806
      Nov 21 '18 at 10:03











    • It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

      – Scotty1-
      Nov 21 '18 at 10:08



















    • tried value counts as well... it counted each text as one

      – user10395806
      Nov 21 '18 at 10:03











    • It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

      – Scotty1-
      Nov 21 '18 at 10:08

















    tried value counts as well... it counted each text as one

    – user10395806
    Nov 21 '18 at 10:03





    tried value counts as well... it counted each text as one

    – user10395806
    Nov 21 '18 at 10:03













    It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

    – Scotty1-
    Nov 21 '18 at 10:08





    It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

    – Scotty1-
    Nov 21 '18 at 10:08













    1














    Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:



    your_text_list = df['Text'].tolist()
    your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
    flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]

    counter = collections.Counter(flat_list)
    top_words = counter.most_common(100)





    share|improve this answer


























    • error: 'list' object has no attribute 'split'

      – user10395806
      Nov 21 '18 at 10:08
















    1














    Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:



    your_text_list = df['Text'].tolist()
    your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
    flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]

    counter = collections.Counter(flat_list)
    top_words = counter.most_common(100)





    share|improve this answer


























    • error: 'list' object has no attribute 'split'

      – user10395806
      Nov 21 '18 at 10:08














    1












    1








    1







    Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:



    your_text_list = df['Text'].tolist()
    your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
    flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]

    counter = collections.Counter(flat_list)
    top_words = counter.most_common(100)





    share|improve this answer















    Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:



    your_text_list = df['Text'].tolist()
    your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
    flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)]

    counter = collections.Counter(flat_list)
    top_words = counter.most_common(100)






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 21 '18 at 10:16

























    answered Nov 21 '18 at 10:05









    RoyaumeIXRoyaumeIX

    1,2591725




    1,2591725













    • error: 'list' object has no attribute 'split'

      – user10395806
      Nov 21 '18 at 10:08



















    • error: 'list' object has no attribute 'split'

      – user10395806
      Nov 21 '18 at 10:08

















    error: 'list' object has no attribute 'split'

    – user10395806
    Nov 21 '18 at 10:08





    error: 'list' object has no attribute 'split'

    – user10395806
    Nov 21 '18 at 10:08











    0














    You can do it via apply and Counter.update methods:



    from collections import Counter

    counter = Counter()
    df = pd.DataFrame({'Text': values})
    _ = df['Text'].apply(lambda x: counter.update(x))

    counter.most_common(10)
    Out:

    [('Amy', 3), ('was', 3), ('hated', 2),
    ('Kamal', 2), ('her', 2), ('and', 2),
    ('she', 2), ('She', 2), ('sent', 2), ('text', 2)]


    Where df['Text'] is:



    0    [Amy, normally, hated, Monday, mornings, but, ...
    1 [Kamal, was, in, her, art, class, and, she, li...
    2 [She, was, waiting, outside, the, classroom, w...
    3 [Hi, Amy, Your, mum, sent, me, a, text]
    4 [You, forgot, your, inhaler]
    5 [Why, don’t, you, turn, your, phone, on, Amy, ...
    6 [She, never, sent, text, messages, and, she, h...
    Name: Text, dtype: object





    share|improve this answer




























      0














      You can do it via apply and Counter.update methods:



      from collections import Counter

      counter = Counter()
      df = pd.DataFrame({'Text': values})
      _ = df['Text'].apply(lambda x: counter.update(x))

      counter.most_common(10)
      Out:

      [('Amy', 3), ('was', 3), ('hated', 2),
      ('Kamal', 2), ('her', 2), ('and', 2),
      ('she', 2), ('She', 2), ('sent', 2), ('text', 2)]


      Where df['Text'] is:



      0    [Amy, normally, hated, Monday, mornings, but, ...
      1 [Kamal, was, in, her, art, class, and, she, li...
      2 [She, was, waiting, outside, the, classroom, w...
      3 [Hi, Amy, Your, mum, sent, me, a, text]
      4 [You, forgot, your, inhaler]
      5 [Why, don’t, you, turn, your, phone, on, Amy, ...
      6 [She, never, sent, text, messages, and, she, h...
      Name: Text, dtype: object





      share|improve this answer


























        0












        0








        0







        You can do it via apply and Counter.update methods:



        from collections import Counter

        counter = Counter()
        df = pd.DataFrame({'Text': values})
        _ = df['Text'].apply(lambda x: counter.update(x))

        counter.most_common(10)
        Out:

        [('Amy', 3), ('was', 3), ('hated', 2),
        ('Kamal', 2), ('her', 2), ('and', 2),
        ('she', 2), ('She', 2), ('sent', 2), ('text', 2)]


        Where df['Text'] is:



        0    [Amy, normally, hated, Monday, mornings, but, ...
        1 [Kamal, was, in, her, art, class, and, she, li...
        2 [She, was, waiting, outside, the, classroom, w...
        3 [Hi, Amy, Your, mum, sent, me, a, text]
        4 [You, forgot, your, inhaler]
        5 [Why, don’t, you, turn, your, phone, on, Amy, ...
        6 [She, never, sent, text, messages, and, she, h...
        Name: Text, dtype: object





        share|improve this answer













        You can do it via apply and Counter.update methods:



        from collections import Counter

        counter = Counter()
        df = pd.DataFrame({'Text': values})
        _ = df['Text'].apply(lambda x: counter.update(x))

        counter.most_common(10)
        Out:

        [('Amy', 3), ('was', 3), ('hated', 2),
        ('Kamal', 2), ('her', 2), ('and', 2),
        ('she', 2), ('She', 2), ('sent', 2), ('text', 2)]


        Where df['Text'] is:



        0    [Amy, normally, hated, Monday, mornings, but, ...
        1 [Kamal, was, in, her, art, class, and, she, li...
        2 [She, was, waiting, outside, the, classroom, w...
        3 [Hi, Amy, Your, mum, sent, me, a, text]
        4 [You, forgot, your, inhaler]
        5 [Why, don’t, you, turn, your, phone, on, Amy, ...
        6 [She, never, sent, text, messages, and, she, h...
        Name: Text, dtype: object






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 21 '18 at 11:51









        Mikhail StepanovMikhail Stepanov

        1,3993812




        1,3993812






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53409318%2fhow-to-count-500-most-common-words-in-pandas-dataframe%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            How to fix TextFormField cause rebuild widget in Flutter

            in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith