How to count 500 most common words in pandas dataframe

I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.

I tried so far (both methods from stackoverflow):

pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]

and

Counter(" ".join(df["Text"]).split()).most_common(100)

both gave me the following error:

TypeError: sequence item 0: expected str instance, list found

And i have tried the counter method simply with

df.Text.apply(Counter())

which gave me the word count in each text
and i also altered the counter method so it returned the most common words in each text

But i want the overall most common words

Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)

    Datum   File    File_type                                         Text                         length    len_cleaned_text

Datum                                                   

2000-01-27  2000-01-27  _04.txt     _04     [business, date, jan, heineken, starts, integr...       396         220

Edit: Code to 'reporduce' it

  for file in file_list:

    name = file[len(input_path):]

        date = name[11:17]

        type_1 = name[17:20] 





with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:

                format

                text = rfile.read()

                text = text.encode('utf-8', 'ignore')

                text = text.decode('utf-8', 'ignore')

     a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}

        result_list.append(a)

new cell

  df['Text']= df['Text'].str.lower()

    p = re.compile(r'[^ws]+')

    d = re.compile(r'd+')

    for index, row in df.iterrows():

        df['Text']=df['Text'].str.replace('n',' ')

        df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')

        df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]

        df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]

    df['Text']=df['Text'].apply(word_tokenize)





    Datum   File    File_type   Text    length  the

Datum                       

2000-01-27  2000-01-27  0864820040_000127_04.txt    _04     [business, date, jan, heineken, starts, integr...   396     0

2000-02-01  2000-02-01  0910068040_000201_04.txt    _04     [group, english, cns, date, feb, bat, acquisit...   305     0

2000-05-03  2000-05-03  1070448040_000503_04.txt    _04     [date, may, cobham, plc, cob, acquisitionsdisp...   701     0

2000-05-11  2000-05-11  0865985020_000511_04.txt    _04     [business, date, may, swedish, match, complete...   439     0

2000-11-28  2000-11-28  1067252020_001128_04.txt    _04     [date, nov, intec, telecom, sys, itl, doc, pla...   158     0

2000-12-18  2000-12-18  1963867040_001218_04.txt    _04     [associated, press, apw, date, dec, volvo, div...   367     0

2000-12-19  2000-12-19  1065767020_001219_04.txt    _04     [date, dec, spirent, plc, spt, acquisition, co...   414     0

2000-12-21  2000-12-21  1076829040_001221_04.txt    _04     [bloomberg, news, bn, date, dec, eni, ceo, cfo...   271     0

2001-02-06  2001-02-06  1084749020_010206_04.txt    _04     [date, feb, chemring, group, plc, chg, acquisi...   130     0

2001-02-15  2001-02-15  1063497040_010215_04.txt    _04     [date, feb, electrolux, ab, elxb, acquisition,...   420     0

And a description of the dataframe:

<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06

Data columns (total 13 columns):

Datum               557 non-null datetime64[ns]

File                557 non-null object

File_type           557 non-null object

Text                557 non-null object

customers           557 non-null int64

grwoth              557 non-null int64

human               557 non-null int64

intagibles          557 non-null int64

length              557 non-null int64

synergies           557 non-null int64

technology          557 non-null int64

the                 557 non-null int64

len_cleaned_text    557 non-null int64

dtypes: datetime64[ns](1), int64(9), object(3)

memory usage: 60.9+ KB

Thanks in advance

edited Nov 21 '18 at 10:23

asked Nov 21 '18 at 9:51

user10395806

356

Could you provide a sample Dataframe and a mcve?

– Scotty1-
Nov 21 '18 at 9:52

medium.com/@cristhianboujon/… if you want to try

– iamklaus
Nov 21 '18 at 9:59

add a comment |

I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.

I tried so far (both methods from stackoverflow):

pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]

and

Counter(" ".join(df["Text"]).split()).most_common(100)

both gave me the following error:

TypeError: sequence item 0: expected str instance, list found

And i have tried the counter method simply with

df.Text.apply(Counter())

which gave me the word count in each text
and i also altered the counter method so it returned the most common words in each text

But i want the overall most common words

Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)

    Datum   File    File_type                                         Text                         length    len_cleaned_text

Datum                                                   

2000-01-27  2000-01-27  _04.txt     _04     [business, date, jan, heineken, starts, integr...       396         220

Edit: Code to 'reporduce' it

  for file in file_list:

    name = file[len(input_path):]

        date = name[11:17]

        type_1 = name[17:20] 





with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:

                format

                text = rfile.read()

                text = text.encode('utf-8', 'ignore')

                text = text.decode('utf-8', 'ignore')

     a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}

        result_list.append(a)

new cell

  df['Text']= df['Text'].str.lower()

    p = re.compile(r'[^ws]+')

    d = re.compile(r'd+')

    for index, row in df.iterrows():

        df['Text']=df['Text'].str.replace('n',' ')

        df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')

        df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]

        df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]

    df['Text']=df['Text'].apply(word_tokenize)





    Datum   File    File_type   Text    length  the

Datum                       

2000-01-27  2000-01-27  0864820040_000127_04.txt    _04     [business, date, jan, heineken, starts, integr...   396     0

2000-02-01  2000-02-01  0910068040_000201_04.txt    _04     [group, english, cns, date, feb, bat, acquisit...   305     0

2000-05-03  2000-05-03  1070448040_000503_04.txt    _04     [date, may, cobham, plc, cob, acquisitionsdisp...   701     0

2000-05-11  2000-05-11  0865985020_000511_04.txt    _04     [business, date, may, swedish, match, complete...   439     0

2000-11-28  2000-11-28  1067252020_001128_04.txt    _04     [date, nov, intec, telecom, sys, itl, doc, pla...   158     0

2000-12-18  2000-12-18  1963867040_001218_04.txt    _04     [associated, press, apw, date, dec, volvo, div...   367     0

2000-12-19  2000-12-19  1065767020_001219_04.txt    _04     [date, dec, spirent, plc, spt, acquisition, co...   414     0

2000-12-21  2000-12-21  1076829040_001221_04.txt    _04     [bloomberg, news, bn, date, dec, eni, ceo, cfo...   271     0

2001-02-06  2001-02-06  1084749020_010206_04.txt    _04     [date, feb, chemring, group, plc, chg, acquisi...   130     0

2001-02-15  2001-02-15  1063497040_010215_04.txt    _04     [date, feb, electrolux, ab, elxb, acquisition,...   420     0

And a description of the dataframe:

<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06

Data columns (total 13 columns):

Datum               557 non-null datetime64[ns]

File                557 non-null object

File_type           557 non-null object

Text                557 non-null object

customers           557 non-null int64

grwoth              557 non-null int64

human               557 non-null int64

intagibles          557 non-null int64

length              557 non-null int64

synergies           557 non-null int64

technology          557 non-null int64

the                 557 non-null int64

len_cleaned_text    557 non-null int64

dtypes: datetime64[ns](1), int64(9), object(3)

memory usage: 60.9+ KB

Thanks in advance

edited Nov 21 '18 at 10:23

asked Nov 21 '18 at 9:51

user10395806

356

Could you provide a sample Dataframe and a mcve?

– Scotty1-
Nov 21 '18 at 9:52

medium.com/@cristhianboujon/… if you want to try

– iamklaus
Nov 21 '18 at 9:59

add a comment |

I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.

I tried so far (both methods from stackoverflow):

pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]

and

Counter(" ".join(df["Text"]).split()).most_common(100)

both gave me the following error:

TypeError: sequence item 0: expected str instance, list found

And i have tried the counter method simply with

df.Text.apply(Counter())

which gave me the word count in each text
and i also altered the counter method so it returned the most common words in each text

But i want the overall most common words

Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)

    Datum   File    File_type                                         Text                         length    len_cleaned_text

Datum                                                   

2000-01-27  2000-01-27  _04.txt     _04     [business, date, jan, heineken, starts, integr...       396         220

Edit: Code to 'reporduce' it

  for file in file_list:

    name = file[len(input_path):]

        date = name[11:17]

        type_1 = name[17:20] 





with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:

                format

                text = rfile.read()

                text = text.encode('utf-8', 'ignore')

                text = text.decode('utf-8', 'ignore')

     a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}

        result_list.append(a)

new cell

  df['Text']= df['Text'].str.lower()

    p = re.compile(r'[^ws]+')

    d = re.compile(r'd+')

    for index, row in df.iterrows():

        df['Text']=df['Text'].str.replace('n',' ')

        df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')

        df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]

        df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]

    df['Text']=df['Text'].apply(word_tokenize)





    Datum   File    File_type   Text    length  the

Datum                       

2000-01-27  2000-01-27  0864820040_000127_04.txt    _04     [business, date, jan, heineken, starts, integr...   396     0

2000-02-01  2000-02-01  0910068040_000201_04.txt    _04     [group, english, cns, date, feb, bat, acquisit...   305     0

2000-05-03  2000-05-03  1070448040_000503_04.txt    _04     [date, may, cobham, plc, cob, acquisitionsdisp...   701     0

2000-05-11  2000-05-11  0865985020_000511_04.txt    _04     [business, date, may, swedish, match, complete...   439     0

2000-11-28  2000-11-28  1067252020_001128_04.txt    _04     [date, nov, intec, telecom, sys, itl, doc, pla...   158     0

2000-12-18  2000-12-18  1963867040_001218_04.txt    _04     [associated, press, apw, date, dec, volvo, div...   367     0

2000-12-19  2000-12-19  1065767020_001219_04.txt    _04     [date, dec, spirent, plc, spt, acquisition, co...   414     0

2000-12-21  2000-12-21  1076829040_001221_04.txt    _04     [bloomberg, news, bn, date, dec, eni, ceo, cfo...   271     0

2001-02-06  2001-02-06  1084749020_010206_04.txt    _04     [date, feb, chemring, group, plc, chg, acquisi...   130     0

2001-02-15  2001-02-15  1063497040_010215_04.txt    _04     [date, feb, electrolux, ab, elxb, acquisition,...   420     0

And a description of the dataframe:

<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06

Data columns (total 13 columns):

Datum               557 non-null datetime64[ns]

File                557 non-null object

File_type           557 non-null object

Text                557 non-null object

customers           557 non-null int64

grwoth              557 non-null int64

human               557 non-null int64

intagibles          557 non-null int64

length              557 non-null int64

synergies           557 non-null int64

technology          557 non-null int64

the                 557 non-null int64

len_cleaned_text    557 non-null int64

dtypes: datetime64[ns](1), int64(9), object(3)

memory usage: 60.9+ KB

Thanks in advance

edited Nov 21 '18 at 10:23

asked Nov 21 '18 at 9:51

user10395806

356

I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts.

I tried so far (both methods from stackoverflow):

pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]

and

Counter(" ".join(df["Text"]).split()).most_common(100)

both gave me the following error:

TypeError: sequence item 0: expected str instance, list found

And i have tried the counter method simply with

df.Text.apply(Counter())

which gave me the word count in each text
and i also altered the counter method so it returned the most common words in each text

But i want the overall most common words

Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed)

    Datum   File    File_type                                         Text                         length    len_cleaned_text

Datum                                                   

2000-01-27  2000-01-27  _04.txt     _04     [business, date, jan, heineken, starts, integr...       396         220

Edit: Code to 'reporduce' it

  for file in file_list:

    name = file[len(input_path):]

        date = name[11:17]

        type_1 = name[17:20] 





with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:

                format

                text = rfile.read()

                text = text.encode('utf-8', 'ignore')

                text = text.decode('utf-8', 'ignore')

     a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}

        result_list.append(a)

new cell

  df['Text']= df['Text'].str.lower()

    p = re.compile(r'[^ws]+')

    d = re.compile(r'd+')

    for index, row in df.iterrows():

        df['Text']=df['Text'].str.replace('n',' ')

        df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')

        df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]

        df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]

    df['Text']=df['Text'].apply(word_tokenize)





    Datum   File    File_type   Text    length  the

Datum                       

2000-01-27  2000-01-27  0864820040_000127_04.txt    _04     [business, date, jan, heineken, starts, integr...   396     0

2000-02-01  2000-02-01  0910068040_000201_04.txt    _04     [group, english, cns, date, feb, bat, acquisit...   305     0

2000-05-03  2000-05-03  1070448040_000503_04.txt    _04     [date, may, cobham, plc, cob, acquisitionsdisp...   701     0

2000-05-11  2000-05-11  0865985020_000511_04.txt    _04     [business, date, may, swedish, match, complete...   439     0

2000-11-28  2000-11-28  1067252020_001128_04.txt    _04     [date, nov, intec, telecom, sys, itl, doc, pla...   158     0

2000-12-18  2000-12-18  1963867040_001218_04.txt    _04     [associated, press, apw, date, dec, volvo, div...   367     0

2000-12-19  2000-12-19  1065767020_001219_04.txt    _04     [date, dec, spirent, plc, spt, acquisition, co...   414     0

2000-12-21  2000-12-21  1076829040_001221_04.txt    _04     [bloomberg, news, bn, date, dec, eni, ceo, cfo...   271     0

2001-02-06  2001-02-06  1084749020_010206_04.txt    _04     [date, feb, chemring, group, plc, chg, acquisi...   130     0

2001-02-15  2001-02-15  1063497040_010215_04.txt    _04     [date, feb, electrolux, ab, elxb, acquisition,...   420     0

And a description of the dataframe:

<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06

Data columns (total 13 columns):

Datum               557 non-null datetime64[ns]

File                557 non-null object

File_type           557 non-null object

Text                557 non-null object

customers           557 non-null int64

grwoth              557 non-null int64

human               557 non-null int64

intagibles          557 non-null int64

length              557 non-null int64

synergies           557 non-null int64

technology          557 non-null int64

the                 557 non-null int64

len_cleaned_text    557 non-null int64

dtypes: datetime64[ns](1), int64(9), object(3)

memory usage: 60.9+ KB

Thanks in advance

python python-3.x pandas

edited Nov 21 '18 at 10:23

asked Nov 21 '18 at 9:51

user10395806

356

edited Nov 21 '18 at 10:23

asked Nov 21 '18 at 9:51

user10395806

356

edited Nov 21 '18 at 10:23

asked Nov 21 '18 at 9:51

user10395806

356

asked Nov 21 '18 at 9:51

user10395806

356

asked Nov 21 '18 at 9:51

user10395806

356

Could you provide a sample Dataframe and a mcve?

– Scotty1-
Nov 21 '18 at 9:52

medium.com/@cristhianboujon/… if you want to try

– iamklaus
Nov 21 '18 at 9:59

add a comment |

Could you provide a sample Dataframe and a mcve?

– Scotty1-
Nov 21 '18 at 9:52

medium.com/@cristhianboujon/… if you want to try

– iamklaus
Nov 21 '18 at 9:59

Could you provide a sample Dataframe and a mcve?

– Scotty1-
Nov 21 '18 at 9:52

medium.com/@cristhianboujon/… if you want to try

– iamklaus
Nov 21 '18 at 9:59

add a comment |

3 Answers
3

active

oldest

votes

Ok, I got it. Your df['Text'] consists of lists of texts. So you can do this:

full_list =   # list containing all words of all texts

for elmnt in df['Text']:  # loop over lists in df

    full_list += elmnt  # append elements of lists to full list



val_counts = pd.Series(full_list).value_counts()  # make temporary Series to count

This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re or collections are needed.

edited Nov 21 '18 at 10:26

answered Nov 21 '18 at 9:58

Scotty1-

2,0551323

tried value counts as well... it counted each text as one

– user10395806
Nov 21 '18 at 10:03

It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

– Scotty1-
Nov 21 '18 at 10:08

add a comment |

Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:

your_text_list = df['Text'].tolist()

your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']

flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)] 



counter = collections.Counter(flat_list)

top_words = counter.most_common(100)

edited Nov 21 '18 at 10:16

answered Nov 21 '18 at 10:05

RoyaumeIX

1,2591725

error: 'list' object has no attribute 'split'

– user10395806
Nov 21 '18 at 10:08

add a comment |

You can do it via apply and Counter.update methods:

from collections import Counter



counter = Counter()

df = pd.DataFrame({'Text': values})

_ = df['Text'].apply(lambda x: counter.update(x))



counter.most_common(10) 

Out:



[('Amy', 3), ('was', 3), ('hated', 2),

 ('Kamal', 2), ('her', 2), ('and', 2), 

 ('she', 2), ('She', 2), ('sent', 2), ('text', 2)]

Where df['Text'] is:

0    [Amy, normally, hated, Monday, mornings, but, ...

1    [Kamal, was, in, her, art, class, and, she, li...

2    [She, was, waiting, outside, the, classroom, w...

3              [Hi, Amy, Your, mum, sent, me, a, text]

4                         [You, forgot, your, inhaler]

5    [Why, don’t, you, turn, your, phone, on, Amy, ...

6    [She, never, sent, text, messages, and, she, h...

Name: Text, dtype: object

answered Nov 21 '18 at 11:51

Mikhail Stepanov

1,3993812

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53409318%2fhow-to-count-500-most-common-words-in-pandas-dataframe%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Ok, I got it. Your df['Text'] consists of lists of texts. So you can do this:

full_list =   # list containing all words of all texts

for elmnt in df['Text']:  # loop over lists in df

    full_list += elmnt  # append elements of lists to full list



val_counts = pd.Series(full_list).value_counts()  # make temporary Series to count

This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re or collections are needed.

edited Nov 21 '18 at 10:26

answered Nov 21 '18 at 9:58

Scotty1-

2,0551323

tried value counts as well... it counted each text as one

– user10395806
Nov 21 '18 at 10:03

It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

– Scotty1-
Nov 21 '18 at 10:08

add a comment |

Ok, I got it. Your df['Text'] consists of lists of texts. So you can do this:

full_list =   # list containing all words of all texts

for elmnt in df['Text']:  # loop over lists in df

    full_list += elmnt  # append elements of lists to full list



val_counts = pd.Series(full_list).value_counts()  # make temporary Series to count

This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re or collections are needed.

edited Nov 21 '18 at 10:26

answered Nov 21 '18 at 9:58

Scotty1-

2,0551323

tried value counts as well... it counted each text as one

– user10395806
Nov 21 '18 at 10:03

It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

– Scotty1-
Nov 21 '18 at 10:08

add a comment |

Ok, I got it. Your df['Text'] consists of lists of texts. So you can do this:

full_list =   # list containing all words of all texts

for elmnt in df['Text']:  # loop over lists in df

    full_list += elmnt  # append elements of lists to full list



val_counts = pd.Series(full_list).value_counts()  # make temporary Series to count

This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re or collections are needed.

edited Nov 21 '18 at 10:26

answered Nov 21 '18 at 9:58

Scotty1-

2,0551323

Ok, I got it. Your df['Text'] consists of lists of texts. So you can do this:

full_list =   # list containing all words of all texts

for elmnt in df['Text']:  # loop over lists in df

    full_list += elmnt  # append elements of lists to full list



val_counts = pd.Series(full_list).value_counts()  # make temporary Series to count

This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. Furthermore no additional modules like re or collections are needed.

edited Nov 21 '18 at 10:26

answered Nov 21 '18 at 9:58

Scotty1-

2,0551323

edited Nov 21 '18 at 10:26

answered Nov 21 '18 at 9:58

Scotty1-

2,0551323

answered Nov 21 '18 at 9:58

Scotty1-

2,0551323

answered Nov 21 '18 at 9:58

Scotty1-

2,0551323

tried value counts as well... it counted each text as one

– user10395806
Nov 21 '18 at 10:03

It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

– Scotty1-
Nov 21 '18 at 10:08

add a comment |

tried value counts as well... it counted each text as one

– user10395806
Nov 21 '18 at 10:03

It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

– Scotty1-
Nov 21 '18 at 10:08

tried value counts as well... it counted each text as one

– user10395806
Nov 21 '18 at 10:03

It works with the sample dataframe you provided. Could you please provde a sample dataframe which reproduces the first 10 rows of your dataframe? Including the code snippet needed to construct the dataframe.

– Scotty1-
Nov 21 '18 at 10:08

add a comment |

Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:

your_text_list = df['Text'].tolist()

your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']

flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)] 



counter = collections.Counter(flat_list)

top_words = counter.most_common(100)

edited Nov 21 '18 at 10:16

answered Nov 21 '18 at 10:05

RoyaumeIX

1,2591725

error: 'list' object has no attribute 'split'

– user10395806
Nov 21 '18 at 10:08

add a comment |

Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:

your_text_list = df['Text'].tolist()

your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']

flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)] 



counter = collections.Counter(flat_list)

top_words = counter.most_common(100)

edited Nov 21 '18 at 10:16

answered Nov 21 '18 at 10:05

RoyaumeIX

1,2591725

error: 'list' object has no attribute 'split'

– user10395806
Nov 21 '18 at 10:08

add a comment |

Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:

your_text_list = df['Text'].tolist()

your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']

flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)] 



counter = collections.Counter(flat_list)

top_words = counter.most_common(100)

edited Nov 21 '18 at 10:16

answered Nov 21 '18 at 10:05

RoyaumeIX

1,2591725

Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter:

your_text_list = df['Text'].tolist()

your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']

flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)] 



counter = collections.Counter(flat_list)

top_words = counter.most_common(100)

edited Nov 21 '18 at 10:16

answered Nov 21 '18 at 10:05

RoyaumeIX

1,2591725

edited Nov 21 '18 at 10:16

answered Nov 21 '18 at 10:05

RoyaumeIX

1,2591725

answered Nov 21 '18 at 10:05

RoyaumeIX

1,2591725

answered Nov 21 '18 at 10:05

RoyaumeIX

1,2591725

error: 'list' object has no attribute 'split'

– user10395806
Nov 21 '18 at 10:08

add a comment |

error: 'list' object has no attribute 'split'

– user10395806
Nov 21 '18 at 10:08

error: 'list' object has no attribute 'split'

– user10395806
Nov 21 '18 at 10:08

add a comment |

You can do it via apply and Counter.update methods:

from collections import Counter



counter = Counter()

df = pd.DataFrame({'Text': values})

_ = df['Text'].apply(lambda x: counter.update(x))



counter.most_common(10) 

Out:



[('Amy', 3), ('was', 3), ('hated', 2),

 ('Kamal', 2), ('her', 2), ('and', 2), 

 ('she', 2), ('She', 2), ('sent', 2), ('text', 2)]

Where df['Text'] is:

0    [Amy, normally, hated, Monday, mornings, but, ...

1    [Kamal, was, in, her, art, class, and, she, li...

2    [She, was, waiting, outside, the, classroom, w...

3              [Hi, Amy, Your, mum, sent, me, a, text]

4                         [You, forgot, your, inhaler]

5    [Why, don’t, you, turn, your, phone, on, Amy, ...

6    [She, never, sent, text, messages, and, she, h...

Name: Text, dtype: object

answered Nov 21 '18 at 11:51

Mikhail Stepanov

1,3993812

add a comment |

You can do it via apply and Counter.update methods:

from collections import Counter



counter = Counter()

df = pd.DataFrame({'Text': values})

_ = df['Text'].apply(lambda x: counter.update(x))



counter.most_common(10) 

Out:



[('Amy', 3), ('was', 3), ('hated', 2),

 ('Kamal', 2), ('her', 2), ('and', 2), 

 ('she', 2), ('She', 2), ('sent', 2), ('text', 2)]

Where df['Text'] is:

0    [Amy, normally, hated, Monday, mornings, but, ...

1    [Kamal, was, in, her, art, class, and, she, li...

2    [She, was, waiting, outside, the, classroom, w...

3              [Hi, Amy, Your, mum, sent, me, a, text]

4                         [You, forgot, your, inhaler]

5    [Why, don’t, you, turn, your, phone, on, Amy, ...

6    [She, never, sent, text, messages, and, she, h...

Name: Text, dtype: object

answered Nov 21 '18 at 11:51

Mikhail Stepanov

1,3993812

add a comment |

You can do it via apply and Counter.update methods:

from collections import Counter



counter = Counter()

df = pd.DataFrame({'Text': values})

_ = df['Text'].apply(lambda x: counter.update(x))



counter.most_common(10) 

Out:



[('Amy', 3), ('was', 3), ('hated', 2),

 ('Kamal', 2), ('her', 2), ('and', 2), 

 ('she', 2), ('She', 2), ('sent', 2), ('text', 2)]

Where df['Text'] is:

0    [Amy, normally, hated, Monday, mornings, but, ...

1    [Kamal, was, in, her, art, class, and, she, li...

2    [She, was, waiting, outside, the, classroom, w...

3              [Hi, Amy, Your, mum, sent, me, a, text]

4                         [You, forgot, your, inhaler]

5    [Why, don’t, you, turn, your, phone, on, Amy, ...

6    [She, never, sent, text, messages, and, she, h...

Name: Text, dtype: object

answered Nov 21 '18 at 11:51

Mikhail Stepanov

1,3993812

You can do it via apply and Counter.update methods:

from collections import Counter



counter = Counter()

df = pd.DataFrame({'Text': values})

_ = df['Text'].apply(lambda x: counter.update(x))



counter.most_common(10) 

Out:



[('Amy', 3), ('was', 3), ('hated', 2),

 ('Kamal', 2), ('her', 2), ('and', 2), 

 ('she', 2), ('She', 2), ('sent', 2), ('text', 2)]

Where df['Text'] is:

0    [Amy, normally, hated, Monday, mornings, but, ...

1    [Kamal, was, in, her, art, class, and, she, li...

2    [She, was, waiting, outside, the, classroom, w...

3              [Hi, Amy, Your, mum, sent, me, a, text]

4                         [You, forgot, your, inhaler]

5    [Why, don’t, you, turn, your, phone, on, Amy, ...

6    [She, never, sent, text, messages, and, she, h...

Name: Text, dtype: object

answered Nov 21 '18 at 11:51

Mikhail Stepanov

1,3993812

answered Nov 21 '18 at 11:51

Mikhail Stepanov

1,3993812

answered Nov 21 '18 at 11:51

Mikhail Stepanov

1,3993812

answered Nov 21 '18 at 11:51

Mikhail Stepanov

1,3993812

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu