Python - From list of list of tokens to bag of words

I am struggling with computing bag of words. I have a pandas dataframe with a textual column, that I properly tokenize, remove stop words, and stem.
In the end, for each document, I have a list of strings.

My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.

I am doing the preprocessing myself with NLTK and would like to keep it that way...

Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:

["hello", "world"]

["hello", "stackoverflow", "hello"]

should be converted into

[1, 1, 0]

[2, 0, 1]

with vocabulary:

["hello", "world", "stackoverflow"]

edited Jan 27 '18 at 10:25

Vasilis G.

3,7092824

asked Jan 27 '18 at 9:34

Florian

568

Have you find out the solution yet?

– Mr. Wizard
Jun 10 '18 at 0:58

add a comment |

My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.

I am doing the preprocessing myself with NLTK and would like to keep it that way...

Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:

["hello", "world"]

["hello", "stackoverflow", "hello"]

should be converted into

[1, 1, 0]

[2, 0, 1]

with vocabulary:

["hello", "world", "stackoverflow"]

edited Jan 27 '18 at 10:25

Vasilis G.

3,7092824

asked Jan 27 '18 at 9:34

Florian

568

Have you find out the solution yet?

– Mr. Wizard
Jun 10 '18 at 0:58

add a comment |

My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.

I am doing the preprocessing myself with NLTK and would like to keep it that way...

Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:

["hello", "world"]

["hello", "stackoverflow", "hello"]

should be converted into

[1, 1, 0]

[2, 0, 1]

with vocabulary:

["hello", "world", "stackoverflow"]

edited Jan 27 '18 at 10:25

Vasilis G.

3,7092824

asked Jan 27 '18 at 9:34

Florian

568

My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.

I am doing the preprocessing myself with NLTK and would like to keep it that way...

Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:

["hello", "world"]

["hello", "stackoverflow", "hello"]

should be converted into

[1, 1, 0]

[2, 0, 1]

with vocabulary:

["hello", "world", "stackoverflow"]

python pandas scikit-learn nlp nltk

edited Jan 27 '18 at 10:25

Vasilis G.

3,7092824

asked Jan 27 '18 at 9:34

Florian

568

edited Jan 27 '18 at 10:25

Vasilis G.

3,7092824

asked Jan 27 '18 at 9:34

Florian

568

edited Jan 27 '18 at 10:25

Vasilis G.

3,7092824

edited Jan 27 '18 at 10:25

Vasilis G.

3,7092824

edited Jan 27 '18 at 10:25

Vasilis G.

3,7092824

asked Jan 27 '18 at 9:34

Florian

568

asked Jan 27 '18 at 9:34

Florian

568

asked Jan 27 '18 at 9:34

Florian

568

Have you find out the solution yet?

– Mr. Wizard
Jun 10 '18 at 0:58

add a comment |

Have you find out the solution yet?

– Mr. Wizard
Jun 10 '18 at 0:58

Have you find out the solution yet?

– Mr. Wizard
Jun 10 '18 at 0:58

add a comment |

3 Answers
3

active

oldest

votes

You can create DataFrame by filtering with Counter and then convert to lists:

from collections import Counter



df = pd.DataFrame({'text':[["hello", "world"],

                           ["hello", "stackoverflow", "hello"]]})



L = ["hello", "world", "stackoverflow"]



f = lambda x: Counter([y for y in x if y in L])

df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())

               .fillna(0)

               .astype(int)

               .reindex(columns=L)

               .values

               .tolist())

print (df)



                            text        new

0                 [hello, world]  [1, 1, 0]

1  [hello, stackoverflow, hello]  [2, 0, 1]

answered Jan 27 '18 at 9:45

jezrael

346k25301374

add a comment |

sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

corpus = [

    'This is the first document.',

    'This is the second second document.',

    'And the third one.',

    'Is this the first document?',

]

X = vectorizer.fit_transform(corpus)

X.toarray() 

/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],

   [0, 1, 0, 1, 0, 2, 1, 0, 1],

   [1, 0, 0, 0, 1, 0, 1, 1, 0],

   [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/

You can get the feature name with the method vectorizer.get_feature_names().

answered Jan 27 '18 at 10:05

Zhangjian

113

add a comment |

Using sklearn.feature_extraction.text.CountVectorizer

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer



df = pd.DataFrame({'text': [['hello', 'world'], 

                        ['hello', 'stackoverflow', 'hello']]

                   })



## Join words to a single line as required by CountVectorizer

df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))



vectorizer = CountVectorizer(lowercase=False)

x = vectorizer.fit_transform(df['text'].values)



print(vectorizer.get_feature_names())

print(x.toarray())

Output:

['hello', 'stackoverflow', 'world']



[[1 0 1]

 [2 1 0]]

edited Jan 1 at 16:39

answered Dec 31 '18 at 19:28

Ryan Suarez

177

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48474442%2fpython-from-list-of-list-of-tokens-to-bag-of-words%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

You can create DataFrame by filtering with Counter and then convert to lists:

from collections import Counter



df = pd.DataFrame({'text':[["hello", "world"],

                           ["hello", "stackoverflow", "hello"]]})



L = ["hello", "world", "stackoverflow"]



f = lambda x: Counter([y for y in x if y in L])

df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())

               .fillna(0)

               .astype(int)

               .reindex(columns=L)

               .values

               .tolist())

print (df)



                            text        new

0                 [hello, world]  [1, 1, 0]

1  [hello, stackoverflow, hello]  [2, 0, 1]

answered Jan 27 '18 at 9:45

jezrael

346k25301374

add a comment |

You can create DataFrame by filtering with Counter and then convert to lists:

from collections import Counter



df = pd.DataFrame({'text':[["hello", "world"],

                           ["hello", "stackoverflow", "hello"]]})



L = ["hello", "world", "stackoverflow"]



f = lambda x: Counter([y for y in x if y in L])

df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())

               .fillna(0)

               .astype(int)

               .reindex(columns=L)

               .values

               .tolist())

print (df)



                            text        new

0                 [hello, world]  [1, 1, 0]

1  [hello, stackoverflow, hello]  [2, 0, 1]

answered Jan 27 '18 at 9:45

jezrael

346k25301374

add a comment |

You can create DataFrame by filtering with Counter and then convert to lists:

from collections import Counter



df = pd.DataFrame({'text':[["hello", "world"],

                           ["hello", "stackoverflow", "hello"]]})



L = ["hello", "world", "stackoverflow"]



f = lambda x: Counter([y for y in x if y in L])

df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())

               .fillna(0)

               .astype(int)

               .reindex(columns=L)

               .values

               .tolist())

print (df)



                            text        new

0                 [hello, world]  [1, 1, 0]

1  [hello, stackoverflow, hello]  [2, 0, 1]

answered Jan 27 '18 at 9:45

jezrael

346k25301374

You can create DataFrame by filtering with Counter and then convert to lists:

from collections import Counter



df = pd.DataFrame({'text':[["hello", "world"],

                           ["hello", "stackoverflow", "hello"]]})



L = ["hello", "world", "stackoverflow"]



f = lambda x: Counter([y for y in x if y in L])

df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())

               .fillna(0)

               .astype(int)

               .reindex(columns=L)

               .values

               .tolist())

print (df)



                            text        new

0                 [hello, world]  [1, 1, 0]

1  [hello, stackoverflow, hello]  [2, 0, 1]

answered Jan 27 '18 at 9:45

jezrael

346k25301374

answered Jan 27 '18 at 9:45

jezrael

346k25301374

answered Jan 27 '18 at 9:45

jezrael

346k25301374

answered Jan 27 '18 at 9:45

jezrael

346k25301374

add a comment |

sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

corpus = [

    'This is the first document.',

    'This is the second second document.',

    'And the third one.',

    'Is this the first document?',

]

X = vectorizer.fit_transform(corpus)

X.toarray() 

/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],

   [0, 1, 0, 1, 0, 2, 1, 0, 1],

   [1, 0, 0, 0, 1, 0, 1, 1, 0],

   [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/

You can get the feature name with the method vectorizer.get_feature_names().

answered Jan 27 '18 at 10:05

Zhangjian

113

add a comment |

sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

corpus = [

    'This is the first document.',

    'This is the second second document.',

    'And the third one.',

    'Is this the first document?',

]

X = vectorizer.fit_transform(corpus)

X.toarray() 

/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],

   [0, 1, 0, 1, 0, 2, 1, 0, 1],

   [1, 0, 0, 0, 1, 0, 1, 1, 0],

   [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/

You can get the feature name with the method vectorizer.get_feature_names().

answered Jan 27 '18 at 10:05

Zhangjian

113

add a comment |

sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

corpus = [

    'This is the first document.',

    'This is the second second document.',

    'And the third one.',

    'Is this the first document?',

]

X = vectorizer.fit_transform(corpus)

X.toarray() 

/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],

   [0, 1, 0, 1, 0, 2, 1, 0, 1],

   [1, 0, 0, 0, 1, 0, 1, 1, 0],

   [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/

You can get the feature name with the method vectorizer.get_feature_names().

answered Jan 27 '18 at 10:05

Zhangjian

113

sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

corpus = [

    'This is the first document.',

    'This is the second second document.',

    'And the third one.',

    'Is this the first document?',

]

X = vectorizer.fit_transform(corpus)

X.toarray() 

/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],

   [0, 1, 0, 1, 0, 2, 1, 0, 1],

   [1, 0, 0, 0, 1, 0, 1, 1, 0],

   [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/

You can get the feature name with the method vectorizer.get_feature_names().

answered Jan 27 '18 at 10:05

Zhangjian

113

answered Jan 27 '18 at 10:05

Zhangjian

113

answered Jan 27 '18 at 10:05

Zhangjian

113

answered Jan 27 '18 at 10:05

Zhangjian

113

add a comment |

Using sklearn.feature_extraction.text.CountVectorizer

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer



df = pd.DataFrame({'text': [['hello', 'world'], 

                        ['hello', 'stackoverflow', 'hello']]

                   })



## Join words to a single line as required by CountVectorizer

df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))



vectorizer = CountVectorizer(lowercase=False)

x = vectorizer.fit_transform(df['text'].values)



print(vectorizer.get_feature_names())

print(x.toarray())

Output:

['hello', 'stackoverflow', 'world']



[[1 0 1]

 [2 1 0]]

edited Jan 1 at 16:39

answered Dec 31 '18 at 19:28

Ryan Suarez

177

add a comment |

Using sklearn.feature_extraction.text.CountVectorizer

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer



df = pd.DataFrame({'text': [['hello', 'world'], 

                        ['hello', 'stackoverflow', 'hello']]

                   })



## Join words to a single line as required by CountVectorizer

df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))



vectorizer = CountVectorizer(lowercase=False)

x = vectorizer.fit_transform(df['text'].values)



print(vectorizer.get_feature_names())

print(x.toarray())

Output:

['hello', 'stackoverflow', 'world']



[[1 0 1]

 [2 1 0]]

edited Jan 1 at 16:39

answered Dec 31 '18 at 19:28

Ryan Suarez

177

add a comment |

Using sklearn.feature_extraction.text.CountVectorizer

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer



df = pd.DataFrame({'text': [['hello', 'world'], 

                        ['hello', 'stackoverflow', 'hello']]

                   })



## Join words to a single line as required by CountVectorizer

df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))



vectorizer = CountVectorizer(lowercase=False)

x = vectorizer.fit_transform(df['text'].values)



print(vectorizer.get_feature_names())

print(x.toarray())

Output:

['hello', 'stackoverflow', 'world']



[[1 0 1]

 [2 1 0]]

edited Jan 1 at 16:39

answered Dec 31 '18 at 19:28

Ryan Suarez

177

Using sklearn.feature_extraction.text.CountVectorizer

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer



df = pd.DataFrame({'text': [['hello', 'world'], 

                        ['hello', 'stackoverflow', 'hello']]

                   })



## Join words to a single line as required by CountVectorizer

df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))



vectorizer = CountVectorizer(lowercase=False)

x = vectorizer.fit_transform(df['text'].values)



print(vectorizer.get_feature_names())

print(x.toarray())

Output:

['hello', 'stackoverflow', 'world']



[[1 0 1]

 [2 1 0]]

edited Jan 1 at 16:39

answered Dec 31 '18 at 19:28

Ryan Suarez

177

edited Jan 1 at 16:39

answered Dec 31 '18 at 19:28

Ryan Suarez

177

answered Dec 31 '18 at 19:28

Ryan Suarez

177

answered Dec 31 '18 at 19:28

Ryan Suarez

177

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu