Python - From list of list of tokens to bag of words
I am struggling with computing bag of words. I have a pandas dataframe with a textual column, that I properly tokenize, remove stop words, and stem.
In the end, for each document, I have a list of strings.
My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.
I am doing the preprocessing myself with NLTK and would like to keep it that way...
Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:
["hello", "world"]
["hello", "stackoverflow", "hello"]
should be converted into
[1, 1, 0]
[2, 0, 1]
with vocabulary:
["hello", "world", "stackoverflow"]
python pandas scikit-learn nlp nltk
add a comment |
I am struggling with computing bag of words. I have a pandas dataframe with a textual column, that I properly tokenize, remove stop words, and stem.
In the end, for each document, I have a list of strings.
My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.
I am doing the preprocessing myself with NLTK and would like to keep it that way...
Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:
["hello", "world"]
["hello", "stackoverflow", "hello"]
should be converted into
[1, 1, 0]
[2, 0, 1]
with vocabulary:
["hello", "world", "stackoverflow"]
python pandas scikit-learn nlp nltk
Have you find out the solution yet?
– Mr. Wizard
Jun 10 '18 at 0:58
add a comment |
I am struggling with computing bag of words. I have a pandas dataframe with a textual column, that I properly tokenize, remove stop words, and stem.
In the end, for each document, I have a list of strings.
My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.
I am doing the preprocessing myself with NLTK and would like to keep it that way...
Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:
["hello", "world"]
["hello", "stackoverflow", "hello"]
should be converted into
[1, 1, 0]
[2, 0, 1]
with vocabulary:
["hello", "world", "stackoverflow"]
python pandas scikit-learn nlp nltk
I am struggling with computing bag of words. I have a pandas dataframe with a textual column, that I properly tokenize, remove stop words, and stem.
In the end, for each document, I have a list of strings.
My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.
I am doing the preprocessing myself with NLTK and would like to keep it that way...
Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:
["hello", "world"]
["hello", "stackoverflow", "hello"]
should be converted into
[1, 1, 0]
[2, 0, 1]
with vocabulary:
["hello", "world", "stackoverflow"]
python pandas scikit-learn nlp nltk
python pandas scikit-learn nlp nltk
edited Jan 27 '18 at 10:25


Vasilis G.
3,7092824
3,7092824
asked Jan 27 '18 at 9:34
FlorianFlorian
568
568
Have you find out the solution yet?
– Mr. Wizard
Jun 10 '18 at 0:58
add a comment |
Have you find out the solution yet?
– Mr. Wizard
Jun 10 '18 at 0:58
Have you find out the solution yet?
– Mr. Wizard
Jun 10 '18 at 0:58
Have you find out the solution yet?
– Mr. Wizard
Jun 10 '18 at 0:58
add a comment |
3 Answers
3
active
oldest
votes
You can create DataFrame
by filtering with Counter
and then convert to list
s:
from collections import Counter
df = pd.DataFrame({'text':[["hello", "world"],
["hello", "stackoverflow", "hello"]]})
L = ["hello", "world", "stackoverflow"]
f = lambda x: Counter([y for y in x if y in L])
df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
.fillna(0)
.astype(int)
.reindex(columns=L)
.values
.tolist())
print (df)
text new
0 [hello, world] [1, 1, 0]
1 [hello, stackoverflow, hello] [2, 0, 1]
add a comment |
sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
X.toarray()
/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/
You can get the feature name with the method vectorizer.get_feature_names().
add a comment |
Using sklearn.feature_extraction.text.CountVectorizer
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'text': [['hello', 'world'],
['hello', 'stackoverflow', 'hello']]
})
## Join words to a single line as required by CountVectorizer
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))
vectorizer = CountVectorizer(lowercase=False)
x = vectorizer.fit_transform(df['text'].values)
print(vectorizer.get_feature_names())
print(x.toarray())
Output:
['hello', 'stackoverflow', 'world']
[[1 0 1]
[2 1 0]]
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48474442%2fpython-from-list-of-list-of-tokens-to-bag-of-words%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can create DataFrame
by filtering with Counter
and then convert to list
s:
from collections import Counter
df = pd.DataFrame({'text':[["hello", "world"],
["hello", "stackoverflow", "hello"]]})
L = ["hello", "world", "stackoverflow"]
f = lambda x: Counter([y for y in x if y in L])
df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
.fillna(0)
.astype(int)
.reindex(columns=L)
.values
.tolist())
print (df)
text new
0 [hello, world] [1, 1, 0]
1 [hello, stackoverflow, hello] [2, 0, 1]
add a comment |
You can create DataFrame
by filtering with Counter
and then convert to list
s:
from collections import Counter
df = pd.DataFrame({'text':[["hello", "world"],
["hello", "stackoverflow", "hello"]]})
L = ["hello", "world", "stackoverflow"]
f = lambda x: Counter([y for y in x if y in L])
df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
.fillna(0)
.astype(int)
.reindex(columns=L)
.values
.tolist())
print (df)
text new
0 [hello, world] [1, 1, 0]
1 [hello, stackoverflow, hello] [2, 0, 1]
add a comment |
You can create DataFrame
by filtering with Counter
and then convert to list
s:
from collections import Counter
df = pd.DataFrame({'text':[["hello", "world"],
["hello", "stackoverflow", "hello"]]})
L = ["hello", "world", "stackoverflow"]
f = lambda x: Counter([y for y in x if y in L])
df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
.fillna(0)
.astype(int)
.reindex(columns=L)
.values
.tolist())
print (df)
text new
0 [hello, world] [1, 1, 0]
1 [hello, stackoverflow, hello] [2, 0, 1]
You can create DataFrame
by filtering with Counter
and then convert to list
s:
from collections import Counter
df = pd.DataFrame({'text':[["hello", "world"],
["hello", "stackoverflow", "hello"]]})
L = ["hello", "world", "stackoverflow"]
f = lambda x: Counter([y for y in x if y in L])
df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
.fillna(0)
.astype(int)
.reindex(columns=L)
.values
.tolist())
print (df)
text new
0 [hello, world] [1, 1, 0]
1 [hello, stackoverflow, hello] [2, 0, 1]
answered Jan 27 '18 at 9:45


jezraeljezrael
346k25301374
346k25301374
add a comment |
add a comment |
sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
X.toarray()
/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/
You can get the feature name with the method vectorizer.get_feature_names().
add a comment |
sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
X.toarray()
/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/
You can get the feature name with the method vectorizer.get_feature_names().
add a comment |
sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
X.toarray()
/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/
You can get the feature name with the method vectorizer.get_feature_names().
sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
X.toarray()
/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/
You can get the feature name with the method vectorizer.get_feature_names().
answered Jan 27 '18 at 10:05
ZhangjianZhangjian
113
113
add a comment |
add a comment |
Using sklearn.feature_extraction.text.CountVectorizer
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'text': [['hello', 'world'],
['hello', 'stackoverflow', 'hello']]
})
## Join words to a single line as required by CountVectorizer
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))
vectorizer = CountVectorizer(lowercase=False)
x = vectorizer.fit_transform(df['text'].values)
print(vectorizer.get_feature_names())
print(x.toarray())
Output:
['hello', 'stackoverflow', 'world']
[[1 0 1]
[2 1 0]]
add a comment |
Using sklearn.feature_extraction.text.CountVectorizer
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'text': [['hello', 'world'],
['hello', 'stackoverflow', 'hello']]
})
## Join words to a single line as required by CountVectorizer
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))
vectorizer = CountVectorizer(lowercase=False)
x = vectorizer.fit_transform(df['text'].values)
print(vectorizer.get_feature_names())
print(x.toarray())
Output:
['hello', 'stackoverflow', 'world']
[[1 0 1]
[2 1 0]]
add a comment |
Using sklearn.feature_extraction.text.CountVectorizer
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'text': [['hello', 'world'],
['hello', 'stackoverflow', 'hello']]
})
## Join words to a single line as required by CountVectorizer
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))
vectorizer = CountVectorizer(lowercase=False)
x = vectorizer.fit_transform(df['text'].values)
print(vectorizer.get_feature_names())
print(x.toarray())
Output:
['hello', 'stackoverflow', 'world']
[[1 0 1]
[2 1 0]]
Using sklearn.feature_extraction.text.CountVectorizer
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'text': [['hello', 'world'],
['hello', 'stackoverflow', 'hello']]
})
## Join words to a single line as required by CountVectorizer
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))
vectorizer = CountVectorizer(lowercase=False)
x = vectorizer.fit_transform(df['text'].values)
print(vectorizer.get_feature_names())
print(x.toarray())
Output:
['hello', 'stackoverflow', 'world']
[[1 0 1]
[2 1 0]]
edited Jan 1 at 16:39
answered Dec 31 '18 at 19:28


Ryan SuarezRyan Suarez
177
177
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48474442%2fpython-from-list-of-list-of-tokens-to-bag-of-words%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Have you find out the solution yet?
– Mr. Wizard
Jun 10 '18 at 0:58