Python - From list of list of tokens to bag of words












2















I am struggling with computing bag of words. I have a pandas dataframe with a textual column, that I properly tokenize, remove stop words, and stem.
In the end, for each document, I have a list of strings.



My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.



I am doing the preprocessing myself with NLTK and would like to keep it that way...



Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:



["hello", "world"]
["hello", "stackoverflow", "hello"]


should be converted into



[1, 1, 0]
[2, 0, 1]


with vocabulary:



["hello", "world", "stackoverflow"]









share|improve this question

























  • Have you find out the solution yet?

    – Mr. Wizard
    Jun 10 '18 at 0:58
















2















I am struggling with computing bag of words. I have a pandas dataframe with a textual column, that I properly tokenize, remove stop words, and stem.
In the end, for each document, I have a list of strings.



My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.



I am doing the preprocessing myself with NLTK and would like to keep it that way...



Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:



["hello", "world"]
["hello", "stackoverflow", "hello"]


should be converted into



[1, 1, 0]
[2, 0, 1]


with vocabulary:



["hello", "world", "stackoverflow"]









share|improve this question

























  • Have you find out the solution yet?

    – Mr. Wizard
    Jun 10 '18 at 0:58














2












2








2


1






I am struggling with computing bag of words. I have a pandas dataframe with a textual column, that I properly tokenize, remove stop words, and stem.
In the end, for each document, I have a list of strings.



My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.



I am doing the preprocessing myself with NLTK and would like to keep it that way...



Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:



["hello", "world"]
["hello", "stackoverflow", "hello"]


should be converted into



[1, 1, 0]
[2, 0, 1]


with vocabulary:



["hello", "world", "stackoverflow"]









share|improve this question
















I am struggling with computing bag of words. I have a pandas dataframe with a textual column, that I properly tokenize, remove stop words, and stem.
In the end, for each document, I have a list of strings.



My ultimate goal is to compute bag of words for this column, I've seen that scikit-learn has a function to do that but it works on string, not on a list of string.



I am doing the preprocessing myself with NLTK and would like to keep it that way...



Is there a way to compute bag of words based on a list of list of tokens ? e.g., something like that:



["hello", "world"]
["hello", "stackoverflow", "hello"]


should be converted into



[1, 1, 0]
[2, 0, 1]


with vocabulary:



["hello", "world", "stackoverflow"]






python pandas scikit-learn nlp nltk






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 27 '18 at 10:25









Vasilis G.

3,7092824




3,7092824










asked Jan 27 '18 at 9:34









FlorianFlorian

568




568













  • Have you find out the solution yet?

    – Mr. Wizard
    Jun 10 '18 at 0:58



















  • Have you find out the solution yet?

    – Mr. Wizard
    Jun 10 '18 at 0:58

















Have you find out the solution yet?

– Mr. Wizard
Jun 10 '18 at 0:58





Have you find out the solution yet?

– Mr. Wizard
Jun 10 '18 at 0:58












3 Answers
3






active

oldest

votes


















3














You can create DataFrame by filtering with Counter and then convert to lists:



from collections import Counter

df = pd.DataFrame({'text':[["hello", "world"],
["hello", "stackoverflow", "hello"]]})

L = ["hello", "world", "stackoverflow"]

f = lambda x: Counter([y for y in x if y in L])
df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
.fillna(0)
.astype(int)
.reindex(columns=L)
.values
.tolist())
print (df)

text new
0 [hello, world] [1, 1, 0]
1 [hello, stackoverflow, hello] [2, 0, 1]





share|improve this answer































    1














    sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:



    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()
    corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    ]
    X = vectorizer.fit_transform(corpus)
    X.toarray()
    /*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
    [0, 1, 0, 1, 0, 2, 1, 0, 1],
    [1, 0, 0, 0, 1, 0, 1, 1, 0],
    [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/


    You can get the feature name with the method vectorizer.get_feature_names().






    share|improve this answer































      0














      Using sklearn.feature_extraction.text.CountVectorizer



      import pandas as pd
      from sklearn.feature_extraction.text import CountVectorizer

      df = pd.DataFrame({'text': [['hello', 'world'],
      ['hello', 'stackoverflow', 'hello']]
      })

      ## Join words to a single line as required by CountVectorizer
      df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))

      vectorizer = CountVectorizer(lowercase=False)
      x = vectorizer.fit_transform(df['text'].values)

      print(vectorizer.get_feature_names())
      print(x.toarray())


      Output:



      ['hello', 'stackoverflow', 'world']

      [[1 0 1]
      [2 1 0]]





      share|improve this answer

























        Your Answer






        StackExchange.ifUsing("editor", function () {
        StackExchange.using("externalEditor", function () {
        StackExchange.using("snippets", function () {
        StackExchange.snippets.init();
        });
        });
        }, "code-snippets");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "1"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48474442%2fpython-from-list-of-list-of-tokens-to-bag-of-words%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        3














        You can create DataFrame by filtering with Counter and then convert to lists:



        from collections import Counter

        df = pd.DataFrame({'text':[["hello", "world"],
        ["hello", "stackoverflow", "hello"]]})

        L = ["hello", "world", "stackoverflow"]

        f = lambda x: Counter([y for y in x if y in L])
        df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
        .fillna(0)
        .astype(int)
        .reindex(columns=L)
        .values
        .tolist())
        print (df)

        text new
        0 [hello, world] [1, 1, 0]
        1 [hello, stackoverflow, hello] [2, 0, 1]





        share|improve this answer




























          3














          You can create DataFrame by filtering with Counter and then convert to lists:



          from collections import Counter

          df = pd.DataFrame({'text':[["hello", "world"],
          ["hello", "stackoverflow", "hello"]]})

          L = ["hello", "world", "stackoverflow"]

          f = lambda x: Counter([y for y in x if y in L])
          df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
          .fillna(0)
          .astype(int)
          .reindex(columns=L)
          .values
          .tolist())
          print (df)

          text new
          0 [hello, world] [1, 1, 0]
          1 [hello, stackoverflow, hello] [2, 0, 1]





          share|improve this answer


























            3












            3








            3







            You can create DataFrame by filtering with Counter and then convert to lists:



            from collections import Counter

            df = pd.DataFrame({'text':[["hello", "world"],
            ["hello", "stackoverflow", "hello"]]})

            L = ["hello", "world", "stackoverflow"]

            f = lambda x: Counter([y for y in x if y in L])
            df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
            .fillna(0)
            .astype(int)
            .reindex(columns=L)
            .values
            .tolist())
            print (df)

            text new
            0 [hello, world] [1, 1, 0]
            1 [hello, stackoverflow, hello] [2, 0, 1]





            share|improve this answer













            You can create DataFrame by filtering with Counter and then convert to lists:



            from collections import Counter

            df = pd.DataFrame({'text':[["hello", "world"],
            ["hello", "stackoverflow", "hello"]]})

            L = ["hello", "world", "stackoverflow"]

            f = lambda x: Counter([y for y in x if y in L])
            df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
            .fillna(0)
            .astype(int)
            .reindex(columns=L)
            .values
            .tolist())
            print (df)

            text new
            0 [hello, world] [1, 1, 0]
            1 [hello, stackoverflow, hello] [2, 0, 1]






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 27 '18 at 9:45









            jezraeljezrael

            346k25301374




            346k25301374

























                1














                sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:



                from sklearn.feature_extraction.text import CountVectorizer
                vectorizer = CountVectorizer()
                corpus = [
                'This is the first document.',
                'This is the second second document.',
                'And the third one.',
                'Is this the first document?',
                ]
                X = vectorizer.fit_transform(corpus)
                X.toarray()
                /*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
                [0, 1, 0, 1, 0, 2, 1, 0, 1],
                [1, 0, 0, 0, 1, 0, 1, 1, 0],
                [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/


                You can get the feature name with the method vectorizer.get_feature_names().






                share|improve this answer




























                  1














                  sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:



                  from sklearn.feature_extraction.text import CountVectorizer
                  vectorizer = CountVectorizer()
                  corpus = [
                  'This is the first document.',
                  'This is the second second document.',
                  'And the third one.',
                  'Is this the first document?',
                  ]
                  X = vectorizer.fit_transform(corpus)
                  X.toarray()
                  /*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
                  [0, 1, 0, 1, 0, 2, 1, 0, 1],
                  [1, 0, 0, 0, 1, 0, 1, 1, 0],
                  [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/


                  You can get the feature name with the method vectorizer.get_feature_names().






                  share|improve this answer


























                    1












                    1








                    1







                    sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:



                    from sklearn.feature_extraction.text import CountVectorizer
                    vectorizer = CountVectorizer()
                    corpus = [
                    'This is the first document.',
                    'This is the second second document.',
                    'And the third one.',
                    'Is this the first document?',
                    ]
                    X = vectorizer.fit_transform(corpus)
                    X.toarray()
                    /*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
                    [0, 1, 0, 1, 0, 2, 1, 0, 1],
                    [1, 0, 0, 0, 1, 0, 1, 1, 0],
                    [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/


                    You can get the feature name with the method vectorizer.get_feature_names().






                    share|improve this answer













                    sklearn.feature_extraction.text.CountVectorizer can help a lot. Here's the excample of official document:



                    from sklearn.feature_extraction.text import CountVectorizer
                    vectorizer = CountVectorizer()
                    corpus = [
                    'This is the first document.',
                    'This is the second second document.',
                    'And the third one.',
                    'Is this the first document?',
                    ]
                    X = vectorizer.fit_transform(corpus)
                    X.toarray()
                    /*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
                    [0, 1, 0, 1, 0, 2, 1, 0, 1],
                    [1, 0, 0, 0, 1, 0, 1, 1, 0],
                    [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/


                    You can get the feature name with the method vectorizer.get_feature_names().







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Jan 27 '18 at 10:05









                    ZhangjianZhangjian

                    113




                    113























                        0














                        Using sklearn.feature_extraction.text.CountVectorizer



                        import pandas as pd
                        from sklearn.feature_extraction.text import CountVectorizer

                        df = pd.DataFrame({'text': [['hello', 'world'],
                        ['hello', 'stackoverflow', 'hello']]
                        })

                        ## Join words to a single line as required by CountVectorizer
                        df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))

                        vectorizer = CountVectorizer(lowercase=False)
                        x = vectorizer.fit_transform(df['text'].values)

                        print(vectorizer.get_feature_names())
                        print(x.toarray())


                        Output:



                        ['hello', 'stackoverflow', 'world']

                        [[1 0 1]
                        [2 1 0]]





                        share|improve this answer






























                          0














                          Using sklearn.feature_extraction.text.CountVectorizer



                          import pandas as pd
                          from sklearn.feature_extraction.text import CountVectorizer

                          df = pd.DataFrame({'text': [['hello', 'world'],
                          ['hello', 'stackoverflow', 'hello']]
                          })

                          ## Join words to a single line as required by CountVectorizer
                          df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))

                          vectorizer = CountVectorizer(lowercase=False)
                          x = vectorizer.fit_transform(df['text'].values)

                          print(vectorizer.get_feature_names())
                          print(x.toarray())


                          Output:



                          ['hello', 'stackoverflow', 'world']

                          [[1 0 1]
                          [2 1 0]]





                          share|improve this answer




























                            0












                            0








                            0







                            Using sklearn.feature_extraction.text.CountVectorizer



                            import pandas as pd
                            from sklearn.feature_extraction.text import CountVectorizer

                            df = pd.DataFrame({'text': [['hello', 'world'],
                            ['hello', 'stackoverflow', 'hello']]
                            })

                            ## Join words to a single line as required by CountVectorizer
                            df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))

                            vectorizer = CountVectorizer(lowercase=False)
                            x = vectorizer.fit_transform(df['text'].values)

                            print(vectorizer.get_feature_names())
                            print(x.toarray())


                            Output:



                            ['hello', 'stackoverflow', 'world']

                            [[1 0 1]
                            [2 1 0]]





                            share|improve this answer















                            Using sklearn.feature_extraction.text.CountVectorizer



                            import pandas as pd
                            from sklearn.feature_extraction.text import CountVectorizer

                            df = pd.DataFrame({'text': [['hello', 'world'],
                            ['hello', 'stackoverflow', 'hello']]
                            })

                            ## Join words to a single line as required by CountVectorizer
                            df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x]))

                            vectorizer = CountVectorizer(lowercase=False)
                            x = vectorizer.fit_transform(df['text'].values)

                            print(vectorizer.get_feature_names())
                            print(x.toarray())


                            Output:



                            ['hello', 'stackoverflow', 'world']

                            [[1 0 1]
                            [2 1 0]]






                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            edited Jan 1 at 16:39

























                            answered Dec 31 '18 at 19:28









                            Ryan SuarezRyan Suarez

                            177




                            177






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48474442%2fpython-from-list-of-list-of-tokens-to-bag-of-words%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                MongoDB - Not Authorized To Execute Command

                                in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

                                Npm cannot find a required file even through it is in the searched directory