how can i use SVM for text classification without using Tf-idf












1















I want to classify every entry message.I work with Persian texts. I already implemented a text classifier with Naive Bayes. i did not use Tf-idf because every single feature is important for me. But i did some tricks to delete stop-words and pouncs to have a better accuracy.



I want to implement a text classifier with SVM but i searched a lot. all I found is related to using pipeline function with the use of Tf-idf. like below:



model = Pipeline([(‘vectorizer’, CountVectorizer()),
(‘tfidf’, TfidfTransformer()),
(‘clf’, OneVsRestClassifier(LinearSVC(class_weight=”balanced”)))])


now, how can i use SVM without Tf-idf?



thanks










share|improve this question

























  • could you provide more about the model you are trying to build? what are your features? words? are you using bag of words of the massage as your data?

    – thebeancounter
    Jan 1 at 15:05











  • my model include body as text of messages and label. i have 6 labels. yes my data is made of some words that make sentences.@thebeancounter

    – hadi javanmard
    Jan 1 at 17:36


















1















I want to classify every entry message.I work with Persian texts. I already implemented a text classifier with Naive Bayes. i did not use Tf-idf because every single feature is important for me. But i did some tricks to delete stop-words and pouncs to have a better accuracy.



I want to implement a text classifier with SVM but i searched a lot. all I found is related to using pipeline function with the use of Tf-idf. like below:



model = Pipeline([(‘vectorizer’, CountVectorizer()),
(‘tfidf’, TfidfTransformer()),
(‘clf’, OneVsRestClassifier(LinearSVC(class_weight=”balanced”)))])


now, how can i use SVM without Tf-idf?



thanks










share|improve this question

























  • could you provide more about the model you are trying to build? what are your features? words? are you using bag of words of the massage as your data?

    – thebeancounter
    Jan 1 at 15:05











  • my model include body as text of messages and label. i have 6 labels. yes my data is made of some words that make sentences.@thebeancounter

    – hadi javanmard
    Jan 1 at 17:36
















1












1








1


1






I want to classify every entry message.I work with Persian texts. I already implemented a text classifier with Naive Bayes. i did not use Tf-idf because every single feature is important for me. But i did some tricks to delete stop-words and pouncs to have a better accuracy.



I want to implement a text classifier with SVM but i searched a lot. all I found is related to using pipeline function with the use of Tf-idf. like below:



model = Pipeline([(‘vectorizer’, CountVectorizer()),
(‘tfidf’, TfidfTransformer()),
(‘clf’, OneVsRestClassifier(LinearSVC(class_weight=”balanced”)))])


now, how can i use SVM without Tf-idf?



thanks










share|improve this question
















I want to classify every entry message.I work with Persian texts. I already implemented a text classifier with Naive Bayes. i did not use Tf-idf because every single feature is important for me. But i did some tricks to delete stop-words and pouncs to have a better accuracy.



I want to implement a text classifier with SVM but i searched a lot. all I found is related to using pipeline function with the use of Tf-idf. like below:



model = Pipeline([(‘vectorizer’, CountVectorizer()),
(‘tfidf’, TfidfTransformer()),
(‘clf’, OneVsRestClassifier(LinearSVC(class_weight=”balanced”)))])


now, how can i use SVM without Tf-idf?



thanks







python svm






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 1 at 17:34









Aaron_ab

1,2192823




1,2192823










asked Jan 1 at 14:34









hadi javanmardhadi javanmard

388




388













  • could you provide more about the model you are trying to build? what are your features? words? are you using bag of words of the massage as your data?

    – thebeancounter
    Jan 1 at 15:05











  • my model include body as text of messages and label. i have 6 labels. yes my data is made of some words that make sentences.@thebeancounter

    – hadi javanmard
    Jan 1 at 17:36





















  • could you provide more about the model you are trying to build? what are your features? words? are you using bag of words of the massage as your data?

    – thebeancounter
    Jan 1 at 15:05











  • my model include body as text of messages and label. i have 6 labels. yes my data is made of some words that make sentences.@thebeancounter

    – hadi javanmard
    Jan 1 at 17:36



















could you provide more about the model you are trying to build? what are your features? words? are you using bag of words of the massage as your data?

– thebeancounter
Jan 1 at 15:05





could you provide more about the model you are trying to build? what are your features? words? are you using bag of words of the massage as your data?

– thebeancounter
Jan 1 at 15:05













my model include body as text of messages and label. i have 6 labels. yes my data is made of some words that make sentences.@thebeancounter

– hadi javanmard
Jan 1 at 17:36







my model include body as text of messages and label. i have 6 labels. yes my data is made of some words that make sentences.@thebeancounter

– hadi javanmard
Jan 1 at 17:36














1 Answer
1






active

oldest

votes


















1














See here for the sklearn page about SVM, there you have a section for multiclass classification using SVM. You first have to convert your texts into a feature vector (numeric, if you wish you use SVM) If you should like to use bag of words you could use this SO question and this manual page of sklearn



You can use pre written python code to create BOW from you texts doing something like that - mind you, i geathered the relevant information for the OP - that was unclear and not compatible with SO strandarts, so you might need to work the code a bit for it to fit your excact usage.



>>> from sklearn.feature_extraction.text import CountVectorizer

>>> vectorizer = CountVectorizer()
>>> vectorizer
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=...'(?u)\b\w\w+\b',
tokenizer=None, vocabulary=None)


>>> corpus = [
... 'This is the first document.',
... 'This is the second second document.',
... 'And the third one.',
... 'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X
<4x9 sparse matrix of type '<... 'numpy.int64'>'
with 19 stored elements in Compressed Sparse ... format>


Then you might need to convert x into a dense matrix (depends on sklearn version)
Then you could feed x into SVM model you can create like so



>>>>from sklearn import svm
>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
>>> clf.fit(X, Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovo', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6
>>> clf.decision_function_shape = "ovr"
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996311%2fhow-can-i-use-svm-for-text-classification-without-using-tf-idf%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    See here for the sklearn page about SVM, there you have a section for multiclass classification using SVM. You first have to convert your texts into a feature vector (numeric, if you wish you use SVM) If you should like to use bag of words you could use this SO question and this manual page of sklearn



    You can use pre written python code to create BOW from you texts doing something like that - mind you, i geathered the relevant information for the OP - that was unclear and not compatible with SO strandarts, so you might need to work the code a bit for it to fit your excact usage.



    >>> from sklearn.feature_extraction.text import CountVectorizer

    >>> vectorizer = CountVectorizer()
    >>> vectorizer
    CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
    dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
    lowercase=True, max_df=1.0, max_features=None, min_df=1,
    ngram_range=(1, 1), preprocessor=None, stop_words=None,
    strip_accents=None, token_pattern=...'(?u)\b\w\w+\b',
    tokenizer=None, vocabulary=None)


    >>> corpus = [
    ... 'This is the first document.',
    ... 'This is the second second document.',
    ... 'And the third one.',
    ... 'Is this the first document?',
    ... ]
    >>> X = vectorizer.fit_transform(corpus)
    >>> X
    <4x9 sparse matrix of type '<... 'numpy.int64'>'
    with 19 stored elements in Compressed Sparse ... format>


    Then you might need to convert x into a dense matrix (depends on sklearn version)
    Then you could feed x into SVM model you can create like so



    >>>>from sklearn import svm
    >>> X = [[0], [1], [2], [3]]
    >>> Y = [0, 1, 2, 3]
    >>> clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
    >>> clf.fit(X, Y)
    SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
    >>> dec = clf.decision_function([[1]])
    >>> dec.shape[1] # 4 classes: 4*3/2 = 6
    6
    >>> clf.decision_function_shape = "ovr"
    >>> dec = clf.decision_function([[1]])
    >>> dec.shape[1] # 4 classes





    share|improve this answer




























      1














      See here for the sklearn page about SVM, there you have a section for multiclass classification using SVM. You first have to convert your texts into a feature vector (numeric, if you wish you use SVM) If you should like to use bag of words you could use this SO question and this manual page of sklearn



      You can use pre written python code to create BOW from you texts doing something like that - mind you, i geathered the relevant information for the OP - that was unclear and not compatible with SO strandarts, so you might need to work the code a bit for it to fit your excact usage.



      >>> from sklearn.feature_extraction.text import CountVectorizer

      >>> vectorizer = CountVectorizer()
      >>> vectorizer
      CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
      dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
      lowercase=True, max_df=1.0, max_features=None, min_df=1,
      ngram_range=(1, 1), preprocessor=None, stop_words=None,
      strip_accents=None, token_pattern=...'(?u)\b\w\w+\b',
      tokenizer=None, vocabulary=None)


      >>> corpus = [
      ... 'This is the first document.',
      ... 'This is the second second document.',
      ... 'And the third one.',
      ... 'Is this the first document?',
      ... ]
      >>> X = vectorizer.fit_transform(corpus)
      >>> X
      <4x9 sparse matrix of type '<... 'numpy.int64'>'
      with 19 stored elements in Compressed Sparse ... format>


      Then you might need to convert x into a dense matrix (depends on sklearn version)
      Then you could feed x into SVM model you can create like so



      >>>>from sklearn import svm
      >>> X = [[0], [1], [2], [3]]
      >>> Y = [0, 1, 2, 3]
      >>> clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
      >>> clf.fit(X, Y)
      SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovo', degree=3, gamma='scale', kernel='rbf',
      max_iter=-1, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False)
      >>> dec = clf.decision_function([[1]])
      >>> dec.shape[1] # 4 classes: 4*3/2 = 6
      6
      >>> clf.decision_function_shape = "ovr"
      >>> dec = clf.decision_function([[1]])
      >>> dec.shape[1] # 4 classes





      share|improve this answer


























        1












        1








        1







        See here for the sklearn page about SVM, there you have a section for multiclass classification using SVM. You first have to convert your texts into a feature vector (numeric, if you wish you use SVM) If you should like to use bag of words you could use this SO question and this manual page of sklearn



        You can use pre written python code to create BOW from you texts doing something like that - mind you, i geathered the relevant information for the OP - that was unclear and not compatible with SO strandarts, so you might need to work the code a bit for it to fit your excact usage.



        >>> from sklearn.feature_extraction.text import CountVectorizer

        >>> vectorizer = CountVectorizer()
        >>> vectorizer
        CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
        dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=...'(?u)\b\w\w+\b',
        tokenizer=None, vocabulary=None)


        >>> corpus = [
        ... 'This is the first document.',
        ... 'This is the second second document.',
        ... 'And the third one.',
        ... 'Is this the first document?',
        ... ]
        >>> X = vectorizer.fit_transform(corpus)
        >>> X
        <4x9 sparse matrix of type '<... 'numpy.int64'>'
        with 19 stored elements in Compressed Sparse ... format>


        Then you might need to convert x into a dense matrix (depends on sklearn version)
        Then you could feed x into SVM model you can create like so



        >>>>from sklearn import svm
        >>> X = [[0], [1], [2], [3]]
        >>> Y = [0, 1, 2, 3]
        >>> clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
        >>> clf.fit(X, Y)
        SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
        decision_function_shape='ovo', degree=3, gamma='scale', kernel='rbf',
        max_iter=-1, probability=False, random_state=None, shrinking=True,
        tol=0.001, verbose=False)
        >>> dec = clf.decision_function([[1]])
        >>> dec.shape[1] # 4 classes: 4*3/2 = 6
        6
        >>> clf.decision_function_shape = "ovr"
        >>> dec = clf.decision_function([[1]])
        >>> dec.shape[1] # 4 classes





        share|improve this answer













        See here for the sklearn page about SVM, there you have a section for multiclass classification using SVM. You first have to convert your texts into a feature vector (numeric, if you wish you use SVM) If you should like to use bag of words you could use this SO question and this manual page of sklearn



        You can use pre written python code to create BOW from you texts doing something like that - mind you, i geathered the relevant information for the OP - that was unclear and not compatible with SO strandarts, so you might need to work the code a bit for it to fit your excact usage.



        >>> from sklearn.feature_extraction.text import CountVectorizer

        >>> vectorizer = CountVectorizer()
        >>> vectorizer
        CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
        dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=...'(?u)\b\w\w+\b',
        tokenizer=None, vocabulary=None)


        >>> corpus = [
        ... 'This is the first document.',
        ... 'This is the second second document.',
        ... 'And the third one.',
        ... 'Is this the first document?',
        ... ]
        >>> X = vectorizer.fit_transform(corpus)
        >>> X
        <4x9 sparse matrix of type '<... 'numpy.int64'>'
        with 19 stored elements in Compressed Sparse ... format>


        Then you might need to convert x into a dense matrix (depends on sklearn version)
        Then you could feed x into SVM model you can create like so



        >>>>from sklearn import svm
        >>> X = [[0], [1], [2], [3]]
        >>> Y = [0, 1, 2, 3]
        >>> clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
        >>> clf.fit(X, Y)
        SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
        decision_function_shape='ovo', degree=3, gamma='scale', kernel='rbf',
        max_iter=-1, probability=False, random_state=None, shrinking=True,
        tol=0.001, verbose=False)
        >>> dec = clf.decision_function([[1]])
        >>> dec.shape[1] # 4 classes: 4*3/2 = 6
        6
        >>> clf.decision_function_shape = "ovr"
        >>> dec = clf.decision_function([[1]])
        >>> dec.shape[1] # 4 classes






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 2 at 10:35









        thebeancounterthebeancounter

        9311927




        9311927
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996311%2fhow-can-i-use-svm-for-text-classification-without-using-tf-idf%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            'app-layout' is not a known element: how to share Component with different Modules

            android studio warns about leanback feature tag usage required on manifest while using Unity exported app?

            WPF add header to Image with URL pettitions [duplicate]