Update Machine Learning models in Mllib dataframe based PySpark (2.2.0)












-1














I have built a Machine Learning model based on clustering, & now just want to update it with new data periodically (on daily basis). I am using PySpark Mlib, & not able to find any method in Spark for this need.



Note, required method 'partial_fit' available in scikit-learn, but not in Spark.



I am not in favor of appending new data & then re-built the model on daily basis, as it will increase data size & will be computationally expensive.



Please suggest me an effective way for model update or on-line learning using Spark Mllib?










share|improve this question




















  • 5




    In general case you cannot. Some models (especially in the old API) might have some methods that enable such process, but this exception not rule, and would be applicable only for small subset of iterative algorithms. Also there are a few legacy streaming implementations (regression models, k-means).
    – user6910411
    Nov 19 '18 at 12:34












  • See StreamingLinearAlgorithm, StreamingKMeans and parameters like initialWeights in LinearRegressionWithSGD.run
    – user6910411
    Nov 19 '18 at 12:43












  • @user6910411 Thanks for comments, could you please suggest me ways how models are being updated in industry (particularly on-line learning) dealing with massive big data?
    – bioinformatician
    Nov 22 '18 at 8:41








  • 4




    I concur with @user6910411 This is not possible with Apache Spark. And for the records, sklearn or other machine learning libraries can scale with right amount of resources you don't always need Spark.
    – eliasah
    Dec 11 '18 at 14:54
















-1














I have built a Machine Learning model based on clustering, & now just want to update it with new data periodically (on daily basis). I am using PySpark Mlib, & not able to find any method in Spark for this need.



Note, required method 'partial_fit' available in scikit-learn, but not in Spark.



I am not in favor of appending new data & then re-built the model on daily basis, as it will increase data size & will be computationally expensive.



Please suggest me an effective way for model update or on-line learning using Spark Mllib?










share|improve this question




















  • 5




    In general case you cannot. Some models (especially in the old API) might have some methods that enable such process, but this exception not rule, and would be applicable only for small subset of iterative algorithms. Also there are a few legacy streaming implementations (regression models, k-means).
    – user6910411
    Nov 19 '18 at 12:34












  • See StreamingLinearAlgorithm, StreamingKMeans and parameters like initialWeights in LinearRegressionWithSGD.run
    – user6910411
    Nov 19 '18 at 12:43












  • @user6910411 Thanks for comments, could you please suggest me ways how models are being updated in industry (particularly on-line learning) dealing with massive big data?
    – bioinformatician
    Nov 22 '18 at 8:41








  • 4




    I concur with @user6910411 This is not possible with Apache Spark. And for the records, sklearn or other machine learning libraries can scale with right amount of resources you don't always need Spark.
    – eliasah
    Dec 11 '18 at 14:54














-1












-1








-1







I have built a Machine Learning model based on clustering, & now just want to update it with new data periodically (on daily basis). I am using PySpark Mlib, & not able to find any method in Spark for this need.



Note, required method 'partial_fit' available in scikit-learn, but not in Spark.



I am not in favor of appending new data & then re-built the model on daily basis, as it will increase data size & will be computationally expensive.



Please suggest me an effective way for model update or on-line learning using Spark Mllib?










share|improve this question















I have built a Machine Learning model based on clustering, & now just want to update it with new data periodically (on daily basis). I am using PySpark Mlib, & not able to find any method in Spark for this need.



Note, required method 'partial_fit' available in scikit-learn, but not in Spark.



I am not in favor of appending new data & then re-built the model on daily basis, as it will increase data size & will be computationally expensive.



Please suggest me an effective way for model update or on-line learning using Spark Mllib?







apache-spark machine-learning pyspark cluster-analysis apache-spark-ml






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 13 '18 at 16:39









desertnaut

16.3k63566




16.3k63566










asked Nov 19 '18 at 12:05









bioinformatician

143722




143722








  • 5




    In general case you cannot. Some models (especially in the old API) might have some methods that enable such process, but this exception not rule, and would be applicable only for small subset of iterative algorithms. Also there are a few legacy streaming implementations (regression models, k-means).
    – user6910411
    Nov 19 '18 at 12:34












  • See StreamingLinearAlgorithm, StreamingKMeans and parameters like initialWeights in LinearRegressionWithSGD.run
    – user6910411
    Nov 19 '18 at 12:43












  • @user6910411 Thanks for comments, could you please suggest me ways how models are being updated in industry (particularly on-line learning) dealing with massive big data?
    – bioinformatician
    Nov 22 '18 at 8:41








  • 4




    I concur with @user6910411 This is not possible with Apache Spark. And for the records, sklearn or other machine learning libraries can scale with right amount of resources you don't always need Spark.
    – eliasah
    Dec 11 '18 at 14:54














  • 5




    In general case you cannot. Some models (especially in the old API) might have some methods that enable such process, but this exception not rule, and would be applicable only for small subset of iterative algorithms. Also there are a few legacy streaming implementations (regression models, k-means).
    – user6910411
    Nov 19 '18 at 12:34












  • See StreamingLinearAlgorithm, StreamingKMeans and parameters like initialWeights in LinearRegressionWithSGD.run
    – user6910411
    Nov 19 '18 at 12:43












  • @user6910411 Thanks for comments, could you please suggest me ways how models are being updated in industry (particularly on-line learning) dealing with massive big data?
    – bioinformatician
    Nov 22 '18 at 8:41








  • 4




    I concur with @user6910411 This is not possible with Apache Spark. And for the records, sklearn or other machine learning libraries can scale with right amount of resources you don't always need Spark.
    – eliasah
    Dec 11 '18 at 14:54








5




5




In general case you cannot. Some models (especially in the old API) might have some methods that enable such process, but this exception not rule, and would be applicable only for small subset of iterative algorithms. Also there are a few legacy streaming implementations (regression models, k-means).
– user6910411
Nov 19 '18 at 12:34






In general case you cannot. Some models (especially in the old API) might have some methods that enable such process, but this exception not rule, and would be applicable only for small subset of iterative algorithms. Also there are a few legacy streaming implementations (regression models, k-means).
– user6910411
Nov 19 '18 at 12:34














See StreamingLinearAlgorithm, StreamingKMeans and parameters like initialWeights in LinearRegressionWithSGD.run
– user6910411
Nov 19 '18 at 12:43






See StreamingLinearAlgorithm, StreamingKMeans and parameters like initialWeights in LinearRegressionWithSGD.run
– user6910411
Nov 19 '18 at 12:43














@user6910411 Thanks for comments, could you please suggest me ways how models are being updated in industry (particularly on-line learning) dealing with massive big data?
– bioinformatician
Nov 22 '18 at 8:41






@user6910411 Thanks for comments, could you please suggest me ways how models are being updated in industry (particularly on-line learning) dealing with massive big data?
– bioinformatician
Nov 22 '18 at 8:41






4




4




I concur with @user6910411 This is not possible with Apache Spark. And for the records, sklearn or other machine learning libraries can scale with right amount of resources you don't always need Spark.
– eliasah
Dec 11 '18 at 14:54




I concur with @user6910411 This is not possible with Apache Spark. And for the records, sklearn or other machine learning libraries can scale with right amount of resources you don't always need Spark.
– eliasah
Dec 11 '18 at 14:54












1 Answer
1






active

oldest

votes


















0














You cannot update arbitrary models.



On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.



For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.



That is why it is fairly common to build new models every night, for example.



Streaming is quite overrated. In particular k-means. Total nonsense to do online k-meand with "big" (lol) data. Because the new point have next to zero effect, you may just as well do a batch every night. These are just academic toys with no relevancy.






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374291%2fupdate-machine-learning-models-in-mllib-dataframe-based-pyspark-2-2-0%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    You cannot update arbitrary models.



    On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.



    For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.



    That is why it is fairly common to build new models every night, for example.



    Streaming is quite overrated. In particular k-means. Total nonsense to do online k-meand with "big" (lol) data. Because the new point have next to zero effect, you may just as well do a batch every night. These are just academic toys with no relevancy.






    share|improve this answer


























      0














      You cannot update arbitrary models.



      On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.



      For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.



      That is why it is fairly common to build new models every night, for example.



      Streaming is quite overrated. In particular k-means. Total nonsense to do online k-meand with "big" (lol) data. Because the new point have next to zero effect, you may just as well do a batch every night. These are just academic toys with no relevancy.






      share|improve this answer
























        0












        0








        0






        You cannot update arbitrary models.



        On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.



        For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.



        That is why it is fairly common to build new models every night, for example.



        Streaming is quite overrated. In particular k-means. Total nonsense to do online k-meand with "big" (lol) data. Because the new point have next to zero effect, you may just as well do a batch every night. These are just academic toys with no relevancy.






        share|improve this answer












        You cannot update arbitrary models.



        On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.



        For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.



        That is why it is fairly common to build new models every night, for example.



        Streaming is quite overrated. In particular k-means. Total nonsense to do online k-meand with "big" (lol) data. Because the new point have next to zero effect, you may just as well do a batch every night. These are just academic toys with no relevancy.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Dec 16 '18 at 20:04









        Anony-Mousse

        57.2k796159




        57.2k796159






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374291%2fupdate-machine-learning-models-in-mllib-dataframe-based-pyspark-2-2-0%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

            Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

            A Topological Invariant for $pi_3(U(n))$