Update Machine Learning models in Mllib dataframe based PySpark (2.2.0)

-1

I have built a Machine Learning model based on clustering, & now just want to update it with new data periodically (on daily basis). I am using PySpark Mlib, & not able to find any method in Spark for this need.

Note, required method 'partial_fit' available in scikit-learn, but not in Spark.

I am not in favor of appending new data & then re-built the model on daily basis, as it will increase data size & will be computationally expensive.

Please suggest me an effective way for model update or on-line learning using Spark Mllib?

edited Dec 13 '18 at 16:39

desertnaut

16.3k63566

asked Nov 19 '18 at 12:05

bioinformatician

143722

5

In general case you cannot. Some models (especially in the old API) might have some methods that enable such process, but this exception not rule, and would be applicable only for small subset of iterative algorithms. Also there are a few legacy streaming implementations (regression models, k-means).
– user6910411
Nov 19 '18 at 12:34

See StreamingLinearAlgorithm, StreamingKMeans and parameters like initialWeights in LinearRegressionWithSGD.run
– user6910411
Nov 19 '18 at 12:43

@user6910411 Thanks for comments, could you please suggest me ways how models are being updated in industry (particularly on-line learning) dealing with massive big data?
– bioinformatician
Nov 22 '18 at 8:41

4

I concur with @user6910411 This is not possible with Apache Spark. And for the records, sklearn or other machine learning libraries can scale with right amount of resources you don't always need Spark.
– eliasah
Dec 11 '18 at 14:54

add a comment |

-1

Note, required method 'partial_fit' available in scikit-learn, but not in Spark.

I am not in favor of appending new data & then re-built the model on daily basis, as it will increase data size & will be computationally expensive.

Please suggest me an effective way for model update or on-line learning using Spark Mllib?

edited Dec 13 '18 at 16:39

desertnaut

16.3k63566

asked Nov 19 '18 at 12:05

bioinformatician

143722

5

In general case you cannot. Some models (especially in the old API) might have some methods that enable such process, but this exception not rule, and would be applicable only for small subset of iterative algorithms. Also there are a few legacy streaming implementations (regression models, k-means).
– user6910411
Nov 19 '18 at 12:34

See StreamingLinearAlgorithm, StreamingKMeans and parameters like initialWeights in LinearRegressionWithSGD.run
– user6910411
Nov 19 '18 at 12:43

@user6910411 Thanks for comments, could you please suggest me ways how models are being updated in industry (particularly on-line learning) dealing with massive big data?
– bioinformatician
Nov 22 '18 at 8:41

4

I concur with @user6910411 This is not possible with Apache Spark. And for the records, sklearn or other machine learning libraries can scale with right amount of resources you don't always need Spark.
– eliasah
Dec 11 '18 at 14:54

add a comment |

-1

Note, required method 'partial_fit' available in scikit-learn, but not in Spark.

I am not in favor of appending new data & then re-built the model on daily basis, as it will increase data size & will be computationally expensive.

Please suggest me an effective way for model update or on-line learning using Spark Mllib?

edited Dec 13 '18 at 16:39

desertnaut

16.3k63566

asked Nov 19 '18 at 12:05

bioinformatician

143722

Note, required method 'partial_fit' available in scikit-learn, but not in Spark.

I am not in favor of appending new data & then re-built the model on daily basis, as it will increase data size & will be computationally expensive.

Please suggest me an effective way for model update or on-line learning using Spark Mllib?

apache-spark machine-learning pyspark cluster-analysis apache-spark-ml

edited Dec 13 '18 at 16:39

desertnaut

16.3k63566

asked Nov 19 '18 at 12:05

bioinformatician

143722

edited Dec 13 '18 at 16:39

desertnaut

16.3k63566

asked Nov 19 '18 at 12:05

bioinformatician

143722

edited Dec 13 '18 at 16:39

desertnaut

16.3k63566

edited Dec 13 '18 at 16:39

desertnaut

16.3k63566

edited Dec 13 '18 at 16:39

desertnaut

16.3k63566

asked Nov 19 '18 at 12:05

bioinformatician

143722

asked Nov 19 '18 at 12:05

bioinformatician

143722

asked Nov 19 '18 at 12:05

bioinformatician

143722

5

In general case you cannot. Some models (especially in the old API) might have some methods that enable such process, but this exception not rule, and would be applicable only for small subset of iterative algorithms. Also there are a few legacy streaming implementations (regression models, k-means).
– user6910411
Nov 19 '18 at 12:34

See StreamingLinearAlgorithm, StreamingKMeans and parameters like initialWeights in LinearRegressionWithSGD.run
– user6910411
Nov 19 '18 at 12:43

@user6910411 Thanks for comments, could you please suggest me ways how models are being updated in industry (particularly on-line learning) dealing with massive big data?
– bioinformatician
Nov 22 '18 at 8:41

4

I concur with @user6910411 This is not possible with Apache Spark. And for the records, sklearn or other machine learning libraries can scale with right amount of resources you don't always need Spark.
– eliasah
Dec 11 '18 at 14:54

add a comment |

5

In general case you cannot. Some models (especially in the old API) might have some methods that enable such process, but this exception not rule, and would be applicable only for small subset of iterative algorithms. Also there are a few legacy streaming implementations (regression models, k-means).
– user6910411
Nov 19 '18 at 12:34

See StreamingLinearAlgorithm, StreamingKMeans and parameters like initialWeights in LinearRegressionWithSGD.run
– user6910411
Nov 19 '18 at 12:43

@user6910411 Thanks for comments, could you please suggest me ways how models are being updated in industry (particularly on-line learning) dealing with massive big data?
– bioinformatician
Nov 22 '18 at 8:41

4

I concur with @user6910411 This is not possible with Apache Spark. And for the records, sklearn or other machine learning libraries can scale with right amount of resources you don't always need Spark.
– eliasah
Dec 11 '18 at 14:54

In general case you cannot. Some models (especially in the old API) might have some methods that enable such process, but this exception not rule, and would be applicable only for small subset of iterative algorithms. Also there are a few legacy streaming implementations (regression models, k-means).
– user6910411
Nov 19 '18 at 12:34

See StreamingLinearAlgorithm, StreamingKMeans and parameters like initialWeights in LinearRegressionWithSGD.run
– user6910411
Nov 19 '18 at 12:43

@user6910411 Thanks for comments, could you please suggest me ways how models are being updated in industry (particularly on-line learning) dealing with massive big data?
– bioinformatician
Nov 22 '18 at 8:41

I concur with @user6910411 This is not possible with Apache Spark. And for the records, sklearn or other machine learning libraries can scale with right amount of resources you don't always need Spark.
– eliasah
Dec 11 '18 at 14:54

add a comment |

1 Answer
1

active

oldest

votes

You cannot update arbitrary models.

On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.

For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.

That is why it is fairly common to build new models every night, for example.

Streaming is quite overrated. In particular k-means. Total nonsense to do online k-meand with "big" (lol) data. Because the new point have next to zero effect, you may just as well do a batch every night. These are just academic toys with no relevancy.

answered Dec 16 '18 at 20:04

Anony-Mousse

57.2k796159

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374291%2fupdate-machine-learning-models-in-mllib-dataframe-based-pyspark-2-2-0%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You cannot update arbitrary models.

On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.

For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.

That is why it is fairly common to build new models every night, for example.

answered Dec 16 '18 at 20:04

Anony-Mousse

57.2k796159

add a comment |

You cannot update arbitrary models.

On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.

For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.

That is why it is fairly common to build new models every night, for example.

answered Dec 16 '18 at 20:04

Anony-Mousse

57.2k796159

add a comment |

You cannot update arbitrary models.

On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.

For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.

That is why it is fairly common to build new models every night, for example.

answered Dec 16 '18 at 20:04

Anony-Mousse

57.2k796159

You cannot update arbitrary models.

On a few select models this works. On some it works if you accept some loss in a accuracy. But on other models, the only way is to rebuild it completely.

For example support vector machines. The model only stores the support vectors. When updating, you would also need all the non-support vectors in order to find the optimal model.

That is why it is fairly common to build new models every night, for example.

answered Dec 16 '18 at 20:04

Anony-Mousse

57.2k796159

answered Dec 16 '18 at 20:04

Anony-Mousse

57.2k796159

answered Dec 16 '18 at 20:04

Anony-Mousse

57.2k796159

answered Dec 16 '18 at 20:04

Anony-Mousse

57.2k796159

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu