Spark: Best way to build RDD line by line
I have an iterative function that generates some data I want the output in an RDD. What is the best way to build it? I can come up with two solutions
1)
var results = sparkSession.sparkContext.emptyRDD[String]
for (values <- values) {
results = results.union(sparkSession.sparkContext.parallelize[String]Seq(performSomething(value))
}
2)
var results = Seq[String]
for (values <- values) {
results = results += performSomething(value)
}
sparkSession.sparkContext.parallelize[String](results)
I guess the first approach will be slower but will probably reduce memory consumption on driver and second approach will be faster, but before parallelizing all data will be in the driver? Am I correct?
Is there a 3rd better approach?
Thanks!
scala apache-spark rdd
add a comment |
I have an iterative function that generates some data I want the output in an RDD. What is the best way to build it? I can come up with two solutions
1)
var results = sparkSession.sparkContext.emptyRDD[String]
for (values <- values) {
results = results.union(sparkSession.sparkContext.parallelize[String]Seq(performSomething(value))
}
2)
var results = Seq[String]
for (values <- values) {
results = results += performSomething(value)
}
sparkSession.sparkContext.parallelize[String](results)
I guess the first approach will be slower but will probably reduce memory consumption on driver and second approach will be faster, but before parallelizing all data will be in the driver? Am I correct?
Is there a 3rd better approach?
Thanks!
scala apache-spark rdd
2
parallelize
would be the way to go, but why not just parallelizevalues
, thenmap(performSomething)
over that?
– cricket_007
Nov 19 '18 at 17:34
1
RDDs
aren't intended for being built iteratively, and sincevalues
fits on the Driver memory why don't justparallelize
them? Also, note that since spark is lazy and just builds a DAG of your computation until run, the first method will consume roughly the same memory as the second (unlessperformSomething
increases dramatically the size of your data) - Thus, the best solution would beval results = sc.parllelize(values).map(performSomething)
as @cricket_007 pointed out.
– Luis Miguel Mejía Suárez
Nov 19 '18 at 17:59
Thanks for the input. Yes performSomething increases a lot the size of each input. I am not sure if perfromSomething will be Serializable in order to make RDD out of the values. Otherwise this would be the best bet. Anyway I will test tomorrow. And from all seems that the best bet is either choice 2) or directly make values a RDD.
– MitakaJ9
Nov 19 '18 at 21:00
add a comment |
I have an iterative function that generates some data I want the output in an RDD. What is the best way to build it? I can come up with two solutions
1)
var results = sparkSession.sparkContext.emptyRDD[String]
for (values <- values) {
results = results.union(sparkSession.sparkContext.parallelize[String]Seq(performSomething(value))
}
2)
var results = Seq[String]
for (values <- values) {
results = results += performSomething(value)
}
sparkSession.sparkContext.parallelize[String](results)
I guess the first approach will be slower but will probably reduce memory consumption on driver and second approach will be faster, but before parallelizing all data will be in the driver? Am I correct?
Is there a 3rd better approach?
Thanks!
scala apache-spark rdd
I have an iterative function that generates some data I want the output in an RDD. What is the best way to build it? I can come up with two solutions
1)
var results = sparkSession.sparkContext.emptyRDD[String]
for (values <- values) {
results = results.union(sparkSession.sparkContext.parallelize[String]Seq(performSomething(value))
}
2)
var results = Seq[String]
for (values <- values) {
results = results += performSomething(value)
}
sparkSession.sparkContext.parallelize[String](results)
I guess the first approach will be slower but will probably reduce memory consumption on driver and second approach will be faster, but before parallelizing all data will be in the driver? Am I correct?
Is there a 3rd better approach?
Thanks!
scala apache-spark rdd
scala apache-spark rdd
edited Nov 19 '18 at 17:33
cricket_007
79.5k1142109
79.5k1142109
asked Nov 19 '18 at 17:31
MitakaJ9
367
367
2
parallelize
would be the way to go, but why not just parallelizevalues
, thenmap(performSomething)
over that?
– cricket_007
Nov 19 '18 at 17:34
1
RDDs
aren't intended for being built iteratively, and sincevalues
fits on the Driver memory why don't justparallelize
them? Also, note that since spark is lazy and just builds a DAG of your computation until run, the first method will consume roughly the same memory as the second (unlessperformSomething
increases dramatically the size of your data) - Thus, the best solution would beval results = sc.parllelize(values).map(performSomething)
as @cricket_007 pointed out.
– Luis Miguel Mejía Suárez
Nov 19 '18 at 17:59
Thanks for the input. Yes performSomething increases a lot the size of each input. I am not sure if perfromSomething will be Serializable in order to make RDD out of the values. Otherwise this would be the best bet. Anyway I will test tomorrow. And from all seems that the best bet is either choice 2) or directly make values a RDD.
– MitakaJ9
Nov 19 '18 at 21:00
add a comment |
2
parallelize
would be the way to go, but why not just parallelizevalues
, thenmap(performSomething)
over that?
– cricket_007
Nov 19 '18 at 17:34
1
RDDs
aren't intended for being built iteratively, and sincevalues
fits on the Driver memory why don't justparallelize
them? Also, note that since spark is lazy and just builds a DAG of your computation until run, the first method will consume roughly the same memory as the second (unlessperformSomething
increases dramatically the size of your data) - Thus, the best solution would beval results = sc.parllelize(values).map(performSomething)
as @cricket_007 pointed out.
– Luis Miguel Mejía Suárez
Nov 19 '18 at 17:59
Thanks for the input. Yes performSomething increases a lot the size of each input. I am not sure if perfromSomething will be Serializable in order to make RDD out of the values. Otherwise this would be the best bet. Anyway I will test tomorrow. And from all seems that the best bet is either choice 2) or directly make values a RDD.
– MitakaJ9
Nov 19 '18 at 21:00
2
2
parallelize
would be the way to go, but why not just parallelize values
, then map(performSomething)
over that?– cricket_007
Nov 19 '18 at 17:34
parallelize
would be the way to go, but why not just parallelize values
, then map(performSomething)
over that?– cricket_007
Nov 19 '18 at 17:34
1
1
RDDs
aren't intended for being built iteratively, and since values
fits on the Driver memory why don't just parallelize
them? Also, note that since spark is lazy and just builds a DAG of your computation until run, the first method will consume roughly the same memory as the second (unless performSomething
increases dramatically the size of your data) - Thus, the best solution would be val results = sc.parllelize(values).map(performSomething)
as @cricket_007 pointed out.– Luis Miguel Mejía Suárez
Nov 19 '18 at 17:59
RDDs
aren't intended for being built iteratively, and since values
fits on the Driver memory why don't just parallelize
them? Also, note that since spark is lazy and just builds a DAG of your computation until run, the first method will consume roughly the same memory as the second (unless performSomething
increases dramatically the size of your data) - Thus, the best solution would be val results = sc.parllelize(values).map(performSomething)
as @cricket_007 pointed out.– Luis Miguel Mejía Suárez
Nov 19 '18 at 17:59
Thanks for the input. Yes performSomething increases a lot the size of each input. I am not sure if perfromSomething will be Serializable in order to make RDD out of the values. Otherwise this would be the best bet. Anyway I will test tomorrow. And from all seems that the best bet is either choice 2) or directly make values a RDD.
– MitakaJ9
Nov 19 '18 at 21:00
Thanks for the input. Yes performSomething increases a lot the size of each input. I am not sure if perfromSomething will be Serializable in order to make RDD out of the values. Otherwise this would be the best bet. Anyway I will test tomorrow. And from all seems that the best bet is either choice 2) or directly make values a RDD.
– MitakaJ9
Nov 19 '18 at 21:00
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379852%2fspark-best-way-to-build-rdd-line-by-line%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379852%2fspark-best-way-to-build-rdd-line-by-line%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
parallelize
would be the way to go, but why not just parallelizevalues
, thenmap(performSomething)
over that?– cricket_007
Nov 19 '18 at 17:34
1
RDDs
aren't intended for being built iteratively, and sincevalues
fits on the Driver memory why don't justparallelize
them? Also, note that since spark is lazy and just builds a DAG of your computation until run, the first method will consume roughly the same memory as the second (unlessperformSomething
increases dramatically the size of your data) - Thus, the best solution would beval results = sc.parllelize(values).map(performSomething)
as @cricket_007 pointed out.– Luis Miguel Mejía Suárez
Nov 19 '18 at 17:59
Thanks for the input. Yes performSomething increases a lot the size of each input. I am not sure if perfromSomething will be Serializable in order to make RDD out of the values. Otherwise this would be the best bet. Anyway I will test tomorrow. And from all seems that the best bet is either choice 2) or directly make values a RDD.
– MitakaJ9
Nov 19 '18 at 21:00