Spark: Best way to build RDD line by line












0














I have an iterative function that generates some data I want the output in an RDD. What is the best way to build it? I can come up with two solutions



1)



var results = sparkSession.sparkContext.emptyRDD[String]
for (values <- values) {
results = results.union(sparkSession.sparkContext.parallelize[String]Seq(performSomething(value))
}


2)



var results = Seq[String]
for (values <- values) {
results = results += performSomething(value)
}
sparkSession.sparkContext.parallelize[String](results)


I guess the first approach will be slower but will probably reduce memory consumption on driver and second approach will be faster, but before parallelizing all data will be in the driver? Am I correct?



Is there a 3rd better approach?



Thanks!










share|improve this question




















  • 2




    parallelize would be the way to go, but why not just parallelize values, then map(performSomething) over that?
    – cricket_007
    Nov 19 '18 at 17:34








  • 1




    RDDs aren't intended for being built iteratively, and since values fits on the Driver memory why don't just parallelize them? Also, note that since spark is lazy and just builds a DAG of your computation until run, the first method will consume roughly the same memory as the second (unless performSomething increases dramatically the size of your data) - Thus, the best solution would be val results = sc.parllelize(values).map(performSomething) as @cricket_007 pointed out.
    – Luis Miguel Mejía Suárez
    Nov 19 '18 at 17:59












  • Thanks for the input. Yes performSomething increases a lot the size of each input. I am not sure if perfromSomething will be Serializable in order to make RDD out of the values. Otherwise this would be the best bet. Anyway I will test tomorrow. And from all seems that the best bet is either choice 2) or directly make values a RDD.
    – MitakaJ9
    Nov 19 '18 at 21:00
















0














I have an iterative function that generates some data I want the output in an RDD. What is the best way to build it? I can come up with two solutions



1)



var results = sparkSession.sparkContext.emptyRDD[String]
for (values <- values) {
results = results.union(sparkSession.sparkContext.parallelize[String]Seq(performSomething(value))
}


2)



var results = Seq[String]
for (values <- values) {
results = results += performSomething(value)
}
sparkSession.sparkContext.parallelize[String](results)


I guess the first approach will be slower but will probably reduce memory consumption on driver and second approach will be faster, but before parallelizing all data will be in the driver? Am I correct?



Is there a 3rd better approach?



Thanks!










share|improve this question




















  • 2




    parallelize would be the way to go, but why not just parallelize values, then map(performSomething) over that?
    – cricket_007
    Nov 19 '18 at 17:34








  • 1




    RDDs aren't intended for being built iteratively, and since values fits on the Driver memory why don't just parallelize them? Also, note that since spark is lazy and just builds a DAG of your computation until run, the first method will consume roughly the same memory as the second (unless performSomething increases dramatically the size of your data) - Thus, the best solution would be val results = sc.parllelize(values).map(performSomething) as @cricket_007 pointed out.
    – Luis Miguel Mejía Suárez
    Nov 19 '18 at 17:59












  • Thanks for the input. Yes performSomething increases a lot the size of each input. I am not sure if perfromSomething will be Serializable in order to make RDD out of the values. Otherwise this would be the best bet. Anyway I will test tomorrow. And from all seems that the best bet is either choice 2) or directly make values a RDD.
    – MitakaJ9
    Nov 19 '18 at 21:00














0












0








0







I have an iterative function that generates some data I want the output in an RDD. What is the best way to build it? I can come up with two solutions



1)



var results = sparkSession.sparkContext.emptyRDD[String]
for (values <- values) {
results = results.union(sparkSession.sparkContext.parallelize[String]Seq(performSomething(value))
}


2)



var results = Seq[String]
for (values <- values) {
results = results += performSomething(value)
}
sparkSession.sparkContext.parallelize[String](results)


I guess the first approach will be slower but will probably reduce memory consumption on driver and second approach will be faster, but before parallelizing all data will be in the driver? Am I correct?



Is there a 3rd better approach?



Thanks!










share|improve this question















I have an iterative function that generates some data I want the output in an RDD. What is the best way to build it? I can come up with two solutions



1)



var results = sparkSession.sparkContext.emptyRDD[String]
for (values <- values) {
results = results.union(sparkSession.sparkContext.parallelize[String]Seq(performSomething(value))
}


2)



var results = Seq[String]
for (values <- values) {
results = results += performSomething(value)
}
sparkSession.sparkContext.parallelize[String](results)


I guess the first approach will be slower but will probably reduce memory consumption on driver and second approach will be faster, but before parallelizing all data will be in the driver? Am I correct?



Is there a 3rd better approach?



Thanks!







scala apache-spark rdd






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 19 '18 at 17:33









cricket_007

79.5k1142109




79.5k1142109










asked Nov 19 '18 at 17:31









MitakaJ9

367




367








  • 2




    parallelize would be the way to go, but why not just parallelize values, then map(performSomething) over that?
    – cricket_007
    Nov 19 '18 at 17:34








  • 1




    RDDs aren't intended for being built iteratively, and since values fits on the Driver memory why don't just parallelize them? Also, note that since spark is lazy and just builds a DAG of your computation until run, the first method will consume roughly the same memory as the second (unless performSomething increases dramatically the size of your data) - Thus, the best solution would be val results = sc.parllelize(values).map(performSomething) as @cricket_007 pointed out.
    – Luis Miguel Mejía Suárez
    Nov 19 '18 at 17:59












  • Thanks for the input. Yes performSomething increases a lot the size of each input. I am not sure if perfromSomething will be Serializable in order to make RDD out of the values. Otherwise this would be the best bet. Anyway I will test tomorrow. And from all seems that the best bet is either choice 2) or directly make values a RDD.
    – MitakaJ9
    Nov 19 '18 at 21:00














  • 2




    parallelize would be the way to go, but why not just parallelize values, then map(performSomething) over that?
    – cricket_007
    Nov 19 '18 at 17:34








  • 1




    RDDs aren't intended for being built iteratively, and since values fits on the Driver memory why don't just parallelize them? Also, note that since spark is lazy and just builds a DAG of your computation until run, the first method will consume roughly the same memory as the second (unless performSomething increases dramatically the size of your data) - Thus, the best solution would be val results = sc.parllelize(values).map(performSomething) as @cricket_007 pointed out.
    – Luis Miguel Mejía Suárez
    Nov 19 '18 at 17:59












  • Thanks for the input. Yes performSomething increases a lot the size of each input. I am not sure if perfromSomething will be Serializable in order to make RDD out of the values. Otherwise this would be the best bet. Anyway I will test tomorrow. And from all seems that the best bet is either choice 2) or directly make values a RDD.
    – MitakaJ9
    Nov 19 '18 at 21:00








2




2




parallelize would be the way to go, but why not just parallelize values, then map(performSomething) over that?
– cricket_007
Nov 19 '18 at 17:34






parallelize would be the way to go, but why not just parallelize values, then map(performSomething) over that?
– cricket_007
Nov 19 '18 at 17:34






1




1




RDDs aren't intended for being built iteratively, and since values fits on the Driver memory why don't just parallelize them? Also, note that since spark is lazy and just builds a DAG of your computation until run, the first method will consume roughly the same memory as the second (unless performSomething increases dramatically the size of your data) - Thus, the best solution would be val results = sc.parllelize(values).map(performSomething) as @cricket_007 pointed out.
– Luis Miguel Mejía Suárez
Nov 19 '18 at 17:59






RDDs aren't intended for being built iteratively, and since values fits on the Driver memory why don't just parallelize them? Also, note that since spark is lazy and just builds a DAG of your computation until run, the first method will consume roughly the same memory as the second (unless performSomething increases dramatically the size of your data) - Thus, the best solution would be val results = sc.parllelize(values).map(performSomething) as @cricket_007 pointed out.
– Luis Miguel Mejía Suárez
Nov 19 '18 at 17:59














Thanks for the input. Yes performSomething increases a lot the size of each input. I am not sure if perfromSomething will be Serializable in order to make RDD out of the values. Otherwise this would be the best bet. Anyway I will test tomorrow. And from all seems that the best bet is either choice 2) or directly make values a RDD.
– MitakaJ9
Nov 19 '18 at 21:00




Thanks for the input. Yes performSomething increases a lot the size of each input. I am not sure if perfromSomething will be Serializable in order to make RDD out of the values. Otherwise this would be the best bet. Anyway I will test tomorrow. And from all seems that the best bet is either choice 2) or directly make values a RDD.
– MitakaJ9
Nov 19 '18 at 21:00












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379852%2fspark-best-way-to-build-rdd-line-by-line%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379852%2fspark-best-way-to-build-rdd-line-by-line%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

How to fix TextFormField cause rebuild widget in Flutter