spark - RDD process twice after persist












1















I made a RDD and created another RDD from origin like below.



val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)

RDD2.foreach({
println("RDD2")
...
})
...so on..


I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.



BUT somehow "RDD1" is printed after "RDD2" printed like below.



RDD1
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
RDD1 -- repeat RDD1 process. WHY?
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2









share|improve this question

























  • when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.

    – DK2
    Nov 22 '18 at 1:28











  • I guess you did some collect action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.

    – David S.
    Nov 22 '18 at 1:31











  • @davidshen84 yes I did collectAsMap() at the end of code

    – DK2
    Nov 22 '18 at 1:33











  • For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.

    – David S.
    Nov 22 '18 at 1:44
















1















I made a RDD and created another RDD from origin like below.



val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)

RDD2.foreach({
println("RDD2")
...
})
...so on..


I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.



BUT somehow "RDD1" is printed after "RDD2" printed like below.



RDD1
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
RDD1 -- repeat RDD1 process. WHY?
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2









share|improve this question

























  • when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.

    – DK2
    Nov 22 '18 at 1:28











  • I guess you did some collect action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.

    – David S.
    Nov 22 '18 at 1:31











  • @davidshen84 yes I did collectAsMap() at the end of code

    – DK2
    Nov 22 '18 at 1:33











  • For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.

    – David S.
    Nov 22 '18 at 1:44














1












1








1


1






I made a RDD and created another RDD from origin like below.



val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)

RDD2.foreach({
println("RDD2")
...
})
...so on..


I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.



BUT somehow "RDD1" is printed after "RDD2" printed like below.



RDD1
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
RDD1 -- repeat RDD1 process. WHY?
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2









share|improve this question
















I made a RDD and created another RDD from origin like below.



val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)

RDD2.foreach({
println("RDD2")
...
})
...so on..


I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.



BUT somehow "RDD1" is printed after "RDD2" printed like below.



RDD1
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
RDD1 -- repeat RDD1 process. WHY?
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2






apache-spark






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 '18 at 1:16







DK2

















asked Nov 22 '18 at 1:06









DK2DK2

190316




190316













  • when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.

    – DK2
    Nov 22 '18 at 1:28











  • I guess you did some collect action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.

    – David S.
    Nov 22 '18 at 1:31











  • @davidshen84 yes I did collectAsMap() at the end of code

    – DK2
    Nov 22 '18 at 1:33











  • For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.

    – David S.
    Nov 22 '18 at 1:44



















  • when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.

    – DK2
    Nov 22 '18 at 1:28











  • I guess you did some collect action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.

    – David S.
    Nov 22 '18 at 1:31











  • @davidshen84 yes I did collectAsMap() at the end of code

    – DK2
    Nov 22 '18 at 1:33











  • For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.

    – David S.
    Nov 22 '18 at 1:44

















when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.

– DK2
Nov 22 '18 at 1:28





when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.

– DK2
Nov 22 '18 at 1:28













I guess you did some collect action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.

– David S.
Nov 22 '18 at 1:31





I guess you did some collect action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.

– David S.
Nov 22 '18 at 1:31













@davidshen84 yes I did collectAsMap() at the end of code

– DK2
Nov 22 '18 at 1:33





@davidshen84 yes I did collectAsMap() at the end of code

– DK2
Nov 22 '18 at 1:33













For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.

– David S.
Nov 22 '18 at 1:44





For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.

– David S.
Nov 22 '18 at 1:44












1 Answer
1






active

oldest

votes


















1














This is the expected behaviour of spark. Like most of the operations persist in spark is also lazy operation. So, even if you add the persist for the 1st RDD, spark doesn't cache the data unless you add any action after the persist operation. The map operation is not an action in spark and it is also lazy.



The way to enforce the caching is to add count action after the persist of RDD2



val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)

RDD2.count // Forces the caching


Now if you do any other operation the RDD2 won't be recomputed






share|improve this answer
























  • is there any configuration option we may use to force caching? i.e. without calling action methods

    – mangusta
    Nov 22 '18 at 3:01











  • No, this is by design of spark. You have to force caching by invoking action on the Rdd

    – Avishek Bhattacharya
    Nov 22 '18 at 3:05











  • from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

    – mangusta
    Nov 25 '18 at 14:37











  • it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

    – mangusta
    Nov 25 '18 at 14:38













  • Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

    – Avishek Bhattacharya
    Nov 25 '18 at 14:46











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53422563%2fspark-rdd-process-twice-after-persist%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














This is the expected behaviour of spark. Like most of the operations persist in spark is also lazy operation. So, even if you add the persist for the 1st RDD, spark doesn't cache the data unless you add any action after the persist operation. The map operation is not an action in spark and it is also lazy.



The way to enforce the caching is to add count action after the persist of RDD2



val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)

RDD2.count // Forces the caching


Now if you do any other operation the RDD2 won't be recomputed






share|improve this answer
























  • is there any configuration option we may use to force caching? i.e. without calling action methods

    – mangusta
    Nov 22 '18 at 3:01











  • No, this is by design of spark. You have to force caching by invoking action on the Rdd

    – Avishek Bhattacharya
    Nov 22 '18 at 3:05











  • from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

    – mangusta
    Nov 25 '18 at 14:37











  • it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

    – mangusta
    Nov 25 '18 at 14:38













  • Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

    – Avishek Bhattacharya
    Nov 25 '18 at 14:46
















1














This is the expected behaviour of spark. Like most of the operations persist in spark is also lazy operation. So, even if you add the persist for the 1st RDD, spark doesn't cache the data unless you add any action after the persist operation. The map operation is not an action in spark and it is also lazy.



The way to enforce the caching is to add count action after the persist of RDD2



val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)

RDD2.count // Forces the caching


Now if you do any other operation the RDD2 won't be recomputed






share|improve this answer
























  • is there any configuration option we may use to force caching? i.e. without calling action methods

    – mangusta
    Nov 22 '18 at 3:01











  • No, this is by design of spark. You have to force caching by invoking action on the Rdd

    – Avishek Bhattacharya
    Nov 22 '18 at 3:05











  • from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

    – mangusta
    Nov 25 '18 at 14:37











  • it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

    – mangusta
    Nov 25 '18 at 14:38













  • Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

    – Avishek Bhattacharya
    Nov 25 '18 at 14:46














1












1








1







This is the expected behaviour of spark. Like most of the operations persist in spark is also lazy operation. So, even if you add the persist for the 1st RDD, spark doesn't cache the data unless you add any action after the persist operation. The map operation is not an action in spark and it is also lazy.



The way to enforce the caching is to add count action after the persist of RDD2



val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)

RDD2.count // Forces the caching


Now if you do any other operation the RDD2 won't be recomputed






share|improve this answer













This is the expected behaviour of spark. Like most of the operations persist in spark is also lazy operation. So, even if you add the persist for the 1st RDD, spark doesn't cache the data unless you add any action after the persist operation. The map operation is not an action in spark and it is also lazy.



The way to enforce the caching is to add count action after the persist of RDD2



val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)

RDD2.count // Forces the caching


Now if you do any other operation the RDD2 won't be recomputed







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 22 '18 at 2:26









Avishek BhattacharyaAvishek Bhattacharya

2,72231533




2,72231533













  • is there any configuration option we may use to force caching? i.e. without calling action methods

    – mangusta
    Nov 22 '18 at 3:01











  • No, this is by design of spark. You have to force caching by invoking action on the Rdd

    – Avishek Bhattacharya
    Nov 22 '18 at 3:05











  • from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

    – mangusta
    Nov 25 '18 at 14:37











  • it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

    – mangusta
    Nov 25 '18 at 14:38













  • Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

    – Avishek Bhattacharya
    Nov 25 '18 at 14:46



















  • is there any configuration option we may use to force caching? i.e. without calling action methods

    – mangusta
    Nov 22 '18 at 3:01











  • No, this is by design of spark. You have to force caching by invoking action on the Rdd

    – Avishek Bhattacharya
    Nov 22 '18 at 3:05











  • from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

    – mangusta
    Nov 25 '18 at 14:37











  • it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

    – mangusta
    Nov 25 '18 at 14:38













  • Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

    – Avishek Bhattacharya
    Nov 25 '18 at 14:46

















is there any configuration option we may use to force caching? i.e. without calling action methods

– mangusta
Nov 22 '18 at 3:01





is there any configuration option we may use to force caching? i.e. without calling action methods

– mangusta
Nov 22 '18 at 3:01













No, this is by design of spark. You have to force caching by invoking action on the Rdd

– Avishek Bhattacharya
Nov 22 '18 at 3:05





No, this is by design of spark. You have to force caching by invoking action on the Rdd

– Avishek Bhattacharya
Nov 22 '18 at 3:05













from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

– mangusta
Nov 25 '18 at 14:37





from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

– mangusta
Nov 25 '18 at 14:37













it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

– mangusta
Nov 25 '18 at 14:38







it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

– mangusta
Nov 25 '18 at 14:38















Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

– Avishek Bhattacharya
Nov 25 '18 at 14:46





Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

– Avishek Bhattacharya
Nov 25 '18 at 14:46




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53422563%2fspark-rdd-process-twice-after-persist%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

How to fix TextFormField cause rebuild widget in Flutter

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith