spark - RDD process twice after persist
I made a RDD and created another RDD from origin like below.
val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)
RDD2.foreach({
println("RDD2")
...
})
...so on..
I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.
BUT somehow "RDD1" is printed after "RDD2" printed like below.
RDD1
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
RDD1 -- repeat RDD1 process. WHY?
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
apache-spark
add a comment |
I made a RDD and created another RDD from origin like below.
val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)
RDD2.foreach({
println("RDD2")
...
})
...so on..
I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.
BUT somehow "RDD1" is printed after "RDD2" printed like below.
RDD1
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
RDD1 -- repeat RDD1 process. WHY?
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
apache-spark
when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.
– DK2
Nov 22 '18 at 1:28
I guess you did somecollect
action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.
– David S.
Nov 22 '18 at 1:31
@davidshen84 yes I did collectAsMap() at the end of code
– DK2
Nov 22 '18 at 1:33
For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.
– David S.
Nov 22 '18 at 1:44
add a comment |
I made a RDD and created another RDD from origin like below.
val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)
RDD2.foreach({
println("RDD2")
...
})
...so on..
I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.
BUT somehow "RDD1" is printed after "RDD2" printed like below.
RDD1
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
RDD1 -- repeat RDD1 process. WHY?
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
apache-spark
I made a RDD and created another RDD from origin like below.
val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)
RDD2.foreach({
println("RDD2")
...
})
...so on..
I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.
BUT somehow "RDD1" is printed after "RDD2" printed like below.
RDD1
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
RDD1 -- repeat RDD1 process. WHY?
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
apache-spark
apache-spark
edited Nov 22 '18 at 1:16
DK2
asked Nov 22 '18 at 1:06
DK2DK2
190316
190316
when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.
– DK2
Nov 22 '18 at 1:28
I guess you did somecollect
action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.
– David S.
Nov 22 '18 at 1:31
@davidshen84 yes I did collectAsMap() at the end of code
– DK2
Nov 22 '18 at 1:33
For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.
– David S.
Nov 22 '18 at 1:44
add a comment |
when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.
– DK2
Nov 22 '18 at 1:28
I guess you did somecollect
action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.
– David S.
Nov 22 '18 at 1:31
@davidshen84 yes I did collectAsMap() at the end of code
– DK2
Nov 22 '18 at 1:33
For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.
– David S.
Nov 22 '18 at 1:44
when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.
– DK2
Nov 22 '18 at 1:28
when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.
– DK2
Nov 22 '18 at 1:28
I guess you did some
collect
action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.– David S.
Nov 22 '18 at 1:31
I guess you did some
collect
action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.– David S.
Nov 22 '18 at 1:31
@davidshen84 yes I did collectAsMap() at the end of code
– DK2
Nov 22 '18 at 1:33
@davidshen84 yes I did collectAsMap() at the end of code
– DK2
Nov 22 '18 at 1:33
For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.
– David S.
Nov 22 '18 at 1:44
For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.
– David S.
Nov 22 '18 at 1:44
add a comment |
1 Answer
1
active
oldest
votes
This is the expected behaviour of spark. Like most of the operations persist in spark is also lazy operation. So, even if you add the persist for the 1st RDD, spark doesn't cache the data unless you add any action after the persist operation. The map operation is not an action in spark and it is also lazy.
The way to enforce the caching is to add count
action after the persist of RDD2
val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)
RDD2.count // Forces the caching
Now if you do any other operation the RDD2 won't be recomputed
is there any configuration option we may use to force caching? i.e. without calling action methods
– mangusta
Nov 22 '18 at 3:01
No, this is by design of spark. You have to force caching by invoking action on the Rdd
– Avishek Bhattacharya
Nov 22 '18 at 3:05
from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"
– mangusta
Nov 25 '18 at 14:37
it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation
– mangusta
Nov 25 '18 at 14:38
Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.
– Avishek Bhattacharya
Nov 25 '18 at 14:46
|
show 3 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53422563%2fspark-rdd-process-twice-after-persist%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is the expected behaviour of spark. Like most of the operations persist in spark is also lazy operation. So, even if you add the persist for the 1st RDD, spark doesn't cache the data unless you add any action after the persist operation. The map operation is not an action in spark and it is also lazy.
The way to enforce the caching is to add count
action after the persist of RDD2
val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)
RDD2.count // Forces the caching
Now if you do any other operation the RDD2 won't be recomputed
is there any configuration option we may use to force caching? i.e. without calling action methods
– mangusta
Nov 22 '18 at 3:01
No, this is by design of spark. You have to force caching by invoking action on the Rdd
– Avishek Bhattacharya
Nov 22 '18 at 3:05
from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"
– mangusta
Nov 25 '18 at 14:37
it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation
– mangusta
Nov 25 '18 at 14:38
Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.
– Avishek Bhattacharya
Nov 25 '18 at 14:46
|
show 3 more comments
This is the expected behaviour of spark. Like most of the operations persist in spark is also lazy operation. So, even if you add the persist for the 1st RDD, spark doesn't cache the data unless you add any action after the persist operation. The map operation is not an action in spark and it is also lazy.
The way to enforce the caching is to add count
action after the persist of RDD2
val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)
RDD2.count // Forces the caching
Now if you do any other operation the RDD2 won't be recomputed
is there any configuration option we may use to force caching? i.e. without calling action methods
– mangusta
Nov 22 '18 at 3:01
No, this is by design of spark. You have to force caching by invoking action on the Rdd
– Avishek Bhattacharya
Nov 22 '18 at 3:05
from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"
– mangusta
Nov 25 '18 at 14:37
it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation
– mangusta
Nov 25 '18 at 14:38
Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.
– Avishek Bhattacharya
Nov 25 '18 at 14:46
|
show 3 more comments
This is the expected behaviour of spark. Like most of the operations persist in spark is also lazy operation. So, even if you add the persist for the 1st RDD, spark doesn't cache the data unless you add any action after the persist operation. The map operation is not an action in spark and it is also lazy.
The way to enforce the caching is to add count
action after the persist of RDD2
val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)
RDD2.count // Forces the caching
Now if you do any other operation the RDD2 won't be recomputed
This is the expected behaviour of spark. Like most of the operations persist in spark is also lazy operation. So, even if you add the persist for the 1st RDD, spark doesn't cache the data unless you add any action after the persist operation. The map operation is not an action in spark and it is also lazy.
The way to enforce the caching is to add count
action after the persist of RDD2
val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)
RDD2.count // Forces the caching
Now if you do any other operation the RDD2 won't be recomputed
answered Nov 22 '18 at 2:26


Avishek BhattacharyaAvishek Bhattacharya
2,72231533
2,72231533
is there any configuration option we may use to force caching? i.e. without calling action methods
– mangusta
Nov 22 '18 at 3:01
No, this is by design of spark. You have to force caching by invoking action on the Rdd
– Avishek Bhattacharya
Nov 22 '18 at 3:05
from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"
– mangusta
Nov 25 '18 at 14:37
it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation
– mangusta
Nov 25 '18 at 14:38
Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.
– Avishek Bhattacharya
Nov 25 '18 at 14:46
|
show 3 more comments
is there any configuration option we may use to force caching? i.e. without calling action methods
– mangusta
Nov 22 '18 at 3:01
No, this is by design of spark. You have to force caching by invoking action on the Rdd
– Avishek Bhattacharya
Nov 22 '18 at 3:05
from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"
– mangusta
Nov 25 '18 at 14:37
it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation
– mangusta
Nov 25 '18 at 14:38
Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.
– Avishek Bhattacharya
Nov 25 '18 at 14:46
is there any configuration option we may use to force caching? i.e. without calling action methods
– mangusta
Nov 22 '18 at 3:01
is there any configuration option we may use to force caching? i.e. without calling action methods
– mangusta
Nov 22 '18 at 3:01
No, this is by design of spark. You have to force caching by invoking action on the Rdd
– Avishek Bhattacharya
Nov 22 '18 at 3:05
No, this is by design of spark. You have to force caching by invoking action on the Rdd
– Avishek Bhattacharya
Nov 22 '18 at 3:05
from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"
– mangusta
Nov 25 '18 at 14:37
from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"
– mangusta
Nov 25 '18 at 14:37
it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation
– mangusta
Nov 25 '18 at 14:38
it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation
– mangusta
Nov 25 '18 at 14:38
Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.
– Avishek Bhattacharya
Nov 25 '18 at 14:46
Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.
– Avishek Bhattacharya
Nov 25 '18 at 14:46
|
show 3 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53422563%2fspark-rdd-process-twice-after-persist%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.
– DK2
Nov 22 '18 at 1:28
I guess you did some
collect
action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.– David S.
Nov 22 '18 at 1:31
@davidshen84 yes I did collectAsMap() at the end of code
– DK2
Nov 22 '18 at 1:33
For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.
– David S.
Nov 22 '18 at 1:44