spark - RDD process twice after persist

I made a RDD and created another RDD from origin like below.

val RDD2 = RDD1.map({

  println("RDD1")

  ....

}).persist(StorageLevel.MEMORY_AND_DISK)



RDD2.foreach({

  println("RDD2")

  ...

})

...so on..

I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.

BUT somehow "RDD1" is printed after "RDD2" printed like below.

RDD1

RDD1

RDD1

RDD1

RDD2

RDD2

RDD2

RDD2

RDD2

RDD1 -- repeat RDD1 process. WHY? 

RDD1

RDD1

RDD1

RDD2

RDD2

RDD2

RDD2

RDD2

edited Nov 22 '18 at 1:16

asked Nov 22 '18 at 1:06

DK2

190316

when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.

– DK2
Nov 22 '18 at 1:28

I guess you did some collect action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.

– David S.
Nov 22 '18 at 1:31

@davidshen84 yes I did collectAsMap() at the end of code

– DK2
Nov 22 '18 at 1:33

For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.

– David S.
Nov 22 '18 at 1:44

add a comment |

I made a RDD and created another RDD from origin like below.

val RDD2 = RDD1.map({

  println("RDD1")

  ....

}).persist(StorageLevel.MEMORY_AND_DISK)



RDD2.foreach({

  println("RDD2")

  ...

})

...so on..

I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.

BUT somehow "RDD1" is printed after "RDD2" printed like below.

RDD1

RDD1

RDD1

RDD1

RDD2

RDD2

RDD2

RDD2

RDD2

RDD1 -- repeat RDD1 process. WHY? 

RDD1

RDD1

RDD1

RDD2

RDD2

RDD2

RDD2

RDD2

edited Nov 22 '18 at 1:16

asked Nov 22 '18 at 1:06

DK2

190316

when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.

– DK2
Nov 22 '18 at 1:28

I guess you did some collect action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.

– David S.
Nov 22 '18 at 1:31

@davidshen84 yes I did collectAsMap() at the end of code

– DK2
Nov 22 '18 at 1:33

For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.

– David S.
Nov 22 '18 at 1:44

add a comment |

I made a RDD and created another RDD from origin like below.

val RDD2 = RDD1.map({

  println("RDD1")

  ....

}).persist(StorageLevel.MEMORY_AND_DISK)



RDD2.foreach({

  println("RDD2")

  ...

})

...so on..

I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.

BUT somehow "RDD1" is printed after "RDD2" printed like below.

RDD1

RDD1

RDD1

RDD1

RDD2

RDD2

RDD2

RDD2

RDD2

RDD1 -- repeat RDD1 process. WHY? 

RDD1

RDD1

RDD1

RDD2

RDD2

RDD2

RDD2

RDD2

edited Nov 22 '18 at 1:16

asked Nov 22 '18 at 1:06

DK2

190316

I made a RDD and created another RDD from origin like below.

val RDD2 = RDD1.map({

  println("RDD1")

  ....

}).persist(StorageLevel.MEMORY_AND_DISK)



RDD2.foreach({

  println("RDD2")

  ...

})

...so on..

I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.

BUT somehow "RDD1" is printed after "RDD2" printed like below.

RDD1

RDD1

RDD1

RDD1

RDD2

RDD2

RDD2

RDD2

RDD2

RDD1 -- repeat RDD1 process. WHY? 

RDD1

RDD1

RDD1

RDD2

RDD2

RDD2

RDD2

RDD2

apache-spark

edited Nov 22 '18 at 1:16

asked Nov 22 '18 at 1:06

DK2

190316

edited Nov 22 '18 at 1:16

asked Nov 22 '18 at 1:06

DK2

190316

edited Nov 22 '18 at 1:16

asked Nov 22 '18 at 1:06

DK2

190316

asked Nov 22 '18 at 1:06

DK2

190316

asked Nov 22 '18 at 1:06

DK2

190316

when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.

– DK2
Nov 22 '18 at 1:28

I guess you did some collect action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.

– David S.
Nov 22 '18 at 1:31

@davidshen84 yes I did collectAsMap() at the end of code

– DK2
Nov 22 '18 at 1:33

For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.

– David S.
Nov 22 '18 at 1:44

add a comment |

when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.

– DK2
Nov 22 '18 at 1:28

I guess you did some collect action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.

– David S.
Nov 22 '18 at 1:31

@davidshen84 yes I did collectAsMap() at the end of code

– DK2
Nov 22 '18 at 1:33

For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.

– David S.
Nov 22 '18 at 1:44

when first "RDD1" is printed all, I can guarantee that process with RDD1 is done. It does same work twice.

– DK2
Nov 22 '18 at 1:28

I guess you did some collect action at the end? spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/… also returns an RDD.

– David S.
Nov 22 '18 at 1:31

@davidshen84 yes I did collectAsMap() at the end of code

– DK2
Nov 22 '18 at 1:33

For both RDD1 and RDD2? Remember two major concepts in Spark which are transforming and action. The transforming will not have any effect on the RDD until you take action.

– David S.
Nov 22 '18 at 1:44

add a comment |

1 Answer
1

active

oldest

votes

This is the expected behaviour of spark. Like most of the operations persist in spark is also lazy operation. So, even if you add the persist for the 1st RDD, spark doesn't cache the data unless you add any action after the persist operation. The map operation is not an action in spark and it is also lazy.

The way to enforce the caching is to add count action after the persist of RDD2

val RDD2 = RDD1.map({

   println("RDD1")

   ....

}).persist(StorageLevel.MEMORY_AND_DISK)



RDD2.count // Forces the caching

Now if you do any other operation the RDD2 won't be recomputed

answered Nov 22 '18 at 2:26

Avishek Bhattacharya

2,72231533

is there any configuration option we may use to force caching? i.e. without calling action methods

– mangusta
Nov 22 '18 at 3:01

No, this is by design of spark. You have to force caching by invoking action on the Rdd

– Avishek Bhattacharya
Nov 22 '18 at 3:05

from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

– mangusta
Nov 25 '18 at 14:37

it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

– mangusta
Nov 25 '18 at 14:38

Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

– Avishek Bhattacharya
Nov 25 '18 at 14:46

|
show 3 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53422563%2fspark-rdd-process-twice-after-persist%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The way to enforce the caching is to add count action after the persist of RDD2

val RDD2 = RDD1.map({

   println("RDD1")

   ....

}).persist(StorageLevel.MEMORY_AND_DISK)



RDD2.count // Forces the caching

Now if you do any other operation the RDD2 won't be recomputed

answered Nov 22 '18 at 2:26

Avishek Bhattacharya

2,72231533

is there any configuration option we may use to force caching? i.e. without calling action methods

– mangusta
Nov 22 '18 at 3:01

No, this is by design of spark. You have to force caching by invoking action on the Rdd

– Avishek Bhattacharya
Nov 22 '18 at 3:05

from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

– mangusta
Nov 25 '18 at 14:37

it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

– mangusta
Nov 25 '18 at 14:38

Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

– Avishek Bhattacharya
Nov 25 '18 at 14:46

|
show 3 more comments

The way to enforce the caching is to add count action after the persist of RDD2

val RDD2 = RDD1.map({

   println("RDD1")

   ....

}).persist(StorageLevel.MEMORY_AND_DISK)



RDD2.count // Forces the caching

Now if you do any other operation the RDD2 won't be recomputed

answered Nov 22 '18 at 2:26

Avishek Bhattacharya

2,72231533

is there any configuration option we may use to force caching? i.e. without calling action methods

– mangusta
Nov 22 '18 at 3:01

No, this is by design of spark. You have to force caching by invoking action on the Rdd

– Avishek Bhattacharya
Nov 22 '18 at 3:05

from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

– mangusta
Nov 25 '18 at 14:37

it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

– mangusta
Nov 25 '18 at 14:38

Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

– Avishek Bhattacharya
Nov 25 '18 at 14:46

|
show 3 more comments

The way to enforce the caching is to add count action after the persist of RDD2

val RDD2 = RDD1.map({

   println("RDD1")

   ....

}).persist(StorageLevel.MEMORY_AND_DISK)



RDD2.count // Forces the caching

Now if you do any other operation the RDD2 won't be recomputed

answered Nov 22 '18 at 2:26

Avishek Bhattacharya

2,72231533

The way to enforce the caching is to add count action after the persist of RDD2

val RDD2 = RDD1.map({

   println("RDD1")

   ....

}).persist(StorageLevel.MEMORY_AND_DISK)



RDD2.count // Forces the caching

Now if you do any other operation the RDD2 won't be recomputed

answered Nov 22 '18 at 2:26

Avishek Bhattacharya

2,72231533

answered Nov 22 '18 at 2:26

Avishek Bhattacharya

2,72231533

answered Nov 22 '18 at 2:26

Avishek Bhattacharya

2,72231533

answered Nov 22 '18 at 2:26

Avishek Bhattacharya

2,72231533

is there any configuration option we may use to force caching? i.e. without calling action methods

– mangusta
Nov 22 '18 at 3:01

No, this is by design of spark. You have to force caching by invoking action on the Rdd

– Avishek Bhattacharya
Nov 22 '18 at 3:05

from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

– mangusta
Nov 25 '18 at 14:37

it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

– mangusta
Nov 25 '18 at 14:38

Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

– Avishek Bhattacharya
Nov 25 '18 at 14:46

|
show 3 more comments

is there any configuration option we may use to force caching? i.e. without calling action methods

– mangusta
Nov 22 '18 at 3:01

No, this is by design of spark. You have to force caching by invoking action on the Rdd

– Avishek Bhattacharya
Nov 22 '18 at 3:05

from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

– mangusta
Nov 25 '18 at 14:37

it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

– mangusta
Nov 25 '18 at 14:38

Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

– Avishek Bhattacharya
Nov 25 '18 at 14:46

is there any configuration option we may use to force caching? i.e. without calling action methods

– mangusta
Nov 22 '18 at 3:01

No, this is by design of spark. You have to force caching by invoking action on the Rdd

– Avishek Bhattacharya
Nov 22 '18 at 3:05

from Spark RDD documentation webpage: "By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it"

– mangusta
Nov 25 '18 at 14:37

it's a bit confusing, so the webpage says that persist makes RDD to be cached for the future use while you propose that it is a lazy operation

– mangusta
Nov 25 '18 at 14:38

Well it persists in the cache but not immediately. It persists only after an action is called on this. For reference please see the book jaceklaskowski.gitbooks.io/mastering-spark-sql/…. This explicitly mentions that the spark cache is lazy operation.

– Avishek Bhattacharya
Nov 25 '18 at 14:46

|
show 3 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu