If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?
This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.
My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?
apache-spark pyspark
add a comment |
This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.
My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?
apache-spark pyspark
2
I think when you usewithColumn
you actually create a new dataframe, not modifying the current dataframe.
– Ali AzG
Nov 19 '18 at 12:11
add a comment |
This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.
My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?
apache-spark pyspark
This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.
My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?
apache-spark pyspark
apache-spark pyspark
asked Nov 19 '18 at 11:56
AntonyP
356
356
2
I think when you usewithColumn
you actually create a new dataframe, not modifying the current dataframe.
– Ali AzG
Nov 19 '18 at 12:11
add a comment |
2
I think when you usewithColumn
you actually create a new dataframe, not modifying the current dataframe.
– Ali AzG
Nov 19 '18 at 12:11
2
2
I think when you use
withColumn
you actually create a new dataframe, not modifying the current dataframe.– Ali AzG
Nov 19 '18 at 12:11
I think when you use
withColumn
you actually create a new dataframe, not modifying the current dataframe.– Ali AzG
Nov 19 '18 at 12:11
add a comment |
2 Answers
2
active
oldest
votes
As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.
Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.
However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement
df = df.withColumn()
It will generate another dataframe and assign it to reference "df
".
In order to verify the same, you can use id()
method of rdd to get the unique identifier of your dataframe.
df.rdd.id()
will give you unique identifier for your dataframe.
I hope the above explanation helps.
Regards,
Neeraj
add a comment |
You aren't; the documentation explicitly says
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
If you keep a variable referring to the dataframe you called withColumn
on, it won't have the new column.
Doesn't something like this work in PySpark?dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1"))
Not too sure about the syntax.
– AntonyP
Nov 19 '18 at 12:20
2
You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can writei = i + 1
. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.
– Alexey Romanov
Nov 19 '18 at 12:27
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374140%2fif-dataframes-in-spark-are-immutable-why-are-we-able-to-modify-it-with-operatio%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.
Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.
However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement
df = df.withColumn()
It will generate another dataframe and assign it to reference "df
".
In order to verify the same, you can use id()
method of rdd to get the unique identifier of your dataframe.
df.rdd.id()
will give you unique identifier for your dataframe.
I hope the above explanation helps.
Regards,
Neeraj
add a comment |
As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.
Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.
However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement
df = df.withColumn()
It will generate another dataframe and assign it to reference "df
".
In order to verify the same, you can use id()
method of rdd to get the unique identifier of your dataframe.
df.rdd.id()
will give you unique identifier for your dataframe.
I hope the above explanation helps.
Regards,
Neeraj
add a comment |
As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.
Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.
However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement
df = df.withColumn()
It will generate another dataframe and assign it to reference "df
".
In order to verify the same, you can use id()
method of rdd to get the unique identifier of your dataframe.
df.rdd.id()
will give you unique identifier for your dataframe.
I hope the above explanation helps.
Regards,
Neeraj
As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.
Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.
However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement
df = df.withColumn()
It will generate another dataframe and assign it to reference "df
".
In order to verify the same, you can use id()
method of rdd to get the unique identifier of your dataframe.
df.rdd.id()
will give you unique identifier for your dataframe.
I hope the above explanation helps.
Regards,
Neeraj
answered Nov 19 '18 at 13:30
neeraj bhadani
809211
809211
add a comment |
add a comment |
You aren't; the documentation explicitly says
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
If you keep a variable referring to the dataframe you called withColumn
on, it won't have the new column.
Doesn't something like this work in PySpark?dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1"))
Not too sure about the syntax.
– AntonyP
Nov 19 '18 at 12:20
2
You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can writei = i + 1
. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.
– Alexey Romanov
Nov 19 '18 at 12:27
add a comment |
You aren't; the documentation explicitly says
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
If you keep a variable referring to the dataframe you called withColumn
on, it won't have the new column.
Doesn't something like this work in PySpark?dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1"))
Not too sure about the syntax.
– AntonyP
Nov 19 '18 at 12:20
2
You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can writei = i + 1
. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.
– Alexey Romanov
Nov 19 '18 at 12:27
add a comment |
You aren't; the documentation explicitly says
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
If you keep a variable referring to the dataframe you called withColumn
on, it won't have the new column.
You aren't; the documentation explicitly says
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
If you keep a variable referring to the dataframe you called withColumn
on, it won't have the new column.
answered Nov 19 '18 at 12:13
Alexey Romanov
105k25208350
105k25208350
Doesn't something like this work in PySpark?dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1"))
Not too sure about the syntax.
– AntonyP
Nov 19 '18 at 12:20
2
You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can writei = i + 1
. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.
– Alexey Romanov
Nov 19 '18 at 12:27
add a comment |
Doesn't something like this work in PySpark?dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1"))
Not too sure about the syntax.
– AntonyP
Nov 19 '18 at 12:20
2
You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can writei = i + 1
. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.
– Alexey Romanov
Nov 19 '18 at 12:27
Doesn't something like this work in PySpark?
dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1"))
Not too sure about the syntax.– AntonyP
Nov 19 '18 at 12:20
Doesn't something like this work in PySpark?
dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1"))
Not too sure about the syntax.– AntonyP
Nov 19 '18 at 12:20
2
2
You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can write
i = i + 1
. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.– Alexey Romanov
Nov 19 '18 at 12:27
You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can write
i = i + 1
. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.– Alexey Romanov
Nov 19 '18 at 12:27
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374140%2fif-dataframes-in-spark-are-immutable-why-are-we-able-to-modify-it-with-operatio%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
I think when you use
withColumn
you actually create a new dataframe, not modifying the current dataframe.– Ali AzG
Nov 19 '18 at 12:11