If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?












0














This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.



My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?










share|improve this question


















  • 2




    I think when you use withColumn you actually create a new dataframe, not modifying the current dataframe.
    – Ali AzG
    Nov 19 '18 at 12:11


















0














This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.



My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?










share|improve this question


















  • 2




    I think when you use withColumn you actually create a new dataframe, not modifying the current dataframe.
    – Ali AzG
    Nov 19 '18 at 12:11
















0












0








0







This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.



My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?










share|improve this question













This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.



My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?







apache-spark pyspark






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 19 '18 at 11:56









AntonyP

356




356








  • 2




    I think when you use withColumn you actually create a new dataframe, not modifying the current dataframe.
    – Ali AzG
    Nov 19 '18 at 12:11
















  • 2




    I think when you use withColumn you actually create a new dataframe, not modifying the current dataframe.
    – Ali AzG
    Nov 19 '18 at 12:11










2




2




I think when you use withColumn you actually create a new dataframe, not modifying the current dataframe.
– Ali AzG
Nov 19 '18 at 12:11






I think when you use withColumn you actually create a new dataframe, not modifying the current dataframe.
– Ali AzG
Nov 19 '18 at 12:11














2 Answers
2






active

oldest

votes


















3














As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.



Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.



However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement



df = df.withColumn()


It will generate another dataframe and assign it to reference "df".



In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.



df.rdd.id()



will give you unique identifier for your dataframe.



I hope the above explanation helps.



Regards,



Neeraj






share|improve this answer





























    2














    You aren't; the documentation explicitly says




    Returns a new Dataset by adding a column or replacing the existing column that has the same name.




    If you keep a variable referring to the dataframe you called withColumn on, it won't have the new column.






    share|improve this answer





















    • Doesn't something like this work in PySpark? dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1")) Not too sure about the syntax.
      – AntonyP
      Nov 19 '18 at 12:20








    • 2




      You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can write i = i + 1. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.
      – Alexey Romanov
      Nov 19 '18 at 12:27











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374140%2fif-dataframes-in-spark-are-immutable-why-are-we-able-to-modify-it-with-operatio%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3














    As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.



    Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.



    However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement



    df = df.withColumn()


    It will generate another dataframe and assign it to reference "df".



    In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.



    df.rdd.id()



    will give you unique identifier for your dataframe.



    I hope the above explanation helps.



    Regards,



    Neeraj






    share|improve this answer


























      3














      As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.



      Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.



      However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement



      df = df.withColumn()


      It will generate another dataframe and assign it to reference "df".



      In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.



      df.rdd.id()



      will give you unique identifier for your dataframe.



      I hope the above explanation helps.



      Regards,



      Neeraj






      share|improve this answer
























        3












        3








        3






        As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.



        Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.



        However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement



        df = df.withColumn()


        It will generate another dataframe and assign it to reference "df".



        In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.



        df.rdd.id()



        will give you unique identifier for your dataframe.



        I hope the above explanation helps.



        Regards,



        Neeraj






        share|improve this answer












        As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.



        Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.



        However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement



        df = df.withColumn()


        It will generate another dataframe and assign it to reference "df".



        In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.



        df.rdd.id()



        will give you unique identifier for your dataframe.



        I hope the above explanation helps.



        Regards,



        Neeraj







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 19 '18 at 13:30









        neeraj bhadani

        809211




        809211

























            2














            You aren't; the documentation explicitly says




            Returns a new Dataset by adding a column or replacing the existing column that has the same name.




            If you keep a variable referring to the dataframe you called withColumn on, it won't have the new column.






            share|improve this answer





















            • Doesn't something like this work in PySpark? dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1")) Not too sure about the syntax.
              – AntonyP
              Nov 19 '18 at 12:20








            • 2




              You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can write i = i + 1. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.
              – Alexey Romanov
              Nov 19 '18 at 12:27
















            2














            You aren't; the documentation explicitly says




            Returns a new Dataset by adding a column or replacing the existing column that has the same name.




            If you keep a variable referring to the dataframe you called withColumn on, it won't have the new column.






            share|improve this answer





















            • Doesn't something like this work in PySpark? dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1")) Not too sure about the syntax.
              – AntonyP
              Nov 19 '18 at 12:20








            • 2




              You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can write i = i + 1. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.
              – Alexey Romanov
              Nov 19 '18 at 12:27














            2












            2








            2






            You aren't; the documentation explicitly says




            Returns a new Dataset by adding a column or replacing the existing column that has the same name.




            If you keep a variable referring to the dataframe you called withColumn on, it won't have the new column.






            share|improve this answer












            You aren't; the documentation explicitly says




            Returns a new Dataset by adding a column or replacing the existing column that has the same name.




            If you keep a variable referring to the dataframe you called withColumn on, it won't have the new column.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 19 '18 at 12:13









            Alexey Romanov

            105k25208350




            105k25208350












            • Doesn't something like this work in PySpark? dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1")) Not too sure about the syntax.
              – AntonyP
              Nov 19 '18 at 12:20








            • 2




              You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can write i = i + 1. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.
              – Alexey Romanov
              Nov 19 '18 at 12:27


















            • Doesn't something like this work in PySpark? dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1")) Not too sure about the syntax.
              – AntonyP
              Nov 19 '18 at 12:20








            • 2




              You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can write i = i + 1. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.
              – Alexey Romanov
              Nov 19 '18 at 12:27
















            Doesn't something like this work in PySpark? dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1")) Not too sure about the syntax.
            – AntonyP
            Nov 19 '18 at 12:20






            Doesn't something like this work in PySpark? dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1")) Not too sure about the syntax.
            – AntonyP
            Nov 19 '18 at 12:20






            2




            2




            You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can write i = i + 1. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.
            – Alexey Romanov
            Nov 19 '18 at 12:27




            You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can write i = i + 1. By contrast, python lists are mutable: stackoverflow.com/questions/24292174/are-python-lists-mutable.
            – Alexey Romanov
            Nov 19 '18 at 12:27


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53374140%2fif-dataframes-in-spark-are-immutable-why-are-we-able-to-modify-it-with-operatio%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            android studio warns about leanback feature tag usage required on manifest while using Unity exported app?

            SQL update select statement

            'app-layout' is not a known element: how to share Component with different Modules