Optimal way to save spark sql dataframe to S3 using information stored in them












1















I have dataframes with data like :



        channel  eventId1               eventId2               eventTs  eventTs2  serialNumber  someCode
Web-DTB akefTEdZhXt8EqzLKXNt1Wjg akTEdZhXt8EqzLKXNt1Wjg 1545502751154 1545502766731 4 rfs
Web-DTB 3ycLHHrbEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 njs
Web-DTB 3ycL4rHHEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 null


I need to save this data to S3 path looking like :



  s3://test/data/ABC/hb/eventTs/[eventTs]/uploadTime_[eventTs2]/*.json.gz


How can I proceed with this as I need to extract data from the partitions to write to S3 path: (the s3 path is a function of eventTs and eventTs2 present in the dataframes)



df.write.partitionBy("eventTs","eventTs2").format("json").save("s3://test/data/ABC/hb????")


I guess I can iterate over each row in the dataframe , extract the path and save to S3 but do not want to do that.



Is there any way to group by the dataframes on eventTs and eventTs2 and then save the dataframes to the full S3 path? Is there something more optimal?










share|improve this question





























    1















    I have dataframes with data like :



            channel  eventId1               eventId2               eventTs  eventTs2  serialNumber  someCode
    Web-DTB akefTEdZhXt8EqzLKXNt1Wjg akTEdZhXt8EqzLKXNt1Wjg 1545502751154 1545502766731 4 rfs
    Web-DTB 3ycLHHrbEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 njs
    Web-DTB 3ycL4rHHEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 null


    I need to save this data to S3 path looking like :



      s3://test/data/ABC/hb/eventTs/[eventTs]/uploadTime_[eventTs2]/*.json.gz


    How can I proceed with this as I need to extract data from the partitions to write to S3 path: (the s3 path is a function of eventTs and eventTs2 present in the dataframes)



    df.write.partitionBy("eventTs","eventTs2").format("json").save("s3://test/data/ABC/hb????")


    I guess I can iterate over each row in the dataframe , extract the path and save to S3 but do not want to do that.



    Is there any way to group by the dataframes on eventTs and eventTs2 and then save the dataframes to the full S3 path? Is there something more optimal?










    share|improve this question



























      1












      1








      1


      1






      I have dataframes with data like :



              channel  eventId1               eventId2               eventTs  eventTs2  serialNumber  someCode
      Web-DTB akefTEdZhXt8EqzLKXNt1Wjg akTEdZhXt8EqzLKXNt1Wjg 1545502751154 1545502766731 4 rfs
      Web-DTB 3ycLHHrbEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 njs
      Web-DTB 3ycL4rHHEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 null


      I need to save this data to S3 path looking like :



        s3://test/data/ABC/hb/eventTs/[eventTs]/uploadTime_[eventTs2]/*.json.gz


      How can I proceed with this as I need to extract data from the partitions to write to S3 path: (the s3 path is a function of eventTs and eventTs2 present in the dataframes)



      df.write.partitionBy("eventTs","eventTs2").format("json").save("s3://test/data/ABC/hb????")


      I guess I can iterate over each row in the dataframe , extract the path and save to S3 but do not want to do that.



      Is there any way to group by the dataframes on eventTs and eventTs2 and then save the dataframes to the full S3 path? Is there something more optimal?










      share|improve this question
















      I have dataframes with data like :



              channel  eventId1               eventId2               eventTs  eventTs2  serialNumber  someCode
      Web-DTB akefTEdZhXt8EqzLKXNt1Wjg akTEdZhXt8EqzLKXNt1Wjg 1545502751154 1545502766731 4 rfs
      Web-DTB 3ycLHHrbEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 njs
      Web-DTB 3ycL4rHHEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 null


      I need to save this data to S3 path looking like :



        s3://test/data/ABC/hb/eventTs/[eventTs]/uploadTime_[eventTs2]/*.json.gz


      How can I proceed with this as I need to extract data from the partitions to write to S3 path: (the s3 path is a function of eventTs and eventTs2 present in the dataframes)



      df.write.partitionBy("eventTs","eventTs2").format("json").save("s3://test/data/ABC/hb????")


      I guess I can iterate over each row in the dataframe , extract the path and save to S3 but do not want to do that.



      Is there any way to group by the dataframes on eventTs and eventTs2 and then save the dataframes to the full S3 path? Is there something more optimal?







      scala apache-spark amazon-s3 apache-spark-sql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 2 at 10:59







      GothamGirl

















      asked Jan 2 at 7:35









      GothamGirlGothamGirl

      134112




      134112
























          1 Answer
          1






          active

          oldest

          votes


















          2














          Spark supports partitions like what we have in Hive. If number of distinct elements for eventTs, eventTs2 is less, partitions will be a good way to solve this.



          Check the scala doc for more information around partitionBy.



          Example usage:



          val someDF = Seq((1, "bat", "marvel"), (2, "mouse", "disney"), (3, "horse", "animal"), (1, "batman", "marvel"), (2, "tom", "disney") ).toDF("id", "name", "place")
          someDF.write.partitionBy("id", "name").orc("/tmp/somedf")


          If you write the dataframe with the paritionBy on "id" and "name" the following directory structure will be created.



          /tmp/somedf/id=1/name=bat
          /tmp/somedf/id=1/name=batman

          /tmp/somedf/id=2/name=mouse
          /tmp/somedf/id=2/name=tom

          /tmp/somedf/id=3/name=horse


          The first and second partition becomes directories and all rows where id is equal to 1 and name is bat will be saved under the directory structure /tmp/somedf/id=1/name=bat, the order of the partitions defined in partitionBy decides the order of directories.



          In your case, the partitions will be on eventTs and eventTS2.



          val someDF = Seq(
          ("Web-DTB","akefTEdZhXt8EqzLKXNt1Wjg","akTEdZhXt8EqzLKXNt1Wjg","1545502751154","1545502766731",4,"rfs"),
          ("Web-DTB","3ycLHHrbEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"njs"),
          ("Web-DTB","3ycL4rHHEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"null"))
          .toDF("channel" , "eventId1", "eventId2", "eventTs", "eventTs2", "serialNumber", "someCode")
          someDF.write("eventTs", "eventTs2").orc("/tmp/someDF")


          Creating a directory structure as follows.



          /tmp/someDF/eventTs=1545502766247/eventTs2=1545502767800
          /tmp/someDF/eventTs=1545502751154/eventTs2=1545502766731





          share|improve this answer


























          • except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.

            – GothamGirl
            Jan 2 at 9:07













          • Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.

            – Sudev Ambadi
            Jan 2 at 9:16











          • Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .

            – GothamGirl
            Jan 2 at 9:38













          • You can n save in S3 without specifying the full path, give it a try.

            – Sudev Ambadi
            Jan 2 at 9:55











          • Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.

            – GothamGirl
            Jan 2 at 12:38













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54002766%2foptimal-way-to-save-spark-sql-dataframe-to-s3-using-information-stored-in-them%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          Spark supports partitions like what we have in Hive. If number of distinct elements for eventTs, eventTs2 is less, partitions will be a good way to solve this.



          Check the scala doc for more information around partitionBy.



          Example usage:



          val someDF = Seq((1, "bat", "marvel"), (2, "mouse", "disney"), (3, "horse", "animal"), (1, "batman", "marvel"), (2, "tom", "disney") ).toDF("id", "name", "place")
          someDF.write.partitionBy("id", "name").orc("/tmp/somedf")


          If you write the dataframe with the paritionBy on "id" and "name" the following directory structure will be created.



          /tmp/somedf/id=1/name=bat
          /tmp/somedf/id=1/name=batman

          /tmp/somedf/id=2/name=mouse
          /tmp/somedf/id=2/name=tom

          /tmp/somedf/id=3/name=horse


          The first and second partition becomes directories and all rows where id is equal to 1 and name is bat will be saved under the directory structure /tmp/somedf/id=1/name=bat, the order of the partitions defined in partitionBy decides the order of directories.



          In your case, the partitions will be on eventTs and eventTS2.



          val someDF = Seq(
          ("Web-DTB","akefTEdZhXt8EqzLKXNt1Wjg","akTEdZhXt8EqzLKXNt1Wjg","1545502751154","1545502766731",4,"rfs"),
          ("Web-DTB","3ycLHHrbEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"njs"),
          ("Web-DTB","3ycL4rHHEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"null"))
          .toDF("channel" , "eventId1", "eventId2", "eventTs", "eventTs2", "serialNumber", "someCode")
          someDF.write("eventTs", "eventTs2").orc("/tmp/someDF")


          Creating a directory structure as follows.



          /tmp/someDF/eventTs=1545502766247/eventTs2=1545502767800
          /tmp/someDF/eventTs=1545502751154/eventTs2=1545502766731





          share|improve this answer


























          • except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.

            – GothamGirl
            Jan 2 at 9:07













          • Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.

            – Sudev Ambadi
            Jan 2 at 9:16











          • Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .

            – GothamGirl
            Jan 2 at 9:38













          • You can n save in S3 without specifying the full path, give it a try.

            – Sudev Ambadi
            Jan 2 at 9:55











          • Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.

            – GothamGirl
            Jan 2 at 12:38


















          2














          Spark supports partitions like what we have in Hive. If number of distinct elements for eventTs, eventTs2 is less, partitions will be a good way to solve this.



          Check the scala doc for more information around partitionBy.



          Example usage:



          val someDF = Seq((1, "bat", "marvel"), (2, "mouse", "disney"), (3, "horse", "animal"), (1, "batman", "marvel"), (2, "tom", "disney") ).toDF("id", "name", "place")
          someDF.write.partitionBy("id", "name").orc("/tmp/somedf")


          If you write the dataframe with the paritionBy on "id" and "name" the following directory structure will be created.



          /tmp/somedf/id=1/name=bat
          /tmp/somedf/id=1/name=batman

          /tmp/somedf/id=2/name=mouse
          /tmp/somedf/id=2/name=tom

          /tmp/somedf/id=3/name=horse


          The first and second partition becomes directories and all rows where id is equal to 1 and name is bat will be saved under the directory structure /tmp/somedf/id=1/name=bat, the order of the partitions defined in partitionBy decides the order of directories.



          In your case, the partitions will be on eventTs and eventTS2.



          val someDF = Seq(
          ("Web-DTB","akefTEdZhXt8EqzLKXNt1Wjg","akTEdZhXt8EqzLKXNt1Wjg","1545502751154","1545502766731",4,"rfs"),
          ("Web-DTB","3ycLHHrbEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"njs"),
          ("Web-DTB","3ycL4rHHEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"null"))
          .toDF("channel" , "eventId1", "eventId2", "eventTs", "eventTs2", "serialNumber", "someCode")
          someDF.write("eventTs", "eventTs2").orc("/tmp/someDF")


          Creating a directory structure as follows.



          /tmp/someDF/eventTs=1545502766247/eventTs2=1545502767800
          /tmp/someDF/eventTs=1545502751154/eventTs2=1545502766731





          share|improve this answer


























          • except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.

            – GothamGirl
            Jan 2 at 9:07













          • Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.

            – Sudev Ambadi
            Jan 2 at 9:16











          • Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .

            – GothamGirl
            Jan 2 at 9:38













          • You can n save in S3 without specifying the full path, give it a try.

            – Sudev Ambadi
            Jan 2 at 9:55











          • Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.

            – GothamGirl
            Jan 2 at 12:38
















          2












          2








          2







          Spark supports partitions like what we have in Hive. If number of distinct elements for eventTs, eventTs2 is less, partitions will be a good way to solve this.



          Check the scala doc for more information around partitionBy.



          Example usage:



          val someDF = Seq((1, "bat", "marvel"), (2, "mouse", "disney"), (3, "horse", "animal"), (1, "batman", "marvel"), (2, "tom", "disney") ).toDF("id", "name", "place")
          someDF.write.partitionBy("id", "name").orc("/tmp/somedf")


          If you write the dataframe with the paritionBy on "id" and "name" the following directory structure will be created.



          /tmp/somedf/id=1/name=bat
          /tmp/somedf/id=1/name=batman

          /tmp/somedf/id=2/name=mouse
          /tmp/somedf/id=2/name=tom

          /tmp/somedf/id=3/name=horse


          The first and second partition becomes directories and all rows where id is equal to 1 and name is bat will be saved under the directory structure /tmp/somedf/id=1/name=bat, the order of the partitions defined in partitionBy decides the order of directories.



          In your case, the partitions will be on eventTs and eventTS2.



          val someDF = Seq(
          ("Web-DTB","akefTEdZhXt8EqzLKXNt1Wjg","akTEdZhXt8EqzLKXNt1Wjg","1545502751154","1545502766731",4,"rfs"),
          ("Web-DTB","3ycLHHrbEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"njs"),
          ("Web-DTB","3ycL4rHHEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"null"))
          .toDF("channel" , "eventId1", "eventId2", "eventTs", "eventTs2", "serialNumber", "someCode")
          someDF.write("eventTs", "eventTs2").orc("/tmp/someDF")


          Creating a directory structure as follows.



          /tmp/someDF/eventTs=1545502766247/eventTs2=1545502767800
          /tmp/someDF/eventTs=1545502751154/eventTs2=1545502766731





          share|improve this answer















          Spark supports partitions like what we have in Hive. If number of distinct elements for eventTs, eventTs2 is less, partitions will be a good way to solve this.



          Check the scala doc for more information around partitionBy.



          Example usage:



          val someDF = Seq((1, "bat", "marvel"), (2, "mouse", "disney"), (3, "horse", "animal"), (1, "batman", "marvel"), (2, "tom", "disney") ).toDF("id", "name", "place")
          someDF.write.partitionBy("id", "name").orc("/tmp/somedf")


          If you write the dataframe with the paritionBy on "id" and "name" the following directory structure will be created.



          /tmp/somedf/id=1/name=bat
          /tmp/somedf/id=1/name=batman

          /tmp/somedf/id=2/name=mouse
          /tmp/somedf/id=2/name=tom

          /tmp/somedf/id=3/name=horse


          The first and second partition becomes directories and all rows where id is equal to 1 and name is bat will be saved under the directory structure /tmp/somedf/id=1/name=bat, the order of the partitions defined in partitionBy decides the order of directories.



          In your case, the partitions will be on eventTs and eventTS2.



          val someDF = Seq(
          ("Web-DTB","akefTEdZhXt8EqzLKXNt1Wjg","akTEdZhXt8EqzLKXNt1Wjg","1545502751154","1545502766731",4,"rfs"),
          ("Web-DTB","3ycLHHrbEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"njs"),
          ("Web-DTB","3ycL4rHHEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"null"))
          .toDF("channel" , "eventId1", "eventId2", "eventTs", "eventTs2", "serialNumber", "someCode")
          someDF.write("eventTs", "eventTs2").orc("/tmp/someDF")


          Creating a directory structure as follows.



          /tmp/someDF/eventTs=1545502766247/eventTs2=1545502767800
          /tmp/someDF/eventTs=1545502751154/eventTs2=1545502766731






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 2 at 9:09

























          answered Jan 2 at 8:58









          Sudev AmbadiSudev Ambadi

          47039




          47039













          • except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.

            – GothamGirl
            Jan 2 at 9:07













          • Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.

            – Sudev Ambadi
            Jan 2 at 9:16











          • Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .

            – GothamGirl
            Jan 2 at 9:38













          • You can n save in S3 without specifying the full path, give it a try.

            – Sudev Ambadi
            Jan 2 at 9:55











          • Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.

            – GothamGirl
            Jan 2 at 12:38





















          • except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.

            – GothamGirl
            Jan 2 at 9:07













          • Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.

            – Sudev Ambadi
            Jan 2 at 9:16











          • Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .

            – GothamGirl
            Jan 2 at 9:38













          • You can n save in S3 without specifying the full path, give it a try.

            – Sudev Ambadi
            Jan 2 at 9:55











          • Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.

            – GothamGirl
            Jan 2 at 12:38



















          except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.

          – GothamGirl
          Jan 2 at 9:07







          except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.

          – GothamGirl
          Jan 2 at 9:07















          Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.

          – Sudev Ambadi
          Jan 2 at 9:16





          Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.

          – Sudev Ambadi
          Jan 2 at 9:16













          Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .

          – GothamGirl
          Jan 2 at 9:38







          Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .

          – GothamGirl
          Jan 2 at 9:38















          You can n save in S3 without specifying the full path, give it a try.

          – Sudev Ambadi
          Jan 2 at 9:55





          You can n save in S3 without specifying the full path, give it a try.

          – Sudev Ambadi
          Jan 2 at 9:55













          Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.

          – GothamGirl
          Jan 2 at 12:38







          Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.

          – GothamGirl
          Jan 2 at 12:38






















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54002766%2foptimal-way-to-save-spark-sql-dataframe-to-s3-using-information-stored-in-them%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          MongoDB - Not Authorized To Execute Command

          in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

          How to fix TextFormField cause rebuild widget in Flutter