Optimal way to save spark sql dataframe to S3 using information stored in them
I have dataframes with data like :
channel eventId1 eventId2 eventTs eventTs2 serialNumber someCode
Web-DTB akefTEdZhXt8EqzLKXNt1Wjg akTEdZhXt8EqzLKXNt1Wjg 1545502751154 1545502766731 4 rfs
Web-DTB 3ycLHHrbEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 njs
Web-DTB 3ycL4rHHEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 null
I need to save this data to S3 path looking like :
s3://test/data/ABC/hb/eventTs/[eventTs]/uploadTime_[eventTs2]/*.json.gz
How can I proceed with this as I need to extract data from the partitions to write to S3 path: (the s3 path is a function of eventTs and eventTs2 present in the dataframes)
df.write.partitionBy("eventTs","eventTs2").format("json").save("s3://test/data/ABC/hb????")
I guess I can iterate over each row in the dataframe , extract the path and save to S3 but do not want to do that.
Is there any way to group by the dataframes on eventTs and eventTs2 and then save the dataframes to the full S3 path? Is there something more optimal?
scala apache-spark amazon-s3 apache-spark-sql
add a comment |
I have dataframes with data like :
channel eventId1 eventId2 eventTs eventTs2 serialNumber someCode
Web-DTB akefTEdZhXt8EqzLKXNt1Wjg akTEdZhXt8EqzLKXNt1Wjg 1545502751154 1545502766731 4 rfs
Web-DTB 3ycLHHrbEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 njs
Web-DTB 3ycL4rHHEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 null
I need to save this data to S3 path looking like :
s3://test/data/ABC/hb/eventTs/[eventTs]/uploadTime_[eventTs2]/*.json.gz
How can I proceed with this as I need to extract data from the partitions to write to S3 path: (the s3 path is a function of eventTs and eventTs2 present in the dataframes)
df.write.partitionBy("eventTs","eventTs2").format("json").save("s3://test/data/ABC/hb????")
I guess I can iterate over each row in the dataframe , extract the path and save to S3 but do not want to do that.
Is there any way to group by the dataframes on eventTs and eventTs2 and then save the dataframes to the full S3 path? Is there something more optimal?
scala apache-spark amazon-s3 apache-spark-sql
add a comment |
I have dataframes with data like :
channel eventId1 eventId2 eventTs eventTs2 serialNumber someCode
Web-DTB akefTEdZhXt8EqzLKXNt1Wjg akTEdZhXt8EqzLKXNt1Wjg 1545502751154 1545502766731 4 rfs
Web-DTB 3ycLHHrbEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 njs
Web-DTB 3ycL4rHHEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 null
I need to save this data to S3 path looking like :
s3://test/data/ABC/hb/eventTs/[eventTs]/uploadTime_[eventTs2]/*.json.gz
How can I proceed with this as I need to extract data from the partitions to write to S3 path: (the s3 path is a function of eventTs and eventTs2 present in the dataframes)
df.write.partitionBy("eventTs","eventTs2").format("json").save("s3://test/data/ABC/hb????")
I guess I can iterate over each row in the dataframe , extract the path and save to S3 but do not want to do that.
Is there any way to group by the dataframes on eventTs and eventTs2 and then save the dataframes to the full S3 path? Is there something more optimal?
scala apache-spark amazon-s3 apache-spark-sql
I have dataframes with data like :
channel eventId1 eventId2 eventTs eventTs2 serialNumber someCode
Web-DTB akefTEdZhXt8EqzLKXNt1Wjg akTEdZhXt8EqzLKXNt1Wjg 1545502751154 1545502766731 4 rfs
Web-DTB 3ycLHHrbEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 njs
Web-DTB 3ycL4rHHEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 null
I need to save this data to S3 path looking like :
s3://test/data/ABC/hb/eventTs/[eventTs]/uploadTime_[eventTs2]/*.json.gz
How can I proceed with this as I need to extract data from the partitions to write to S3 path: (the s3 path is a function of eventTs and eventTs2 present in the dataframes)
df.write.partitionBy("eventTs","eventTs2").format("json").save("s3://test/data/ABC/hb????")
I guess I can iterate over each row in the dataframe , extract the path and save to S3 but do not want to do that.
Is there any way to group by the dataframes on eventTs and eventTs2 and then save the dataframes to the full S3 path? Is there something more optimal?
scala apache-spark amazon-s3 apache-spark-sql
scala apache-spark amazon-s3 apache-spark-sql
edited Jan 2 at 10:59
GothamGirl
asked Jan 2 at 7:35
GothamGirlGothamGirl
134112
134112
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Spark supports partitions like what we have in Hive. If number of distinct elements for eventTs, eventTs2 is less, partitions will be a good way to solve this.
Check the scala doc for more information around partitionBy.
Example usage:
val someDF = Seq((1, "bat", "marvel"), (2, "mouse", "disney"), (3, "horse", "animal"), (1, "batman", "marvel"), (2, "tom", "disney") ).toDF("id", "name", "place")
someDF.write.partitionBy("id", "name").orc("/tmp/somedf")
If you write the dataframe with the paritionBy on "id" and "name" the following directory structure will be created.
/tmp/somedf/id=1/name=bat
/tmp/somedf/id=1/name=batman
/tmp/somedf/id=2/name=mouse
/tmp/somedf/id=2/name=tom
/tmp/somedf/id=3/name=horse
The first and second partition becomes directories and all rows where id is equal to 1 and name is bat will be saved under the directory structure /tmp/somedf/id=1/name=bat
, the order of the partitions defined in partitionBy decides the order of directories.
In your case, the partitions will be on eventTs and eventTS2.
val someDF = Seq(
("Web-DTB","akefTEdZhXt8EqzLKXNt1Wjg","akTEdZhXt8EqzLKXNt1Wjg","1545502751154","1545502766731",4,"rfs"),
("Web-DTB","3ycLHHrbEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"njs"),
("Web-DTB","3ycL4rHHEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"null"))
.toDF("channel" , "eventId1", "eventId2", "eventTs", "eventTs2", "serialNumber", "someCode")
someDF.write("eventTs", "eventTs2").orc("/tmp/someDF")
Creating a directory structure as follows.
/tmp/someDF/eventTs=1545502766247/eventTs2=1545502767800
/tmp/someDF/eventTs=1545502751154/eventTs2=1545502766731
except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.
– GothamGirl
Jan 2 at 9:07
Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.
– Sudev Ambadi
Jan 2 at 9:16
Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .
– GothamGirl
Jan 2 at 9:38
You can n save in S3 without specifying the full path, give it a try.
– Sudev Ambadi
Jan 2 at 9:55
Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.
– GothamGirl
Jan 2 at 12:38
|
show 3 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54002766%2foptimal-way-to-save-spark-sql-dataframe-to-s3-using-information-stored-in-them%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Spark supports partitions like what we have in Hive. If number of distinct elements for eventTs, eventTs2 is less, partitions will be a good way to solve this.
Check the scala doc for more information around partitionBy.
Example usage:
val someDF = Seq((1, "bat", "marvel"), (2, "mouse", "disney"), (3, "horse", "animal"), (1, "batman", "marvel"), (2, "tom", "disney") ).toDF("id", "name", "place")
someDF.write.partitionBy("id", "name").orc("/tmp/somedf")
If you write the dataframe with the paritionBy on "id" and "name" the following directory structure will be created.
/tmp/somedf/id=1/name=bat
/tmp/somedf/id=1/name=batman
/tmp/somedf/id=2/name=mouse
/tmp/somedf/id=2/name=tom
/tmp/somedf/id=3/name=horse
The first and second partition becomes directories and all rows where id is equal to 1 and name is bat will be saved under the directory structure /tmp/somedf/id=1/name=bat
, the order of the partitions defined in partitionBy decides the order of directories.
In your case, the partitions will be on eventTs and eventTS2.
val someDF = Seq(
("Web-DTB","akefTEdZhXt8EqzLKXNt1Wjg","akTEdZhXt8EqzLKXNt1Wjg","1545502751154","1545502766731",4,"rfs"),
("Web-DTB","3ycLHHrbEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"njs"),
("Web-DTB","3ycL4rHHEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"null"))
.toDF("channel" , "eventId1", "eventId2", "eventTs", "eventTs2", "serialNumber", "someCode")
someDF.write("eventTs", "eventTs2").orc("/tmp/someDF")
Creating a directory structure as follows.
/tmp/someDF/eventTs=1545502766247/eventTs2=1545502767800
/tmp/someDF/eventTs=1545502751154/eventTs2=1545502766731
except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.
– GothamGirl
Jan 2 at 9:07
Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.
– Sudev Ambadi
Jan 2 at 9:16
Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .
– GothamGirl
Jan 2 at 9:38
You can n save in S3 without specifying the full path, give it a try.
– Sudev Ambadi
Jan 2 at 9:55
Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.
– GothamGirl
Jan 2 at 12:38
|
show 3 more comments
Spark supports partitions like what we have in Hive. If number of distinct elements for eventTs, eventTs2 is less, partitions will be a good way to solve this.
Check the scala doc for more information around partitionBy.
Example usage:
val someDF = Seq((1, "bat", "marvel"), (2, "mouse", "disney"), (3, "horse", "animal"), (1, "batman", "marvel"), (2, "tom", "disney") ).toDF("id", "name", "place")
someDF.write.partitionBy("id", "name").orc("/tmp/somedf")
If you write the dataframe with the paritionBy on "id" and "name" the following directory structure will be created.
/tmp/somedf/id=1/name=bat
/tmp/somedf/id=1/name=batman
/tmp/somedf/id=2/name=mouse
/tmp/somedf/id=2/name=tom
/tmp/somedf/id=3/name=horse
The first and second partition becomes directories and all rows where id is equal to 1 and name is bat will be saved under the directory structure /tmp/somedf/id=1/name=bat
, the order of the partitions defined in partitionBy decides the order of directories.
In your case, the partitions will be on eventTs and eventTS2.
val someDF = Seq(
("Web-DTB","akefTEdZhXt8EqzLKXNt1Wjg","akTEdZhXt8EqzLKXNt1Wjg","1545502751154","1545502766731",4,"rfs"),
("Web-DTB","3ycLHHrbEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"njs"),
("Web-DTB","3ycL4rHHEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"null"))
.toDF("channel" , "eventId1", "eventId2", "eventTs", "eventTs2", "serialNumber", "someCode")
someDF.write("eventTs", "eventTs2").orc("/tmp/someDF")
Creating a directory structure as follows.
/tmp/someDF/eventTs=1545502766247/eventTs2=1545502767800
/tmp/someDF/eventTs=1545502751154/eventTs2=1545502766731
except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.
– GothamGirl
Jan 2 at 9:07
Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.
– Sudev Ambadi
Jan 2 at 9:16
Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .
– GothamGirl
Jan 2 at 9:38
You can n save in S3 without specifying the full path, give it a try.
– Sudev Ambadi
Jan 2 at 9:55
Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.
– GothamGirl
Jan 2 at 12:38
|
show 3 more comments
Spark supports partitions like what we have in Hive. If number of distinct elements for eventTs, eventTs2 is less, partitions will be a good way to solve this.
Check the scala doc for more information around partitionBy.
Example usage:
val someDF = Seq((1, "bat", "marvel"), (2, "mouse", "disney"), (3, "horse", "animal"), (1, "batman", "marvel"), (2, "tom", "disney") ).toDF("id", "name", "place")
someDF.write.partitionBy("id", "name").orc("/tmp/somedf")
If you write the dataframe with the paritionBy on "id" and "name" the following directory structure will be created.
/tmp/somedf/id=1/name=bat
/tmp/somedf/id=1/name=batman
/tmp/somedf/id=2/name=mouse
/tmp/somedf/id=2/name=tom
/tmp/somedf/id=3/name=horse
The first and second partition becomes directories and all rows where id is equal to 1 and name is bat will be saved under the directory structure /tmp/somedf/id=1/name=bat
, the order of the partitions defined in partitionBy decides the order of directories.
In your case, the partitions will be on eventTs and eventTS2.
val someDF = Seq(
("Web-DTB","akefTEdZhXt8EqzLKXNt1Wjg","akTEdZhXt8EqzLKXNt1Wjg","1545502751154","1545502766731",4,"rfs"),
("Web-DTB","3ycLHHrbEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"njs"),
("Web-DTB","3ycL4rHHEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"null"))
.toDF("channel" , "eventId1", "eventId2", "eventTs", "eventTs2", "serialNumber", "someCode")
someDF.write("eventTs", "eventTs2").orc("/tmp/someDF")
Creating a directory structure as follows.
/tmp/someDF/eventTs=1545502766247/eventTs2=1545502767800
/tmp/someDF/eventTs=1545502751154/eventTs2=1545502766731
Spark supports partitions like what we have in Hive. If number of distinct elements for eventTs, eventTs2 is less, partitions will be a good way to solve this.
Check the scala doc for more information around partitionBy.
Example usage:
val someDF = Seq((1, "bat", "marvel"), (2, "mouse", "disney"), (3, "horse", "animal"), (1, "batman", "marvel"), (2, "tom", "disney") ).toDF("id", "name", "place")
someDF.write.partitionBy("id", "name").orc("/tmp/somedf")
If you write the dataframe with the paritionBy on "id" and "name" the following directory structure will be created.
/tmp/somedf/id=1/name=bat
/tmp/somedf/id=1/name=batman
/tmp/somedf/id=2/name=mouse
/tmp/somedf/id=2/name=tom
/tmp/somedf/id=3/name=horse
The first and second partition becomes directories and all rows where id is equal to 1 and name is bat will be saved under the directory structure /tmp/somedf/id=1/name=bat
, the order of the partitions defined in partitionBy decides the order of directories.
In your case, the partitions will be on eventTs and eventTS2.
val someDF = Seq(
("Web-DTB","akefTEdZhXt8EqzLKXNt1Wjg","akTEdZhXt8EqzLKXNt1Wjg","1545502751154","1545502766731",4,"rfs"),
("Web-DTB","3ycLHHrbEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"njs"),
("Web-DTB","3ycL4rHHEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"null"))
.toDF("channel" , "eventId1", "eventId2", "eventTs", "eventTs2", "serialNumber", "someCode")
someDF.write("eventTs", "eventTs2").orc("/tmp/someDF")
Creating a directory structure as follows.
/tmp/someDF/eventTs=1545502766247/eventTs2=1545502767800
/tmp/someDF/eventTs=1545502751154/eventTs2=1545502766731
edited Jan 2 at 9:09
answered Jan 2 at 8:58
Sudev AmbadiSudev Ambadi
47039
47039
except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.
– GothamGirl
Jan 2 at 9:07
Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.
– Sudev Ambadi
Jan 2 at 9:16
Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .
– GothamGirl
Jan 2 at 9:38
You can n save in S3 without specifying the full path, give it a try.
– Sudev Ambadi
Jan 2 at 9:55
Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.
– GothamGirl
Jan 2 at 12:38
|
show 3 more comments
except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.
– GothamGirl
Jan 2 at 9:07
Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.
– Sudev Ambadi
Jan 2 at 9:16
Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .
– GothamGirl
Jan 2 at 9:38
You can n save in S3 without specifying the full path, give it a try.
– Sudev Ambadi
Jan 2 at 9:55
Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.
– GothamGirl
Jan 2 at 12:38
except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.
– GothamGirl
Jan 2 at 9:07
except that I am looking at storing in S3. I am aware of this partitioning logic. Looking for a simple and clean way of doing this in S3.
– GothamGirl
Jan 2 at 9:07
Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.
– Sudev Ambadi
Jan 2 at 9:16
Partitioning will be the simplest IMHO, unless the number of distinct elements in eventTS/eventTS2 is not in thousands and you don't have thousands of partitions in dataframe. In these cases you will end up creating thousands of very tiny files per partition.
– Sudev Ambadi
Jan 2 at 9:16
Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .
– GothamGirl
Jan 2 at 9:38
Edited the question to be more clear on where I am stuck. The S3 path is a function of eventTs and eventTs2 and i doubt that we can save in S3 without specifying the full path the way you are storing in HDFS .
– GothamGirl
Jan 2 at 9:38
You can n save in S3 without specifying the full path, give it a try.
– Sudev Ambadi
Jan 2 at 9:55
You can n save in S3 without specifying the full path, give it a try.
– Sudev Ambadi
Jan 2 at 9:55
Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.
– GothamGirl
Jan 2 at 12:38
Works! Don;t know why I thought it won't. is there anyway to drop a column after partitioning based on that? Or use a UDF in partitionBy? As I do not want to format the dates and store new columns in yyyyMMdd format from epoch and increase the datasize. And I can't drop the original columns for granularity. As I have 3 partitioning columns which are different from the original data columns.
– GothamGirl
Jan 2 at 12:38
|
show 3 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54002766%2foptimal-way-to-save-spark-sql-dataframe-to-s3-using-information-stored-in-them%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown