Writing a Java RDD to multiple folders with varying schemas
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
if I have a Spark dataframe I can, when writing, very easily partition the data across multiple folders using the partitionBy() method of the DataFrameWriter.
But my starting point is actually a JavaRDD. Again it is straightforward to convert a JavaRDD to a dataframe using the static createDataFrame(JavaRDD rdd, Class clsT) of the SparkSession class.
In order for this to work the class clsT must be a bean type class. So far so good, this works for me.
Now, my bean class has the following fields with the following example values
String partitionFieldOne = "a";
String partitionFieldTwo = "b";
String id = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
String data = "field1: value1, field2: value2, field3: value3, ..."
Now, the issue I am having is as follows.
When writing in such a partitioned fashion, all bean objects which are written to the same folder will have the same fields in the data String (field1a, field2a, field3a, etc), but a bean object written to another folder will have a different set of fields in the data String (field1b, field2b, field3b, etc).
This means that I could save considerable disk space if I was able to include these field names as part of a schema. At present all data points written to one file (I'm writing to parquet, but I am sure the question holds for writing to other formats) all have the same sets of fields, and so half of the information in the data String (the fields) is repeated for each data point.
The names of these fields (they are actually longs representing timestamps) only become known during run time, and will evolve over time, so I cannot create a family of Java bean classes for each set of fields that might exist.
What I want to know is if it is possible to be able to write in a partitioned fashion while having varying schemas for each individual partition?
I know that I could partition my RDD (in this one paragraph I am using the word partition in the traditional sense for Spark, in which I specify which data points in the RDD are in which nodes), and then use the forEachPartition() method of JavaRDD to write each file in each folder distinctly, but I would be interested to know if the DataFrameWriter allows for varying schemas for different write partitions.
Please let me know if I have explained this badly
apache-spark rdd
add a comment |
if I have a Spark dataframe I can, when writing, very easily partition the data across multiple folders using the partitionBy() method of the DataFrameWriter.
But my starting point is actually a JavaRDD. Again it is straightforward to convert a JavaRDD to a dataframe using the static createDataFrame(JavaRDD rdd, Class clsT) of the SparkSession class.
In order for this to work the class clsT must be a bean type class. So far so good, this works for me.
Now, my bean class has the following fields with the following example values
String partitionFieldOne = "a";
String partitionFieldTwo = "b";
String id = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
String data = "field1: value1, field2: value2, field3: value3, ..."
Now, the issue I am having is as follows.
When writing in such a partitioned fashion, all bean objects which are written to the same folder will have the same fields in the data String (field1a, field2a, field3a, etc), but a bean object written to another folder will have a different set of fields in the data String (field1b, field2b, field3b, etc).
This means that I could save considerable disk space if I was able to include these field names as part of a schema. At present all data points written to one file (I'm writing to parquet, but I am sure the question holds for writing to other formats) all have the same sets of fields, and so half of the information in the data String (the fields) is repeated for each data point.
The names of these fields (they are actually longs representing timestamps) only become known during run time, and will evolve over time, so I cannot create a family of Java bean classes for each set of fields that might exist.
What I want to know is if it is possible to be able to write in a partitioned fashion while having varying schemas for each individual partition?
I know that I could partition my RDD (in this one paragraph I am using the word partition in the traditional sense for Spark, in which I specify which data points in the RDD are in which nodes), and then use the forEachPartition() method of JavaRDD to write each file in each folder distinctly, but I would be interested to know if the DataFrameWriter allows for varying schemas for different write partitions.
Please let me know if I have explained this badly
apache-spark rdd
add a comment |
if I have a Spark dataframe I can, when writing, very easily partition the data across multiple folders using the partitionBy() method of the DataFrameWriter.
But my starting point is actually a JavaRDD. Again it is straightforward to convert a JavaRDD to a dataframe using the static createDataFrame(JavaRDD rdd, Class clsT) of the SparkSession class.
In order for this to work the class clsT must be a bean type class. So far so good, this works for me.
Now, my bean class has the following fields with the following example values
String partitionFieldOne = "a";
String partitionFieldTwo = "b";
String id = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
String data = "field1: value1, field2: value2, field3: value3, ..."
Now, the issue I am having is as follows.
When writing in such a partitioned fashion, all bean objects which are written to the same folder will have the same fields in the data String (field1a, field2a, field3a, etc), but a bean object written to another folder will have a different set of fields in the data String (field1b, field2b, field3b, etc).
This means that I could save considerable disk space if I was able to include these field names as part of a schema. At present all data points written to one file (I'm writing to parquet, but I am sure the question holds for writing to other formats) all have the same sets of fields, and so half of the information in the data String (the fields) is repeated for each data point.
The names of these fields (they are actually longs representing timestamps) only become known during run time, and will evolve over time, so I cannot create a family of Java bean classes for each set of fields that might exist.
What I want to know is if it is possible to be able to write in a partitioned fashion while having varying schemas for each individual partition?
I know that I could partition my RDD (in this one paragraph I am using the word partition in the traditional sense for Spark, in which I specify which data points in the RDD are in which nodes), and then use the forEachPartition() method of JavaRDD to write each file in each folder distinctly, but I would be interested to know if the DataFrameWriter allows for varying schemas for different write partitions.
Please let me know if I have explained this badly
apache-spark rdd
if I have a Spark dataframe I can, when writing, very easily partition the data across multiple folders using the partitionBy() method of the DataFrameWriter.
But my starting point is actually a JavaRDD. Again it is straightforward to convert a JavaRDD to a dataframe using the static createDataFrame(JavaRDD rdd, Class clsT) of the SparkSession class.
In order for this to work the class clsT must be a bean type class. So far so good, this works for me.
Now, my bean class has the following fields with the following example values
String partitionFieldOne = "a";
String partitionFieldTwo = "b";
String id = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
String data = "field1: value1, field2: value2, field3: value3, ..."
Now, the issue I am having is as follows.
When writing in such a partitioned fashion, all bean objects which are written to the same folder will have the same fields in the data String (field1a, field2a, field3a, etc), but a bean object written to another folder will have a different set of fields in the data String (field1b, field2b, field3b, etc).
This means that I could save considerable disk space if I was able to include these field names as part of a schema. At present all data points written to one file (I'm writing to parquet, but I am sure the question holds for writing to other formats) all have the same sets of fields, and so half of the information in the data String (the fields) is repeated for each data point.
The names of these fields (they are actually longs representing timestamps) only become known during run time, and will evolve over time, so I cannot create a family of Java bean classes for each set of fields that might exist.
What I want to know is if it is possible to be able to write in a partitioned fashion while having varying schemas for each individual partition?
I know that I could partition my RDD (in this one paragraph I am using the word partition in the traditional sense for Spark, in which I specify which data points in the RDD are in which nodes), and then use the forEachPartition() method of JavaRDD to write each file in each folder distinctly, but I would be interested to know if the DataFrameWriter allows for varying schemas for different write partitions.
Please let me know if I have explained this badly
apache-spark rdd
apache-spark rdd
asked Jan 3 at 16:23
SiLafSiLaf
16812
16812
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54026127%2fwriting-a-java-rdd-to-multiple-folders-with-varying-schemas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54026127%2fwriting-a-java-rdd-to-multiple-folders-with-varying-schemas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown