Writing a Java RDD to multiple folders with varying schemas

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

if I have a Spark dataframe I can, when writing, very easily partition the data across multiple folders using the partitionBy() method of the DataFrameWriter.

But my starting point is actually a JavaRDD. Again it is straightforward to convert a JavaRDD to a dataframe using the static createDataFrame(JavaRDD rdd, Class clsT) of the SparkSession class.

In order for this to work the class clsT must be a bean type class. So far so good, this works for me.

Now, my bean class has the following fields with the following example values

String partitionFieldOne = "a";

String partitionFieldTwo = "b";

String id = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"

String data = "field1: value1, field2: value2, field3: value3, ..."

Now, the issue I am having is as follows.
When writing in such a partitioned fashion, all bean objects which are written to the same folder will have the same fields in the data String (field1a, field2a, field3a, etc), but a bean object written to another folder will have a different set of fields in the data String (field1b, field2b, field3b, etc).

This means that I could save considerable disk space if I was able to include these field names as part of a schema. At present all data points written to one file (I'm writing to parquet, but I am sure the question holds for writing to other formats) all have the same sets of fields, and so half of the information in the data String (the fields) is repeated for each data point.

The names of these fields (they are actually longs representing timestamps) only become known during run time, and will evolve over time, so I cannot create a family of Java bean classes for each set of fields that might exist.

What I want to know is if it is possible to be able to write in a partitioned fashion while having varying schemas for each individual partition?

I know that I could partition my RDD (in this one paragraph I am using the word partition in the traditional sense for Spark, in which I specify which data points in the RDD are in which nodes), and then use the forEachPartition() method of JavaRDD to write each file in each folder distinctly, but I would be interested to know if the DataFrameWriter allows for varying schemas for different write partitions.

Please let me know if I have explained this badly

asked Jan 3 at 16:23

SiLaf

16812

add a comment |

if I have a Spark dataframe I can, when writing, very easily partition the data across multiple folders using the partitionBy() method of the DataFrameWriter.

But my starting point is actually a JavaRDD. Again it is straightforward to convert a JavaRDD to a dataframe using the static createDataFrame(JavaRDD rdd, Class clsT) of the SparkSession class.

In order for this to work the class clsT must be a bean type class. So far so good, this works for me.

Now, my bean class has the following fields with the following example values

String partitionFieldOne = "a";

String partitionFieldTwo = "b";

String id = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"

String data = "field1: value1, field2: value2, field3: value3, ..."

What I want to know is if it is possible to be able to write in a partitioned fashion while having varying schemas for each individual partition?

Please let me know if I have explained this badly

asked Jan 3 at 16:23

SiLaf

16812

add a comment |

if I have a Spark dataframe I can, when writing, very easily partition the data across multiple folders using the partitionBy() method of the DataFrameWriter.

But my starting point is actually a JavaRDD. Again it is straightforward to convert a JavaRDD to a dataframe using the static createDataFrame(JavaRDD rdd, Class clsT) of the SparkSession class.

In order for this to work the class clsT must be a bean type class. So far so good, this works for me.

Now, my bean class has the following fields with the following example values

String partitionFieldOne = "a";

String partitionFieldTwo = "b";

String id = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"

String data = "field1: value1, field2: value2, field3: value3, ..."

What I want to know is if it is possible to be able to write in a partitioned fashion while having varying schemas for each individual partition?

Please let me know if I have explained this badly

asked Jan 3 at 16:23

SiLaf

16812

if I have a Spark dataframe I can, when writing, very easily partition the data across multiple folders using the partitionBy() method of the DataFrameWriter.

But my starting point is actually a JavaRDD. Again it is straightforward to convert a JavaRDD to a dataframe using the static createDataFrame(JavaRDD rdd, Class clsT) of the SparkSession class.

In order for this to work the class clsT must be a bean type class. So far so good, this works for me.

Now, my bean class has the following fields with the following example values

String partitionFieldOne = "a";

String partitionFieldTwo = "b";

String id = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"

String data = "field1: value1, field2: value2, field3: value3, ..."

What I want to know is if it is possible to be able to write in a partitioned fashion while having varying schemas for each individual partition?

Please let me know if I have explained this badly

apache-spark rdd

asked Jan 3 at 16:23

SiLaf

16812

asked Jan 3 at 16:23

SiLaf

16812

asked Jan 3 at 16:23

SiLaf

16812

asked Jan 3 at 16:23

SiLaf

16812

asked Jan 3 at 16:23

SiLaf

16812

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54026127%2fwriting-a-java-rdd-to-multiple-folders-with-varying-schemas%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu